How to grep umlauts and other accented text characters via AppleScript

How to grep umlauts and other accented text characters via AppleScript - grep

I have a problem trying to execute shell scripts from apple script. I do a "grep", but as soon as it contains special characters it doesn't work as intended.
(The script reads a list list ob subfolders in a directory and checks if any of the subfolders appear in a file.)
Here is my script:
set searchFile to "/tmp/output.txt"
set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & space & searchFile
do shell script theCommand
tell application "Finder"
set companies to get name of folders of folder ("/path/" as POSIX file)
end tell
repeat with company in companies
set theCommand to "grep -c " & quoted form of company & space & quoted form of searchFile
try
do shell script theCommand
set CompanyName to company as string
return CompanyName
on error
end try
end repeat
return false
The problem is e.g. with strings with umlauts. "theCommand" is somehow differently encoded that when I do it on the CLI directly.
$ grep -c 'Württemberg' '/tmp/output.txt' --> typed on command line
3
$ grep -c 'Württemberg' '/tmp/output.txt' --> copy & pasted from AppleScript
0
$ grep -c 'rttemberg' '/tmp/output.txt' --> no umlauts, no problems
3
The "ü" from the first and the second line are different; a echo 'Württemberg' | openssl base64 shows this.
I tried several encoding tricks at different places, basically everything I could find or think of.
Does anyone have any idea? How can I check which encoding a string has?
Thanks in advance!
Sebastian

Overview
This can work by escaping each character that has an accent in each company name before they are used in the grep command.
So, you'll need to escape each one of those characters (i.e. those which have an accent) with double backslashes (i.e. \\). For example:
The ü in Württemberg will need to become \\ü
The ö in Königsberg will need to become \\ö
The ß in Einbahnstraße will need to become \\ß
Why is this necessary:
These accented characters, such as a u with diaeresis, are certainly getting encoded differently. Which type of encoding they receive is difficult to ascertain. My assumption is that the encoding pattern used begins with a backslash - hence why escaping those characters with backslashes fixes the issue. Consider the u with diaeresis in the previous link, it shows that for the C/C++ language the ü is encoded as \u00FC.
Solution
In the complete script below you'll notice the following:
set accentedChars to {"ü", "ö", "ß", "á", "ė"} has been added to hold a list of all characters that will need to be escaped. You'll need to explicitly state each one as there doesn't seem to be a way to infer whether the character has an accent.
Before assigning the grepcommand to theCommand variable we firstly escape the necessary characters via the line reading:
set company to escapeChars(company, accentedChars)
As you can see here we are passing two arguments to the escapeChars sub-routine, (i.e. the non-escaped company variable and the list of accented characters).
In the escapeChars sub-routine we iterate over each char in the accentedChars list and invoke the findAndReplace sub-routine. This will escape any instances of those characters with backslashes found in the company variable.
Complete script:
set searchFile to "/tmp/output.txt"
set accentedChars to {"ü", "ö", "ß", "á", "ė"}
set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & ¬
space & searchFile
do shell script theCommand
tell application "Finder"
set companies to get name of folders of folder ("/path/" as POSIX file)
end tell
repeat with company in companies
set company to escapeChars(company, accentedChars)
set theCommand to "grep -c " & quoted form of company & ¬
space & quoted form of searchFile
try
do shell script theCommand
set CompanyName to company as string
return CompanyName
on error
end try
end repeat
return false
(**
* Checks each character of a given word. If any characters of the word
* match a character in the given list of characters they will be escapd.
*
* #param {text} searchWord - The word to check the characters of.
* #param {text} charactersList - List of characters to be escaped.
* #returns {text} The new text with the item(s) replaced.
*)
on escapeChars(searchWord, charactersList)
repeat with char in charactersList
set searchWord to findAndReplace(char, ("\\" & char), searchWord)
end repeat
return searchWord
end escapeChars
(**
* Replaces all occurances of findString with replaceString
*
* #param {text} findString - The text string to find.
* #param {text} replaceString - The replacement text string.
* #param {text} searchInString - Text string to search.
* #returns {text} The new text with the item(s) replaced.
*)
on findAndReplace(findString, replaceString, searchInString)
set oldTIDs to text item delimiters of AppleScript
set text item delimiters of AppleScript to findString
set searchInString to text items of searchInString
set text item delimiters of AppleScript to replaceString
set searchInString to "" & searchInString
set text item delimiters of AppleScript to oldTIDs
return searchInString
end findAndReplace
Note about current counts:
Currently your grep pattern only reports the number of lines that the word was found on. Not how many instances of the word were found.
If you want the actual number of instances of the word then use the -o option with grep to output each occurrence. Then pipe that to wc with the -l option to count the number of lines. For example:
grep -o 'Württemberg' /tmp/output.txt | wc -l
and in your AppleScript that would be:
set theCommand to "grep -o " & quoted form of company & space & ¬
quoted form of searchFile & "| wc -l"
Tip: If your want to remove the leading spaces in the count/number that gets logged then pipe it to sed to strip the spaces: For example via your script:
set theCommand to "grep -o " & quoted form of company & space & ¬
quoted form of searchFile & "| wc -l | sed -e 's/ //g'"
and the equivalent via the command line:
grep -o 'Württemberg' /tmp/output.txt | wc -l | sed -e 's/ //g'

Related

select only a word that is part of colon

I have a text file using markup language (similar to wikipedia articles)
cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.
I need to select the word "having:" only because that is part of regular text. I tried
grep -v '[*:*]' test.txt
This will correctly avoid the tags, but does not select the expected word.

The square brackets specify a character class, so your regular expression looks for any occurrence of one of the characters * or : (or *, but we said that already, didn't we?)
grep has the option -o to only print the matching text, so something lie
grep -ow '[^[:space:]]*:[^[:space:]]*' file.txt
would extract any text with a colon in it, surrounded by zero or more non-whitespace characters on each side. The -w option adds the condition that the match needs to be between word boundaries.
However, if you want to restrict in which context you want to match the text, you will probably need to switch to a more capable tool than plain grep. For example, you could use sed to preprocess each line to remove any bracketed text, and then look for matches in the remaining text.
sed -e 's/\[.*]//g' -e 's/ [^: ]*$/ /' -e 's/[^: ]* //g' -e 's/ /\n/' file.txt
(This assumes that your sed recognizes \n in the replacement string as a literal newline. There are simple workarounds available if it doesn't, but let's not go there if it's not necessary.)
In brief, we first replace any text between square brackets. (This needs to be improved if your input could contain multiple sequences of square brackets on a line with normal text between them. Your example only shows nested square brackets, but my approach is probably too simple for either case.) Then, we remove any words which don't contain a colon, with a special provision for the last word on the line, and some subsequent cleanup. Finally, we replace any remaining spaces with newlines, and (implicitly) print whatever is left. (This still ends up printing one newline too many, but that is easy to fix up later.)
Alternatively, we could use sed to remove any bracketed expressions, then use grep on the remaining tokens.
sed -e :a -e 's/\[[^][]*\]//' -e ta file.txt |
grep -ow '[^[:space:]]*:[^[:space:]]*'
The :a creates a label a and ta says to jump back to that label and try again if the regex matched. This one also demonstrates how to handle nested and repeated brackets. (I suppose it could be refactored into the previous attempt, so we could avoid the pipe to grep. But outlining different solution models is also useful here, I suppose.)
If you wanted to ensure that there is at least one non-colon character adjacent to the colon, you could do something like
... file.txt |
grep -owE '[^:[:space:]]+:[^[:space:]]*|[^[:space:]]*:[^: [:space:]]+'
where the -E option selects a slightly more modern regex dialect which allows us to use | between alternatives and + for one or more repetitions. (Basic grep in 1969 did not have these features at all; much later, the POSIX standard grafted them on with a slightly wacky syntax which requires you to backslash them to remove the literal meaning and select the metacharacter behavior... but let's not go there.)
Notice also how [^:[:space:]] matches a single character which is not a colon or a whitespace character, where [:space:] is the (slightly arcane) special POSIX named character class which matches any whitespace character (regular space, horizontal tab, vertical tab, possibly Unicode whitespace characters, depending on locale).
Awk easily lets you iterate over the tokens on a line. The requirement to ignore matches within square brackets complicates matters somewhat; you could keep a separate variable to keep track of whether you are inside brackets or not.
awk '{ for(i=1; i<=NF; ++i) {
if($i ~ /\]/) { brackets=0; next }
if($i ~ /\[/) brackets=1;
if(brackets) next;
if($i ~ /:/) print $i }' file.txt
This again hard-codes some perhaps incorrect assumptions about how the brackets can be placed. It will behave unexpectedly if a single token contains a closing square bracket followed by an opening one, and has an oversimplified treatment of nested brackets (the first closing bracket after a series of opening brackets will effectively assume we are no longer inside brackets).

A combined solution using sed and awk:
sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'
sed will change all spaces to a newline
awk (or gawk) will output all lines matching $0~/:$/, as long as i equals zero
The last part of the awk stuff keeps a count of the opening and closing brackets.
Another solution using sed and grep:
sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt | grep ':$'
's/\[.*\]+//g' will filter the stuff between brackets
's/ /\n/g' will replace a space with a newline
grep will only find lines ending with :
A third on using only awk:
gawk '{ for (t=1;t<=NF;t++){
if(i==0 && $t~/:$/) print $t;
i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt
gsub returns the number of replacements.
The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2. When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.

Linux Grep Command - Extract multiple texts between strings

Context;
After running the following command on my server:
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022 > analisis.txt
I get a text file with thousands of lines like this example:
loggers1/PCRF1_17868/PCRF12_01_03_2022_00_15_39.log:[C]|01-03-2022:00:18:20:183401|140404464875264|TRACKING: CCR processing Compleated for SubId-5281181XXXXX, REQNO-1, REQTYPE-3,
SId-mscp01.herpgwXX.epc.mncXXX.mccXXX.XXXXX.org;25b8510c;621dbaab;3341100102036XX-27cf0XXX,
RATTYPE-1004, ResCode-5005 |processCCR|ProcessingUnit.cpp|423
(X represents incrementing numbers)
Problem:
The output is filled with unnecessary data. The only string portions I need are the MSISDN,IMSI comma separated for each line, like this:
5281181XXXXX,3341100102036XX
Steps I tried
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022| grep -o -P
'(?<=SubId-).*?(?=, REQ)' > analisis1.txt
This gave me the first part of the solution
5281181XXXXX
However, when I tried to get the second string located between '334110' and "-"
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022| grep -o -P
'(?<=SubId-).?(?=, REQ)' | grep -o -P '(?<=334110).?(?=-)' >
analisis1.txt
it doesn't work.
Any input will be appreciated.

To get 5281181XXXXX or the second string located between '334110' and "-" you can use a pattern like:
\b(?:SubId-|334110)\K[^,\s-]+
The pattern matches:
\b A word boundary to prevent a partial word match
(?: Non capture group to match as a whole
SubId- Match literally
| Or
334110 Match literally
) Close the non capture group
\K Forget what is matched so far
[^,\s-]+ Match 1+ occurrences of any char except a whitespace char , or -
See the matches in this regex demo.
That will match:
5281181XXXXX
0102036XX
The command could look like
zgrep "ResCode-5005" /loggers1/PCRF*/_01_03_2022 | grep -oP '\b(?:SubId-|334110)\K[^,\s-]+' > analisis1.txt

Using sed to replace unique text with URL

I'm trying to figure out a more elegant way to replace a unique piece of text in a file with a URL.
It seems that sed is interpreting the URL as part of its evaluation logic instead of just replacing the text from a bash variable.
My file looks something like:
$srcRemoveSoftwareURL = "softwareURL"
and I'm attempting to (case-sensitive) search/replace softwareURL with the actual URL.
I'm using a bash script to help with the manipulation and I'm setting up my variables like so:
STORAGE_ENDPOINT_URL="http://mywebsite.com"
sas_url="se=2021-07-20T18%3A42Z&sp=rl&spr=https&sv=2018-11-09&sr=s&sig=oI/T9oHqzfEtuTjAotLyLN3IXbkiADGTPQllkyJlvEA%3D"
softwareURL="$STORAGE_ENDPOINT_URL/1-remove-software.sh?$sas_url"
# the resulting URL is like this:
# http://mywebsite.com/1-remove-software.sh?se=2021-07-20T18%3A42Z&sp=rl&spr=https&sv=2018-11-09&sr=s&sig=oI/T9oHqzfEtuTjAotLyLN3IXbkiADGTPQllkyJlvEA%3D
I then use sed to replace the text:
sed "s|softwareURL|$softwareURL|" template_file.sh
I recognize that bash is taking preference for the $softwareURL variable and inserting it in, but then sed interprets the URL as part of some evaluation logic.
At the moment my resulting template file looks like so:
$srcRemoveSoftwareURL = "http://mywebsite.com/1-remove-software.sh?se=2021-07-20T18%3A42ZsoftwareURLsp=rlsoftwareURLspr=httpssoftwareURLsv=2018-11-09softwareURLsr=ssoftwareURLsig=oI/T9oHqzfEtuTjAotLyLN3IXbkiIIGTPQllkyJlvEA%3D
It seems that sed is also finding any ampersand & characters in the URL and replacing it with the literal softwareURL.
What I'm doing now is to pipe the result to sed again and replace softwareURL with & which seems a little inefficient.
Is there a better way to do this?
Any guidance is most welcome!
Thanks!

The issue with the current result is that in sed the & refers to the pattern matched in the first part of the sed/s command (in this case & == softwareURL).
One idea would be to escape all &'s in the replacement string, eg:
sed "s|softwareURL|$softwareURL|" template_file.sh # old
sed "s|softwareURL|${softwareURL//&/\\&}|" template_file.sh # new
With this new command generating:
$srcRemoveSoftwareURL = "http://mywebsite.com/1-remove-software.sh?se=2021-07-20T18%3A42Z&sp=rl&spr=https&sv=2018-11-09&sr=s&sig=oI/T9oHqzfEtuTjAotLyLN3IXbkiADGTPQllkyJlvEA%3D"
If you find yourself needing to make several replacements:
$ somevariable="abc & % $ 123 & % $ xyz"
$ somevariable="${somevariable//&/\\&}" # escape literal '&'
$ somevariable="${somevariable//%/\\%}" # escape literal '%'
$ somevariable="${somevariable//$/\\$}" # escape literal '$'
$ echo $somevariable
abc \& \% \$ 123 \& \% \$ xyz
NOTE: I'm not saying you need to escape these particular characters for sed ... just pointing out how to go about making multiple replacements in a variable.

Match Lines From Two Lists With Wildcards In One List

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.
For example:
File 1
123456|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|frankie1#hotmail.com
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
File 2
1***6|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|f**1#hotmail.com
092362936|Joe|Jordan|J*****|joe#joesjoinery.com
928|Bob|Horton|Farmer|b*****n#f*********.co.uk
Output
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
Explanation
The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.
I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.

$ cat tst.awk
NR==FNR {
file1[$0]
next
}
{
# Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
gsub(/[^^*]/,"[&]") # Convert every char X to [X] except ^ and *
gsub(/\^/,"\\^") # Convert every ^ to \^
# Convert every * to .:
gsub(/\*/,".")
# Add line start/end anchors
$0 = "^" $0 "$"
# See if the current file2 line matches any line from file1
# and if so print that line from file1:
for ( line in file1 ) {
if ( line ~ $0 ) {
print line
}
}
}
$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk

sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1
Explanation:
I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).
The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.
Note:
This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.
Update:
jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:
sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1
This solution is both more general and more efficient, see comment below.

grep to find words with unique letters

how to use grep to find occurrences of words from a dictionary file which have a given set of letters with the restriction that each letter occurs once and only once.
EG if the letters are abc then the expected output is:
cab
EDIT:
Given a dictionary file (that is a file containing one word per line such as /usr/share/dict/words on mac os x operating system) and a set of (unique) characters, I want to print out all of the dictionary file's words that contain each character of the input set once and only once. For example if the set of characters is {a,b,c} then print out all (3-letter) words that contain each character of the set.
I am looking, preferably, for a solution that uses just grep expressions.

Given a series of letters, for example abc, you can convert each one to a lookahead, like this:
^(?=[^a]*a[^a]*)(?=[^b]*b[^b]*)(?=[^c]*c[^c]*)$
You may need to use the "extended regex" flag -E to use this regex with grep.
To create this regex from a string, you could use sed (an exercise for the reader)

grep -E ^[abc]{3}.$ <Dictionary file> | grep -v -e a.*a -e b.*b -e c.*c
i.e. Find all three letter strings matching the input and pipe these through inverse grep to remove strings with double letters.
I'm using the '.' after {3} because my dictionary file is windows based so has an extra carriage return or line feed. So, that's probably not necessary.

Below is a Perl solution. Note, you'll need to add more words to the dictionary, and read input in to the $input variable. An array of valid words will end up in #results.
#!/usr/bin/env perl
use Data::Dumper;
my $input = "abc";
my #dictionary = qw(aaa aac aad aal aam aap aar aas aat aaw aba abc abd abf abg
abh abm abn abo abr abs abv abw aca acc ace aci ack acl acp acs act acv ada adb
adc add adf adh adl adn ado adp adq adr ads adt adw aea aeb aec aed aef aes aev
afb afc afe aff afg afi afk afl afn afp aft afu afv agb agc agl agm agn ago agp
...
PUT A REAL DICTIONARY HERE!
...
zie zif zig zii zij zik zil zim zin zio zip zir zis zit ziu ziv zlm zlo zlx zma
zme zmi zmu zna zoa zob zoe zog zoi zol zom zon zoo zor zos zot zou zov zoy zrn
zsr zub zud zug zui zuk zul zum zun zuo zur zus zut zuz zva zwo zye zzz);
# Generate a lookahead expression for each character in the input word
my $regexp = join("", map { "(?=.*$_)" } split(//, $input));
my #results;
foreach my $word (#dictionary) {
# If the size of the input doesn't match the dictionary word, skip to the
# next word.
if (length($input) != length($word)) {
next;
}
if ($word =~ /$regexp/) {
push(#results, $word);
}
}
print Dumper #results;

The solution I found involves using grep first to extract all n-letter words that contain only letters from the input set - although some letters might appear more than once, some may not appear; (again I am assuming that the input letters are unique). Then it does a series of 1-letter greps to make sure each letter occurs at least once. Because the words are of length n this ensures the word contains each letter once and only once. For example, if the input character set is (a,b,c} then the solution would be:
grep -E '^[abc]{3}$' /usr/share/dict/words | grep a | grep b | grep c
a simple bash script can be written which creates this grep string and executes it against the word file, using $1 as the input letter set. It might not be the most efficient method of generating the string, but as I am not familiar with sed or awk it does seem to solve my problem. The script I created is:
#!/bin/sh
slen=${#1}
g2="'^[$1]{$slen}\$'"
g3=""
ix1=0
while [ $ix1 -lt $slen ]
do
g3="$g3 | grep ${1:$ix1:1}"
ix1=$((ix1+1))
done
eval grep -E $g2 /usr/share/dict/words $g3

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to grep umlauts and other accented text characters via AppleScript - grep

Related

select only a word that is part of colon

Linux Grep Command - Extract multiple texts between strings

Using sed to replace unique text with URL

Match Lines From Two Lists With Wildcards In One List

grep to find words with unique letters

Categories

Resources