Understanding white spaces and xargs - grep

I've learned that when using find with xargs, it's advisable to use the -print0 and -0 arguments for file names with spaces to work correctly.
Now I have the following file named patterns with the following content
a a a
b b b
and I have the file named text with the following content
a
b b b
When I run the following command,
cat patterns| xargs -ipattern grep pattern text
I get
b b b
which means that xargs knew to look for a a a and b b b instead of a, a, a, b, b, b.
My question is why isn't there any problems in the example above? I thought it would look for a, a, a, b, b, b and return both lines in text.
What am I missing?

-i and -I change it from using (possibly quoted) whitespace-separated strings to lines. Quoting man xargs:
-I replace-str
Replace occurrences of replace-str in the initial-arguments with names read from
standard input. Also, unquoted blanks do not terminate input items; instead the
separator is the newline character. Implies -x and -L 1.
--replace[=replace-str]
-i[replace-str]
This option is a synonym for -Ireplace-str if replace-str is specified, and for
-I{} otherwise. This option is deprecated; use -I instead.
When you're working with filenames in default (i.e. not -I) mode, you need to protect spaces, tabs, and (not that filenames like this are ever a good idea) newlines; hence -0.

Related

select only a word that is part of colon

I have a text file using markup language (similar to wikipedia articles)
cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.
I need to select the word "having:" only because that is part of regular text. I tried
grep -v '[*:*]' test.txt
This will correctly avoid the tags, but does not select the expected word.
The square brackets specify a character class, so your regular expression looks for any occurrence of one of the characters * or : (or *, but we said that already, didn't we?)
grep has the option -o to only print the matching text, so something lie
grep -ow '[^[:space:]]*:[^[:space:]]*' file.txt
would extract any text with a colon in it, surrounded by zero or more non-whitespace characters on each side. The -w option adds the condition that the match needs to be between word boundaries.
However, if you want to restrict in which context you want to match the text, you will probably need to switch to a more capable tool than plain grep. For example, you could use sed to preprocess each line to remove any bracketed text, and then look for matches in the remaining text.
sed -e 's/\[.*]//g' -e 's/ [^: ]*$/ /' -e 's/[^: ]* //g' -e 's/ /\n/' file.txt
(This assumes that your sed recognizes \n in the replacement string as a literal newline. There are simple workarounds available if it doesn't, but let's not go there if it's not necessary.)
In brief, we first replace any text between square brackets. (This needs to be improved if your input could contain multiple sequences of square brackets on a line with normal text between them. Your example only shows nested square brackets, but my approach is probably too simple for either case.) Then, we remove any words which don't contain a colon, with a special provision for the last word on the line, and some subsequent cleanup. Finally, we replace any remaining spaces with newlines, and (implicitly) print whatever is left. (This still ends up printing one newline too many, but that is easy to fix up later.)
Alternatively, we could use sed to remove any bracketed expressions, then use grep on the remaining tokens.
sed -e :a -e 's/\[[^][]*\]//' -e ta file.txt |
grep -ow '[^[:space:]]*:[^[:space:]]*'
The :a creates a label a and ta says to jump back to that label and try again if the regex matched. This one also demonstrates how to handle nested and repeated brackets. (I suppose it could be refactored into the previous attempt, so we could avoid the pipe to grep. But outlining different solution models is also useful here, I suppose.)
If you wanted to ensure that there is at least one non-colon character adjacent to the colon, you could do something like
... file.txt |
grep -owE '[^:[:space:]]+:[^[:space:]]*|[^[:space:]]*:[^: [:space:]]+'
where the -E option selects a slightly more modern regex dialect which allows us to use | between alternatives and + for one or more repetitions. (Basic grep in 1969 did not have these features at all; much later, the POSIX standard grafted them on with a slightly wacky syntax which requires you to backslash them to remove the literal meaning and select the metacharacter behavior... but let's not go there.)
Notice also how [^:[:space:]] matches a single character which is not a colon or a whitespace character, where [:space:] is the (slightly arcane) special POSIX named character class which matches any whitespace character (regular space, horizontal tab, vertical tab, possibly Unicode whitespace characters, depending on locale).
Awk easily lets you iterate over the tokens on a line. The requirement to ignore matches within square brackets complicates matters somewhat; you could keep a separate variable to keep track of whether you are inside brackets or not.
awk '{ for(i=1; i<=NF; ++i) {
if($i ~ /\]/) { brackets=0; next }
if($i ~ /\[/) brackets=1;
if(brackets) next;
if($i ~ /:/) print $i }' file.txt
This again hard-codes some perhaps incorrect assumptions about how the brackets can be placed. It will behave unexpectedly if a single token contains a closing square bracket followed by an opening one, and has an oversimplified treatment of nested brackets (the first closing bracket after a series of opening brackets will effectively assume we are no longer inside brackets).
A combined solution using sed and awk:
sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'
sed will change all spaces to a newline
awk (or gawk) will output all lines matching $0~/:$/, as long as i equals zero
The last part of the awk stuff keeps a count of the opening and closing brackets.
Another solution using sed and grep:
sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt | grep ':$'
's/\[.*\]+//g' will filter the stuff between brackets
's/ /\n/g' will replace a space with a newline
grep will only find lines ending with :
A third on using only awk:
gawk '{ for (t=1;t<=NF;t++){
if(i==0 && $t~/:$/) print $t;
i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt
gsub returns the number of replacements.
The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2. When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.

Grep {n} The preceding item is matched exactly n times, is not clear to me in case of "Hair", "Haair" and "Haaair"

Suppose there is three strings "Hair", "Haair" and "Haaair" , When i use grep -E '^Ha{1}' , it returns all the former three words, instead i was expecting only "Hair", as i have asked return a line which starts with H and is followed by letter 'a' exactly once.
grep does not check that its input matches the given search expression. Grep finds substrings of the input that match the search.
See:
grep test <<< This is a test.
The input does not exactly match test. Only part of the input matches,
This is a test.
but that is enough for grep to output the whole line.
Similarly, when you say
grep -E '^Ha{1}' <<< Haaair
The input does not exactly match the search, but a part of it does,
Haaair
and that is enough. Note that {n,m} syntax is purely a convenience: Ha{1} is exactly equivalent to Ha, Ha{3,} is Haaa+, Ha{2,5} is Haa(a?){3} is Haaa?a?a?, etc. In other words, {1} does not mean "exactly once", it just means "once".
What you want to do is match a Ha that is not followed by another a. You have two options:
If your grep supports PCRE, you can use a negative lookahead:
grep -P '^Ha(?!a)'
(?!a) is a zero-length assertion, like ^. It doesn't match any characters; it simply causes the match to fail if there is an a after the first one.
Or, you can keep it simple and use a negative []:
grep -E '^Ha([^a]|$)'
Where [^a] matches any single character that is not a, and the alternation with $ handles the case of no character at all.

Match Lines From Two Lists With Wildcards In One List

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.
For example:
File 1
123456|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|frankie1#hotmail.com
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
File 2
1***6|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|f**1#hotmail.com
092362936|Joe|Jordan|J*****|joe#joesjoinery.com
928|Bob|Horton|Farmer|b*****n#f*********.co.uk
Output
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
Explanation
The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.
I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.
$ cat tst.awk
NR==FNR {
file1[$0]
next
}
{
# Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
gsub(/[^^*]/,"[&]") # Convert every char X to [X] except ^ and *
gsub(/\^/,"\\^") # Convert every ^ to \^
# Convert every * to .:
gsub(/\*/,".")
# Add line start/end anchors
$0 = "^" $0 "$"
# See if the current file2 line matches any line from file1
# and if so print that line from file1:
for ( line in file1 ) {
if ( line ~ $0 ) {
print line
}
}
}
$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1
Explanation:
I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).
The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.
Note:
This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.
Update:
jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:
sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1
This solution is both more general and more efficient, see comment below.

Searching tabs with grep

I have a file that might contain a line like this.
A B //Seperated by a tab
I wanna return true to terminal if the line is found, false if the value isn't found.
when I do
grep 'A' 'file.tsv', It returns to row (not true / false)
but
grep 'A \t B' "File.tsv"
or
grep 'A \\t B' "File.tsv"
or
grep 'A\tB'
or
grep 'A<TAB>B' //pressing tab button
doesn't return anything.
How do I search tab seperated values with grep.
How do I return a boolean value with grep.
Use a literal Tab character, not the \t escape. (You may need to press Ctrl+V first.) Also, grep is not Perl 6 (or Perl 5 with the /x modifier); spaces are significant and will be matched literally, so even if \t worked A \t B with the extra spaces around the \t would not unless the spaces were actually there in the original.
As for the return value, know that you get three different kinds of responses from a program: standard output, standard error, and exit code. The latter is 0 for success and non-0 for some error (for most programs that do matching, 1 means not found and 2 and up mean some kind of usage error). In traditional Unix you redirect the output from grep if you only want the exit code; with GNU grep you could use the -q option instead, but be aware that that is not portable. Both traditional and GNU grep allow -s to suppress standard error, but there are some differences in how the two handle it; most portable is grep PATTERN FILE >/dev/null 2>&1.
Two methods:
use the -P option:
grep -P 'A\tB' "File.tsv"
enter ctrl+v first and enter tab
grep 'A B' "File.tsv"
Here's a handy way to create a variable with a literal tab as its value:
TAB=`echo -e "\t"`
Then, you can use it as follows:
grep "A${TAB}B" File.tsv
This way, there's no literal tab required. Note that with this approach, you'll need to use double quotes (not single quotes) around the pattern string, otherwise the variable reference won't be replaced.

Do not merge the context of contiguous matches with grep

If I run grep -C 1 match over the following file:
a
b
match1
c
d
e
match2
f
match3
g
I get the following output:
b
match1
c
--
e
match2
f
match3
g
As you can see, since the context around the contiguous matches "match2" and "match3" overlap, they are merged. However, I would prefer to get one context description for each match, possibly duplicating lines from the input in the context reporting. In this case, what I would like is:
b
match1
c
--
e
match2
f
--
f
match3
g
What would be the best way to achieve this? I would prefer solutions which are general enough to be trivially adaptable to other grep options (different values for -A, -B, -C, or entirely different flags). Ideally, I was hoping that there was a clever way to do that just with grep....
I don't think it is possible to do that using plain grep.
the sed construct below works to some extent, now I only need to figure out how to add the "--" separator
$ sed -n -e '/match/{x;1!p;g;$!N;p;D;}' -e h log
b
match1
c
e
match2
f
f
match3
g
I don't think this is possible using plain grep.
Have you ever used Python? In my opinion it's a perfect language for such tasks (this code snippet will work for both Python 2.7 and 3.x):
with open("your_file_name") as f:
lines = [line.rstrip() for line in f.readlines()]
for num, line in enumerate(lines):
if "match" in line:
if num > 0:
print(lines[num - 1])
print(line)
if num < len(lines) - 1:
print(lines[num + 1])
if num < len(lines) - 2:
print("--")
This gives me:
b
match1
c
--
e
match2
f
--
f
match3
g
I'd suggest to patch grep instead of working around it. In GNU grep 2.9 in src/main.cpp:
933 /* We print the SEP_STR_GROUP separator only if our output is
934 discontiguous from the last output in the file. */
935 if ((out_before || out_after) && used && p != lastout && group_separator)
936 {
937 PR_SGR_START_IF(sep_color);
938 fputs (group_separator, stdout);
939 PR_SGR_END_IF(sep_color);
940 fputc('\n', stdout);
941 }
942
A simple additional flag would suffice here.
Edit: Well, d'oh, it is of course not THAT simple since grep would not reproduce the context, just add a few more separators. Due to the linearity of grep, the whole patch is probably not that easy. Nevertheless, if you have a good case for the patch, it could be worth it.
This does not appear possible with grep or GNU grep. However it is possible with standard POSIX tools and a good shell like bash as leverage to obtain the desired output.
Note: neither python nor perl should be necessary for the solution. Worst case, use awk or sed.
One solution I rapidly prototyped is something like this (it does involve overhead of re-reading the file, and this solution depends on whether this overhead is OK, and the give-away is the original question's use of -1 as fixed number of lines of context which allows simple use of head & tail) :
$ OIFS="$IFS"; lines=`grep -n match greptext.txt | /bin/cut -f1 -d:`;
for l in $lines;
do IFS=""; match=`/bin/tail -n +$(($l-1)) greptext.txt | /bin/head -3`;
echo $match; echo "---";
done; IFS="$OIFS"
This might have some corner case associated with it, and this resets IFS when perhaps not necessary, though it is a hint for trying to use the power of POSIX shell & tools rather than a high level interpreter to get the desired output.
Opinion: All good operating systems have: grep, awk, sed, tr, cut, head, tail, more, less, vi as built-ins. On the best operating systems, these are in /bin.

Resources