sh: grep absent elements (BSD grep)

sh: grep absent elements (BSD grep) - grep

I am attempting to find difference between two variables:
left='f012 f013' and right='f012 f013 f014'.
I need find all f* which is absent in the left side. I have also tried the following which also doesn't work:
echo 'f012 f013' | grep -o -v 'f012 f013 f014'
Can anyone tell me what I am doing wrong?

Most *nix tools require you to have one item per line, and comm might be more appropriate:
$ cat haystack
f012
f013
$ cat needles
f012
f013
f014
$ comm -13 haystack needles
f014
From man comm, -13 suppresses lines from column 1 (lines unique to FILE1, ie. haystack, your command's output) and column 3 (lines that appear in both files), leaving lines unique to FILE2, ie. needles, the f* you are looking for.

You can use the diff to compare the variable, but you need to get them into separate lines, or else it just compare the complete line, not every fields:
diff <(awk -v RS=" |\n" '$1=$1' <<<"$left") <(awk -v RS=" |\n" '$1=$1' <<<"$right")
2a3
> f014

Related

grep or ripgrep: How to find only files that match multiple patterns (not only on the same line)?

I'm searching for a fast method to find all files in a folder which contain 2 or more patterns
grep -l -e foo -e bar ./*
or
rg -l -e foo -e bar
show all files containing 'foo' AND 'bar' in the same line or 'foo' OR 'bar' in different lines but I want only files that have at a minimum one 'foo' match AND one 'bar' match in different lines. Files which only have 'foo' matches or only 'bar' matches shall be filtered out.
I know I could chain the grep calls but this will be too slow.

rg with multiline does work, however it will print as result everything in-between the criteria and sometimes that's not useful.
For the use case of chaining searches (in e.g. html, json, etc), where the 1st criterium is just to narrow down the files, and the 2nd criterium is actually what I am looking for, this is a possible solution:
rg -0 -l crit1 | xargs -0 -I % rg -H crit2 %
Alternatively I have just discovered ugrep which supports combining multiple criteria using boolean operators both on line and file level. This is quite something. It's a bit slower than rg + xargs, however it prints nicely all lines matching all criteria from the files (instead of just showing the last criteria from above):
ugrep --files -e crit1 --and -e crit2

If you want to search for two or more words that occur on multiple lines you can use ripgrep's option --multiline-dotall, in addition to to provide -U/--multiline. You also need to search for foo before bar and bar before foo using the | operator:
rg -lU --multiline-dotall 'foo.*bar|bar.*foo' .
For any number of words you'll need to | all permutations of those words. For that I use a small python script (which I called rga) which searches in
the current directory (and downwards), for files that contain all arguments given on the commandline:
#! /opt/util/py310/bin/python
import sys
import subprocess
from itertools import permutations
rgarg = '|'.join(('.*'.join(x) for x in permutations(sys.argv[1:])))
cmd = ['rg', '-lU', '--multiline-dotall', rgarg, '.']
# print(' '.join(cmd))
proc = subprocess.run(cmd, capture_output=True)
sys.stdout.write(proc.stdout.decode('utf-8'))
I have searched successfully with six arguments, above that the commandline becomes to long. There are probably ways around that by saving the argument to a file and adding -f file_name, but I never needed/investigated that.

$ cat f1
afoot
2bar
$ cat f2
foo bar
$ cat f3
foot
$ cat f4
bar
$ cat f5
barred
123
foo3
$ rg -Ul '(?s)foo.*?\n.*?bar|bar.*?\n.*?foo'
f5
f1
You can use -U option to match across lines. The s flag will enable . to match newlines as well. Since you want the matches to be across different lines, you need to match a newline character in between the search terms as well.

So this doesn't perfectly answer the question, but, this is the StackOverflow question that pops up every time I google "ripgrep multiple patterns". So I'm leaving my answer here for the future googler (including myself)...
I primarily work in PowerShell, so this is how I perform an and search in ripgrep in PowerShell. This will match same line matches, which is why it's not a perfect answer, but it will identify files that match both patterns, and runs relatively quickly:
rg -l 'SecondSearchPattern' (rg -l 'FirstSearchPattern')
Explanation:
First the parens run: rg -l 'FirstSearchPattern', which searches all files for the pattern FirstSearchPattern. By using -l it returns a list of file paths only.
By placing it in (parentheses), it runs the whole command first, then "splats" the results of the command into the external rg command.
The external rg command is now run like this:
rg -l 'SecondSearchPattern' "file.txt" "directory\file.txt"
And yes, it does put them into quotes, so it handles paths with spaces. This searches all provided files that match the pattern SecondSearchPattern. Thus returning only files that match both patterns.
You can go one step further and add on | Get-Item (| gi) to return filesystem objects, and | % FullName to get the full path.
rg -l 'SecondSearchPattern' (rg -l 'FirstSearchPattern') | gi | % FullName

grep: Find all files containing the word `star`, but not the word `start`

I have a bunch of files: some contain the word star, some contain the word start, some contain both.
I'd like to grep for files that contain the word star, but not the word start.
How can this be accomplished using only grep?

grep has some options for inverting the matches at the line or file level. You want the latter option, with the -L switch. The following will print the names of all the files in a folder that don't contain the text start:
grep -LF start *
-F tells grep that start is a literal string and not a regex. It's optional here, but might speed things up a tiny bit.
You can use the resulting list to search for files that contain star:
grep -lF star $(grep -LF start *)
-l prints only the names of files containing a match, not any line-by-line or match-by-match details. If this is not exactly what you want, man grep is your friend.
This uses an additional shell construct to run the inverted match, but it technically doesn't call any additional programs that aren't grep.
Update
Since you mention wanting to look through all the files starting with a given root folder, change -LF to -LFr. Replace * with your root folder if you don't want to change working directories.
-r tells grep to recurse into directories, and search every file it finds along the way.

With GNU grep for -w:
$ cat file
foo star bar
oof start rab
$ grep -w star *
foo star bar
or if you just want the names of the files containing star:
$ grep -lw star *
file
and to just find files to look in:
$ find . -maxdepth 1 -type f -exec grep -w 'star' {} \;
foo star bar

grep using a list to find matches in a file, and print only the first occurrence for each string in the list

I have a file, for example, "queries.txt" that has hard return separated strings. I want to use this list to find matches in a second file, "biglist.txt".
"biglist.txt" may have multiple matches for each string in "queries.txt". I want to return only the first hit for each query and write this to another file.
grep -m 1 -wf queries.txt biglist.txt > output
only gives me one line in output. I should have output that is the same number of lines as queries.txt.
Any suggestions for this? Many thanks! I searched for past questions but did not find one that was exactly the same sort of case after a few minutes of reading.

If you want to "reset the counter" after each file, you could do
cat queries.txt | xargs -I{} grep -m 1 -w {} biglist.txt > output
This uses xargs to call grep once for each line in the input… should do the trick for you.
Explanation:
cat queries.txt - produce one "search word" per line
xargs -I{} - take the input one line at a time, and insert it at {}
grep -m 1 -w - find only one match of a whole word
{} - this is where xargs inserts the search term (once per call)
biglist.txt - the file to be searched
> output - the file where the result is to be written

An alternate method without xargs (which one should indeed learn):
(this method assumes there are no spaces in the lines in queries.txt)
cat queries.txt | while read target; do grep -m 1 $target biglist.txt; done > outr

I might not fully understand your question, but it sounds like something like this might work.
cat queries.txt | while read word; do grep "$word" biglist.txt | tee -a output.txt; done

Extracting n rows of text from a large csv file

I have a CSV file (foo.csv) with 200,000 rows. I need to break it into four files (foo1.csv, foo2.csv... etc.) with 50,000 rows each.
I already tried simple ctrl-v/-c using gui text editors, but the my computer slows to a halt.
What unix command(s) could I use to accomplish this task?

I don't have a terminal handy to try it out, but it should be just split -d -l 50000 foo.csv.
Hopefully the naming isn't terribly important because with the -d option, the output files will be named foo.csv00 .. foo.csv03. You can add the -a 1 option so that the suffixes are 0-3, but there's no simple way to get the suffix to be injected into the middle of the filename.

you should use head and tail.
head -n 50000 myfile > part1.csv
head -n 100000 myfile | tail -n 50000 > part2.csv
head -n 150000 myfile | tail -n 50000 > part3.csv
etc ...
Else, but with no control on file names, you can use unix command split.

sed -n 2000,4000p somefile.txt
will print from lines 2000 to 4000 to stdout.

split -l50000 foo.csv

You can use sed

I wrote this little shell script for this topic very similar at yours.
This shell script + awk works fine for me:
#!/bin/bash
awk -v initial_line=$1 -v end_line=$2 '{
if (NR >= initial_line && NR <= end_line)
print $0
}' $3
Used with this sample file (file.txt):
one
two
three
four
five
six
The command (it will extract from second to fourth line in the file):
edu#debian5:~$./script.sh 2 4 file.txt
Output of this command:
two
three
four
Of course, you can improve it, for example by testing that all argument values are the expected :-)

How to truncate long matching lines returned by grep or ack

I want to run ack or grep on HTML files that often have very long lines. I don't want to see very long lines that wrap repeatedly. But I do want to see just that portion of a long line that surrounds a string that matches the regular expression. How can I get this using any combination of Unix tools?

You could use the grep options -oE, possibly in combination with changing your pattern to ".{0,10}<original pattern>.{0,10}" in order to see some context around it:
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
-E, --extended-regexp
Interpret pattern as an extended regular expression (i.e., force grep to behave as egrep).
For example (from #Renaud's comment):
grep -oE ".{0,10}mysearchstring.{0,10}" myfile.txt
Alternatively, you could try -c:
-c, --count
Suppress normal output; instead print a count of matching lines
for each input file. With the -v, --invert-match option (see
below), count non-matching lines.

Pipe your results thru cut. I'm also considering adding a --cut switch so you could say --cut=80 and only get 80 columns.

You could use less as a pager for ack and chop long lines: ack --pager="less -S" This retains the long line but leaves it on one line instead of wrapping. To see more of the line, scroll left/right in less with the arrow keys.
I have the following alias setup for ack to do this:
alias ick='ack -i --pager="less -R -S"'

grep -oE ".\{0,10\}error.\{0,10\}" mylogfile.txt
In the unusual situation where you cannot use -E, use lowercase -e instead.
Explanation:

cut -c 1-100
gets characters from 1 to 100.

The Silver Searcher (ag) supports its natively via the --width NUM option. It will replace the rest of longer lines by [...].
Example (truncate after 120 characters):
$ ag --width 120 '#patternfly'
...
1:{"version":3,"file":"react-icons.js","sources":["../../node_modules/#patternfly/ [...]
In ack3, a similar feature is planned but currently not implemented.

Taken from: http://www.topbug.net/blog/2016/08/18/truncate-long-matching-lines-of-grep-a-solution-that-preserves-color/
The suggested approach ".{0,10}<original pattern>.{0,10}" is perfectly good except for that the highlighting color is often messed up. I've created a script with a similar output but the color is also preserved:
#!/bin/bash
# Usage:
# grepl PATTERN [FILE]
# how many characters around the searching keyword should be shown?
context_length=10
# What is the length of the control character for the color before and after the
# matching string?
# This is mostly determined by the environmental variable GREP_COLORS.
control_length_before=$(($(echo a | grep --color=always a | cut -d a -f '1' | wc -c)-1))
control_length_after=$(($(echo a | grep --color=always a | cut -d a -f '2' | wc -c)-1))
grep -E --color=always "$1" $2 |
grep --color=none -oE \
".{0,$(($control_length_before + $context_length))}$1.{0,$(($control_length_after + $context_length))}"
Assuming the script is saved as grepl, then grepl pattern file_with_long_lines should display the matching lines but with only 10 characters around the matching string.

I put the following into my .bashrc:
grepl() {
$(which grep) --color=always $# | less -RS
}
You can then use grepl on the command line with any arguments that are available for grep. Use the arrow keys to see the tail of longer lines. Use q to quit.
Explanation:
grepl() {: Define a new function that will be available in every (new) bash console.
$(which grep): Get the full path of grep. (Ubuntu defines an alias for grep that is equivalent to grep --color=auto. We don't want that alias but the original grep.)
--color=always: Colorize the output. (--color=auto from the alias won't work since grep detects that the output is put into a pipe and won't color it then.)
$#: Put all arguments given to the grepl function here.
less: Display the lines using less
-R: Show colors
S: Don't break long lines

Here's what I do:
function grep () {
tput rmam;
command grep "$#";
tput smam;
}
In my .bash_profile, I override grep so that it automatically runs tput rmam before and tput smam after, which disabled wrapping and then re-enables it.

ag can also take the regex trick, if you prefer it:
ag --column -o ".{0,20}error.{0,20}"

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

sh: grep absent elements (BSD grep) - grep

You can use the diff to compare the variable, but you need to get them into separate lines, or else it just compare the complete line, not every fields: diff <(awk -v RS=" |\n" '$1=$1' <<<"$left") <(awk -v RS=" |\n" '$1=$1' <<<"$right") 2a3 > f014

Related

grep or ripgrep: How to find only files that match multiple patterns (not only on the same line)?

grep: Find all files containing the word `star`, but not the word `start`

grep using a list to find matches in a file, and print only the first occurrence for each string in the list

Extracting n rows of text from a large csv file

How to truncate long matching lines returned by grep or ack

Categories

Resources