How to keep header of the file when using grep? - grep

I have two files. One very large file with a header and approx several million rows (chrall.txt.gz).
Another file (extract0.3.txt) with a single column/list of values to cross-reference the larger file. If the value match (they all should) a new file is created outputting the matched lines. I am using the grep command below:
gunzip -c chrall.txt.gz | grep -Fwf extract0.3.txt > output
However, this does not print my header line. How would I retain the header line of chrall.txt.gz

Related

How to grep multiple lines using a .txt vocab, matching only first word as variable?

I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm

Grep Entire File For Strings, Not Line by Line

I am wanting to search for files that contain 'even:suspendcount>0' AND 'even:holdcount>0'. These 2 strings of text must be somewhere in the file, not necessarily on the same line. The problem I am running into is that my search results are not pulling back files that contain 1 sting of text on say line #5 and the other on line #10. It is only pulling back files if they are on the same line number. How would I search for files that contains multiple strings of text just somewhere in the file, they do not have to be on the same line.
Using grep
To use grep to get files that have both strings in either order:
grep -lZ 'even:suspendcount>0' * | xargs --null grep -l 'even:holdcount>0'
How it works:
grep -lZ 'even:suspendcount>0' *
This returns a nul-separated list of the names of files which contain the string even:suspendcount>0.
xargs --null grep -l 'even:holdcount>0'
Of the files selected by the first step, this returns the names of files which also contain even:holdcount>0
Because we are using nul-separation when passing the file names from one process to the next, this approach is safe even for difficult file names.
Using awk
This prints the file name of any file that contains both strings:
awk 'BEGINFILE{f=0;g=0} /even:suspendcount>0/{f=1} /even:holdcount>0/{g=1} f && g{print FILENAME; nextfile}' *
How it works:
BEGINFILE{f=0;g=0}
As we start reading a new file, variables f and g are set to zero (false).
/even:suspendcount>0/{f=1}
If we encounter a line containing even:suspendcount>0, then set variable f to 1.
/even:holdcount>0/{g=1}
Similarly, f we encounter a line containing even:holdcount>0, then set variable g to 1.
f && g{print FILENAME; nextfile}
If both f and g are true (nonzero), then print the filename and skip to the next file.
A grep pattern is line-oriented, i.e. in your case it should be 'even:suspendcount>0' OR 'even:holdcount>0' (namely grep -E 'even:(suspend|hold)count>0').

What is the best way to use tr and grep on a folder?

I'm trying to search through all files in a folder for the following string
<cert>
</cert>
However, I have to remove line returns.
The following code works on one file but how can I pipe an entire folder through the tr and grep? The -l option is to only print the filename and not the whole file.
tr -d '\n' < test | grep -l '<cert></cert>'
The tr/grep approach requires grep to process the whole file as one line. While GNU grep can handle long lines, many others cannot. Also, if the file is large, memory may be taxed.
The following avoids those issues. It searches through all files in the currect directory and report names of any that contain <cert> on one line and </cert> on the next:
awk 'last ~ "<cert>" && $0 ~ "</cert>" {print FILENAME; nextfile} {last=$0}' *
How it works
awk implicitly loops over all lines in a file.
This script uses one variable, last, which contains the text of the previous line.
last ~ "<cert>" && $0 ~ ""`
This tests if (a) the last line contains the characters <cert> and (b) the current line contains the characters </cert>.
If you actually wanted lines that contain <cert> and no other characters, then replace ~ with ==.
{print FILENAME; nextfile}
If the preceding condition returns true, then this prints the file's name and starts on the next file.
(nextfile was a common extension to awk that became POSIX 2012.)
{last=$0}
This updates the variable last to have the current line.

How to clean a CSV file using the 'grep' command

Assuming that we have the following record {(XXX1),(XXX2)},whatever What I want is, extract the information, based on the following rule, preferably with 'grep': if {} contains less or equal to two UNIQUE elements, the ones inside the (), then keep (both) of them, otherwise delete the whole row. As a further step, I want to extract the values within the (), and finally write the remaining lines in the following form: XXX1,XXX2,whatever
UPDATE:
For the following input:
{(XXX1),(XXX2)},whatever,unique=2
{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
{(XXX1)},whatever,unique=1
{},whatever,unique=0
{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
I should get the following output:
XXX1,XXX2,whatever,unique=2
XXX1,whatever,unique=1
awk could do it, check this one-liner:
awk -F'[}{]' '{split($2,a,",");delete(b);for(x in a)b[a[x]]}length(b)<=2' file
let's do a small test:
kent$ cat file
ok,{(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1)},whatever,unique=1
ok,{},whatever,unique=0
nok,{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
kent$ awk -F'[}{]' '{split($2,a,",");delete(b);for(x in a)b[a[x]]}length(b)<=2' file
ok,{(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1)},whatever,unique=1
ok,{},whatever,unique=0
you can see, the nok line was removed
EDIT
awk -F'[}{]' '{gsub(/[()]/,"");split($2,a,",");delete(b);for(x in a)b[a[x]];l=length(b)}l<=2&&l>0{s="";for(x in b)s=s""x",";sub(/,$/,"",s);y[s]=s $3}END{for(x in y)print y[x]}' file
test
kent$ cat file
{(XXX1),(XXX2)},whatever,unique=2
{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
{(XXX1)},whatever,unique=1
{},whatever,unique=0
{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
kent$ awk -F'[}{]' '{gsub(/[()]/,"");split($2,a,",");delete(b);for(x in a)b[a[x]];l=length(b)}l<=2&&l>0{s="";for(x in b)s=s""x",";sub(/,$/,"",s);y[s]=s $3}END{for(x in y)print y[x]}' file
XXX1,XXX2,whatever,unique=2
XXX1,whatever,unique=1

extract a line from a file using csh

I am writing a csh script that will extract a line from a file xyz.
the xyz file contains a no. of lines of code and the line in which I am interested appears after 2-3 lines of the file.
I tried the following code
set product1 = `grep -e '<product_version_info.*/>' xyz`
I want it to be in a way so that as the script find out that line it should save that line in some variable as a string & terminate reading the file immediately ie. it should not read furthermore aftr extracting the line.
Please help !!
grep has an -m or --max-count flag that tells it to stop after a specified number of matches. Hopefully your version of grep supports it.
set product1 = `grep -m 1 -e '<product_version_info.*/>' xyz`
From the man page linked above:
-m NUM, --max-count=NUM
Stop reading a file after NUM matching lines. If the input is
standard input from a regular file, and NUM matching lines are
output, grep ensures that the standard input is positioned to
just after the last matching line before exiting, regardless of
the presence of trailing context lines. This enables a calling
process to resume a search. When grep stops after NUM matching
lines, it outputs any trailing context lines. When the -c or
--count option is also used, grep does not output a count
greater than NUM. When the -v or --invert-match option is also
used, grep stops after outputting NUM non-matching lines.
As an alternative, you can always the command below to just check the first few lines (since it always occurs in the first 2-3 lines):
set product1 = `head -3 xyz | grep -e '<product_version_info.*/>'`
I think you're asking to return the first matching line in the file. If so, one solution is to pipe the grep result to head
set product1 = `grep -e '<product_version_info.*/>' xyz | head -1`

Resources