How to clean a CSV file using the 'grep' command - parsing

Assuming that we have the following record {(XXX1),(XXX2)},whatever What I want is, extract the information, based on the following rule, preferably with 'grep': if {} contains less or equal to two UNIQUE elements, the ones inside the (), then keep (both) of them, otherwise delete the whole row. As a further step, I want to extract the values within the (), and finally write the remaining lines in the following form: XXX1,XXX2,whatever
UPDATE:
For the following input:
{(XXX1),(XXX2)},whatever,unique=2
{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
{(XXX1)},whatever,unique=1
{},whatever,unique=0
{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
I should get the following output:
XXX1,XXX2,whatever,unique=2
XXX1,whatever,unique=1

awk could do it, check this one-liner:
awk -F'[}{]' '{split($2,a,",");delete(b);for(x in a)b[a[x]]}length(b)<=2' file
let's do a small test:
kent$ cat file
ok,{(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1)},whatever,unique=1
ok,{},whatever,unique=0
nok,{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
kent$ awk -F'[}{]' '{split($2,a,",");delete(b);for(x in a)b[a[x]]}length(b)<=2' file
ok,{(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1)},whatever,unique=1
ok,{},whatever,unique=0
you can see, the nok line was removed
EDIT
awk -F'[}{]' '{gsub(/[()]/,"");split($2,a,",");delete(b);for(x in a)b[a[x]];l=length(b)}l<=2&&l>0{s="";for(x in b)s=s""x",";sub(/,$/,"",s);y[s]=s $3}END{for(x in y)print y[x]}' file
test
kent$ cat file
{(XXX1),(XXX2)},whatever,unique=2
{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
{(XXX1)},whatever,unique=1
{},whatever,unique=0
{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
kent$ awk -F'[}{]' '{gsub(/[()]/,"");split($2,a,",");delete(b);for(x in a)b[a[x]];l=length(b)}l<=2&&l>0{s="";for(x in b)s=s""x",";sub(/,$/,"",s);y[s]=s $3}END{for(x in y)print y[x]}' file
XXX1,XXX2,whatever,unique=2
XXX1,whatever,unique=1

Related

Remove two lines using sed

I'm writing a script which can parse an HTML document. I would like to remove two lines, how does sed work with newlines? I tried
sed 's/<!DOCTYPE.*\n<h1.*/<newstring>/g'
which didn't work. I tried this statement but it removes the whole document because it seems to remove all newlines:
sed ':a;N;$!ba;s/<!DOCTYPE.*\n<h1.*\n<b.*/<newstring>/g'
Any ideas? Maybe I should work with awk?
For the simple task of removing two lines if each matches some pattern, all you need to do is:
sed '/<!DOCTYPE.*/{N;/\n<h1.*/d}'
This uses an address matching the first line you want to delete. When the address matches, it executes:
Next - append the next line to the current pattern-space (including \n)
Then, it matches on an address for the contents of the second line (following \n). If that works it executes:
delete - discard current input and start reading next unread line
If d isn't executed, then both lines will print by default and execution will continue as normal.
To adjust this for three lines, you need only use N again. If you want to pull in multiple lines until some delimiter is reached, you can use a line-pump, which looks something like this:
/<!DOCTYPE.*/{
:pump
N
/some-regex-to-stop-pump/!b pump
/regex-which-indicates-we-should-delete/d
}
However, writing a full XML parser in sed or awk is a Herculean task and you're likely better off using an existing solution.
If an xml parsing tool is definitely not an option, awk maybe an option:
awk '/<!DOCTYPE/ { lne=NR+1;next } NR==lne && /<h1/ { next }1' file
When we encounter a line with "<!DOCTYPE" set the variable lne to the line number + 1 (NR+1) and then skip to the next line. Then when the line is equal to lne (NR==lne) and the line contains "<h1", skip to the next line. Print all other lines by using 1.
My solution for a document like this:
<b>...
<first...
<second...
<third...
<a ...
this awk command works well:
awk -v RS='<first[^\n]*\n<second[^\n]*\n<third[^\n]*\n' '{printf "%s", $0}'
that's all.
This might work for you (GNU sed):
sed 'N;/<!DOCTYPE.*\n<h1.*/d;P;D' file
Append the following line and if the pattern matches both lines in the pattern space delete them.
Otherwise, print then delete the first of the two lines and repeat.
To replace the two lines with another string, use:
sed 'N;s/<!DOCTYPE.*\n<h1.*/another string/;P;D'

How to grep multiple lines using a .txt vocab, matching only first word as variable?

I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm

Grep Entire File For Strings, Not Line by Line

I am wanting to search for files that contain 'even:suspendcount>0' AND 'even:holdcount>0'. These 2 strings of text must be somewhere in the file, not necessarily on the same line. The problem I am running into is that my search results are not pulling back files that contain 1 sting of text on say line #5 and the other on line #10. It is only pulling back files if they are on the same line number. How would I search for files that contains multiple strings of text just somewhere in the file, they do not have to be on the same line.
Using grep
To use grep to get files that have both strings in either order:
grep -lZ 'even:suspendcount>0' * | xargs --null grep -l 'even:holdcount>0'
How it works:
grep -lZ 'even:suspendcount>0' *
This returns a nul-separated list of the names of files which contain the string even:suspendcount>0.
xargs --null grep -l 'even:holdcount>0'
Of the files selected by the first step, this returns the names of files which also contain even:holdcount>0
Because we are using nul-separation when passing the file names from one process to the next, this approach is safe even for difficult file names.
Using awk
This prints the file name of any file that contains both strings:
awk 'BEGINFILE{f=0;g=0} /even:suspendcount>0/{f=1} /even:holdcount>0/{g=1} f && g{print FILENAME; nextfile}' *
How it works:
BEGINFILE{f=0;g=0}
As we start reading a new file, variables f and g are set to zero (false).
/even:suspendcount>0/{f=1}
If we encounter a line containing even:suspendcount>0, then set variable f to 1.
/even:holdcount>0/{g=1}
Similarly, f we encounter a line containing even:holdcount>0, then set variable g to 1.
f && g{print FILENAME; nextfile}
If both f and g are true (nonzero), then print the filename and skip to the next file.
A grep pattern is line-oriented, i.e. in your case it should be 'even:suspendcount>0' OR 'even:holdcount>0' (namely grep -E 'even:(suspend|hold)count>0').

What is the best way to use tr and grep on a folder?

I'm trying to search through all files in a folder for the following string
<cert>
</cert>
However, I have to remove line returns.
The following code works on one file but how can I pipe an entire folder through the tr and grep? The -l option is to only print the filename and not the whole file.
tr -d '\n' < test | grep -l '<cert></cert>'
The tr/grep approach requires grep to process the whole file as one line. While GNU grep can handle long lines, many others cannot. Also, if the file is large, memory may be taxed.
The following avoids those issues. It searches through all files in the currect directory and report names of any that contain <cert> on one line and </cert> on the next:
awk 'last ~ "<cert>" && $0 ~ "</cert>" {print FILENAME; nextfile} {last=$0}' *
How it works
awk implicitly loops over all lines in a file.
This script uses one variable, last, which contains the text of the previous line.
last ~ "<cert>" && $0 ~ ""`
This tests if (a) the last line contains the characters <cert> and (b) the current line contains the characters </cert>.
If you actually wanted lines that contain <cert> and no other characters, then replace ~ with ==.
{print FILENAME; nextfile}
If the preceding condition returns true, then this prints the file's name and starts on the next file.
(nextfile was a common extension to awk that became POSIX 2012.)
{last=$0}
This updates the variable last to have the current line.

difference of lines to file without diff tags

I just want to take the difference of two files and write them to another without patch tags like + or - or diff tags like > or <. I understand how patches work and how to use the following commands:
diff file1.txt file2.txt | grep ">" > difffile.txt
diff -u file1.txt file2.txt > difffile.patch
patch original.txt < difffile.patch
but when I open my difffile.txt from the first command, I get something like this:
> some line of text
> some other line of text
when what I reallly want is:
some line of text
some other line of text
I thought that maybe indexing the string like
${stringname:2}
would work, but I don't know how to use that with grep or how to index a grep string.
I'm actually parsing html and xml and just want the values differences in some file. I don't know how to do that.
If you just want to remove the first two characters of every line, cut is your friend:
cut -c3- file
Test
$ cat a
hello this is me
and this is you
$ cut -c3- a
llo this is me
d this is you

Resources