Remove two lines using sed - parsing

I'm writing a script which can parse an HTML document. I would like to remove two lines, how does sed work with newlines? I tried
sed 's/<!DOCTYPE.*\n<h1.*/<newstring>/g'
which didn't work. I tried this statement but it removes the whole document because it seems to remove all newlines:
sed ':a;N;$!ba;s/<!DOCTYPE.*\n<h1.*\n<b.*/<newstring>/g'
Any ideas? Maybe I should work with awk?

For the simple task of removing two lines if each matches some pattern, all you need to do is:
sed '/<!DOCTYPE.*/{N;/\n<h1.*/d}'
This uses an address matching the first line you want to delete. When the address matches, it executes:
Next - append the next line to the current pattern-space (including \n)
Then, it matches on an address for the contents of the second line (following \n). If that works it executes:
delete - discard current input and start reading next unread line
If d isn't executed, then both lines will print by default and execution will continue as normal.
To adjust this for three lines, you need only use N again. If you want to pull in multiple lines until some delimiter is reached, you can use a line-pump, which looks something like this:
/<!DOCTYPE.*/{
:pump
N
/some-regex-to-stop-pump/!b pump
/regex-which-indicates-we-should-delete/d
}
However, writing a full XML parser in sed or awk is a Herculean task and you're likely better off using an existing solution.

If an xml parsing tool is definitely not an option, awk maybe an option:
awk '/<!DOCTYPE/ { lne=NR+1;next } NR==lne && /<h1/ { next }1' file
When we encounter a line with "<!DOCTYPE" set the variable lne to the line number + 1 (NR+1) and then skip to the next line. Then when the line is equal to lne (NR==lne) and the line contains "<h1", skip to the next line. Print all other lines by using 1.

My solution for a document like this:
<b>...
<first...
<second...
<third...
<a ...
this awk command works well:
awk -v RS='<first[^\n]*\n<second[^\n]*\n<third[^\n]*\n' '{printf "%s", $0}'
that's all.

This might work for you (GNU sed):
sed 'N;/<!DOCTYPE.*\n<h1.*/d;P;D' file
Append the following line and if the pattern matches both lines in the pattern space delete them.
Otherwise, print then delete the first of the two lines and repeat.
To replace the two lines with another string, use:
sed 'N;s/<!DOCTYPE.*\n<h1.*/another string/;P;D'

Related

How can I find files that match a two-line pattern using grep?

I created a test file with the following:
<cert>
</cert>
I'm now trying to find this with grep and the following command, but it take forever to run.
How can I search quickly for files that contain adjacent lines like these?
tr -d '\n' | grep '<cert></cert>' test.test
So, from the comments, you're trying to get the filenames that contain an empty <cert>..</cert> element. You're using several tools wrong. As #iiSeymour pointed out, tr only reads from standard input-- so if you want to use it to select from lots of filenames, you'll need to use a loop. grep prints out matching lines, not filenames; though you could use grep -l to see the filenames instead.
But you're only joining lines because grep works one line at a time; so let's use a better tool. Here's how to search with awk:
awk '/<cert>/ { started=1; }
/<\/cert>/ { if (started) { print FILENAME; nextfile;} }
!/<cert>/ { started = 0; }' file1 file2 *.txt
It checks each line and keeps track of whether the previous line matched <cert>. (!/pattern/ sets the flag back to zero on lines not matching /pattern/.) Call it with all your files (or with a wildcard like *.txt).
And a friendly suggestion: Next time, try each command separately (you've been stuck on this for hours and you still don't know what grep does?). And have a quick look at the manual for the tools you want to use. Unix tools are usually too complex for simple trial and error.

grep from beginning of found word to end of word

I am trying to grep the output of a command that outputs unknown text and a directory per line. Below is an example of what I mean:
.MHuj.5.. /var/log/messages
The text and directory may be different from time to time or system to system. All I want to do though is be able to grep the directory out and send it to a variable.
I have looked around but cannot figure out how to grep to the end of a word. I know I can start the search phrase looking for a "/", but I don't know how to tell grep to stop at the end of the word, or if it will consider the next "/" a new word or not. The directories listed could change, so I can't assume the same amount of directories will be listed each time. In some cases, there will be multiple lines listed and each will have a directory list in it's output. Thanks for any help you can provide!
If your directory paths does not have spaces then you can do:
$ echo '.MHuj.5.. /var/log/messages' | awk '{print $NF}'
/var/log/messages
It's not clear from a single example whether we can generalize that e.g. the first occurrence of a slash marks the beginning of the data you want to extract. If that holds, try
grep -o '/.*' file
To fetch everything after the last space, try
grep -o '[^ ]*$' file
For more advanced pattern matching and extraction, maybe look at sed, or Awk or Perl or Python.
Your line can be described as:
^\S+\s+(\S+)$
That's assuming whitespace is your delimiter between the random text and the directory. It simply separates the whitespace from the non-whitespace and captures the second part.
Or you might want to look into the word boundary character class: \b.
I know you said to use grep, but I can't help to mention that this is trivially done using awk:
awk '{ print $NF }' input.txt
This is assuming that a whitespace is the delimiter and that the path does not contain any whitespaces.

How to clean a CSV file using the 'grep' command

Assuming that we have the following record {(XXX1),(XXX2)},whatever What I want is, extract the information, based on the following rule, preferably with 'grep': if {} contains less or equal to two UNIQUE elements, the ones inside the (), then keep (both) of them, otherwise delete the whole row. As a further step, I want to extract the values within the (), and finally write the remaining lines in the following form: XXX1,XXX2,whatever
UPDATE:
For the following input:
{(XXX1),(XXX2)},whatever,unique=2
{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
{(XXX1)},whatever,unique=1
{},whatever,unique=0
{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
I should get the following output:
XXX1,XXX2,whatever,unique=2
XXX1,whatever,unique=1
awk could do it, check this one-liner:
awk -F'[}{]' '{split($2,a,",");delete(b);for(x in a)b[a[x]]}length(b)<=2' file
let's do a small test:
kent$ cat file
ok,{(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1)},whatever,unique=1
ok,{},whatever,unique=0
nok,{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
kent$ awk -F'[}{]' '{split($2,a,",");delete(b);for(x in a)b[a[x]]}length(b)<=2' file
ok,{(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
ok,{(XXX1)},whatever,unique=1
ok,{},whatever,unique=0
you can see, the nok line was removed
EDIT
awk -F'[}{]' '{gsub(/[()]/,"");split($2,a,",");delete(b);for(x in a)b[a[x]];l=length(b)}l<=2&&l>0{s="";for(x in b)s=s""x",";sub(/,$/,"",s);y[s]=s $3}END{for(x in y)print y[x]}' file
test
kent$ cat file
{(XXX1),(XXX2)},whatever,unique=2
{(XXX1),(XXX1),(XXX1),(XXX2)},whatever,unique=2
{(XXX1)},whatever,unique=1
{},whatever,unique=0
{(XXX1),(XXX2),(XXX3),(XXX4)},whatever
kent$ awk -F'[}{]' '{gsub(/[()]/,"");split($2,a,",");delete(b);for(x in a)b[a[x]];l=length(b)}l<=2&&l>0{s="";for(x in b)s=s""x",";sub(/,$/,"",s);y[s]=s $3}END{for(x in y)print y[x]}' file
XXX1,XXX2,whatever,unique=2
XXX1,whatever,unique=1

sed add additional column

I want to add an additional column of ones to a tab separated file.
The file looks like this:
#> cat /tmp/myfile
Aal Fisch_und_Fleisch
Aalsuppe Fisch_und_Fleisch
The way I wanted to do it is by sed, matching the whole line, printing it out together with the new column. However the additional column is written in the middle of the lines instead of the end:
#> cat /tmp/myfile | sed 's#^\(.*\)$#\1\t1#g'
Aal 1isch_und_Fleisch
Aalsuppe1 Fisch_und_Fleisch
When I do a sanity check with some manually created lines it works, though:
#> echo -e "aaaaaaaaaa\taaaaaaaaaaaa\nbbbbbbb\tbbbbbbbb" | sed 's#^\(.*\)$#\1\t1#g'
aaaaaaaaaa aaaaaaaaaaaa 1
bbbbbbb bbbbbbbb 1
I guessed it might be an encoding/line break issue, here is what file is saying:
#> file /tmp/myfile
/tmp/myfile: ASCII text, with CRLF line terminators
If it is an encoding/line break issue, how do I go about it?
I'm not able to reproduce your exact issue, but have seen similar things before. Essentially, CRLF line endings can cause strangeness in the visual display, because the CR part, the carriage return, can cause the cursor to move to the begin of the same line, rather than to the beginning of a new line. Easiest is probably just to switch to Unix-style endings.
To switch to Unix-style endings, use one of
dos2unix
tr -d '\r'
As a whole, something like
cat /tmp/myfile | dos2unix | sed 's#^\(.*\)$#\1\t1#g'
If you need to switch back, you could use unix2dos.
This might work for you (GNU sed):
sed 's/$/\t1/' file

Grep for beginning and end of line?

I have a file where I want to grep for lines that start with either -rwx or drwx AND end in any number.
I've got this, but it isnt quite right. Any ideas?
grep [^.rwx]*[0-9] usrLog.txt
The tricky part is a regex that includes a dash as one of the valid characters in a character class. The dash has to come immediately after the start for a (normal) character class and immediately after the caret for a negated character class. If you need a close square bracket too, then you need the close square bracket followed by the dash. Mercifully, you only need dash, hence the notation chosen.
grep '^[-d]rwx.*[0-9]$' "$#"
See: Regular Expressions and grep for POSIX-standard details.
It looks like you were on the right track... The ^ character matches beginning-of-line, and $ matches end-of-line. Jonathan's pattern will work for you... just wanted to give you the explanation behind it
It should be noted that not only will the caret (^) behave differently within the brackets, it will have the opposite result of placing it outside of the brackets. Placing the caret where you have it will search for all strings NOT beginning with the content you placed within the brackets. You also would want to place a period before the asterisk in between your brackets as with grep, it also acts as a "wildcard".
grep ^[.rwx].*[0-9]$
This should work for you, I noticed that some posters used a character class in their expressions which is an effective method as well, but you were not using any in your original expression so I am trying to get one as close to yours as possible explaining every minor change along the way so that it is better understood. How can we learn otherwise?
You probably want egrep. Try:
egrep '^[d-]rwx.*[0-9]$' usrLog.txt
are you parsing output of ls -l?
If you are, and you just want to get the file name
find . -iname "*[0-9]"
If you have no choice because usrLog.txt is created by something/someone else and you absolutely must use this file, other options include
awk '/^[-d].*[0-9]$/' file
Ruby(1.9+)
ruby -ne 'print if /^[-d].*[0-9]$/' file
Bash
while read -r line ; do case $line in [-d]*[0-9] ) echo $line; esac; done < file
Many answers provided for this question. Just wanted to add one more which uses bashism-
#! /bin/bash
while read -r || [[ -n "$REPLY" ]]; do
[[ "$REPLY" =~ ^(-rwx|drwx).*[[:digit:]]+$ ]] && echo "Got one -> $REPLY"
done <"$1"
#kurumi answer for bash, which uses case is also correct but it will not read last line of file if there is no newline sequence at the end(Just save the file without pressing 'Enter/Return' at the last line).

Resources