sed add additional column - character-encoding

I want to add an additional column of ones to a tab separated file.
The file looks like this:
#> cat /tmp/myfile
Aal Fisch_und_Fleisch
Aalsuppe Fisch_und_Fleisch
The way I wanted to do it is by sed, matching the whole line, printing it out together with the new column. However the additional column is written in the middle of the lines instead of the end:
#> cat /tmp/myfile | sed 's#^\(.*\)$#\1\t1#g'
Aal 1isch_und_Fleisch
Aalsuppe1 Fisch_und_Fleisch
When I do a sanity check with some manually created lines it works, though:
#> echo -e "aaaaaaaaaa\taaaaaaaaaaaa\nbbbbbbb\tbbbbbbbb" | sed 's#^\(.*\)$#\1\t1#g'
aaaaaaaaaa aaaaaaaaaaaa 1
bbbbbbb bbbbbbbb 1
I guessed it might be an encoding/line break issue, here is what file is saying:
#> file /tmp/myfile
/tmp/myfile: ASCII text, with CRLF line terminators
If it is an encoding/line break issue, how do I go about it?

I'm not able to reproduce your exact issue, but have seen similar things before. Essentially, CRLF line endings can cause strangeness in the visual display, because the CR part, the carriage return, can cause the cursor to move to the begin of the same line, rather than to the beginning of a new line. Easiest is probably just to switch to Unix-style endings.
To switch to Unix-style endings, use one of
dos2unix
tr -d '\r'
As a whole, something like
cat /tmp/myfile | dos2unix | sed 's#^\(.*\)$#\1\t1#g'
If you need to switch back, you could use unix2dos.

This might work for you (GNU sed):
sed 's/$/\t1/' file

Related

Remove two lines using sed

I'm writing a script which can parse an HTML document. I would like to remove two lines, how does sed work with newlines? I tried
sed 's/<!DOCTYPE.*\n<h1.*/<newstring>/g'
which didn't work. I tried this statement but it removes the whole document because it seems to remove all newlines:
sed ':a;N;$!ba;s/<!DOCTYPE.*\n<h1.*\n<b.*/<newstring>/g'
Any ideas? Maybe I should work with awk?
For the simple task of removing two lines if each matches some pattern, all you need to do is:
sed '/<!DOCTYPE.*/{N;/\n<h1.*/d}'
This uses an address matching the first line you want to delete. When the address matches, it executes:
Next - append the next line to the current pattern-space (including \n)
Then, it matches on an address for the contents of the second line (following \n). If that works it executes:
delete - discard current input and start reading next unread line
If d isn't executed, then both lines will print by default and execution will continue as normal.
To adjust this for three lines, you need only use N again. If you want to pull in multiple lines until some delimiter is reached, you can use a line-pump, which looks something like this:
/<!DOCTYPE.*/{
:pump
N
/some-regex-to-stop-pump/!b pump
/regex-which-indicates-we-should-delete/d
}
However, writing a full XML parser in sed or awk is a Herculean task and you're likely better off using an existing solution.
If an xml parsing tool is definitely not an option, awk maybe an option:
awk '/<!DOCTYPE/ { lne=NR+1;next } NR==lne && /<h1/ { next }1' file
When we encounter a line with "<!DOCTYPE" set the variable lne to the line number + 1 (NR+1) and then skip to the next line. Then when the line is equal to lne (NR==lne) and the line contains "<h1", skip to the next line. Print all other lines by using 1.
My solution for a document like this:
<b>...
<first...
<second...
<third...
<a ...
this awk command works well:
awk -v RS='<first[^\n]*\n<second[^\n]*\n<third[^\n]*\n' '{printf "%s", $0}'
that's all.
This might work for you (GNU sed):
sed 'N;/<!DOCTYPE.*\n<h1.*/d;P;D' file
Append the following line and if the pattern matches both lines in the pattern space delete them.
Otherwise, print then delete the first of the two lines and repeat.
To replace the two lines with another string, use:
sed 'N;s/<!DOCTYPE.*\n<h1.*/another string/;P;D'

What is the best way to use tr and grep on a folder?

I'm trying to search through all files in a folder for the following string
<cert>
</cert>
However, I have to remove line returns.
The following code works on one file but how can I pipe an entire folder through the tr and grep? The -l option is to only print the filename and not the whole file.
tr -d '\n' < test | grep -l '<cert></cert>'
The tr/grep approach requires grep to process the whole file as one line. While GNU grep can handle long lines, many others cannot. Also, if the file is large, memory may be taxed.
The following avoids those issues. It searches through all files in the currect directory and report names of any that contain <cert> on one line and </cert> on the next:
awk 'last ~ "<cert>" && $0 ~ "</cert>" {print FILENAME; nextfile} {last=$0}' *
How it works
awk implicitly loops over all lines in a file.
This script uses one variable, last, which contains the text of the previous line.
last ~ "<cert>" && $0 ~ ""`
This tests if (a) the last line contains the characters <cert> and (b) the current line contains the characters </cert>.
If you actually wanted lines that contain <cert> and no other characters, then replace ~ with ==.
{print FILENAME; nextfile}
If the preceding condition returns true, then this prints the file's name and starts on the next file.
(nextfile was a common extension to awk that became POSIX 2012.)
{last=$0}
This updates the variable last to have the current line.

grep from beginning of found word to end of word

I am trying to grep the output of a command that outputs unknown text and a directory per line. Below is an example of what I mean:
.MHuj.5.. /var/log/messages
The text and directory may be different from time to time or system to system. All I want to do though is be able to grep the directory out and send it to a variable.
I have looked around but cannot figure out how to grep to the end of a word. I know I can start the search phrase looking for a "/", but I don't know how to tell grep to stop at the end of the word, or if it will consider the next "/" a new word or not. The directories listed could change, so I can't assume the same amount of directories will be listed each time. In some cases, there will be multiple lines listed and each will have a directory list in it's output. Thanks for any help you can provide!
If your directory paths does not have spaces then you can do:
$ echo '.MHuj.5.. /var/log/messages' | awk '{print $NF}'
/var/log/messages
It's not clear from a single example whether we can generalize that e.g. the first occurrence of a slash marks the beginning of the data you want to extract. If that holds, try
grep -o '/.*' file
To fetch everything after the last space, try
grep -o '[^ ]*$' file
For more advanced pattern matching and extraction, maybe look at sed, or Awk or Perl or Python.
Your line can be described as:
^\S+\s+(\S+)$
That's assuming whitespace is your delimiter between the random text and the directory. It simply separates the whitespace from the non-whitespace and captures the second part.
Or you might want to look into the word boundary character class: \b.
I know you said to use grep, but I can't help to mention that this is trivially done using awk:
awk '{ print $NF }' input.txt
This is assuming that a whitespace is the delimiter and that the path does not contain any whitespaces.

Opposite of "only-matching" in grep?

Is there any way to do the opposite of showing only the matching part of strings in grep (the -o flag), that is, show everything except the part that matches the regex?
That is, the -v flag is not the answer, since that would not show files containing the match at all, but I want to show these lines, but not the part of the line that matches.
EDIT: I wanted to use grep over sed, since it can do "only-matching" matches on multi-line, with:
cat file.xml|grep -Pzo "<starttag>.*?(\n.*?)+.*?</starttag>"
This is a rather unusual requirement, I don't think grep would alternate the strings like that. You can achieve this with sed, though:
sed -n 's/$PATTERN//gp' file
EDIT in response to OP's edit:
You can do multiline matching with sed, too, if the file is small enough to load it all into memory:
sed -rn ':r;$!{N;br};s/<starttag>.*?(\n.*?)+.*?<\/starttag>//gp' file.xml
You can do that with a little help from sed:
grep "pattern" input_file | sed 's/pattern//g'
I don't think there is a way in grep.
If you use ack, you could output Perl's special variables $` and $' variables to show everything before and after the match, respectively:
ack string --output="\$`\$'"
Similarly if you wanted to output what did match along with other text, you could use $& which contains the matched string;
ack string --output="Matched: $&"

Shell Removing Tabs/Spaces

I've used a grep command with sed and cut filters that basically turns my output to something similar to this
this line 1
this line 2
another line 3
another line 4
I'm trying to get an output without the spaces in between the lines and in front of the lines so it'd look like
this line 1
this line 2
another line 3
another line 4
I'd like to add another | filter
Add this filter to remove whitespace from the beginning of the line and remove blank lines, notice that it uses two sed commands, one to remove leading whitespace and another to delete lines with no content
| sed -e 's/^\s*//' -e '/^$/d'
There is an example in the Wikipedia article for sed which uses the d command to delete lines that are either blank or only contain spaces, my solution uses the escape sequence \s to match any whitespace character (space, tab, and so on), here is the Wikipedia example:
sed -e '/^ *$/d' inputFileName
The caret (^) matches the beginning of the line.
The dollar sign ($) matches the end of the line.
The asterisk (*) matches zero or more occurrences of the previous character.
This can be done with the tr command as well. Like so
| tr -s [:space:]
or alternatively
| tr -s \\n
if you want to remove the line breaks only, without the space chars in the beginning of each line.
I would do this, short and simple:
sed 's: ::g'
Add this at the end of your command, and all whitespace will go poof. For example try this command:
cat /proc/meminfo | sed 's: ::g'
You can also use grep:
... | grep -o '[^$(printf '\t') ].*'
Here we print lines that have at least one character that isn't white space. By using the "-o" flag, we print only the match, and we force the match to start on a non white space character.
EDIT: Changed command so it can remove the leading white space characters.
Hope this helps =)
Use grep "^." filename to remove blank lines while printing.Here,the lines starting with any character is matched so that the blank lines are left out.
^ indicates start of the line.
. checks for any character.
(whateverproducesthisoutput)|sed -E 's/^[[:space:]]+//'|grep -v '^$'
(depending on your sed, you can replace [[:space:]] with \s).

Resources