Print text between delimiters using sed - printing

Suppose I have op(abc)asdfasdf and I need sed to print abc between the brackets. What would work for me? (Note: I only want the text between first pair of delimiters on a line, and nothing if a particular line of input does not have a pair of brackets.)

$ echo 'op(abc)asdfasdf' | sed 's|[^(]*(\([^)]*\)).*|\1|'
abc

sed -n -e '/^[^(]*(\([^)]*\)).*/s//\1/p'
The pattern looks for lines that start with a list of zero or more characters that are not open parentheses, then an open parenthesis; then start remembering a list of zero or more characters that are not close parentheses, then a close parenthesis, followed by anything. Replace the input with the list you remembered and print it. The -n means 'do not print by default' - any lines of input without the parentheses will not be printed.

Related

select only a word that is part of colon

I have a text file using markup language (similar to wikipedia articles)
cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.
I need to select the word "having:" only because that is part of regular text. I tried
grep -v '[*:*]' test.txt
This will correctly avoid the tags, but does not select the expected word.
The square brackets specify a character class, so your regular expression looks for any occurrence of one of the characters * or : (or *, but we said that already, didn't we?)
grep has the option -o to only print the matching text, so something lie
grep -ow '[^[:space:]]*:[^[:space:]]*' file.txt
would extract any text with a colon in it, surrounded by zero or more non-whitespace characters on each side. The -w option adds the condition that the match needs to be between word boundaries.
However, if you want to restrict in which context you want to match the text, you will probably need to switch to a more capable tool than plain grep. For example, you could use sed to preprocess each line to remove any bracketed text, and then look for matches in the remaining text.
sed -e 's/\[.*]//g' -e 's/ [^: ]*$/ /' -e 's/[^: ]* //g' -e 's/ /\n/' file.txt
(This assumes that your sed recognizes \n in the replacement string as a literal newline. There are simple workarounds available if it doesn't, but let's not go there if it's not necessary.)
In brief, we first replace any text between square brackets. (This needs to be improved if your input could contain multiple sequences of square brackets on a line with normal text between them. Your example only shows nested square brackets, but my approach is probably too simple for either case.) Then, we remove any words which don't contain a colon, with a special provision for the last word on the line, and some subsequent cleanup. Finally, we replace any remaining spaces with newlines, and (implicitly) print whatever is left. (This still ends up printing one newline too many, but that is easy to fix up later.)
Alternatively, we could use sed to remove any bracketed expressions, then use grep on the remaining tokens.
sed -e :a -e 's/\[[^][]*\]//' -e ta file.txt |
grep -ow '[^[:space:]]*:[^[:space:]]*'
The :a creates a label a and ta says to jump back to that label and try again if the regex matched. This one also demonstrates how to handle nested and repeated brackets. (I suppose it could be refactored into the previous attempt, so we could avoid the pipe to grep. But outlining different solution models is also useful here, I suppose.)
If you wanted to ensure that there is at least one non-colon character adjacent to the colon, you could do something like
... file.txt |
grep -owE '[^:[:space:]]+:[^[:space:]]*|[^[:space:]]*:[^: [:space:]]+'
where the -E option selects a slightly more modern regex dialect which allows us to use | between alternatives and + for one or more repetitions. (Basic grep in 1969 did not have these features at all; much later, the POSIX standard grafted them on with a slightly wacky syntax which requires you to backslash them to remove the literal meaning and select the metacharacter behavior... but let's not go there.)
Notice also how [^:[:space:]] matches a single character which is not a colon or a whitespace character, where [:space:] is the (slightly arcane) special POSIX named character class which matches any whitespace character (regular space, horizontal tab, vertical tab, possibly Unicode whitespace characters, depending on locale).
Awk easily lets you iterate over the tokens on a line. The requirement to ignore matches within square brackets complicates matters somewhat; you could keep a separate variable to keep track of whether you are inside brackets or not.
awk '{ for(i=1; i<=NF; ++i) {
if($i ~ /\]/) { brackets=0; next }
if($i ~ /\[/) brackets=1;
if(brackets) next;
if($i ~ /:/) print $i }' file.txt
This again hard-codes some perhaps incorrect assumptions about how the brackets can be placed. It will behave unexpectedly if a single token contains a closing square bracket followed by an opening one, and has an oversimplified treatment of nested brackets (the first closing bracket after a series of opening brackets will effectively assume we are no longer inside brackets).
A combined solution using sed and awk:
sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'
sed will change all spaces to a newline
awk (or gawk) will output all lines matching $0~/:$/, as long as i equals zero
The last part of the awk stuff keeps a count of the opening and closing brackets.
Another solution using sed and grep:
sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt | grep ':$'
's/\[.*\]+//g' will filter the stuff between brackets
's/ /\n/g' will replace a space with a newline
grep will only find lines ending with :
A third on using only awk:
gawk '{ for (t=1;t<=NF;t++){
if(i==0 && $t~/:$/) print $t;
i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt
gsub returns the number of replacements.
The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2. When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.

Getting only grep exact matches

I am trying to grep a file for the exact occurrence of a match, but I get also longer spurious matches:
grep CAT1717O99 myfile.txt -F -w
Output:
CAT1717O99
CAT1717O99.5
I would like to output only the first exactly matching line. Is there any way to get rid of the second line?
Thanks in advance.
Arturo
This is the file 'myfile.txt':
CAT1717O99
CAT1717O99.5
This will do the work for you.
grep -Fx "CAT1717O99" textfile
-F means Fixed
-x mean exact
Use the power of Perl-compatible regular expression (PCRE) and search the matches to the given pattern:
grep -Po "\bCAT1717O99(\s|$)" myfile.txt
(\s|$) - alternative group, ensures matching substring CAT1717O99 if it's followed by whitespace or placed at the end of the line
-P option, allows regular expressions
-o option, prints only matched parts of matching lines
You'll need explicitly request spaces in order to ignore special chars.
grep -E '(^| )CAT1717O99( |$)' myFile.txt
from grep manual :
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.

Shell Removing Tabs/Spaces

I've used a grep command with sed and cut filters that basically turns my output to something similar to this
this line 1
this line 2
another line 3
another line 4
I'm trying to get an output without the spaces in between the lines and in front of the lines so it'd look like
this line 1
this line 2
another line 3
another line 4
I'd like to add another | filter
Add this filter to remove whitespace from the beginning of the line and remove blank lines, notice that it uses two sed commands, one to remove leading whitespace and another to delete lines with no content
| sed -e 's/^\s*//' -e '/^$/d'
There is an example in the Wikipedia article for sed which uses the d command to delete lines that are either blank or only contain spaces, my solution uses the escape sequence \s to match any whitespace character (space, tab, and so on), here is the Wikipedia example:
sed -e '/^ *$/d' inputFileName
The caret (^) matches the beginning of the line.
The dollar sign ($) matches the end of the line.
The asterisk (*) matches zero or more occurrences of the previous character.
This can be done with the tr command as well. Like so
| tr -s [:space:]
or alternatively
| tr -s \\n
if you want to remove the line breaks only, without the space chars in the beginning of each line.
I would do this, short and simple:
sed 's: ::g'
Add this at the end of your command, and all whitespace will go poof. For example try this command:
cat /proc/meminfo | sed 's: ::g'
You can also use grep:
... | grep -o '[^$(printf '\t') ].*'
Here we print lines that have at least one character that isn't white space. By using the "-o" flag, we print only the match, and we force the match to start on a non white space character.
EDIT: Changed command so it can remove the leading white space characters.
Hope this helps =)
Use grep "^." filename to remove blank lines while printing.Here,the lines starting with any character is matched so that the blank lines are left out.
^ indicates start of the line.
. checks for any character.
(whateverproducesthisoutput)|sed -E 's/^[[:space:]]+//'|grep -v '^$'
(depending on your sed, you can replace [[:space:]] with \s).

Print text between ( ) sed

This is an extension of my previous question. In that question, I needed to retrieve the text between parentheses where all the text was on a single line. Now I have this case:
(aop)
(abc
d)
This time, the open parenthesis can be on one line and the close parenthesis on another line, so:
(abc
d)
also counts as text between the delimiters '( )' and I need to print it as
abc
d
EDIT:
In response to possible confusions of my question, let me clarify a little. Basically, I need to print text between delimiters which could span multiple lines.
for example I have this text in my file:
randomtext(1234
567) randomtext
randomtext(abc)randomtext
Now I want Sed to pick out text between the delimiter "(" and ")". So the output would be:
1234
567
abc
Notice that the left and right brackets are not on the same line but they still count as a delimiter for 1234 567, so I need to print that part of the text. (note, I only want the text between the first pair of delimiters).
Any help would be appreciated.
Ah! another tricky sed puzzle :)
I believe this code will work for your problem:
sed -n '/(/,/)/{:a; $!N; /)/!{$!ba}; s/.*(\([^)]*\)).*/\1/p}' file
OUTPUT
For the provided input it produced:
1234
567
abc
Explanation:
-n suppresses the regular sed output
/(/,/)/ is for range selection between ( and )
:a is for marking a label a
$!N means append the next line of input into the current pattern space
/)/! means do some actions if ) is not matched in current pattern space
/)/!${!ba} means go to label a if ) is not matched in current pattern space
s/.*(\([^)]*\)).*/\1/ means replace content between ( and ) by just the content thus stripping out parenthesis
\1 is for back reference of group 1 i.e. text between \( and \)
p is for printing the replaced content
This link has the answer. I am paraphrasing to match your need:
sed -n '1h;1!H;${;g;s/.*(\([^)]*\)).*/\1/;p}' < your_input
The answer given didn't work for my case. What worked for me was:
cat file | tr -d '\n'
^^^
this puts the whole file in a single line by deleting line breaks.
and then I further piped it into the answer here. (note: instead of brackets, OPEN and CLOSE are used in that question)

How do I use grep to extract a specific field value from lines

I have lines in a file which look like the following
....... DisplayName="john" ..........
where .... represents variable number of other fields.
Using the following grep command, I am able to extract all the lines which have a valid 'DisplayName' field:
grep DisplayName="[0-9A-Za-z[:space:]]*" e:\test
However, I wish to extract just the name (ie "john") from each line instead of the whole line returned by grep. I tried piping the output into the cut command but it does not accept string delimiters.
This works for me:
awk -F "=" '/DisplayName/ {print $2}'
which returns "john". To remove the quotes for john use:
awk -F "=" '/DisplayName/ {gsub("\"","");print $2}'
Specifically:
sed 's/.*DisplayName="\(.*\)".*/\1/'
Should do, sed semantics is s/subsitutethis/forthis/ where "/" is delimiter. The escaped parentheses in combination with escaped 1 are used to keep the part of the pattern designated by parentheses. This expression keeps everything inside the parentheses after displayname and throws away the rest.
This can also work without first using grep, if you use:
sed -n 's/.*DisplayName="\(.*\)".*/\1/p'
The -n option and p flag tells sed to print just the changed lines.
More in: http://www.grymoire.com/Unix/Sed.html

Resources