Parsing text from .txt files - parsing

I've a tabbed log file but I need only few chracters of the line marked 30.10 in the beginning.
Using the command
awk '/^30.10/{print}' FOOD_ORDERS_201907041307.DEL
i get this output
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
What i need to extract is 3547 and the last nth caracthers from the very end after zeros.
So, expected output will be:
3547
34
29
11
But if the last 10 caracthers contains leading zeros and a number, i need that number

While your question is unclear, your answer to Ed Morton's comment provides a bit more clarity on what you are trying to achieve. Where it is still unclear is just exactly you want from the third field. From your question and the various comments, it appears if the line begins with 30.10 you want the first 4-digits from second field and you want the rightmost digits that are [1-9] from the third field.
If that accurately captures what you need, then awk with a combination of substr, match and length string functions can isolate the digits you are interested in. For example:
awk '/^30.10/ {
l=match ($3, /[1-9]+$/)
print substr ($2, 1, 4) " " substr ($3, l, length($3)-l+1)
}' test
Would take the input file (borrowed from Dudi Boy's answer), e.g.
$ cat test
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000001143
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
and return to you:
3547 34
3547 1143
3547 29
3547 11
Let me know if that accurately captures what you need.

Here is a simple awk script to do the task:
script.awk
/^30.10/ { # for each line starting with 30.10
last2chars = substr($3, length($3)-1); # extract last 2 chars from 3rd field into variable last2chars
if($3 ~ /00001143$/) last2chars = 1143; # if 3rd field ends with 1143, update variable last2chars respectively
print last2chars; # output variable last2chars
}
input.txt
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000001143
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
running:
awk -f script.awk input.txt
outupt:
34
1143
29
11

GOT Part of it!
awk '/^30.10/{print}' FOOD_ORDERS_201907041307.DEL | sed 's/.*(..)/\1/'

Related

How to grep after a certain pattern?

I have a input file such as
file;14;19;;;hello 2019
file2;2019;2020;;;this is a test 2020
file3;25;31;this is a number 31
I would like to grep numbers only after ;;;. For example if I wanted to grep 2019 it would give me
file;14;19;;;hello 2019
instead of if I did grep '2019' file
file;14;19;;;hello 2019
file2;2019;2020;;;this is a test 2020
How can I accomplish this task?
Regular expression can include stuff other than fixed text, it sounds like all you need is:
grep ';;;.*[0-9]' inputFile.txt
This will deliver all lines that have the text ;;; followed by a digit somewhere after that in the line. In terms of explanation:
;;; is the literal text, three semicolons;
.* is zero or more of any character;
[0-9] is any digit.
That will give you lines with any number. If you want a specific number, use that for the final bullet point above.
Just keep in mind that this will also give you the line xyzzy ;;; 920194 if you go looking for 2019.
If you want just the 2019 numbers (i.e., without any digits on either side), you can use the zero-width negative look-behind and look-ahead assertions, assuming your version of grep has Perl-compatible regular expressions (PCRE, which GNU grep does with the -P flag):
grep -P ';;;.*(?<![0-9])2019(?![0-9])' inputFile.txt
This can be read as:
;;; is the literal text, three semicolons;
.* is zero or more of any character;
(?<![0-9]) means next match cannot be preceded by a digit;
2019 is the number you're looking for;
(?![0-9]) means previous match cannot be followed by a digit.
Use this Perl one-liner:
perl -F';' -lane 'print if $F[-1] =~ /2019/' in_file
Example:
( echo 'file;14;19;;;hello 2019' ; echo 'file2;2019;2020;;;this is a test 2020' ) | perl -F';' -lane 'print if $F[-1] =~ /2019/'
Prints:
file;14;19;;;hello 2019
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F';' : Split into #F on semicolon (;), rather than on whitespace.
$F[-1] : the last element of the array #F = the last element of the input line split on semicolon. Alternatively, use $F[5] (the 6th element - the arrays are 0-indexed), if you need to count from the left.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Counting specific lines that don't contain specific word

Please I have question: I have a file like this
#HWI-ST273:296:C0EFRACXX:2:2101:17125:145325/1
TTAATACACCCAACCAGAAGTTAGCTCCTTCACTTTCAGCTAAATAAAAG
+
8?8A;DDDD;#?++8A?;C;F92+2A#19:1*1?DDDECDE?B4:BDEEI
#BBBB-ST273:296:C0EFRACXX:2:1303:5281:183410/1
TAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTTACCA
+
CCBFFFFFFHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJJJJIJJJJJI
#HWI-ST273:296:C0EFRACXX:2:1103:16617:140195/1
AAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTT
+
#C#FF?EDGFDHH#HGHIIGEGIIIIIEDIIGIIIGHHHIIIIIIIIIII
#HWI-ST273:296:C0EFRACXX:2:1207:14316:145263/1
AATACACCCAACCAGAAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCC
+
CCCFFFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJ
I
I'm interested just about the line that starts with '#HWI', but I want to count all the lines that are not starting with '#HWI'. In the example shown, the result will be 1 because there's one line that starts with '#BBB'.
To be more clear: I just want to know know the number of the first line of the patterns (that are 4 line that repeated) that are not '#HWI'; I hope I'm clear enough. Please tell me if you need more clarification
With GNU sed, you can use its extended address to print every fourth line, then use grep to count the ones that don't start with #HWI:
sed -n '1~4p' file.fastq | grep -cv '^#HWI'
Otherwise, you can use e.g. Perl
perl -ne 'print if 1 == $. % 4' -- file.fastq | grep -cv '^#HWI'
$. contains the current line number, % is the modulo operator.
But once we're running Perl, we don't need grep anymore:
perl -lne '++$c if 1 == $. % 4; END { print $c }' -- file.fastq
-l removes newlines from input and adds them to output.

Why does my grep command output "--" between some lines?

I have a fasta file like the test one here:
>HWI-D00196:168:C66U5ANXX:3:1106:16404:19663 1:N:0:GCCAAT
CCTAGCACCATGATTTAATGTTTCTTTTGTACGTTCTTTCTTTGGAAACTGCACTTGTTGCAACCTTGCAAGCCATATAAACACATTTCAGATATAAGGCT
>HWI-D00196:168:C66U5ANXX:3:1106:16404:19663 2:N:0:GCCAAT
AAAACATAAATTTGAGCTTGACAAAAATTAAAAATGAGCCCAGCCTTATATCTGAAATGTGTTTATATGGCTTGCAAGGTTGCAACAAGTGCAGTTTCCAA
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 1:N:0:GCCAAT
ATATTTGAATTATCAGAAATAAACACAAAGAAAACCTAGAACAGATAATTTCTTCCACATTATTGATCAGATACAGATTTCAAGGGTACCGTTGTGAATTG
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 2:N:0:GCCAAT
AAACGATTGATAGATCTATTTGCATTATAAAAACATTAAAAAAACAAAATACTGATTAAATGTCGTCTTTCTATTCCACAATTTTATAGATCTCACTGTAT
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 1:N:0:GCCAAT
CTTACTTTGCCTCTCTCAGCCAATGTCTCCTGAGTCTAATTTTTTGGAGGCTAAGCTATGAGCTAATGATGGGTTCCATTTGGGGCCAATGCTTCAGCCTG
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 2:N:0:GCCAAT
CTATTAGTTCTTATCTTTGCCTGCAAATATAAGACTAGCGCTTGAGTAGCTGACAGAGACAAAGTAAGCTGGAGTGTTTATCACCTGGTCACTCCAATTGT
When i type in a simple grep command like:
grep -B1 "CTT" test.fasta
I get a really strange output in which "--" is sometimes placed on a newline above the grep hit like so:
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 2:N:0:GCCAAT
AAACGATTGATAGATCTATTTGCATTATAAAAACATTAAAAAAACAAAATACTGATTAAATGTCGTCTTTCTATTCCACAATTTTATAGATCTCACTGTAT
--
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 2:N:0:GCCAAT
CTATTAGTTCTTATCTTTGCCTGCAAATATAAGACTAGCGCTTGAGTAGCTGACAGAGACAAAGTAAGCTGGAGTGTTTATCACCTGGTCACTCCAATTGT
I can't figure out why some fasta entries have this and others don't. I don't get this problem when i remove the -B1. I can remove those lines from my file with a grep -v "--" statement, but I'd really like to understand what's going on here.
You are asking for one line of leading context by using the -B1 option. This means grep will display both the line which matched and the line directly before it. Each match will be separated by -- on a line by itself as shown below:
$ man grep | grep -B1 context
-A num, --after-context=num
Print num lines of trailing context after each match. See also
--
-B num, --before-context=num
Print num lines of leading context before each match. See also
--
-C[num, --context=num]
Print num lines of leading and trailing context surrounding each
--
--context[=num]
Print num lines of leading and trailing context. The default is
The reason you aren't seeing -- between every match is that the context is only displayed above a sequence of consecutive matches. So see the following example:
seq 13 | grep -B1 1
1
--
9
10
11
12
13
The seq command produces all the numbers between 1 and 13. Only the first line and the lines from 10 on contain a 1, so you see the 1 in its own group, then --, then the one line context, then the group of consecutive matching lines.
GREP_COLORS section of the grep manpage says :
Specifies the colors and other attributes used to highlight various > parts of the output. Its value is a colon-separated list
of capabilities that defaults to
ms=01;31:mc=01;31:sl=:cx=:fn=35:ln=32:bn=32:se=36 with the rv and
ne boolean capabilities omitted (i.e., false).
and
se=36 SGR substring for separators that are inserted between
selected line fields (:), between context line fields, (-), and
between groups of adjacent lines when nonzero context is
specified (--). The default is a cyan text foreground over the
terminal's default background.
Consider file sample.txt :
$cat sample.txt
ABBB
AAB
AAB
S
S
S
AABB
ABAA
BAA
CCC
$grep -B2 'AAB' sample.txt
ABBB
AAB
AAB
--
S
S
AABB
Here -- is the way of grep to tell you that AAB before -- and S after -- are not adjacent lines in the actual file.

Grep wilcard of unknown length In between pipes

I'm trying to grep the following string:
Line must start with a 15 and the rest of the string can have any length of numbers between the pipes. There must be nothing in between the last 2 pipes.
"15|155702|0101|1||"
So far i have:
grep "^15|" $CONCAT_FILE_NAME >> "VAS-"$CONCAT_FILE_NAME
I'm having trouble specifying any length of numbers when using [0-9]
You need to escape the |
grep -E '^15\|([[:digit:]]+\|)+\|$'
Assuming the beginning must start with 15| and there are a total of 5 pipes(|) and nothing between the last two pipes.. And any number of digits between the 2nd 3rd and 4th pipes.
grep "^15\|[0-9]*\|[0-9]*\|[0-9]*\|\|$" $CONCAT_FILE_NAME >> "VAS-"$CONCAT_FILE_NAME
Using awk
cat file
15|155702|0101|1||
15|155702|0101|1|test|
16|155702|0101|1||
awk -F\| '/^15/ && !$(NF-1)' file
15|155702|0101|1||
This prints a line only if it starts with 15 and the second last field, separated by | is blank
So this would be:
VAS-CONCAT_FILE_NAME=$(awk -F\| '/^15/ && !$(NF-1)' <<<"$CONCAT_FILE_NAME")
Another shorter regex that would work
awk '/^15.*\|\|$/' file
This search for all lines starting with 15 and ends with ||

grep to find words with unique letters

how to use grep to find occurrences of words from a dictionary file which have a given set of letters with the restriction that each letter occurs once and only once.
EG if the letters are abc then the expected output is:
cab
EDIT:
Given a dictionary file (that is a file containing one word per line such as /usr/share/dict/words on mac os x operating system) and a set of (unique) characters, I want to print out all of the dictionary file's words that contain each character of the input set once and only once. For example if the set of characters is {a,b,c} then print out all (3-letter) words that contain each character of the set.
I am looking, preferably, for a solution that uses just grep expressions.
Given a series of letters, for example abc, you can convert each one to a lookahead, like this:
^(?=[^a]*a[^a]*)(?=[^b]*b[^b]*)(?=[^c]*c[^c]*)$
You may need to use the "extended regex" flag -E to use this regex with grep.
To create this regex from a string, you could use sed (an exercise for the reader)
grep -E ^[abc]{3}.$ <Dictionary file> | grep -v -e a.*a -e b.*b -e c.*c
i.e. Find all three letter strings matching the input and pipe these through inverse grep to remove strings with double letters.
I'm using the '.' after {3} because my dictionary file is windows based so has an extra carriage return or line feed. So, that's probably not necessary.
Below is a Perl solution. Note, you'll need to add more words to the dictionary, and read input in to the $input variable. An array of valid words will end up in #results.
#!/usr/bin/env perl
use Data::Dumper;
my $input = "abc";
my #dictionary = qw(aaa aac aad aal aam aap aar aas aat aaw aba abc abd abf abg
abh abm abn abo abr abs abv abw aca acc ace aci ack acl acp acs act acv ada adb
adc add adf adh adl adn ado adp adq adr ads adt adw aea aeb aec aed aef aes aev
afb afc afe aff afg afi afk afl afn afp aft afu afv agb agc agl agm agn ago agp
...
PUT A REAL DICTIONARY HERE!
...
zie zif zig zii zij zik zil zim zin zio zip zir zis zit ziu ziv zlm zlo zlx zma
zme zmi zmu zna zoa zob zoe zog zoi zol zom zon zoo zor zos zot zou zov zoy zrn
zsr zub zud zug zui zuk zul zum zun zuo zur zus zut zuz zva zwo zye zzz);
# Generate a lookahead expression for each character in the input word
my $regexp = join("", map { "(?=.*$_)" } split(//, $input));
my #results;
foreach my $word (#dictionary) {
# If the size of the input doesn't match the dictionary word, skip to the
# next word.
if (length($input) != length($word)) {
next;
}
if ($word =~ /$regexp/) {
push(#results, $word);
}
}
print Dumper #results;
The solution I found involves using grep first to extract all n-letter words that contain only letters from the input set - although some letters might appear more than once, some may not appear; (again I am assuming that the input letters are unique). Then it does a series of 1-letter greps to make sure each letter occurs at least once. Because the words are of length n this ensures the word contains each letter once and only once. For example, if the input character set is (a,b,c} then the solution would be:
grep -E '^[abc]{3}$' /usr/share/dict/words | grep a | grep b | grep c
a simple bash script can be written which creates this grep string and executes it against the word file, using $1 as the input letter set. It might not be the most efficient method of generating the string, but as I am not familiar with sed or awk it does seem to solve my problem. The script I created is:
#!/bin/sh
slen=${#1}
g2="'^[$1]{$slen}\$'"
g3=""
ix1=0
while [ $ix1 -lt $slen ]
do
g3="$g3 | grep ${1:$ix1:1}"
ix1=$((ix1+1))
done
eval grep -E $g2 /usr/share/dict/words $g3

Resources