print every nth line into a row using gawk

print every nth line into a row using gawk - printing

I have a very huge file in which I need to obtain every nth line and print it into a row.
My data:
1 937 4.320194
2 667 4.913314
3 934 1.783326
4 940 -0.299312
5 939 2.309559
6 936 3.229496
7 611 -1.41808
8 608 -1.154019
9 606 2.159683
10 549 0.767828
I want my data to look like this:
1 937 4.320194
3 934 1.783326
5 939 2.309559
7 611 -1.41808
9 606 2.159683
This is of course an example, I want every 10th line for my huge data file. I tried this so far:
NF == 6 {
if(NR%10) {print;}
}

To print every second line, starting with the first:
awk 'NR%2==1' file.txt
To print every tenth line, starting with the tenth line:
awk 'NR%10==0' file.txt
To use this in a script, add the following to a file called script.awk:
BEGIN {
print "Processing file"
}
NR%10==0
END {
print "Finished processing"
}
Then execute:
awk -f script.awk file.txt

With sed, you can do a lot of variations on this quite easily with the first~step command. For instance:
# Odd lines
sed -n 1~2p file
# Every tenth line (10, 20, 30, ...)
sed -n 10~10p file
# Every tenth line (1, 11, 21, ...)
sed -n 1~10p file
# First plus every tenth (1, 10, 20, 30, ...)
sed -n -e 1p -e 10~10p file

Piece of cake: cat test.txt | awk 'NR % 10 == 1'

It's not (g)awk, but it'll work:
cat myfile | grep ^[[:digit:]]*0[[:blank:]] should do the trick.

Doing it directly in command Prompt (Windows).
Put the gawk.exe file in the folder where the file is and start a command Prompt in the folder, and write
gawk "NR%n==x" oldfile.txt>newfile.txt
n is every n'th line you want to print and x is the starting line.
E.g n=10 and x=1, printing line 1,11,21,31,41......end line from the original file into the new file.
E.g n=20 and x=5, printing line 5,25,45,65......end line from the original file into the new file.

Related

Is it possible to show all lines after match with grep/ripgrep? [duplicate]

Question: I'd like to print a single line directly following a line that contains a matching pattern.
My version of sed will not take the following syntax (it bombs out on +1p) which would seem like a simple solution:
sed -n '/ABC/,+1p' infile
I assume awk would be better to do multiline processing, but I am not sure how to do it.

Never use the word "pattern" in this context as it is ambiguous. Always use "string" or "regexp" (or in shell "globbing pattern"), whichever it is you really mean. See How do I find the text that matches a pattern? for more about that.
The specific answer you want is:
awk 'f{print;f=0} /regexp/{f=1}' file
or specializing the more general solution of the Nth record after a regexp (idiom "c" below):
awk 'c&&!--c; /regexp/{c=1}' file
The following idioms describe how to select a range of records given a specific regexp to match:
a) Print all records from some regexp:
awk '/regexp/{f=1}f' file
b) Print all records after some regexp:
awk 'f;/regexp/{f=1}' file
c) Print the Nth record after some regexp:
awk 'c&&!--c;/regexp/{c=N}' file
d) Print every record except the Nth record after some regexp:
awk 'c&&!--c{next}/regexp/{c=N}1' file
e) Print the N records after some regexp:
awk 'c&&c--;/regexp/{c=N}' file
f) Print every record except the N records after some regexp:
awk 'c&&c--{next}/regexp/{c=N}1' file
g) Print the N records from some regexp:
awk '/regexp/{c=N}c&&c--' file
I changed the variable name from "f" for "found" to "c" for "count" where
appropriate as that's more expressive of what the variable actually IS.
f is short for found. Its a boolean flag that I'm setting to 1 (true) when I find a string matching the regular expression regexp in the input (/regexp/{f=1}). The other place you see f on its own in each script it's being tested as a condition and when true causes awk to execute its default action of printing the current record. So input records only get output after we see regexp and set f to 1/true.
c && c-- { foo } means "if c is non-zero then decrement it and if it's still non-zero then execute foo" so if c starts at 3 then it'll be decremented to 2 and then foo executed, and on the next input line c is now 2 so it'll be decremented to 1 and then foo executed again, and on the next input line c is now 1 so it'll be decremented to 0 but this time foo will not be executed because 0 is a false condition. We do c && c-- instead of just testing for c-- > 0 so we can't run into a case with a huge input file where c hits zero and continues getting decremented so often it wraps around and becomes positive again.

It's the line after that match that you're interesting in, right? In sed, that could be accomplished like so:
sed -n '/ABC/{n;p}' infile
Alternatively, grep's A option might be what you're looking for.
-A NUM, Print NUM lines of trailing context after matching lines.
For example, given the following input file:
foo
bar
baz
bash
bongo
You could use the following:
$ grep -A 1 "bar" file
bar
baz
$ sed -n '/bar/{n;p}' file
baz

I needed to print ALL lines after the pattern ( ok Ed, REGEX ), so I settled on this one:
sed -n '/pattern/,$p' # prints all lines after ( and including ) the pattern
But since I wanted to print all the lines AFTER ( and exclude the pattern )
sed -n '/pattern/,$p' | tail -n+2 # all lines after first occurrence of pattern
I suppose in your case you can add a head -1 at the end
sed -n '/pattern/,$p' | tail -n+2 | head -1 # prints line after pattern
And I really should include tlwhitec's comment in this answer (since their sed-strict approach is the more elegant than my suggestions):
sed '0,/pattern/d'
The above script deletes every line starting with the first and stopping with (and including) the line that matches the pattern. All lines after that are printed.

awk Version:
awk '/regexp/ { getline; print $0; }' filetosearch

If pattern match, copy next line into the pattern buffer, delete a return, then quit -- side effect is to print.
sed '/pattern/ { N; s/.*\n//; q }; d'

Actually sed -n '/pattern/{n;p}' filename will fail if the pattern match continuous lines:
$ seq 15 |sed -n '/1/{n;p}'
2
11
13
15
The expected answers should be:
2
11
12
13
14
15
My solution is:
$ sed -n -r 'x;/_/{x;p;x};x;/pattern/!s/.*//;/pattern/s/.*/_/;h' filename
For example:
$ seq 15 |sed -n -r 'x;/_/{x;p;x};x;/1/!s/.*//;/1/s/.*/_/;h'
2
11
12
13
14
15
Explains:
x;: at the beginning of each line from input, use x command to exchange the contents in pattern space & hold space.
/_/{x;p;x};: if pattern space, which is the hold space actually, contains _ (this is just a indicator indicating if last line matched the pattern or not), then use x to exchange the actual content of current line to pattern space, use p to print current line, and x to recover this operation.
x: recover the contents in pattern space and hold space.
/pattern/!s/.*//: if current line does NOT match pattern, which means we should NOT print the NEXT following line, then use s/.*// command to delete all contents in pattern space.
/pattern/s/.*/_/: if current line matches pattern, which means we should print the NEXT following line, then we need to set a indicator to tell sed to print NEXT line, so use s/.*/_/ to substitute all contents in pattern space to a _(the second command will use it to judge if last line matched the pattern or not).
h: overwrite the hold space with the contents in pattern space; then, the content in hold space is ^_$ which means current line matches the pattern, or ^$, which means current line does NOT match the pattern.
the fifth step and sixth step can NOT exchange, because after s/.*/_/, the pattern space can NOT match /pattern/, so the s/.*// MUST be executed!

This might work for you (GNU sed):
sed -n ':a;/regexp/{n;h;p;x;ba}' file
Use seds grep-like option -n and if the current line contains the required regexp replace the current line with the next, copy that line to the hold space (HS), print the line, swap the pattern space (PS) for the HS and repeat.

Piping some greps can do it (it runs in POSIX shell and under BusyBox):
cat my-file | grep -A1 my-regexp | grep -v -- '--' | grep -v my-regexp
-v will show non-matching lines
-- is printed by grep to separate each match, so we skip that too

If you just want the next line after a pattern, this sed command will work
sed -n -e '/pattern/{n;p;}'
-n supresses output (quiet mode);
-e denotes a sed command (not required in this case);
/pattern/ is a regex search for lines containing the literal combination of the characters pattern (Use /^pattern$/ for line consisting of only of “pattern”;
n replaces the pattern space with the next line;
p prints;
For example:
seq 10 | sed -n -e '/5/{n;p;}'
Note that the above command will print a single line after every line containing pattern. If you just want the first one use sed -n -e '/pattern/{n;p;q;}'. This is also more efficient as the whole file is not read.
This strictly sed command will print all lines after your pattern.
sed -n '/pattern/,${/pattern/!p;}
Formatted as a sed script this would be:
/pattern/,${
/pattern/!p
}
Here’s a short example:
seq 10 | sed -n '/5/,${/5/!p;}'
/pattern/,$ will select all the lines from pattern to the end of the file.
{} groups the next set of commands (c-like block command)
/pattern/!p; prints lines that doesn’t match pattern. Note that the ; is required in early versions, and some non-GNU, of sed. This turns the instruction into a exclusive range - sed ranges are normally inclusive for both start and end of the range.
To exclude the end of range you could do something like this:
sed -n '/pattern/,/endpattern/{/pattern/!{/endpattern/d;p;}}
/pattern/,/endpattern/{
/pattern/!{
/endpattern/d
p
}
}
/endpattern/d is deleted from the “pattern space” and the script restarts from the top, skipping the p command for that line.
Another pithy example:
seq 10 | sed -n '/5/,/8/{/5/!{/8/d;p}}'
If you have GNU sed you can add the debug switch:
seq 5 | sed -n --debug '/2/,/4/{/2/!{/4/d;p}}'
Output:
SED PROGRAM:
/2/,/4/ {
/2/! {
/4/ d
p
}
}
INPUT: 'STDIN' line 1
PATTERN: 1
COMMAND: /2/,/4/ {
COMMAND: }
END-OF-CYCLE:
INPUT: 'STDIN' line 2
PATTERN: 2
COMMAND: /2/,/4/ {
COMMAND: /2/! {
COMMAND: }
COMMAND: }
END-OF-CYCLE:
INPUT: 'STDIN' line 3
PATTERN: 3
COMMAND: /2/,/4/ {
COMMAND: /2/! {
COMMAND: /4/ d
COMMAND: p
3
COMMAND: }
COMMAND: }
END-OF-CYCLE:
INPUT: 'STDIN' line 4
PATTERN: 4
COMMAND: /2/,/4/ {
COMMAND: /2/! {
COMMAND: /4/ d
END-OF-CYCLE:
INPUT: 'STDIN' line 5
PATTERN: 5
COMMAND: /2/,/4/ {
COMMAND: }
END-OF-CYCLE:

grep every other occurrence of a value

I have an output file with multiple occurrences of a value. E.g.:
value 123
value 111
value 456
value 111
value 789
value 111
If I use the following command, I can print all the occurrences of 'value':
grep 'value ' file_name
But I only want to print every other occurrence. Is there some way that I can edit my command to do this?

Use this Perl one-liner to filter the results of grep:
grep 'value' file_name | perl -ne 'print if $. % 2;'
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
$. : Current input line number.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlvar: Perl predefined variables

Or with awk:
grep value YourFile | awk 'NR%2'
value 123
value 456
value 789

grep every fourth line in .fastq

I am working on a linux machine using bash.
My question is, how can I skip lines in the query file using grep?
I am working with a large ~16Gb .fastq file named example.fastq which has the following format.
example.fastq
#SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
#SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR6750041.3 3/1
ATCCANAATGATGTGTTGCTCTGGAGGTACAGAGATAACGTCAGCTGGAATAGTTTCCCCTCACAG
+
AAAAA#EE6E6EEEEEE6EEEEAEEEEEEEEEEE//EAEEEEEAAEAEEEAE/EAEEA6/EEA<E/
#SRR6750041.4 4/1
ACACCNAATGCTCTGGCCTCTCAAGCACGTGGATTATGCCAGAGAGGCCAGAGCATTCTTCGTACA
+
/AAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE/E/<//AEA/EA//E//
#SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E
I need to extract lines containing a strings of interest #SRR6750041.2 #SRR6750041.5 stored in a bash array called IDarray as well as the 3 lines following each match. The following grep command allows me to do this
for ID in "${IDarray[#]}";
do
grep -F -A 3 "$ID " example.fastq
done
This correctly output the following.
#SRR6750041.2 2/1
CTATANTATTCTATATTTATTCTAGATAAAAGCATTCTATATTTAGCATATGTCTAGCAAAAAAAA
+
AAAAA#EE6EEEEEEEEEEEEAAEEAEEEEEEEEEEEE/EAE/EAE/EA/EAEAAAE//EEAEAA6
#SRR6750041.5 5/1
CAGCANTTCTCGCTCACCAACTCCAAAGCAAAAGAAGAAGAAAAAGAAGAAAGATAGAGTACGCAG
+
AAAAA#EEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEAEEEAEEE<EE/E
I am looking for ways to speed this process up... one way would be to reduce the number of lines searched by grep by restricting the search to lines beginning with # or skipping lines that can not possibly contain the match #SRR6750041.1 such as lines 2,3,4 and 6,7,8 etc. Is there a way to do this using grep? Alternative methods are also welcome!

Here are some thoughts with examples. For test purposes I created test case as mini version of Yours example_mini.fastq is 145 MB big and IDarray has 999 elements (interests).
Your version has this performance (more than 2 mins in user space):
$ time for i in "${arr[#]}"; do grep -A 3 "${i}" example_mini.fastq; done 1> out.txt
real 3m16.310s
user 2m9.645s
sys 0m53.092s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
First upgrade of grep to end grep after first match -m 1, I am assuming that interest ID is unique. This narrow down by 50% of complexity and takes approx 1 min in user space:
$ time for i in "${arr[#]}"; do grep -m 1 -A 3 "${i}" example_mini.fastq; done 1> out.txt
real 1m19.325s
user 0m55.844s
sys 0m21.260s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
These solutions are linearly dependent on number of elements. Call n times grep on huge file.
Now let's implement in AWK only for one run, I am exporting IDarray into input file so I can process in one run. I am loading big file into associative array per ID and then looping 1x through You array of IDs to search. This is generic scenario where You can define regexp and number of lines after to print. This has complexity with only one run through file + N comparisons. This is 2000% speed up:
$ for i in "${arr[#]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (linesafter-- > 0) { arr[interest]=arr[interest] RS $0; next; }
(FNR==NR) && /^#/ { interest=$1; arr[interest]=$0; linesafter=3; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real 0m7.044s
user 0m6.628s
sys 0m0.307s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
As in Your title If You really can confirm that every fourth line is id of interest and three lines after are about to be printed. You can simplify into this and speed up by another 20%:
$ for i in "${arr[#]}"; do echo $i; done > IDarray.txt
$ time awk '
(FNR==NR) && (FNR%4==1) { interest=$1; arr[interest]=$0; next; }
(FNR==NR) { arr[interest]=arr[interest] RS $0; next; }
(FNR!=NR) && arr[$1] { print(arr[$1]); }
' example_mini.fastq IDarray.txt 1> out.txt
real 0m5.944s
user 0m5.593s
sys 0m0.242s
$ md5sum out.txt
8f199a78465f561fff3cbe98ab792262 out.txt
On 1.5 GB file with 999 elements to search time is:
real 1m4.333s
user 0m59.491s
sys 0m3.460s
So per my predictions on my machine Your 15 GB example with 10k elements would take approx 16 minutes in user space to process.

grep invert match on two files

I have two text files containing one column each, for example -
File_A File_B
1 1
2 2
3 8
If I do grep -f File_A File_B > File_C, I get File_C containing 1 and 2. I want to know how to use grep -v on two files so that I can get the non-matching values, 3 and 8 in the above example.
Thanks.

You can also use comm if it allows empty output delimiter
$ # -3 means suppress lines common to both input files
$ # by default, tab character appears before lines from second file
$ comm -3 f1 f2
3
8
$ # change it to empty string
$ comm -3 --output-delimiter='' f1 f2
3
8
Note: comm requires sorted input, so use comm -3 --output-delimiter='' <(sort f1) <(sort f2) if they are not already sorted
You can also pass common lines got from grep as input to grep -v. Tested with GNU grep, some version might not support all these options
$ grep -Fxf f1 f2 | grep -hxvFf- f1 f2
3
8
-F option to match strings literally, not as regex
-x option to match whole lines only
-h to suppress file name prefix
f- to accept stdin instead of file input

awk 'NR==FNR{a[$0]=$0;next} !($0 in a) {print a[(FNR)], $0}' f1 f2
3 8
To Understand the meaning of NR and FNR check below output of their print.
awk '{print NR,FNR}' f1 f2
1 1
2 2
3 3
4 4
5 1
6 2
7 3
8 4
Condition NR==FNR is used to extract the data from first file as both NR and FNR would be same for first file only.

With GNU diff command (to compare files line by line):
diff --suppress-common-lines -y f1 f2 | column -t
The output (left column contain lines from f1, right column - from f2):
3 | 8
-y, --side-by-side - output in two columns

Egrep - find 0 (zero) and ignore previous line

I am trying hard to get the output as I Like.
Current Output:
###Server1###
2
###Server2###
0
###Server3###
5
###Server4###
0
Required Output:
###Server1###
2
###Server3###
5
All I am looking is to grep and ignore any line and the previous line that containts 0 (zero) in any place of the line. I am using bash shell.

This is a possible approach:
$ grep -B 1 "^\s*[1-9]$" file
###Server1###
2
--
###Server3###
5
To get rid of the group separator, we can also do:
$ grep --no-group-separator -B 1 "^\s*[1-9]$" file
###Server1###
2
###Server3###
5
Explanation
Instead of using grep -v to find the inverse, I think it is easier to look for the lines having a single digit value not being 0. This is done with the "^\s*[1-9]$" expression, that allows spaces before the digit.
With -B 1 we make it print also the line before the matched one.

Code for GNU sed:
sed '$!N;/\s*\b0\b\s*/d' file
$ sed '$!N;/\s*\b0\b\s*/d' file
###Server1###
2
###Server3###
5

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

print every nth line into a row using gawk - printing

Piece of cake: cat test.txt | awk 'NR % 10 == 1'

It's not (g)awk, but it'll work: cat myfile | grep ^[[:digit:]]*0[[:blank:]] should do the trick.

Related

Is it possible to show all lines after match with grep/ripgrep? [duplicate]

grep every other occurrence of a value

grep every fourth line in .fastq

grep invert match on two files

Egrep - find 0 (zero) and ignore previous line

Categories

Resources