Count the number of occurrence of string in a large file

Count the number of occurrence of string in a large file - grep

I have a large 900MB xml file and the entire file is just one lines. There is no line break between tags. I need to count the occurence of a particular tag in that file.
I tried
grep -o '<start tag>' filename | wc -l
i get a grep: line too long error.
How can I get around this?

Here's a bit of a hack:
perl -ne 'BEGIN { $/ = ">"; $c = 0 } $c++ if /<start tag>/; END { print "$c\n" }' filename
The idea is to loop over "lines" that are terminated by > instead of \n (newline). That should avoid "line too long" errors.

Just use awk:
awk -F'<start tag>' '{print NF-1}' file
If that fails, you can do this with GNU awk (for multi-char RS):
awk -v RS='<start tag>' 'END{print NR-1}' file

Related

how do capture(grep/awk/sed) substring from a string the value in shell

New to scripting. I have only one line & one file. How do I capture summerfruit value (ie "mango") & pass it to another variable from the below line.
.. abc.dfe summer.fruit=mango summer.vegetable=potato projects.blah ...

If your grep supports Perl-compatible regular expressions (PCRE):
summerfruit=$(grep -Po 'summer\.fruit=\K[^ ]+' file)
The \K doesn't print the matched summer.fruit= and [^ ]+ matches one or more non-space characters after the =.
without PCRE:
summerfruit=$(grep -o 'summer\.fruit=[^ ]*' file | grep -o '[^=]*$')
With sed:
summerfruit=$(sed 's/.*summer\.fruit=\([^ ]*\).*/\1/' file)
With awk:
summerfruit=$(awk '{
for (i=1;i<=NF;i++)
if ($i ~ /^summer\.fruit=/){ sub(/^[^=]*=/,"",$i); print $i; exit }
}' file)

Cutting a length of specific string with grep

Let's say we have a string "test123" in a text file.
How do we cut out "test12" only or let's say there is other garbage behind "test123" such as test123x19853 and we want to cut out "test123x"?
I tried with grep -a "test123.\{1,4\}" testasd.txt and so on, but just can't get it right.
I also looked for example, but never found what I'm looking for.

expr:
kent$ x="test123x19853"
kent$ echo $(expr "$x" : '\(test.\{1,4\}\)')
test123x

What you need is -o which print out matched things only:
$ echo "test123x19853"|grep -o "test.\{1,4\}"
test123x
$ echo "test123x19853"|grep -oP "test.{1,4}"
test123x
-o, --only-matching show only the part of a line matching PATTERN

If you are ok with awkthen try following(not this will look for continuous occurrences of alphabets and then continuous occurrences of digits, didn't limit it to 4 or 5).
echo "test123x19853" | awk 'match($0,/[a-zA-Z]+[0-9]+/){print substr($0,RSTART,RLENGTH)}'
In case you want to look for only 1 to 4 digits after 1st continuous occurrence of alphabets then try following(my awk is old version so using --re-interval you could remove it in case you have latest version of ittoo).
echo "test123x19853" | awk --re-interval 'match($0,/[a-zA-Z]+[0-9]{1,4}/){print substr($0,RSTART,RLENGTH)}'

Use awk to parse and modify every CSV field

I need to parse and modify a each field from a CSV header line for a dynamic sqlite create table statement. Below is what works from the command line with the appropriate output:
echo ",header1,header2,header3"| awk 'BEGIN {FS=","}; {for(i=2;i<=NF;i++){printf ",%s text ", $i}; printf "\n"}'
,header1 text ,header2 text ,header3 text
Well, it breaks when it is run from within a bash shell script. I got it to work by writing the output to a file like below:
echo $optionalHeaders | awk 'BEGIN {FS=","}; {for(i=2;i<=NF;i++){printf ",%s text ", $i}; printf "\n"}' > optionalHeaders.txt
This sucks! There are a lot of examples that show how to parse/modify specific Nth fields. This issue requires each field to be modified. Is there a more concise and elegant Awk one liner that can store its contents to a variable rather than writing to a file?

sed is usually the right tool for simple substitutions on a single line. Take your pick:
$ echo ",header1,header2,header3" | sed 's/[^,][^,]*/& text/g'
,header1 text,header2 text,header3 text
$ echo ",header1,header2,header3" | sed -r 's/[^,]+/& text/g'
,header1 text,header2 text,header3 text
The last 1 above requires GNU sed to use EREs instead of BREs. You can do the same in awk using gsub() if you prefer:
$ echo ",header1,header2,header3" | awk '{gsub(/[^,]+/,"& text")}1'
,header1 text,header2 text,header3 text

I found the problem and it was me... I forgot to echo the contents of the variable to the Awk command. Brianadams comment was so simple that forced me to re-look at my code and find the problem! Thanks!
I am ok with resolving this but if anyone wants to propose a more concise and elegant Awk one liner - that would be cool.

You can try the following:
#! /bin/bash
header=",header1,header2,header3"
newhead=$(awk 'BEGIN {FS=OFS=","}; {for(i=2;i<=NF;i++) $i=$i" text"}1' <<<"$header")
echo "$newhead"
with output:
,header1 text,header2 text,header3 text

Instead of modifying fields one by one, another option is with a simple substitution:
echo ",header1,header2,header3" | awk '{gsub(/[^,]+/, "& text", $0); print}'
That is, replace a sequence of non-comma characters with text appended.
Another alternative would be replacing the commas, but due to the irregularities of your header line (first comma must be left alone, no comma at the end), that's a bit less easy:
echo ",header1,header2,header3" | awk '{gsub(/,/, " text,", $0); sub(/^ text,/, "", $0); print $0 " text"}'
Btw, the rough equivalent of the two commands in sed:
echo ",header1,header2,header3" | sed -e 's/[^,]\{1,\}/& text/g'
echo ",header1,header2,header3" | sed -e 's/\(.\),/\1 text,/g' -e 's/$/ text/'

Use grep to report back only line numbers

I have a file that possibly contains bad formatting (in this case, the occurrence of the pattern \\backslash). I would like to use grep to return only the line numbers where this occurs (as in, the match was here, go to line # x and fix it).
However, there doesn't seem to be a way to print the line number (grep -n) and not the match or line itself.
I can use another regex to extract the line numbers, but I want to make sure grep cannot do it by itself. grep -no comes closest, I think, but still displays the match.

try:
grep -n "text to find" file.ext | cut -f1 -d:

If you're open to using AWK:
awk '/textstring/ {print FNR}' textfile
In this case, FNR is the line number. AWK is a great tool when you're looking at grep|cut, or any time you're looking to take grep output and manipulate it.

All of these answers require grep to generate the entire matching lines, then pipe it to another program. If your lines are very long, it might be more efficient to use just sed to output the line numbers:
sed -n '/pattern/=' filename

Bash version
lineno=$(grep -n "pattern" filename)
lineno=${lineno%%:*}

I recommend the answers with sed and awk for just getting the line number, rather than using grep to get the entire matching line and then removing that from the output with cut or another tool. For completeness, you can also use Perl:
perl -nE 'say $. if /pattern/' filename
or Ruby:
ruby -ne 'puts $. if /pattern/' filename

using only grep:
grep -n "text to find" file.ext | grep -Po '^[^:]+'

You're going to want the second field after the colon, not the first.
grep -n "text to find" file.txt | cut -f2 -d:

To count the number of lines matched the pattern:
grep -n "Pattern" in_file.ext | wc -l
To extract matched pattern
sed -n '/pattern/p' file.est
To display line numbers on which pattern was matched
grep -n "pattern" file.ext | cut -f1 -d:

Inserting a matched string from previous line to the current line using sed or awk

I have a CSV file that shows the statistics for links on a half an hour basis. The link name only appears on the 00:00 line.
link1,0:00,0,0,0,0
,00:30,0,0,0,0
,01:00,0,0,0,0
,01:30,0,0,0,0
,02:00,0,0,0,0
,02:30,0,0,0,0
,03:00,0,0,0,0
,03:30,0,0,0,0
,23:30,0,0,0,0
....
....
link2,00:00,0,0,0,0
How do I copy the link name to every other line until the link name is different, using sed or awk?

With awk, just keep track of the last seen non-empty link name, and always use that.
awk -F, -v OFS=, '$1 != "" { link=$1 } { $1 = link; print $0 }'
Omitting the ellipses, this gives:
link1,0:00,0,0,0,0
link1,00:30,0,0,0,0
link1,01:00,0,0,0,0
link1,01:30,0,0,0,0
link1,02:00,0,0,0,0
link1,02:30,0,0,0,0
link1,03:00,0,0,0,0
link1,03:30,0,0,0,0
link1,23:30,0,0,0,0
link2,00:00,0,0,0,0

This is a simpler job with awk, but if you want to use sed:
sed -e '/^[^,]/{h;s/,.*//;x};/^,/{G;s/^\(.*\)\n\(.*\)/\2\1/}'
Bellow a commented version in sed script file format that can be run with sed -f script:
# For lines not beginning with a ',', saves what precedes a ',' in the hold space and print the original line.
/^[^,]/{
h
s/,.*//
x}
# For lines beginning with a ',', put what has been save in the hold space at the beginning of the pattern space and print.
/^,/{
G
s/^\(.*\)\n\(.*\)/\2\1/}

You can do that in pure bash shell without needing to start a new process, which should be faster than using awk or sed:
IFS=","
while read v1 v2; do
if [[ $v1 != "" ]]; then
link=$v1;
fi
printf "%s,%s\n" "$link" "$v2"
done < file

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Count the number of occurrence of string in a large file - grep

I have a large 900MB xml file and the entire file is just one lines. There is no line break between tags. I need to count the occurence of a particular tag in that file. I tried grep -o '<start tag>' filename | wc -l i get a grep: line too long error. How can I get around this?

Here's a bit of a hack: perl -ne 'BEGIN { $/ = ">"; $c = 0 } $c++ if /<start tag>/; END { print "$c\n" }' filename The idea is to loop over "lines" that are terminated by > instead of \n (newline). That should avoid "line too long" errors.

Just use awk: awk -F'<start tag>' '{print NF-1}' file If that fails, you can do this with GNU awk (for multi-char RS): awk -v RS='<start tag>' 'END{print NR-1}' file

Related

how do capture(grep/awk/sed) substring from a string the value in shell

Cutting a length of specific string with grep

Use awk to parse and modify every CSV field

Use grep to report back only line numbers

Inserting a matched string from previous line to the current line using sed or awk

Categories

Resources