Word count and it output

Word count and it output - grep

I have the following lines:
123;123;#rss
123;123;#site #design #rss
123;123;#rss
123;123;#rss
123;123;#site #design
and need to count how many times each tag appears, do the following:
grep -Eo '#[a-z].*' ./1.txt | tr "\ " "\n" | uniq -c
i.e. first select only the tags from the strings, and then break them down and count it.
output:
1 #rss
1 #site
1 #design
3 #rss
1 #site
1 #design
instead of the expected:
2 #site
4 #rss
2 #design
It seems that the problem is in the non-printable characters, which makes counting incorrect. Or is it something else? Can anyone suggest a correct solution?

uniq -c works only on sorted input.
Also, you can drop the tr by changing the regex to #[a-z]*.
grep -Eo '#[a-z]*' ./1.txt | sort | uniq -c
prints
2 #design
4 #rss
2 #site
as expected.

It can be done in a single gnu awk:
awk -v RS='#[a-zA-Z]+' 'RT {++freq[RT]} END {for (i in freq) print freq[i], i}' file
2 #site
2 #design
4 #rss
Or else a grep + awk solution:
grep -iEo '#[a-z]+' file |
awk '{++freq[$1]} END {for (i in freq) print freq[i], i}'
2 #site
2 #design
4 #rss

Using awk as an alternative:
awk -F [" "\;] '{ for(i=3;i<=NF;i++) { map[$i]++ } } END { for (i in map) { print map[i]" "i} }' file
Set the field separator to a space or a ";" Then loop from the third field to the last field (NF), adding to an array map, with the field as the index and incrementing counter as the value. At the end of the file processing, loop through the map array and print the indexes/values.

With your shown samples only, could you please try following. Written and tested in GNU awk.
awk '
{
while($0){
match($0,/#[^ ]*/)
count[substr($0,RSTART,RLENGTH)]++
$0=substr($0,RSTART+RLENGTH)
}
}
END{
for(key in count){
print count[key],key
}
}' Input_file
Output will be as follows.
2 #site
2 #design
4 #rss
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
while($0){ ##Running while till line value.
match($0,/#[^ ]*/) ##using match function to match regex #[^ ]* in current line.
count[substr($0,RSTART,RLENGTH)]++ ##Creating count array which has index as matched sub string and keep increasing its value with 1 here.
$0=substr($0,RSTART+RLENGTH) ##Putting rest of line after match into currnet line here.
}
}
END{ ##Starting END block of this program from here.
for(key in count){ ##using for loop to go throgh count here.
print count[key],key ##printing value of count which has index as key and key here.
}
}
' Input_file ##Mentioning Input_file name here.

$ cut -d';' -f3 file | tr ' ' '\n' | sort | uniq -c
2 #design
4 #rss
2 #site

Related

extract the adjacent character of selected letter

I have this text file:
# cat letter.txt
this
is
just
a
test
to
check
if
grep
works
The letter "e" appear in 3 words.
# grep e letter.txt
test
check
grep
Is there any way to return the letter printed on left of the selected character?
expected.txt
t
h
r

With shown samples in awk, could you please try following.
awk '/e/{print substr($0,index($0,"e")-1,1)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/e/{ ##Looking if current line has e in it then do following.
print substr($0,index($0,"e")-1,1)
##Printing sub string from starting value of index e-1 and print 1 character from there.
}
' Input_file ##Mentioning Input_file name here.

You can use positive lookahead to match a character that is followed by an e, without making the e part of the match.
cat letter.txt | grep -oP '.(?=e)'

With sed:
sed -nE 's/.*(.)e.*/\1/p' letter.txt

Assuming you have this input file:
cat file
this
is
just
a
test
to
check
if
grep
works
egg
element
You may use this grep + sed solution to find letter or empty string before e:
grep -oE '(^|.)e' file | sed 's/.$//'
t
h
r
l
m
Or alternatively this single awk command should also work:
awk -F 'e' 'NF > 1 {
for (i=1; i<NF; i++) print substr($i, length($i), 1)
}' file

This might work for you (GNU sed):
sed -nE '/(.)e/{s//\n\1\n/;s/^[^\n]*\n//;P;D}' file
Turn off implicit printing and enable extended regexp -nE.
Focus only on lines that meet the requirements i.e. contain a character before e.
Surround the required character by newlines.
Remove any characters before and including the first newline.
Print the first line (up to the second newline).
Delete the first line (including the newline).
Repeat.
N.B. The solution will print each such character on a separate line.
To print all such characters on their own line, use:
sed -nE '/(.e)/{s//\n\1/g;s/^/e/;s/e[^\n]*\n?//g;s/\B/ /g;p}' file
N.B. Remove the s/\B /g if space separation is not needed.

With GNU awk you can use empty string as FS to split the input as individual characters:
awk -v FS= '/[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file
t
h
r
Excluding "e" at the beginning in the for loop.
edited
empty string if e is the first character in the word.
For example, this input:
cat file2
grep
erroneously
egg
Wednesday
effectively
awk -v FS= '/^[e]/ {print ""} /[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file2
r
n
W
n
f
v

Join multiple lines into One (.cap file) CentOS

Single entry has multiple lines. Each entry is separated by two blank lines.
Each entry has to be made into a single line followed by a delimiter(;).
Sample Input:
Name:Sid
ID:123
Name:Jai
ID:234
Name:Arun
ID:12
Tried replacing the blank lines with cat test.cap | tr -s [:space:] ';'
Output:
Name:Sid;ID:123;Name:Jai;ID:234;Name:Arun;ID:12;
Expected Output:
Name:SidID:123;Name:JaiID:234;Name:ArunID:12;
Same is the case with Xargs.
I've used sed command as well but it only joined two lines into one. Where as I've 132 lines as one entry and 1000 such entries in one file.

You may use
cat file | awk 'BEGIN { FS = "\n"; RS = "\n\n"; ORS=";" } { gsub(/\n/, "", $0); print }' | sed 's/;;*$//' > output.file
Output:
Name:SidID:123;Name:JaiID:234;Name:ArunID:12
Notes:
FS = "\n" will set field separators to a newline`
RS = "\n\n" will set your record separators to double newline
gsub(/\n/, "", $0) will remove all newlines from a found record
sed 's/;;*$//' will remove the trailing ; added by awk
See the online demo

Could you please try following.
awk 'NF{val=(val?$0~/^ID/?val $0";":val $0:$0)} END{print val}' Input_file
Output will be as follows.
Name:SidID:123;Name:JaiID:234;Name:ArunID:12;
Explanation: Adding explanation of above code too now.
awk ' ##Starting awk program here.
NF{ ##Checking condition if a LINE is NOT NULL and having some value in it.
val=(val?$0~/^ID/?val $0";":val $0:$0) ##Creating a variable val here whose value is concatenating its own value along with check if a line starts with string ID then add a semi colon at last else no need to add it then.
}
END{ ##Starting END section of awk here.
print val ##Printing value of variable val here.
}
' Input_file ##Mentioning Input_file name here.

This might work for you (GNU sed):
sed -r '/./{N;s/\n//;H};$!d;x;s/.//;s/\n|$/;/g' file
If it is not a blank line, append the following line and remove the newline between them. Append the result to the hold space and if it is not the end of the file, delete the current line. At the end of the file, swap to the hold space, remove the first character (which will be a newline) and then replace all newlines (append an extra semi-colon for the last line only) with semi-colons.

gawk: presenting two operations outcome in two rows

I have a program which output is summary file with header and few columns of results.
I want to show only two data: file name and best period prediction and I use this command:
program input_file | gawk 'NR==2 {print $3}; NR==4 {print $2}'
as the result I obtain result in one column, two lines. What I have to do to have this result in one line, two columns?

You could use:
program input_file | gawk 'NR==2 {heading = $3}; NR==4 {print heading " = " $2}'
This saves the value in $3 on line 2 in variable heading and prints the heading and the value from column 2 when it reads line 4.

Use grep -A1 to return a value in the second line as long as a numeric value in the first line is met

I have log entries that are paired two lines each. I have to parse the first line to extract
a number to know if it is greater than 5000. If this number is greater than 5000 then I need to return the second line, which will also be parsed to retrieve an ID.
I know how to grep all of the info and to parse it. I don't know how to make the grep ignore
things if they are less than a particular value. Note that I am not committed to using grep if some
other means like awk/sed can be substituted.
Raw Data (two lines separated for example clarity).
The target of my grep is the number 5001
following "credits extracted = ", if this is over 5000 then I want to return number "12345" from
the second line --------------------------
2012-03-16T23:26:12.082358 0x214d000 DEBUG ClientExtractAttachmentsPlayerMailTask for envelope 22334455 finished: credits extracted = 5001, items extracted count = 0, status = 0. [Mail.heomega.mail.Mail](PlayerMailTasks.cpp:OnExtractAttachmentsResponse:944)
2012-03-16T23:26:12.082384 0x214d000 DEBUG Mail Cache found cached mailbox for: 12345 [Mail.heomega.mail.Mail](MailCache.cpp:GetCachedMailbox:772)
Snippits --------------------------
-- Find the number of credits extracted, without the comma noise:
grep "credits extracted = " fileName.log | awk '{print $12}' | awk -F',' '{print $1}'
-- Find the second line's ID no matter what the value of credits extracted is:
grep -A1 "credits extracted = " fileName.log | grep "cached mailbox for" | awk -F, '{print $1}' | awk '{print $10}'
-- An 'if' statement symbolizing the logic I need to acquire:
v_CredExtr=5001; v_ID=12345; if [ $v_Cred -gt 5000 ]; then echo $v_ID; fi;

You can do everything with a single AWK filter I believe:
#!/usr/bin/awk -f
/credits extracted =/ {
credits = substr($12, 1, length($12) - 1) + 0
if (credits > 5000)
show_id = 1
next
}
show_id == 1 {
print $10
show_id = 0
}
Obviously, you can stuff all the AWK script in a shell string inside a script, even multiline. I showed it here in its own script for clarity.
P.S: Please notify when it works ;-)

awk/sed/shell to merge/concatenate data

Trying to merge some data that I have. The input would look like so:
foo bar
foo baz boo
abc def
abc ghi
And I would like the output to look like:
foo bar baz boo
abc def ghi
I have some ideas using some arrays in a shell script, but I was looking for a more elegant or quicker solution.

How about join?
file="file"
join -a1 -a2 <(sort "$file" | sed -n 1~2p) <(sort "$file" | sed -n 2~2p)
The seds there are just splitting the file on odd and even lines

While pixelbeat's answer works, I can't say I'm very enthused about it. I think I'd use awk something like this:
{ for (i=2; i<=NF; i++) { lines[$1] = lines[$1] " " $i;} }
END { for (i in lines) printf("%s%s\n", i, lines[i]); }
This shouldn't require pre-sorting the data, and should work fine regardless of the number or length of the fields (short of overflowing memory, of course). Its only obvious shortcoming is that its output is in an arbitrary order. If you need it sorted, you'll need to pipe the output through sort (but getting back to the original order would be something else).

An awk solution
awk '
{key=$1; $1=""; x[key] = x[key] $0}
END {for (key in x) {print key x[key]}}
' filename

if the length of the first field is fixed, you can use uniq with the -w option. Otherwise you night want to use awk (warning: untested code):
awk '
BEGIN{last='';}
{
if ($1==last) {
for (i = 1; i < NF;i++) print $i;
} else {
print "\n", $0;
last = $1;
}
}'

Pure Bash, for truly alternating lines:
infile="paste.dat"
toggle=0
while read -a line ; do
if [ $toggle -eq 0 ] ; then
echo -n "${line[#]}"
else
unset line[0] # remove first element
echo " ${line[#]}"
fi
((toggle=1-toggle))
done < "$infile"

Based on fgm's pure Bash snippet:
text='
foo bar
foo baz boo
abc def
abc ghi
'
count=0
oneline=""
firstword=""
while IFS=" " read -a line ; do
let count++
if [[ $count -eq 1 ]]; then
firstword="${line[0]}"
oneline="${line[#]}"
else
if [[ "$firstword" == "${line[0]}" ]]; then
unset line[0] # remove first word of line
oneline="${oneline} ${line[#]}"
else
printf "%s\n" "${oneline}"
oneline="${line[#]}"
firstword="${line[0]}"
fi
fi
done <<< "$text"

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Word count and it output - grep

uniq -c works only on sorted input. Also, you can drop the tr by changing the regex to #[a-z]. grep -Eo '#[a-z]' ./1.txt | sort | uniq -c prints 2 #design 4 #rss 2 #site as expected.

It can be done in a single gnu awk: awk -v RS='#[a-zA-Z]+' 'RT {++freq[RT]} END {for (i in freq) print freq[i], i}' file 2 #site 2 #design 4 #rss Or else a grep + awk solution: grep -iEo '#[a-z]+' file | awk '{++freq[$1]} END {for (i in freq) print freq[i], i}' 2 #site 2 #design 4 #rss

$ cut -d';' -f3 file | tr ' ' '\n' | sort | uniq -c 2 #design 4 #rss 2 #site

Related

extract the adjacent character of selected letter

Join multiple lines into One (.cap file) CentOS

gawk: presenting two operations outcome in two rows

Use grep -A1 to return a value in the second line as long as a numeric value in the first line is met

awk/sed/shell to merge/concatenate data

Categories

Resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Word count and it output - grep

uniq -c works only on sorted input. Also, you can drop the tr by changing the regex to #[a-z]*. grep -Eo '#[a-z]*' ./1.txt | sort | uniq -c prints 2 #design 4 #rss 2 #site as expected.

It can be done in a single gnu awk: awk -v RS='#[a-zA-Z]+' 'RT {++freq[RT]} END {for (i in freq) print freq[i], i}' file 2 #site 2 #design 4 #rss Or else a grep + awk solution: grep -iEo '#[a-z]+' file | awk '{++freq[$1]} END {for (i in freq) print freq[i], i}' 2 #site 2 #design 4 #rss

$ cut -d';' -f3 file | tr ' ' '\n' | sort | uniq -c 2 #design 4 #rss 2 #site

Related

extract the adjacent character of selected letter

Join multiple lines into One (.cap file) CentOS

gawk: presenting two operations outcome in two rows

Use grep -A1 to return a value in the second line as long as a numeric value in the first line is met

awk/sed/shell to merge/concatenate data

Categories

Resources

uniq -c works only on sorted input. Also, you can drop the tr by changing the regex to #[a-z]. grep -Eo '#[a-z]' ./1.txt | sort | uniq -c prints 2 #design 4 #rss 2 #site as expected.