I have a log file that simplified looks like this (it has enough columns so that direct addressing of the columns is not feasible):
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
Most of the fields appear only once by id but not all of them (see time, for example).
What I want to achieve is to have one line per id that combines the columns of all records with the same id:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:03.006Z,bar_host,192.168.0.13,bar_user
For the columns that appear more than once in each id, I don't care which one is returned as long as it relates to a record with the same id.
I would exploit GNU AWK 2D arrays following way, let file.txt content be
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
then
awk 'BEGIN{FS=OFS=",";cols=5}NR==1{print}NR>1{for(i=1;i<=cols;i+=1){arr[$1][i]=arr[$1][i]?arr[$1][i]:$i}}END{for(i in arr){for(j in arr[i]){$j=arr[i][j]};print}}' file.txt
output
id,time,host,ip,user_uuid
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
Explanation: Firstly I inform GNU AWK that both field separator (FS) and output field separator (OFS) is ,, I use cols variable for holding information how many columns you wish to have. First row I simply print, for following rows for each column I check if there is already some truthy value in arr[id][number of field] using so-called ternary operator if yes I use it otherwise I set value to current field. In END I use nested for loops, for each id I do set value of its field in current line, so GNU AWK build string from these which I can print. Disclaimer: this solution assumes number of columns is equal in all lines and number of columns is known a priori and any order of output is acceptable. If this does not hold then develop own superior solution.
(tested in gawk 4.2.1)
You can use the ruby csv parser to group then reduce the repeated entries:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>","})
puts data[0].to_csv
data[1..].group_by { |row| row[0] }.
each{ |k, arr|
puts arr.transpose().map{ |ta| ta.find { |x| !x.nil? }}.to_csv
}
' file
Prints:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
This assumes the valid data is the first non-nil, nonblank encountered for that particular column.
I would like to straight up merge multiple pipe delimited files using Awk. Every example I have found on here is several times more complicated than what i am trying to so. I have several text files formatted identically, and just want to merge them together, like a UNION ALL in SQL. Don't need to join on a column, and don't care about duplicate rows.
Concatenating the files should work for you then:
cat file1.txt file2.txt file3.txt > finalFile.txt
No need for awk.
That is a job for cat (see #mjuarez's answer) but if you really want to use awk for it:
$ awk 1 files* > another_file
(g)awk '{print}' file1 file2 file* >> outputfile
Given that you aim to use awk only.
James' answer is way better,
However I still want to show what I came with, the very basic using of awk. :)
I have following file format:
AAA-12345~TRAX~~AAAAAAAAAAAA111111ETC
AAA-12345~RCV~~BBBBBBBBBBBB222222ETC
BBB-78900~TRAX~~CCCCCCCCCCCC444444ETC
BBB-78900~RCV~~DDDDDDDDDDDD555555ETC
CCC-65432~TRAX~~HHHHHHHHHHHH888888ETC
All lines are in pairs and each pair is identical up single ~.
Sometimes there are orphans like last record which has TRAX but no RCV.
Question is: using bash utilies like sed or awk or commands like grep or cut how do I find and display orphans only?
Using awk:
awk -F~ '{a[$1]+=1} END{for(key in a) if(a[key]==1){print key}}'
This is just loading the first field (split by tilde) as they key of an array and incrementing the value for that key each time it's found. Then when the file is finished, it iterates the array and prints out key's with just 1 for the value.
I am searching list of names with pattern "japconfig".There are many files inside one directory. Those files contain names like ixdf_japconfig_FZ.txt,
ixdf_japconfig_AB.txt, ixdf_japconfig_RK.txt, ixdf_japconfig_DK.txt, ixdf_japconfig_LY.txt. But I don't know what are the names present after japconfig word. I need to list down all such names. Also my files contain ixdf_dbconfig.txt, but I don't want to print ixdf_dbconfig.txt in the output.
Each of my file contains one ixdf_japconfig_*.txt and ixdf_dbconfig.txt where * can be FZ,AB,RK,DK,LY. I can achieve my desired result by using grep and then awk to cut the columns.But I don't want to use AWK or other command. I want to achive using grep only.
I need to print below names.
ixdf_japconfig_FZ.txt
ixdf_japconfig_AB.txt
ixdf_japconfig_RK.txt
ixdf_japconfig_DK.txt
ixdf_japconfig_LY.txt
I don't want to print ixdf_dbconfig.txt.
When I tried using "grep -oh "ixdf_japconfig.*.txt" *.dat" command, I am getting below output.
ixdf_japconfig_FZ.txt ixdf_dbconfig.txt
ixdf_japconfig_AB.txt ixdf_dbconfig.txt
ixdf_japconfig_RK.txt ixdf_dbconfig.txt
ixdf_japconfig_DK.txt ixdf_dbconfig.txt
ixdf_japconfig_LY.txt ixdf_dbconfig.txt
where first column is my desired column. But I don't want to print second column. How can I change my code to print only first column?
grep -oh ixdf_japconfig_...txt *.dat
(Your .*. was matching most of the line.)
I have a folder with a lot of text files and would like to get a list of all words in that folder that contain a certain string. So, e.g. there is words in the form of 'IND:abc', 'IND:cde', ... and I am looking for a way to get a list of all words starting with IND:, so something like:
[IND:abc, IND:cde, IND:...]
Can grep do that?
Cheers,
Chris
grep -ho 'IND:\w\+' * | sort | uniq
-h suppresses the filenames so that you will only get the text. -o prints only the matching path of the text. If you want to see the duplicates just remove the sort, and uniq.