Combine log lines with awk - parsing

I have a log file that simplified looks like this (it has enough columns so that direct addressing of the columns is not feasible):
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
Most of the fields appear only once by id but not all of them (see time, for example).
What I want to achieve is to have one line per id that combines the columns of all records with the same id:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:03.006Z,bar_host,192.168.0.13,bar_user
For the columns that appear more than once in each id, I don't care which one is returned as long as it relates to a record with the same id.

I would exploit GNU AWK 2D arrays following way, let file.txt content be
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
then
awk 'BEGIN{FS=OFS=",";cols=5}NR==1{print}NR>1{for(i=1;i<=cols;i+=1){arr[$1][i]=arr[$1][i]?arr[$1][i]:$i}}END{for(i in arr){for(j in arr[i]){$j=arr[i][j]};print}}' file.txt
output
id,time,host,ip,user_uuid
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
Explanation: Firstly I inform GNU AWK that both field separator (FS) and output field separator (OFS) is ,, I use cols variable for holding information how many columns you wish to have. First row I simply print, for following rows for each column I check if there is already some truthy value in arr[id][number of field] using so-called ternary operator if yes I use it otherwise I set value to current field. In END I use nested for loops, for each id I do set value of its field in current line, so GNU AWK build string from these which I can print. Disclaimer: this solution assumes number of columns is equal in all lines and number of columns is known a priori and any order of output is acceptable. If this does not hold then develop own superior solution.
(tested in gawk 4.2.1)

You can use the ruby csv parser to group then reduce the repeated entries:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>","})
puts data[0].to_csv
data[1..].group_by { |row| row[0] }.
each{ |k, arr|
puts arr.transpose().map{ |ta| ta.find { |x| !x.nil? }}.to_csv
}
' file
Prints:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
This assumes the valid data is the first non-nil, nonblank encountered for that particular column.

Related

Use both columns and previously defined values in fitnesse ColumnFixture

The rows in my test-table all repeat the same values, except for two columns which are different for each row. I would like to use values i defined earlier for the repeating rows.
The Fixture uploads files to FTP, each row in the test-table now has username, password, host and so on, these are always the same. The name of the file is different.
If your tests use Slim you can use constructor parameters to define the repeated values in the first (i.e. header) row of your table. In that case you only have to define the file names in the table's rows.
If your table is a 'decision table' based on a 'scenario' you can also supply repeated parameters in the header row (using a 'having' syntax). More details can be found in FitNesse's own acceptance tests. For instance:
|scenario |Division _ _ _|numerator, denominator, quotient?|
|setNumerator |#numerator |
|setDenominator|#denominator |
|$quotient= |quotient |
|Division |having|numerator|9|
|denominator|quotient? |
|3 |3.0 |
|2 |4.5 |
Another option, but this seems less appropriate when the values are really the same for ALL rows, is to use a baseline decision table where the first row defines values for all columns and subsequent rows only define the altered values.
You can use FitNesse variables:
!define username {bob}
!define password {secret}
|myfixture|
|username|password|other|stuff|
|${username}|${password}|a|b|
|${username}|${password}|c|d|
|${username}|${password}|x|y|
The answer by Fried Hoeben works for Slim, the following answer is for fit:
If your Fixture is a child of Fixture, then you can define extra parameters by adding extra columns in the header row.
|!-UploadFileToFtps-! |ftpPassword=${password} | ftpUserName=${userName}|
|host |ftpDir |localFile |result? |
|${ftpHost}|${ftpSrc}|${folder1}${file1}.xlsx |File '${folder1}${file1}.xlsx' successfully uploaded|
|${ftpHost}|${ftpSrc}|${folder2}${file2}.xlsx |File '${folder2}${file2}.xlsx' successfully uploaded|
|${ftpHost}|${ftpSrc}|${folder2}${file3}.pdf |File '${folder2}${file3}.pdf' successfully uploaded |
You can access the values in those columns with getArgs() which retrieves a String Array.
I use key-value pairs separated by '=', this enables me to use named parameters. Otherwise i would have to reference the parameters in order, which i think is wrong.

How to diff records within same file

I have following file format:
AAA-12345~TRAX~~AAAAAAAAAAAA111111ETC
AAA-12345~RCV~~BBBBBBBBBBBB222222ETC
BBB-78900~TRAX~~CCCCCCCCCCCC444444ETC
BBB-78900~RCV~~DDDDDDDDDDDD555555ETC
CCC-65432~TRAX~~HHHHHHHHHHHH888888ETC
All lines are in pairs and each pair is identical up single ~.
Sometimes there are orphans like last record which has TRAX but no RCV.
Question is: using bash utilies like sed or awk or commands like grep or cut how do I find and display orphans only?
Using awk:
awk -F~ '{a[$1]+=1} END{for(key in a) if(a[key]==1){print key}}'
This is just loading the first field (split by tilde) as they key of an array and incrementing the value for that key each time it's found. Then when the file is finished, it iterates the array and prints out key's with just 1 for the value.

Rails query by number of digits in field

I have a Rails app with a table: "clients". the clients table has a field: phone. phone data type is string. I'm using postgresql. I would like to write a query which selects all clients which have a phone value containing more than 10 digits. phone does not have a specific format:
+1 781-658-2687
+1 (207) 846-3332
2067891111
(345)222-777
123.234.3443
etc.
I've been trying variations of the following:
Client.where("LENGTH(REGEXP_REPLACE(phone,'[^\d]', '')) > 10")
Any help would be great.
You almost have it but you're missing the 'g' option to regexp_replace, from the fine manual:
The regexp_replace function provides substitution of new text for substrings that match POSIX regular expression patterns. [...] The flags parameter is an optional text string containing zero or more single-letter flags that change the function's behavior. Flag i specifies case-insensitive matching, while flag g specifies replacement of each matching substring rather than only the first one.
So regexp_replace(string, pattern, replacement) behaves like Ruby's String#sub whereas regexp_replace(string, pattern, replacement, 'g') behaves like Ruby's String#gsub.
You'll also need to get a \d through your double-quoted Ruby string all the way down to PostgreSQL so you'll need to say \\d in your Ruby. Things tend to get messy when everyone wants to use the same escape character.
This should do what you want:
Client.where("LENGTH(REGEXP_REPLACE(phone, '[^\\d]', '', 'g')) > 10")
# --------------------------------------------^^---------^^^
Try this:
phone_number.gsub(/[^\d]/, '').length

How to aggregate multiple rows into one in CSV?

I have following problem:
I have a CSV file, which looks like this:
1,12
1,15
1,18
2,10
2,11
3,20
And I would like to parse it somehow to get this:
1,12,15,18
2,10,11
3,20
Do you have any solution? Thanks!
Here is one solution for you.
This first part just sets up the example for testing. I am assuming you already have a file with values in the second part of the script.
$path = "$env:TEMP\csv.txt"
$data =#"
1,12
1,15
1,18
2,10
2,11
3,20
"#
$data | Set-Content $path
This should be all you need:
$path = "$env:TEMP\csv.txt"
$results = #{}
foreach($line in (Get-Content $path))
{
$split = $line -split ','
$rowid = $split[0]
$data = $split[1]
if(-not($results.$rowid))
{
$results.$rowid = $rowid
}
$results.$rowid += "," + $data
}
$results.values | Sort-Object
Your original dataset does not need to be sorted for this one to work. I slice the data up and insert it into a hashtable.
I don't know your exact code requirement. I will try to write some logic which may help you!
CSV means a text file which I can read into a string or an array
If one will look at the above CSV data, there is a common pattern i.e. after each pair there is a space in-between.
So my parsing will be depending on 2 phases
parse with ' ' i.e. single space and will insert into an array (say elements)
then parse with ',' i.e. comma from each element of elements and save into another array (say details) where odd indexes will be containing the left hand values and even indexes will be containing the right hand values.
So next while printing or using skip the odd index if you have an existing value.
Hope this helps...
Satyaranjan,
thanks for your answer! To clarify - I don't have any code requirements, I can use any language to achieve results. The point is to take unique values from first position (1,2,3) and put all related numbers on the right (1 - 12, 15 and 18 etc.). It is something like GROUP_CONCAT function in MySQL - but unfortunately I don't have such a function, so I am looking for some workaround.
Hope it is more clear now. Thanks

Parsing a CSV file with rows of varying lenghs

I am calling a webservice that's returning a comma separated dataset with varying columns and multiple text-qualified rows (the first row denotes the column names) . I need to insert each row into a database while concatenating the rows that are varied.
The data is returned like so
"Email Address","First Name","Last Name", "State","Training","Suppression","Events","MEMBER_RATING","OPTIN_TIME","CLEAN_CAMPAIGN_ID"
"scott#example.com","Scott","Staph","NY","Campaigns and activism","Social Media","Fundraiser",1,"2012-03-08 17:17:42","Training"
There can be up to 60 columns between State and Member_Rating, and the data in those fields are to get concatenated and inserted into one database column. The first four fields and the last three fields in the list will always be the same. I'm unsure the best way to tackle this.
I am not sure if this solution fits your needs. I hope so. It's a perl script that joins with - surrounded with spaces all fields but first four and last three. It uses a non standard module, Text::CSV_XS that must be installed using CPAN or similar tool.
Content of infile:
"Email Address","First Name","Last Name","State","Training","Suppression","Events","MEMBER_RATING","OPTIN_TIME","CLEAN_CAMPAIGN_ID"
"scott#example.com","Scott","Staph","NY","Campaigns and activism","Social Media","Fundraiser",1,"2012-03-08 17:17:42","Training"
Content of script.pl:
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new({
allow_whitespace => 1,
});
open my $fh, q[<], $ARGV[0] or die qq[Open: $!\n];
while ( my $row = $csv->getline( $fh ) ) {
my $concat = join q[ - ], (#$row)[4 .. #$row-4];
splice #$row, 4, scalar #$row - (3 + 4), $concat;
$csv->print( \*STDOUT, $row );
print qq[\n];
}
Run it like:
perl script.pl infile
With following output:
"Email Address","First Name","Last Name",State,"Training - Suppression - Events",MEMBER_RATING,OPTIN_TIME,CLEAN_CAMPAIGN_ID
scott#example.com,Scott,Staph,NY,"Campaigns and activism - Social Media - Fundraiser",1,"2012-03-08 17:17:42",Training

Resources