How to aggregate multiple rows into one in CSV? - parsing

I have following problem:
I have a CSV file, which looks like this:
1,12
1,15
1,18
2,10
2,11
3,20
And I would like to parse it somehow to get this:
1,12,15,18
2,10,11
3,20
Do you have any solution? Thanks!

Here is one solution for you.
This first part just sets up the example for testing. I am assuming you already have a file with values in the second part of the script.
$path = "$env:TEMP\csv.txt"
$data =#"
1,12
1,15
1,18
2,10
2,11
3,20
"#
$data | Set-Content $path
This should be all you need:
$path = "$env:TEMP\csv.txt"
$results = #{}
foreach($line in (Get-Content $path))
{
$split = $line -split ','
$rowid = $split[0]
$data = $split[1]
if(-not($results.$rowid))
{
$results.$rowid = $rowid
}
$results.$rowid += "," + $data
}
$results.values | Sort-Object
Your original dataset does not need to be sorted for this one to work. I slice the data up and insert it into a hashtable.

I don't know your exact code requirement. I will try to write some logic which may help you!
CSV means a text file which I can read into a string or an array
If one will look at the above CSV data, there is a common pattern i.e. after each pair there is a space in-between.
So my parsing will be depending on 2 phases
parse with ' ' i.e. single space and will insert into an array (say elements)
then parse with ',' i.e. comma from each element of elements and save into another array (say details) where odd indexes will be containing the left hand values and even indexes will be containing the right hand values.
So next while printing or using skip the odd index if you have an existing value.
Hope this helps...

Satyaranjan,
thanks for your answer! To clarify - I don't have any code requirements, I can use any language to achieve results. The point is to take unique values from first position (1,2,3) and put all related numbers on the right (1 - 12, 15 and 18 etc.). It is something like GROUP_CONCAT function in MySQL - but unfortunately I don't have such a function, so I am looking for some workaround.
Hope it is more clear now. Thanks

Related

Combine log lines with awk

I have a log file that simplified looks like this (it has enough columns so that direct addressing of the columns is not feasible):
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
Most of the fields appear only once by id but not all of them (see time, for example).
What I want to achieve is to have one line per id that combines the columns of all records with the same id:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:03.006Z,bar_host,192.168.0.13,bar_user
For the columns that appear more than once in each id, I don't care which one is returned as long as it relates to a record with the same id.
I would exploit GNU AWK 2D arrays following way, let file.txt content be
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
then
awk 'BEGIN{FS=OFS=",";cols=5}NR==1{print}NR>1{for(i=1;i<=cols;i+=1){arr[$1][i]=arr[$1][i]?arr[$1][i]:$i}}END{for(i in arr){for(j in arr[i]){$j=arr[i][j]};print}}' file.txt
output
id,time,host,ip,user_uuid
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
Explanation: Firstly I inform GNU AWK that both field separator (FS) and output field separator (OFS) is ,, I use cols variable for holding information how many columns you wish to have. First row I simply print, for following rows for each column I check if there is already some truthy value in arr[id][number of field] using so-called ternary operator if yes I use it otherwise I set value to current field. In END I use nested for loops, for each id I do set value of its field in current line, so GNU AWK build string from these which I can print. Disclaimer: this solution assumes number of columns is equal in all lines and number of columns is known a priori and any order of output is acceptable. If this does not hold then develop own superior solution.
(tested in gawk 4.2.1)
You can use the ruby csv parser to group then reduce the repeated entries:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>","})
puts data[0].to_csv
data[1..].group_by { |row| row[0] }.
each{ |k, arr|
puts arr.transpose().map{ |ta| ta.find { |x| !x.nil? }}.to_csv
}
' file
Prints:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
This assumes the valid data is the first non-nil, nonblank encountered for that particular column.

GSheets - How to query a partial string

I am currently using this formula to get all the data from everyone whose first name is "Peter", but my problem is that if someone is called "Simon Peter" this data is gonna show up on the formula output.
=QUERY('Data'!1:1000,"select * where B contains 'Peter'")
I know that for the other formulas if I add an * to the String this issue is resolved. But in this situation for the QUERY formula the same logic do not applies.
Do someone knows the correct syntax or a workaround?
How about classic SQL syntax
=QUERY('Data'!1:1000,"select * where B like 'Peter %'")
The LIKE keyword allows use of wildcard % to represent characters relative to the known parts of the searched string.
See the query reference: developers.google.com/chart/interactive/docs/querylanguage You could split firstname and lastname into separate columns, then only search for firstnames exactly equal to 'Peter'. Though you may want to also check if lowercase/uppercase where lower(B) contains 'peter' or whitespaces are present in unexpected places (e.g., trim()). You could also search only for values that start with Peter by using starts with instead of contains, or a regular expression using matches. – Brian D
It seems that for my case using 'starts with' is a perfect fit. Thank you!

beam.io.WriteToText add new line after each value - can it be removed?

My pipeline looks similar to the following:
parDo return list per processed line | beam.io.WriteToText
beam.io.WriteToText adds a new line after each list element. How can I remove this new line and have the values separated by comma so I will be able to build CSV file
Any help is very appreciated!
Thanks,
eilalan
To remove the newline char, you can use this:
beam.io.WriteToText(append_trailing_newlines=False)
But for adding commas between your values, there's no out-of-the-box feature on TextIO to convert to CSV. But, you can check this answer for a user defined PTransform that can be applied to your PCollection in order to convert dictionary data into csv data.

PostgreSql + Query Statement having \r in between the attributes !

Suppose we have a textarea in which we put example string. Textbox contain :
Earth is revolving around the Sun.
But at the time of saving, I just pressed a enter key after "the sun". Now the statements in texbox ::
Earth is revolving around
the Sun
Now, in database where enter was pressed the \r is stored. Now i am trying to fetch the data but unable, because my query is of such type ::
SELECT * FROM "volume_factors" WHERE lower(volume_notes) like E'atest\\r\\n 100'
Actual data stored in database field
atest\r
100
Please suggest me to solve the issue.I have tried gsub to replace but no change.
search_text_array[1] = search_text_array[1].gsub('\\r\\n','\r\n')
Thanks in Advance !!!
Try this:
update volume_factors set volume_notes = regexp_replace(volume_notes, '\r\n', ' ');
That's to replace crlf with one space for data that is already in the database. You use postgresql's psql to do this.
To prevent new data containing crlf entering database, you should do it in the application. If you use ruby's gsub, do not use single quote, use double quote to recognize \n like this:
thestring.gsub("\n", " ")
Here we can replace \r\n by % to fetch the data.
Seaching Query will be like this ::
SELECT * FROM "volume_factors" WHERE lower(volume_notes) like E'atest% 100'
gsub function ::
search_text_array[1] = search_text_array[1].gsub('\\r\\n','%')

Concatenating a Text in front of Individual Database Records with Tcl

In short, currently I am using the following code to pull records from multiple tables in a Sqlite Db and insert them in a single combobox ($SearchBar):
set SrchVals1 [db eval {SELECT DISTINCT Stitle From Subcontract Order By Stitle ASC}]
set SrchVals2 [db eval {...
set SrchVals3 ...
set SrchValsALL [concat $SrchVals1 $SrchVals2 $SrchVals3]
$SearchBar configure -value $SrchValsAll
For the variable "SrchVals1", I am trying to figure out a way to concatenate the text "Sub: " to each individual record in SrchVals1. For example, if SrchVals1 shows the following records in the combobox:
First Title
Second Title
Third Title
I would like to concatenate so that the records in the combobox look like this:
Sub: First Title
Sub: Second Title
Sub: Third Title
I understand that I might have to use a foreach statement; however, I am having no luck writing one that adds "Sub: " in front of each record, as opposed to one. This seems like something that should be pretty easy, but I cannot seem to figure it out.
Does anyone know how I can achieve these results?
Thank you,
DFM
You're right. The foreach command is the right way to do it. Here's how:
set SrchValsALL {}
foreach value [concat $SrchVals1 $SrchVals2 $SrchVals3] {
lappend SrchValsALL "Sub: $value"
}
$SearchBar configure -value $SrchValsAll

Resources