Parsing a CSV file with rows of varying lenghs - parsing

I am calling a webservice that's returning a comma separated dataset with varying columns and multiple text-qualified rows (the first row denotes the column names) . I need to insert each row into a database while concatenating the rows that are varied.
The data is returned like so
"Email Address","First Name","Last Name", "State","Training","Suppression","Events","MEMBER_RATING","OPTIN_TIME","CLEAN_CAMPAIGN_ID"
"scott#example.com","Scott","Staph","NY","Campaigns and activism","Social Media","Fundraiser",1,"2012-03-08 17:17:42","Training"
There can be up to 60 columns between State and Member_Rating, and the data in those fields are to get concatenated and inserted into one database column. The first four fields and the last three fields in the list will always be the same. I'm unsure the best way to tackle this.

I am not sure if this solution fits your needs. I hope so. It's a perl script that joins with - surrounded with spaces all fields but first four and last three. It uses a non standard module, Text::CSV_XS that must be installed using CPAN or similar tool.
Content of infile:
"Email Address","First Name","Last Name","State","Training","Suppression","Events","MEMBER_RATING","OPTIN_TIME","CLEAN_CAMPAIGN_ID"
"scott#example.com","Scott","Staph","NY","Campaigns and activism","Social Media","Fundraiser",1,"2012-03-08 17:17:42","Training"
Content of script.pl:
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new({
allow_whitespace => 1,
});
open my $fh, q[<], $ARGV[0] or die qq[Open: $!\n];
while ( my $row = $csv->getline( $fh ) ) {
my $concat = join q[ - ], (#$row)[4 .. #$row-4];
splice #$row, 4, scalar #$row - (3 + 4), $concat;
$csv->print( \*STDOUT, $row );
print qq[\n];
}
Run it like:
perl script.pl infile
With following output:
"Email Address","First Name","Last Name",State,"Training - Suppression - Events",MEMBER_RATING,OPTIN_TIME,CLEAN_CAMPAIGN_ID
scott#example.com,Scott,Staph,NY,"Campaigns and activism - Social Media - Fundraiser",1,"2012-03-08 17:17:42",Training

Related

Combine log lines with awk

I have a log file that simplified looks like this (it has enough columns so that direct addressing of the columns is not feasible):
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
Most of the fields appear only once by id but not all of them (see time, for example).
What I want to achieve is to have one line per id that combines the columns of all records with the same id:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:03.006Z,bar_host,192.168.0.13,bar_user
For the columns that appear more than once in each id, I don't care which one is returned as long as it relates to a record with the same id.
I would exploit GNU AWK 2D arrays following way, let file.txt content be
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
then
awk 'BEGIN{FS=OFS=",";cols=5}NR==1{print}NR>1{for(i=1;i<=cols;i+=1){arr[$1][i]=arr[$1][i]?arr[$1][i]:$i}}END{for(i in arr){for(j in arr[i]){$j=arr[i][j]};print}}' file.txt
output
id,time,host,ip,user_uuid
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
Explanation: Firstly I inform GNU AWK that both field separator (FS) and output field separator (OFS) is ,, I use cols variable for holding information how many columns you wish to have. First row I simply print, for following rows for each column I check if there is already some truthy value in arr[id][number of field] using so-called ternary operator if yes I use it otherwise I set value to current field. In END I use nested for loops, for each id I do set value of its field in current line, so GNU AWK build string from these which I can print. Disclaimer: this solution assumes number of columns is equal in all lines and number of columns is known a priori and any order of output is acceptable. If this does not hold then develop own superior solution.
(tested in gawk 4.2.1)
You can use the ruby csv parser to group then reduce the repeated entries:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>","})
puts data[0].to_csv
data[1..].group_by { |row| row[0] }.
each{ |k, arr|
puts arr.transpose().map{ |ta| ta.find { |x| !x.nil? }}.to_csv
}
' file
Prints:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
This assumes the valid data is the first non-nil, nonblank encountered for that particular column.

How do i remove rows based on comma-separated list of values in a Power BI parameter in Power Query?

I have a list of data with a title column (among many other columns) and I have a Power BI parameter that has, for example, a value of "a,b,c". What I want to do is loop through the parameter's values and remove any rows that begin with those characters.
For example:
Title
a
b
c
d
Should become
Title
d
This comma separated list could have one value or it could have twenty. I know that I can turn the parameter into a list by using
parameterList = Text.Split(<parameter-name>,",")
but then I am unsure how to continue to use that to filter on. For one value I would just use
#"Filtered Rows" = Table.SelectRows(#"Table", each Text.StartsWith([key], <value-to-filter-on>))
but that only allows one value.
EDIT: I may have worded my original question poorly. The comma separated values in the parameterList can be any number of characters (e.g.: a,abcd,foo,bar) and I want to see if the value in [key] starts with that string of characters.
Try using List.Contains to check whether the starting character is in the parameter list.
each List.Contains(parameterList, Text.Start([key], 1)
Edit: Since you've changed the requirement, try this:
Table.SelectRows(
#"Table",
(C) => not List.AnyTrue(
List.Transform(
parameterList,
each Text.StartsWith(C[key], _)
)
)
)
For each row, this transforms the parameterList into a list of true/false values by checking if the current key starts with each text string in the list. If any are true, then List.AnyTrue returns true and we choose not to select that row.
Since you want to filter out all the values from the parameter, you can use something like:
= Table.SelectRows(#"Changed Type", each List.Contains(Parameter1,Text.Start([Title],1))=false)
Another way to do this would be to create a custom column in the table, which has the first character of title:
= Table.AddColumn(#"Changed Type", "FirstChar", each Text.Start([Title],1))
and then use this field in the filter step:
= Table.SelectRows(#"Added Custom", each List.Contains(Parameter1,[FirstChar])=false)
I tested this with a small sample set and it seems to be running fine. You can test both and see if it helps with the performance. If you are still facing performance issues, it would probably be easier if you can share the pbix file.
This seems to work fairly well:
= List.Select(Source[Title], each Text.Contains(Parameter1,Text.Start(_,1))=false)
Replace Source with the name of your table and Parameter1 with the name of your Parameter.

How to aggregate multiple rows into one in CSV?

I have following problem:
I have a CSV file, which looks like this:
1,12
1,15
1,18
2,10
2,11
3,20
And I would like to parse it somehow to get this:
1,12,15,18
2,10,11
3,20
Do you have any solution? Thanks!
Here is one solution for you.
This first part just sets up the example for testing. I am assuming you already have a file with values in the second part of the script.
$path = "$env:TEMP\csv.txt"
$data =#"
1,12
1,15
1,18
2,10
2,11
3,20
"#
$data | Set-Content $path
This should be all you need:
$path = "$env:TEMP\csv.txt"
$results = #{}
foreach($line in (Get-Content $path))
{
$split = $line -split ','
$rowid = $split[0]
$data = $split[1]
if(-not($results.$rowid))
{
$results.$rowid = $rowid
}
$results.$rowid += "," + $data
}
$results.values | Sort-Object
Your original dataset does not need to be sorted for this one to work. I slice the data up and insert it into a hashtable.
I don't know your exact code requirement. I will try to write some logic which may help you!
CSV means a text file which I can read into a string or an array
If one will look at the above CSV data, there is a common pattern i.e. after each pair there is a space in-between.
So my parsing will be depending on 2 phases
parse with ' ' i.e. single space and will insert into an array (say elements)
then parse with ',' i.e. comma from each element of elements and save into another array (say details) where odd indexes will be containing the left hand values and even indexes will be containing the right hand values.
So next while printing or using skip the odd index if you have an existing value.
Hope this helps...
Satyaranjan,
thanks for your answer! To clarify - I don't have any code requirements, I can use any language to achieve results. The point is to take unique values from first position (1,2,3) and put all related numbers on the right (1 - 12, 15 and 18 etc.). It is something like GROUP_CONCAT function in MySQL - but unfortunately I don't have such a function, so I am looking for some workaround.
Hope it is more clear now. Thanks

How do I parse a table into its meaningful chunks?

I need to extract a table of data on a collection of pages. I can already traverse the pages just fine.
How do I extract the table's data? I'm using Ruby and Nokogiri, but I would assume that this is a pretty general issue.
I underlined the desired data points in each row in the following image.
A sample of the html is: http://pastebin.com/YYFPbFLC
How would I parse this table into a hash via Nokogiri into the meaningful chunks?
The table's xpath is:
/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table
The table has a variable number of rows of data and formatting rows. I only want to collect the rows with meaningful data, but I don't readily see a way to distinguish this via an XPath except the second column will reliably have "keyword" in it. Each of these rows have an XPath of:
1st meaningful row is: /html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]
...
Last meaningful row: /html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[N]
The first meaningful column that needs to match text content on the "keyword" is:
/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]
The last column of this first row of data would be:
/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[6]
Each row is a record and has a timestamp with this column/td being the time in the timestamp; The year, month and day are all in their own variables and can be appended for a full timestamp:
/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[5]
The first rule of XPath is: never use the autogenerated XPath from Firebug or other browser tool. This creates brittle XPath that treats all page elements as equally important and required, even parts you don't care about. For example, if a notice went up at the top of the page and it happened to be in a table, it could throw off your parsing.
Instead, think about how a human would identify it. In this case, you want "the first table under the heading with the word 'today' in it". Here's the XPath for that:
//table[preceding-sibling::h2[contains(text(), "today")]][1]
This says take the tables that have a preceding h2 (in other words, that follow the h2), where the h2 contains the word "today". Then take the first such table.
Then you need to identify the rows you are interested in. Note that some rows are just dividers containing a single td, so you want to make sure you only parse the rows that have multiple td tags. In XPath, that is:
//tr[td[2]]
Then you just grab the content of all the columns. In the first one you can remove everything before the words "of magnitude" to get just the value. Putting it all together:
doc = Nokogiri::HTML.parse(html)
events = []
doc.xpath('//table[preceding-sibling::h2[contains(text(), "today")]][1]//tr[td[2]]').each do |row|
cols = row.search('td/text()').map(&:to_s)
events << {
:magnitude => cols[0].gsub(/^.*of magnitude /,''),
:temp_area => cols[1],
:time_start => cols[2],
:time_middle => cols[3],
:time_end => cols[4]
}
end
The output is:
[
{:magnitude=>"F1.7",
:temp_area=>"0",
:time_start=>"01:11:00",
:time_middle=>"01:24:00",
:time_end=>"01:32:00"},
{:magnitude=>"F3.1",
:temp_area=>"0",
:time_start=>"04:01:00",
:time_middle=>"04:10:00",
:time_end=>"04:26:00"},
{:magnitude=>"F3.5",
:temp_area=>"134F55",
:time_start=>"06:24:00",
:time_middle=>"06:42:00",
:time_end=>"06:53:00"},
{:magnitude=>"F1.4",
:temp_area=>"0",
:time_start=>"11:58:00",
:time_middle=>"12:06:00",
:time_end=>"12:16:00"},
{:magnitude=>"F1.0",
:temp_area=>"0",
:time_start=>"13:02:00",
:time_middle=>"13:05:00",
:time_end=>"13:09:00"},
{:magnitude=>"D53.7",
:temp_area=>"134F55",
:time_start=>"17:37:00",
:time_middle=>"18:37:00",
:time_end=>"18:56:00"}
]

ISQL Perform instruction: after editadd editupdate of table vs. after add update of table

INFORMIX-SQL 7.3 Perform Screens:
According to documentation, in an "after editadd editupdate of table" control block, its instructions are executed before the row is added or updated to the table, whereas in an "after add update of table" control block, its instructions are executed after the row has been added or updated to the table. Supposedly, this would mean that any instructions which would alter values of field-tags linked to table.columns would not be committed to the table, but field-tags linked to displayonly fields will change?
However, when using "after add update of table", I placed instructions which alter values for field-tags linked to table.columns and their displayed and committed values also changed! I would have thought that an "after add update of table" would only alter displayonly fields.
TABLES
customer
transaction
branch
interest
dates
ATTRIBUTES
[...]
q = transaction.trx_type, INCLUDE=("E","C","V","P","T"), ...;
tb = transaction.trx_int_table,
LOOKUP f1 = ta_days1_f,
t1 = ta_days1_t,
i1 = ta_int1,
[...]
JOINING *interest.int_table, ...;
[...]
INSTRUCTIONS
customer MASTER OF transaction
transaction MASTER OF customer
delimiters ". ";
AFTER QUERY DISPLAY ADD UPDATE OF transaction
if z = "E" then let q = "E"
if z = "C" then let q = "C"
if z = "1" then let q = "E"
[...]
END
Is 'z' a column in the transaction table?
Is the trouble that the value in 'z' is causing a change in the value of 'q' (aka transaction.trx_type), and the modified value is being stored in the database?
Is the value in 'z' part of the transaction table?
Have you verified that the value in the DB is indeed changed - using the Query Language option or a simple (default) form?
It might look as if it is because the instruction is also used AFTER DISPLAY, so when the values are retrieved from the DB, the value displayed in 'q' would be the mapped values corresponding to the value stored in 'z'. You would have to inspect the raw data to hide that mapping.
If this is not the problem, please:
Amend the question to show where 'z' comes from.
Also describe exactly what you do and see.
Confirm that the data in the database, as opposed to on the screen, is amended.
Please can you see whether this table plus form behaves the same for you as it does for me?
Table Transaction
CREATE TABLE TRANSACTION
(
trx_id SERIAL NOT NULL,
trx_type CHAR(1) NOT NULL,
trx_last_type CHAR(1) NOT NULL,
trx_int_table INTEGER NOT NULL
);
Form
DATABASE stores
SCREEN SIZE 24 BY 80
{
trx_id [f000]
trx_type [q]
trx_last_type [z]
trx_int_table [f001 ]
}
END
TABLES
transaction
ATTRIBUTES
f000 = transaction.trx_id;
q = transaction.trx_type, UPSHIFT, AUTONEXT,
INCLUDE=("E","C","V","P","T");
z = transaction.trx_last_type, UPSHIFT, AUTONEXT,
INCLUDE=("E","C","V","P","T","1");
f001 = transaction.trx_int_table;
INSTRUCTIONS
AFTER ADD UPDATE DISPLAY QUERY OF transaction
IF z = "E" THEN LET q = "E"
IF z = "C" THEN LET q = "C"
IF z = "1" THEN LET q = "E"
END
Experiments
[The parenthesized number is automatically generated by IDS/Perform.]
Add a row with data (1), V, E, 23.
Observe that the display is: 1, E, E, 23.
Exit the form.
Observe that the data in the table is: 1, V, E, 23.
Reenter the form and query the data.
Update the data to: (1), T, T, 37.
Observe that the display is: 1, T, T, 37.
Exit the form.
Observe that the data in the table is: 1, T, T, 37.
Reenter the form and query the data.
Update the data to: (1), P, 1, 49
Observe that the display is: 1, E, 1, 49.
Exit the form.
Observe that the data in the table is: 1, P, 1, 49.
Reenter the form and query the data.
Observe that the display is: 1, E, 1, 49.
Choose 'Update', and observe that the display changes to: 1, P, 1, 49.
I did the 'Observe that the data in the table is' steps using:
sqlcmd -d stores -e 'select * from transaction'
This generated lines like these (reflecting different runs):
1|V|E|23
1|P|1|49
That is my SQLCMD program, not Microsoft's upstart of the same name. You can do more or less the same thing with DB-Access, except it is noisier (13 extraneous lines of output) and you would be best off writing the SELECT statement in a file and providing that as an argument:
$ echo "select * from transaction" > check.sql
$ dbaccess stores check
Database selected.
trx_id trx_type trx_last_type trx_int_table
1 P 1 49
1 row(s) retrieved.
Database closed.
$
Conclusions
This is what I observed on Solaris 10 (SPARC) using ISQL 7.50.FC1; it matches what the manual describes, and is also what I suggested in the original part of the answer might be the trouble - what you see on the form is not what is in the database (because of the INSTRUCTIONS section).
Do you see something different? If so, then there could be a bug in ISQL that has been fixed since. Technically, ISQL 7.30 is out of support, I believe. Can you upgrade to a more recent version than that? (I'm not sure whether 7.32 is still supported, but you should really upgrade to 7.50; the current release is 7.50.FC4.)
Transcribing commentary before deleting it:
Up to a point, it is good that you replicate my results. The bad news is that in the bigger form we have different behaviour. I hope that ISQL validates all limits - things like number of columns etc. However, there is a chance that they are not properly validated, given the bug, or maybe there is a separate problem that only shows with the larger form. So, you need to ensure you have a supported version of the product and that the problem reproduces in it. Ideally, you will have a smaller version of the table (or, at least, of the form) that shows the problem, and maybe a still smaller (but not quite as small as my example) version that shows the absence of the problem.
With the test case (table schema and Perform screen that shows the problem) in hand, you can then go to IBM Tech Support with "Look - this works correctly when the form is small; and look, it works incorrectly when the form is large". The bug should then be trackable. You will need to include instructions on how to reproduce the bug similar to those I gave you. And there is no problem with running two forms - one simple and one more complex and displaying the bug - in parallel to show how the data is stored vs displayed. You could describe the steps in terms of 'Form A' and 'Form B', with Form A being Absolutely OK and Form B being Believed to be Buggy. So, add a record with certain values in Form B; show what is displayed in Form B after; show what is stored in the database in Form A after too; show that they are not different when they should be.
Please bear in mind that those who will be fixing the issue have less experience with the product than either you or me - so keep it as simple as possible. Remove as many attributes as you can; leave comments to identify data types etc.

Resources