import csv with a column with several numbers with 0 before them - psql

I have a csv file (tab separated) that looks like:
code description
---- ----------------
0011 it is a car
0123 it is a ball
1434 mirror
0234 chain saw
01312 radiator
I need to do a postgres \copy operation to a table called mytable. I used
\COPY mytable FROM '/home/myuser/Documents/tbfile.csv' DELIMITER E'\t' CSV HEADER;
I defined this table as
CREATE TABLE mytable(
code text,
description text
);
I noticed that the table became
code description
---- ----------------
11 it is a car
123 it is a ball
1434 mirror
234 chain saw
1312 radiator
i.e. the 0s were 'eliminated'. How can I do to avoid this problem and keep those 0s?
Thanks in advance

Related

Playwright: Comparing the downloaded xlsx document with the expected one

Please, can you advise me how to compare the downloaded file with the expected one? I haven't seen anything like that anywhere. By the way, if I compare e.g. two xlsx files, so I would also like to have information about the different values in the cells.
Example
Fist xlsx file
Column A
Column B
1
2
3
4
Second xlsx file
Column A
Column B
2
1
4
3
Comparing result - example
----------------- DIFF -------------------
DIFF Cell at A2 => '1' v/s '2'
.
.
.
I thought I'd join the comparison https://github.com/vrootic/FileCompare

SPSS: Inconsistent totals due to rounding of numbers

I am using weights when running the data with SPSS custom tables.
Thus it is expected that the column or row values may not add up to row total, column total or Table Total due to rounding of decimals
sample table result:
variable 2
category 1 category 2 Total
variable 1 category 1 45 52 97
category 2 60 56 115
Total 105 107 211
Is there a way to force SPSS to output the correct row, column, or table totals?
expected table output:
variable 2
category 1 category 2 Total
variable 1 category 1 45 52 97
category 2 60 56 116
Total 105 108 213
If you are using the CROSSTABS procedure to produce these figures then you should do using the option ASIS.
To be clear: the total displayed by CTABLES is mathematically correct. However, if you want to display as the total the sum of the displayed values in the rows, instead, the only way to do this is by using the STATS TABLE CALC extension command to recompute the totals using the rounded values.
Here is how to do that.
First, you need to create a Python module named customcalc.py with the following contents
def custom(datacells, ncells, roworcol):
'''Calculate sum of formatted values'''
total = sum(float(datacells.GetValueAt(roworcol,i)) for i in range(ncells))
return(total)
This file should be saved in the python\lib\site-packages directory under your Statistics installation or anywhere else that Python can find it.
Then, after your CTABLES command, run this syntax
STATS TABLE CALC SUBTYPE="customtable" PROCESS=PRECEDING
/TARGET custommodule="customcalc"
FORMULA="customcalc.custom(datacells, ncells, roworcol)" DIMENSION=COLUMNS LEVEL = -2 LOCATION="Total"
LABEL="Rounded Count".
That custom function adds up the formatted values in each row instead of the full precision values. If you have suppressed the default statistic name, Count, so that "Total" is the innermost label, use LEVEL=-1 instead of LEVEL=-2 ABOVE.

Google Sheets: Split data and delete first part of each new cell

I'm feeding data from a SAAS into a Google Sheet, and would need to format it a bit to be able to work with it.
Most columns are ok, but one column has multiple parameters in one. Each cell looks like (data anonymized):
affiliate_fees: None
affiliate_percent: 0.X
amount_refunded: 0
author_fees: 0
author_id: xxxx
author_percent: 0.5
coupon_id: xxxx
created_at: 2016-xxxxx
currency: USD
custom_gateway?: None
earnings_usd: None
meta: {u'url': None, u'class': u'transaction', u'image_url': None, u'description': None, u'name': u'xxxx'}
net_charge: xxx
net_charge_usd: xxx
paypal_payment_id: PAY-XXXXXXX
purchased_at: 2016-xxxx
refundable: True
sale_id: xxxx
status: None
stripe_charge_token: None
stripe_invoice_id: None
total_fedora_fee: None
total_processor_fee: None
user_id: xxxx
vat_fees: None
I've already found out how to SPLIT the data into different columns - I'm doing it via =SPLIT(CC2,CHAR(10))
Now what I'd like to do, ideally in the same operation, is to remove the part before the first colon :
So the goal is: ending up with only the values (part after the :) spread into different columns. I can manually enter the column names. For examaple:
--------------------------------------------------
| affiliate_fees | affiliate_percent |
--------------------------------------------------
| None | 0.X |
--------------------------------------------------
| ... | ... |
--------------------------------------------------
Any hints? Thanks for your time!
Note: I don't really need the meta: line, it can be discarded. I just left it in there because it might (or might not?) make things extra tricky
Alternative 1
Google Sheets introduced few months ago "Split text to columns" as a menu command. See Separate cell text into columns for further details.
Once you separate the text, you could use copy & paste > transpose
Alternative 2
A single formula alternative is to use
=ArrayFormula(transpose(REGEXEXTRACT(A1:A25,{"(.*[\w\?])+\:","\: (.*)+"})))
This will return an 25 x 2 array, and you will not have to manually add the column headers.
Alternative 3
If you still want to use SPLIT, you could use ": " as the separator and FALSE as the third argument to threat them as a single separator, but this also will split the meta: ... into several columns.
Assume that your data start at A1, then the formula to use is:
=SPLIT(A1,": ",FALSE)
To include all the rows with data, you will have to fill down this formula. Then do copy & paste > transpose.
In this spreadsheet I used this formula in cell E2
=ArrayFormula({regexreplace(split(A3, char(10)), "\:(.+)",""); regexreplace(split(A3, char(10)), "(.+)\: ","")})
This will create a row with headers and the values in row 2. If you don't want the headers, just use
=ArrayFormula(regexreplace(split(A3, char(10)), "(.+)\: ",""))
See if that works for you ?

Formula to condtionally sum an array of values from one sheet to give totals on another

I have two sheets, connected by ID, which contain details of events and charges.
Sheet1 (breakdown of charges):
[Oh, just discovered I'm not allowed to include screen shots. I apparently need 10 reputation points. Not sure how to show you my spreadsheet now...]
ID DBF PCC Extras
1 200
1 100
3 200
4 350
4 250
4 75
4 25
7 100
[Sorry this will probably look horrible, I can't figure out how to include a spreadseet snippet without using an image. I had 3 imaage all prepared ready.]
Sheet2 (indentification and summary information):
ID Type Name
3 MON Edwards
7 REC Smith
4 WDG Jones
1 FNL West
8 WDG Richards
9 WDG Morrison
11 INT Gray
I am trying to add three additional columns to sheet 2 so that it shows a summary of the charges for each event. I would the charges information to update automatically in sheet2 as detail is added to sheet 1.
The resulting sheet2 will look like this:
ID Type Name DBF PCC Extras
3 MON Edwards 200
7 REC Smith 100
4 WDG Jones 350 250 100
1 FNL West 100 200
8 WDG Richards
9 WDG Morrison
11 INT Gray
As data for ID 8, 9 and 11 is added to sheet1, the summations should automatically appear in sheet2.
I have been trying to create an array formula to put in sheet2:B2, something like this:
=QUERY('Log Items'!A:F, "select sum(C), sum(D), sum(E), sum(F) where A="&A:A, 0)
This produces the correct result for ID 1 but it stops there and I'm not sure why. Also, despite my 0 as the third parameter, the header row is output.
I tried encapsulating the above in an ARRAYFORMULA but get a parse error.
I have also tried various combinations of ARRAYFORMULA, SUM and IF but not got anything that works. For example:
=ARRAYFORMULA(SUM(IF('Log Items'!A:A=A:A,'Log Items'!C:E,0)))
This gives #N/A "argument out of range", which I don't understand.
Although I've been working with Excel for a while, I'm really new to Google's Array formulas but have mananged to use them successfully in other parts of my spreadsheet and found them really powerful.
If anyone could help me with this, I would be very grateful.
In Sheet2!D2:
=ARRAYFORMULA(IF(A2:A,MMULT(N(A2:A=TRANSPOSE('Log Items'!A2:A)),N('Log Items'!B2:D)),))
Note: the N() functions have become necessary with different coercion behaviour in the new version of Sheets. They can be omitted in the classic version.
MMULT usage

Apache Pig: Join records by shifting

I have records of type:
time | url
==========
34 google.com
42 cnn.com
54 yahoo.com
64 fb.com
I want to add another column to these records time_diff which basically takes the difference of the time of the current record with the previous record. Output should look like:
time | url | time_diff
======================
34 google.com -- <can drop this row>
42 cnn.com 08
54 yahoo.com 12
64 fb.com 10
If I can somehow add another column (same as time) shifting the time by one such that 42 is aligned with 34, 54 is aligned with 42 and so on, then I can take the difference between these columns to calculate time_diff column.
I can project the time column to a new variable T and if I can drop the first record in the original data, then I can join it with T to obtain the desired result.
I appreciate any help. Thanks!
See this question, for example. You'll need to get your tuples in a bag (using GROUP ... ALL in your case), and then in a nested FOREACH, ORDER them and call a UDF to rank them. After you have this rank, you can FLATTEN the bag back out into a set of tuples again, and you'll have three fields: time, url, and rank. Once you have this, create a fourth column which is rank-1, do a self-join on those latter two columns, and you'll have what you need to compute the time_diff.
Since multiple records can have the same time, it would be a good idea to also sort on url so that you are guaranteed the same result every time.
I think you can use "lead" function of PiggyBank. Something like following might work.
A = LOAD 'T';
B = GROUP A ALL
C = FOREACH B {
C1 = ORDER A BY d;
GENERATE FLATTEN(Stitch(C1, Over(C1.time, 'lead')));
}
D = FOREACH C
GENERATE stitched::time AS time,
stitched::url AS url,
stitched::time - $3 AS time_diff;
https://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/evaluation/Over.html

Resources