Loosing Column when running dask to_parquet method with partition_on option - dask

I have data that I need to optimize in order to perform group_by .
Currently I have the data in several parquet files (over 2.5B rows) which looks as follows:
ID1 | ID2 | Location |
AERPLORDRVA | AOAAATDRLVA | None
ASDFGHJHASA | QWEFRFASEEW | home
I'm adding a third column in order to resave the file with partitions (and also append them) that will hopefully assist with the groupby
df['ID4']=df.ID1.apply(lambda x: x[:2])
When I view the df I see the column like this
ID1 | ID2 | Location | ID4
AERPLORDRVA | AOAAATDRLVA | None | AE
ASDFGHJHASA | QWEFRFASEEW | home | AS
....
But when I run the following code the ID4 column changes
dd.to_parquet(path2newfile, df, compression='SNAPPY', partition_on = ['ID4'], has_nulls= ['Location'], fixed_text ={'ID1':11,'ID2':11,'ID4':2}
into
df2 = dd.read_parquet(path2newfile)
ID1 | ID2 | Location | dir0
AERPLORDRVA | AOAAATDRLVA | None | ID4=AE
ASDFGHJHASA | QWEFRFASEEW | home | ID4=AS
....
Any ideas?
I was planning to include the ID4 within the groupby an thus improve the efficacy of the query
dfc = df.groupby(['ID4','ID1','ID2').count()
I'm working on a single workstation with 24 cores and 190GB (although the dask cluster only recognizes 123.65GB)

This was a bug in how directory names were parsed: apparently you are the first to use a field name containing numbers since the addition of the option for "drill"-style directory partitioning.
The fix is here: https://github.com/dask/fastparquet/pull/190 and was merged into master on 30-Jul-2017, and will eventually be released.
For the time being, you could rename your column not to include numbers.

Related

Can I order a pivot table using a second condition?

I am working with a spreadsheet where I store the books I read. The format is as follows:
A | B | C | D | E | F
year | book | author | mark | language | country of the author
With entries like:
A | B | C | D | E | F
-------------------------------------------------------------
2004 | Hamlet | Shakespeare | 8 | ES | UK
2005 | Crimen y punishment | Dostoevsky | 9 | CAT | Russia
2007 | El mundo es ansí | Baroja | 8 | ES | Spain
2011 | Dersu Uzala | Arsenyev | 8 | EN | Russia
2015 | Brothers Karamazov | Dostoevsky | 8 | ES | Russia
2019 | ... Shanti Andía | Baroja | 7 | ES | Spain
I have several pivot tablas to get different data, such as top countries, top books, etc. In one of them I want to group by authors and order by number of books I have read from each one of them.
So I defined:
ROWS
author (column C) with
order: Desc for COUNT of author
VALUES
author
summation by: COUNT
show as Default
mark
summation by: AVERAGE
show as Default
This way, the data above show like this:
author | COUNT of author | AVERAGE of mark
-------------------------------------------------------------
Baroja | 2 | 7,5
Dostoevsky | 2 | 8,5
Shakespeare | 1 | 8
Arsenyev | 1 | 8
It is fine, since it orders data having top read authors on top. However, I would also like to order also by AVERAGE of mark. This way, when COUNT of author matches, it would use AVERAGE of mark to solve the ties and put on top the one author with a better average on their books.
On my sample data, Dostoevsky would go above Baroja (8,5 > 7).
I have been looking for different options, but I could not find any without including an extra column in the pivot table.
How can I use a second option to solve the ties when the first option gives the same value?
You can achieve a customized sort order on a pivot table without any extra columns in the source range. However... you'd definately need an extra field added to the pivot.
In the Pivot table editor go to Values and add a Calculated Field.
Use any formula that describes the sort order you want. E.g. let's multiply the counter by 100 to use as first criteria:
=COUNTA(author) * 100 + AVERAGE(score)
Do notice it is important to select Summarize by your Custom formula (screenshot above).
Now, just add this new calculated field as your row's Sort by field, and you're done!
Notice though, you do get an extra column added to the pivot.
Of course, you could hide it.
Translated from my answer to the cross-posted question on es.SO.
try:
=QUERY(A2:F,
"select C,count(C),avg(D)
where A is not null
group by C
order by count(C) desc
label C'author',count(C)'COUNT of author',avg(D)'AVERAGE of mark'")

In data warehouse, can fact table contain two same records?

If a user ordered same product with two different order_id;
The orders are created within a same date-hour granularity, for example
order#1 2019-05-05 17:23:21
order#2 2019-05-05 17:33:21
In the data warehouse, should we put them into two rows like this (Option 1):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 1 |
| 002 | 1111 | 22 | 123 | 456 | 10 | 2 |
Or just put them in one row with the aggregated quantity (Option 2):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 3 |
I know if I put the order_id as a degenerate dimension in the fact table, it should be Option 1. But in our case, we don't really want to keep the order_id.
Also I once read an article that says that when all dimensions are filtered out, there should be only one row of data in the fact table. If this statement is correct, the Option 2 will be the choice.
Is there a principle where I can refer ?
Conceptually, fact tables in a data warehouse should be designed at the most detailed grain available. You can always aggregate data from the lower granularity to the higher one, while the opposite is not true - if you combine the records, some information is lost permanently. If you ever need it later (even though you might not see it now), you'll regret the decision.
I would recommend the following approach: in a data warehouse, keep order number as degenerate dimension. Then, when you publish a star schema, you might build a pre-aggregated version of the table (skip order number, group identical records by date/hour). This way, you can have smaller/cleaner fact table in your dimensional model, and yet preserve more detailed data in the DW.

How should browser and version be one hot encoded?

I need to output browser and version data for one hot encoding. We have come up with a few options (outlined below). I did some searching but couldn't find any existing examples of someone with similar data (searched Kaggle Datasets and DuckDuckGo).
Option 1: One column with browser name and version joined together
e.g. "browser_version" column values: "Safari-1.2.3", "Chrome-4.5.6", "Firefox-7.8.9"
| order_id | browser_version |
| 1 | Safari-1.2.3 |
| 2 | Chrome-4.5.6 |
| 3 | Firefox-7.8.9 |
Option 2: Two columns: one with browser name, another with browser version
e.g. "browser" (column 1) values: "Safari", "Chrome", "Firefox"
e.g. "version" (column 2) values: "1.2.3", "4.5.6", "7.8.9"
| order_id | browser | version |
| 1 | Safari | 1.2.3 |
| 2 | Chrome | 4.5.6 |
| 3 | Firefox | 7.8.9 |
Option 3: Two columns: one with browser name, another with browser name and version joined together
e.g. "browser" (column 1) values: "Safari", "Chrome", "Firefox"
e.g. "browser_version" (column 2) values: "Safari-1.2.3", "Chrome-4.5.6", "Firefox-7.8.9"
| order_id | browser | browser_version |
| 1 | Safari | Safari-1.2.3 |
| 2 | Chrome | Chrome-4.5.6 |
| 3 | Firefox | Firefox-7.8.9 |
What is the most beneficial way to set up the data values (assuming a CSV file, columns) for one hot encoding?
I suppose the correct answer might be to test each option and check the results but I thought this is likely is something that has been done before so I figured it's worth an ask.
I would use the first option. It will give on index per pair (browser | version).
The second option put version number of different browsers in the same column, whereas these numbers are not comparable. You can compare a Chrome version number with another Chrome version number but not a Chrome version number with a Firefox one.
And the third option contains the first one, with additional redundant data.

How do I count all cells in a column that have emoji?

I have a problem with emoji in my production database. Since it's in production, all I get out of it is an auto-geneated excel spreadsheet (.xls) every so often with tens of thousands of rows. I use Google Sheets to parse this so I can easily share the results.
What formula can I use to get a count of all cells in column n that contain emoji?
For instance:
Data
+----+-----------------+
| ID | Name |
+----+-----------------+
| 1 | Chad |
+----+-----------------+
| 2 | ✨Darla✨ |
+----+-----------------+
| 3 | John Smith |
+----+-----------------+
| 4 | Austin ⚠️ Powers |
+----+-----------------+
| 5 | Missus 🎂 |
+----+-----------------+
Totals
+----------------------------------+---+
| People named Chad | 1 |
+----------------------------------+---+
| People with emoji in their names | 3 |
+----------------------------------+---+
Edit by Ben C. R. Leggiero:
=COUNTA(FILTER(A2:A6;REGEXMATCH(A2:A6;"[^\x{0}-\x{F7}]")))
This should work:
=arrayformula(countif(REGEXMATCH(A2:A6,"[^a-zA-Z\d\s:]"),true))
You cannot extract emojis with regular formula because Google Spreadsheet uses the light-weight re2 regex engine, which lacks many features, including those necessary to find emojis.
What you need to do is creating a custom formula. Select Tools menu, then Script editor.... In the script editor, add the following:
function find_emoji(s) {
var re = /[\u1F60-\u1F64]|[\u2702-\u27B0]|[\u1F68-\u1F6C]|[\u1F30-\u1F70]|[\u2600-\u26ff]|[\uD83C-\uDBFF\uDC00-\uDFFF]+/i;
if (s instanceof Array) {
return s.map(function(el){return el.toString().match(re);});
} else {
return s.toString().match(re);
}
}
Save the script. Go back to your spreadsheet, then test your formula =find_emoji(A1)
My test yields the following:
| Missus 🎂 | 🎂 |
| Austin ⚠️ Powers | ⚠ |
| ✨Darla✨ | ✨ |
| joke 😆😆 | 😆😆 |
And, to count entries that don't have emojis, you can use this formula:
=countif( arrayformula(isblank( find_emoji(filter(F2:F,not(isblank(F2:F)))))), FALSE)
EDIT
I was wrong. You can use regular formula to extract emoji. The regex syntax is [\x{1F300}-\x{1F64F}]|[\x{2702}-\x{27B0}]|[\x{1F68}-\x{1F6C}]|[\x{1F30}-\x{1F70}]|[\x{2600}-\x{26ff}]|[\x{D83C}-\x{DBFF}\x{DC00}-\x{DFFF}]

Trying to find a Google Spreadsheet formula or query for this example

A | B | C | D | E | F | G
name|num|quant|item|quant2
car | 5 | 100 |
| | |wheel| 4
| | |axel | 2
| | |engine|1
truck| 2 | 20 |
| | |wheel| 6
| | |bed | 1
| | | axel| 2
I need a formula which will do B*C*E. the tables look like this, so it needs to be something like
=b$2*c$2*e3 and then dragged.... and then the next set, b$6*c$6*e7 and dragged, etc but i want sure how to get the cieling sort of something. if b5 is empty, look at each above until it finds the one not filled.
I am trying to use this to get total quantity of parts per car, truck etc.... and then group by part.
I dont have a set of DB tables to do this, just a spreadsheet.
I had to add some additional information to resolve this.
I was thinking there would be a way to do a google script that would do this and update the file, but i couldnt seem to find it.
I first summed each group item:
=b$3*e4
and dragged for that grouping.
Then afterwards, i went to a selection of space and wrote up a query.
=query(D:F, "select D,sum(F) group by D")

Resources