Dask read_json metadata mismatch - dask

I'm trying to load json files into a dask df.
files = glob.glob('**/*.json', recursive=True)
df = dd.read_json(files, lines = False)
There are some missing values in the data, and some of the files have extra columns.
Is there a way to specify a column list, so all possible columns will exist in the concatenated dask df?
Additionally, can't it handle missing values? I get the following error when trying to compute the df:
ValueError: Metadata mismatch found in `from_delayed`.
Partition type: `DataFrame`
+-----------------+-------+----------+
| Column | Found | Expected |
+-----------------+-------+----------+
| x22 | - | float64 |
| x21 | - | object |
| x20 | - | float64 |
| x19 | - | float64 |
| x18 | - | object |
| x17 | - | float64 |
| x16 | - | object |
| x15 | - | object |
| x14 | - | object |
| x13 | - | object |
| x12 | - | object |
| x11 | - | object |
| x10 | - | object |
| x9 | - | float64 |
| x8 | - | object |
| x7 | - | object |
| x6 | - | object |
| x5 | - | int64 |
| x4 | - | object |
| x3 | - | float64 |
| x2 | - | object |
| x1 | - | object |
+-----------------+-------+----------+

read_json() is new and tested for the "common" case of homogenous data. It could, like read_csv, be extended to cope with column selection and data type coercion fairly easily. I note that the pandas function allows the passing of a dtype= parameter.
This is not an answer, but perhaps you would be interested in submitting a PR at the repo ? The specific code lives in file dask.dataframe.io.json.

I bumped into similar problem and came up with another solution:
def read_data(path, **kwargs):
meta = dd.read_json(path, **kwargs).head(0)
meta = meta.head(0)
# edit meta dataframe to match what's read here
def json_engine(*args, **kwargs):
df = pd.read_json(*args, **kwargs)
# add or drop necessary columns here
return df
return dd.read_json(path, meta=meta, engine=json_engine, **kwargs)
So idea of this solution is that you do two things:
Edit meta as you see fit (for example removing column from it which you don't need)
Wrapping json engine function and dropping/adding necessary columns so meta will match what's returned by this function.
Examples:
You have one particular irrelevant column which cause your code to fail with error:
| Column | Found | Expected |
| x22 | - | object |
In this case you simply drop this column from meta and in your json_engine() wrapper.
You have some relevant columns which are reported missing for some partitions. In this case you get similar error to topic starter.
In this case you add necessary columns to meta with necessary types (BTW meta is just empty pandas dataframe in this case) and you also add those columns as empty in your json_engine() wrapper if necessary.
Also look at proposal in comments to https://stackoverflow.com/a/50929229/2727308 answer - to use dask.bag instead.

I added the pandas read_json kwarg dtype as object so all the columns are inferred as objects:
df = dd.read_json(files, dtype=object)

Related

Partial transpose of Sheet

I have a Google Sheet with this format:
+---------+---------+---------+------------+------------+------------+------------+--------+--------+
| Field_A | Field_B | Field_C | 24/09/2019 | 25/09/2019 | 26/09/2019 | 27/09/2019 | day... | day... |
+---------+---------+---------+------------+------------+------------+------------+--------+--------+
| ValX | ValY | ValZ | Val1 | Val2 | Val3 | Val4 | | |
| ValW | ValY | ValZ | Val5 | Val6 | Val7 | Val8 | | |
+---------+---------+---------+------------+------------+------------+------------+--------+--------+
First 3 columns are specific fields and all other columns are related to one specific day in a given (and static) range.
I need to convert the table in the following format:
+---------+---------+---------+------------+-----------+
| Field_A | Field_B | Field_C | Date | DateValue |
+---------+---------+---------+------------+-----------+
| ValX | Valy | Valz | 24/09/2019 | Val1 |
| ValX | Valy | Valz | 25/09/2019 | Val2 |
| ValX | Valy | Valz | 26/09/2019 | Val3 |
| ... | | | | |
+---------+---------+---------+------------+-----------+
Basically, the first 3 columns are gathered as-is, but the day-column in transposed (is even the correct term?) with 2 values:
The date
The value in the cell related to date
Is something that can be achieved with formula or do I need to create a bounded AppsScript?
Following a sample Sheet demo: https://docs.google.com/spreadsheets/d/1cprzD96i-4NQ8tieA_nwd8s43yKF-M8Kww4yWNfB6tg/edit#gid=505040170
In Sheet Start you can see the initial data and format, 3 static columns and one column for every da
In Sheet End you can see the output format I'm looking for, the same 3 static columns, but the date and cell value related to date are transposed as a row.
You can see the Formula I used, TRANSPOSE for every row, where I select the days for the IV column and one row at a time for the V row.
For the 3 static columns, I replicated the Formula for every instance of the day related to that row.
This is working but requires much manual work to set up every single TRANSPOSE. I'm wondering if there is a more automatic way of doing this (except for using AppsScript, in that case, I'm already planning on doing this if not other solutions are available)
=ARRAYFORMULA(TRIM(SPLIT(TRANSPOSE(SPLIT(QUERY(TRANSPOSE(QUERY(TRANSPOSE(
IF(Start!D2:F<>""; "♦"&TRANSPOSE(QUERY(TRANSPOSE(Start!A2:C&"♠");;999^99))&
TEXT(Start!D1:F1; "dd/mm/yyyy")&"♠"&Start!D2:F; ));;999^99));;999^99); "♦")); "♠")))

Try to match string in column and print matching column name

I am trying to build an expense dashboard in google sheets for my personal use.
I have data that I will pull from my receipts like so:
First sheet: "Expenses Feb 18"
+------------+--------+--------+
| Item | Amount | Type |
+------------+--------+--------+
| Tomatoes | 2.39 | veggie |
| Joghurt | 1.45 | dairy |
| mozzarella | 1.99 | dairy |
| macadamia | 4.59 | nuts |
+------------+--------+--------+
Second table: "Categories"
+------------+----------+-----------+---------------+
| dairy | veggie | nuts | uncategorised |
+------------+----------+-----------+---------------+
| joghurt | tomatoes | macadamia | a |
| mozzarella | cucumber | pecan | b |
| feta | | | c |
| | | | d-z |
| | | | 0-9 |
| | | | - |
| | | | _ |
+------------+----------+-----------+---------------+
I want to automatically fill out the type column based on the item name.
So far I have a regex that is able to match an item. It will print the matched string. But what I need is the column name (header). And it has to be able to loop through the columns. This only works for a single column.
=REGEXEXTRACT(C11, JOIN("|", INDIRECT("Categories!A1:A"&COUNTA(Categories!A:A))))
The second table is not a desirable way to enter data. Data should be entered preferably with more rows than columns ( not in a pivoted manner).
=ARRAYFORMULA(CONCATENATE(IF(A16=$C$24:$E$25,C$23:E$23,)))
A16 : 🍅
C24:E25: Category table
C23:E23: Category header.

How can I capture and store data from a repeating HL7 segement?

We currently capture data from HL7 messages like below and then insert the same in database. This is easy as it is value from a single segment
var vACC_NO =checkSize("ACC",msg['PID']['PID.3']['PID.3.1'].toString(),20);
INSERT INTO adt_tab ( SITEID,ACC_NO) VALUES (vSITEID,vACC_NO);
Now I need to capture DG1 segment data, where we have multiple DG1 segments in HL7 message. And also need to store in Database
| DG1 | 1 | ICD10 | I22.8^MYOCARDIAL INFARCT^ICD10 | MYOCARDIAL | | | | | | | | | | | |
| INFARCTION | 201702010437 | B | | | | | | | | | 7 | | | | |
| DG1 | 2 | ICD10 | A44.9^ORGANISM^ICD10 | ORGANISM | 20170201 0437 | B | | | | | | | | | 7 |
So in my database table I have now more columns - SITEID, ACC_NO, CODE1, CODE2...
From the above message I need to insert I22.8 into CODE 1, A44.9 into CODE2 and so on ..
How I should first capture these codes in loop from multiple DG1 segments in the message ?
And then how I should store it in the database ?
Thanks
You can iterate over the segments like this
for each (dg1 in msg['DG1']){
variable1 = dg1['DG1.3']['DG1.3.1'];
variable2 = dg1['DG1.3']['DG1.3.2'];
// database call with the previus
databaseCall(variable1,variable2, ...
}
For each segment you are going to do an insert.
Apart from this, I do not think is a good idea to make more columns in the same row by adding variable1, variable2, variable3 ... as it is not normalized and it not a good database design practice.

SUMPRODUCT with wildcard

I have the SUMPRODUCT working with hardcoded values however I want to use a wild card for the B clomun in my example.
Here is my data
+----------+----------+-----------+
| A COLUMN | B COLUMN | C COLUMN |
+----------+----------+-----------+
| Status | Fruit | Quanitity |
| | | |
| Fresh | Apple | 6 |
| | | |
| Fresh | Apricot | 7 |
| | | |
| Stale | Apple | 4 |
+----------+----------+-----------+
I would like to match Fresh, AP* and then sum the matches form Column C.
I have the following
=SUMPRODUCT(--($B$2:$B$840="AP*"),--($A$2:$A$840="Fresh"),$C$2:$C$840)
Working code with the Wildcard but the count is off
=SUMPRODUCT(ISNUMBER(SEARCH"AP",$B$2:$B$840,1))*($A$2:$A$840="Fresh")*($C$2:$C$840))
The SUMPRODUCT() function does not support wildcards within an array-type expression. The same result can be achieved with:
=SUMPRODUCT((A2:A1000="Fresh")*(LEFT(B2:B1000,2)="Ap")*(C2:C1000))

How to map values within a data series to different y-axes?

I have a column chart in Highcharts that looks roughly like this:
| |
| |
S | |
e | | M
c | +-+ | e
o | +-+ | | +-+ +-+ +-+ | t
n | +-+ | | | | | | | | +-+ | | | e
d | | | | | | | | | +-+ | | | | | | | r
s | | | +-+ | | | | | | | | | | | | | | | s
| |1| |2| |3| |1| |2| |3| |1| |2| |3| |
+-------------------------------------------------------------+
Fld A (s) Fld B (s) Fld C (m)
The labels "1", "2", and "3" refer to records; while "A", "B", and "C" refer to fields. So record #1 is represented as three separate values over fields A, B, and C, as represented by the labeled columns. I achieved this result by:
Providing an array to the series config option, one series for each record.
Providing an array to the xAxis/categories config option, one element for each field name.
Providing a 2-element array to the yAxis config option.
My problem is that values in field C will are shown on the Seconds axis, even though they are in units of Meters. I could change the entire series to be on the Meters axis (via the series/yAxis config option), but then fields A and B would show on the wrong axis.
Is there any way to map values within a series to different axes?
EDIT 9/12/2011: If this is impossible as stated, I'm willing to accept an alternate method, such as a different configuration or modifying Highcharts internals, via a plugin or otherwise.
EDIT 9/13/2011: I asked the same question on the HighCharts forum here: http://highslide.com/forum/viewtopic.php?f=9&t=12315, and no one has answered it there either. I'm beginning to think there is probably not any easy answer. :)
A demo is available here: http://www.highcharts.com/demo/combo-dual-axes
chart.yAxis should be an array of two yAxis objects and your series object should specify the yAxis that it corresponds to.
A highslide support person told me this is not possible.
However, another person gave me a possible workaround: create a separate set of series for field C. Then set the values for fields A and B in the second set to null, and set the values in the first set of series for field C to null.
There is a link to a jsfiddle that demonstrates this workaround in the forum topic: http://highslide.com/forum/viewtopic.php?f=9&t=12315

Resources