Calculating a new column in spark df based on another spark df without an explicit join column - join

I have the df1 and df2 without a common crossover column. Now I need to add a new column in df1 from df2 if a condition based on columns df2 is met. I will try to explain myself better with an example:
df1:
+--------+----------+
|label | raw |
+--------+----------+
|0.0 |-1.1088619|
|0.0 |-1.3188809|
|0.0 |-1.3051535|
+--------+----------+
df2:
+--------------------+----------+----------+
| probs | minRaw| maxRaw|
+--------------------+----------+----------+
| 0.1|-1.3195256|-1.6195256|
| 0.2|-1.6195257|-1.7195256|
| 0.3|-1.7195257|-1.8195256|
| 0.4|-1.8195257|-1.9188809|
The expected output will be a new column in df1 that get the df2.probs if df1.raw value is between df2.minRaw and df2.maxRaw .
My first aproach has been try to explode the range minRaw and maxRaw, and then joined dataframes, but those columns are continuous. The second idea is an udflike this:
def get_probabilities(raw):
df= isotonic_prob_table.filter((F.col("min_raw")>=raw)& \
(F.col("max_raw")<=raw))\
.select("probs")
df.show()
#return df.select("probabilidad_bin").value()
#return df.first()["probabilidad_bin"]
But it takes a long time in my large dataframe, and give me this alerts:
23/02/13 22:02:20 WARN org.apache.spark.sql.execution.window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
23/02/13 22:02:20 WARN org.apache.spark.sql.execution.window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
[Stage 82:> (0 + 1) / 1][Stage 83:====> (4 + 3) / 15]23/02/13 22:04:36 WARN org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
23/02/13 22:04:36 WARN org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.
If value is'n't between minRaw and maxRaw, the output expected is null and df1 can have duplicates.
I have spark version 2.4.7 and I'm not a pyspark expert. Thank you in advance for read!

I think you can just join those dataframes with the condition between.
df1.join(df2, f.col('raw').between(f.col('maxRaw'), f.col('minRaw')), 'left').show(truncate=False)
+-----+-----+-----+----------+----------+
|label|raw |probs|minRaw |maxRaw |
+-----+-----+-----+----------+----------+
|0.0 |-1.1 |null |null |null |
|0.0 |-1.1 |null |null |null |
|0.0 |-1.32|0.1 |-1.3195256|-1.6195256|
|0.0 |-1.32|0.1 |-1.3195256|-1.6195256|
|0.0 |-1.73|0.3 |-1.7195257|-1.8195256|
|0.0 |-1.88|0.4 |-1.8195257|-1.9188809|
+-----+-----+-----+----------+----------+

Use range between in a sql expression
df2.createOrReplaceTempView('df2')
df1.createOrReplaceTempView('df1')
%sql
SELECT minRaw,maxRaw,raw
FROM df1 JOIN df2 ON df1.raw BETWEEN df2.minRaw and df2.maxRaw

You can perform a crossjoin between df1 and df2, and apply a filter so that you're only selecting rows where df1.raw is between df2.minRaw and df2.maxRaw – this should be more performant than a udf.
Note: Since df1 can have duplicates, we want to deduplicate df1 before crossjoining with df2 so that after we apply the filter we don't have any duplicate rows, but still have the minimum information we need. Then we can right join on df1 to ensure we have all of the original rows in df1.
I've also modified your df1 slightly to include duplicates for the purpose of demonstrating the result:
df1 = spark.createDataFrame(
[
(0.0,-1.10),
(0.0,-1.10),
(0.0,-1.32),
(0.0,-1.32),
(0.0,-1.73),
(0.0,-1.88)
],
['label','raw']
)
df2 = spark.createDataFrame(
[
(0.1, -1.3195256, -1.6195256),
(0.2, -1.6195257, -1.7195256),
(0.3, -1.7195257, -1.8195256),
(0.4, -1.8195257, -1.9188809)
],
['probs','minRaw','maxRaw']
)
This is the result when you crossjoin df1 and df2 and remove duplicates:
df1.drop_duplicates().crossJoin(df2).show()
+-----+-----+-----+----------+----------+
|label| raw|probs| minRaw| maxRaw|
+-----+-----+-----+----------+----------+
| 0.0| -1.1| 0.1|-1.3195256|-1.6195256|
| 0.0|-1.32| 0.1|-1.3195256|-1.6195256|
| 0.0|-1.73| 0.1|-1.3195256|-1.6195256|
| 0.0|-1.88| 0.1|-1.3195256|-1.6195256|
...
| 0.0| -1.1| 0.4|-1.8195257|-1.9188809|
| 0.0|-1.32| 0.4|-1.8195257|-1.9188809|
| 0.0|-1.73| 0.4|-1.8195257|-1.9188809|
| 0.0|-1.88| 0.4|-1.8195257|-1.9188809|
+-----+-----+-----+----------+----------+
Then we can apply the filter and right join with df1 to make sure all of the original rows exist:
df1.crossJoin(df2).filter(
(F.col('raw') > F.col('maxRaw')) & (F.col('raw') < F.col('minRaw'))
).select(
'label','raw','probs'
).join(
df1, on=['label','raw'], how='right'
)
+-----+-----+-----+
|label| raw|probs|
+-----+-----+-----+
| 0.0| -1.1| null|
| 0.0| -1.1| null|
| 0.0|-1.32| 0.1|
| 0.0|-1.32| 0.1|
| 0.0|-1.73| 0.3|
| 0.0|-1.88| 0.4|
+-----+-----+-----+

Related

Mapping timeseries+static information into an ML model (XGBoost)

So lets say I have multiple probs, where one prob has two input DataFrames:
Input:
One constant stream of data (e.g. from a sensor) Second step: Multiple streams from multiple sensors
> df_prob1_stream1
timestamp | ident | measure1 | measure2 | total_amount |
----------------------------+--------+--------------+----------+--------------+
2019-09-16 20:00:10.053174 | A | 0.380 | 0.08 | 2952618 |
2019-09-16 20:00:00.080592 | A | 0.300 | 0.11 | 2982228 |
... (1 million more rows - until a pre-defined ts) ...
One static DataFrame of information, mapped to an unique identifier called ident, which needs to be assigned to the ident column in each df_probX_streamX in order to let the system recognize, that this data is related.
> df_global
ident | some1 | some2 | some3 |
--------+--------------+----------+--------------+
A | LARGE | 8137 | 1 |
B | SMALL | 1234 | 2 |
Output:
A binary classifier [0,1]
So how can I suitable train XGBoost to be able to make the best usage of one timeseries DataFrame in combination with one static DataFrame (containg additional context information) in one prob? Any help would be appreciated.

Building Activerecord / SQL query for jsonb value search

Currently, for a recurring search with different parameters, I have this ActiveRecord query built:
current_user.documents.order(:updated_at).reverse_order.includes(:groups,:rules)
Now, usually I tack on a where clause to this to perform this search. However, I now need to do a search through the jsonb field for all rows that have a certain value as in the key:value pair. I've been able to do something a similar to that in my SQL, with this syntax (the data field will only be exactly two levels nested):
SELECT
*
FROM
(SELECT
*
FROM
(SELECT
*
FROM
documents
) A,
jsonb_each(A.data)
) B,
jsonb_each_text(B.value) ASC C
WHERE
C.value = '30';
However, I want to use the current ActiveRecord search to make this query (which includes the groups/rules eager loading).
I'm struggling with the use of the comma, which I understand is an implicit join, which is executed before explicit joins, so when I try something like this:
select * from documents B join (select * from jsonb_each(B.data)) as A on true;
ERROR: invalid reference to FROM-clause entry for table "b"
LINE 1: ...* from documents B join (select * from jsonb_each(B.data)) a...
^
HINT: There is an entry for table "b", but it cannot be referenced from this part of the query.
But I don't understand how to reference the complete "table" the ActiveRecord query I have creates before I make a joins call, as well as make use of the comma syntax for implicit joins to work.
Also, I'm an SQL amateur, so if you see some improvements or other ways to do this, please do tell.
EDIT: Description of documents table:
Table "public.documents"
Column | Type | Modifiers | Storage | Stats target | Description
------------+-----------------------------+--------------------------------------------------------+----------+--------------+-------------
id | integer | not null default nextval('documents_id_seq'::regclass) | plain | |
document_id | character varying | | extended | |
name | character varying | | extended | |
size | integer | | plain | |
last_updated| timestamp without time zone | | plain | |
user_id | integer | | plain | |
created_at | timestamp without time zone | | plain | |
updated_at | timestamp without time zone | | plain | |
kind | character varying | | extended | |
uid | character varying | | extended | |
access_token_id | integer | | plain | |
data | jsonb | not null default '{}'::jsonb | extended | |
Indexes:
"documents_pkey" PRIMARY KEY, btree (id)
```
Sample rows, first would match a search for '30' (data is the last field):
2104 | 24419693037 | LsitHandsBackwards.jpg | | | 1 | 2017-06-25 21:45:49.121686 | 2017-07-01 21:32:37.624184 | box | 221607127 | 15 | {"owner": {"born": "to make history", "price": 30}}
2177 | /all-drive/uml flows/typicaluseractivity.svg | TypicalUserActivity.svg | 12375 | 2014-08-11 02:21:14 | 1 | 2017-07-07 14:00:11.487455 | 2017-07-07 14:00:11.487455 | dropbox | 325694961 | 20 | {"owner": {}}
You can use a query similar to the one you already showed:
SELECT
d.id, d.data
FROM
documents AS d
INNER JOIN json_each(d.data) AS x ON TRUE
INNER JOIN json_each(x.value) AS y ON TRUE
WHERE
cast(y.value as text) = '30';
Assuming your data would be the following one:
INSERT INTO documents
(data)
VALUES
('{"owner": {"born": "to make history", "price": 30}}'),
('{"owner": {}}'),
('{"owner": {"born": "to make history", "price": 50}, "seller": {"worth": 30}}')
;
The result you'd get is:
id | data
-: | :---------------------------------------------------------------------------
1 | {"owner": {"born": "to make history", "price": 30}}
3 | {"owner": {"born": "to make history", "price": 50}, "seller": {"worth": 30}}
You can check it (together with some step-by-step looks at the data) at dbfiddle here

Transform a table horizontally, with filling NULL columns when needed

I have two tables
ORDERS
_____________________
|OrdNum |Stockid |
----------------------
|1234 |alpha |
|1238 |beta |
|1745 |gamma |
----------------------
and MARKS
______________________________
|OrdNum |RowNum| Mark |
------------------------------
|1234 |1 | AB |
|1238 |1 | XY |
|1238 |2 | XZ |
|1745 |1 | KD |
|1745 |2 | KS |
|1745 |3 | JJ |
|1745 |4 | RT |
|1745 |5 | PJ |
------------------------------
For every order, there can be from 1 up to a maximum of 5 corresponding records in the MARKS table.
I need to do a sort of "Horizontal" join, with NULL values where no corrispondance occurs. Frankly speaking, I am not even sure it can be done.
The final results should look like the following
_____________________________________________________
|OrdNum |Mark1 |Mark2 |Mark3 |Mark4 |Mark5 |
-----------------------------------------------------
|1234 |AB |NULL |NULL |NULL |NULL |
|1238 |XY |XZ |NULL |NULL |NULL |
|1238 |KD |KS |JJ |RT |PJ |
-----------------------------------------------------
Again, the final view has to be horizontal, with 5 (plus order number) columns.
Anybody know if this is possible ?
Thank you in advance.
In T-SQL (SQL Server 2008) I did try the following joing
For the sake of simplification, let us consider only 2 columns instead of 5.
SELECT o.OrdNum,
m1.Mark AS MARK1,
m2.Mark AS MARK2,
FROM ORDERS o
LEFT JOIN MARKS m1 ON o.OrdNum = m1.OrdNum WHERE m1.RowNum=1
LEFT JOIN MARKS m2 ON o.OrdNum = m2.OrdNum WHERE m2.RowNum=2
which is synthetically wrong.
So I modified the join as it follows
SELECT o.OrdNum,
m1.Mark AS MARK1,
m2.Mark AS MARK2,
FROM ORDERS o
LEFT JOIN MARKS m1 ON o.OrdNum = m1.OrdNum
LEFT JOIN MARKS m2 ON o.OrdNum = m2.OrdNum
WHERE m1.RowNum=1 AND m2.RowNum=2
which is not what I want to get, since it does not produce the orders having only the record corresponding to the first RowNum (with the values at the beginning of this post, it will not show order number 1234 ..)

Profiling neo4j query: filter to db hits

I'm curious how filters work in neo4j queries. They result in db hits (according to PROFILE), and it seems that they shouldn't.
An example query:
PROFILE MATCH (a:act)<-[r:relationship]-(n)
WHERE a.chapter='13' and a.year='2009'
RETURN r, n
NodeIndexSeek: (I created the an index on the label act for chapter property) returns 6 rows.
Filter: a.year == {AUTOSTRING1} which results in 12 db hits.
Why does it need to do any db hits if it's already fetched the 6 matching instances of a in earlier db reads, shouldn't it just filter them down without going back to do more db reads?
I realise I'm equating 'db hits' with 'db reads' here, which may not be accurate. If not, what exactly are 'db hits'?
Lastly, the number of db hits incurred by a filter appear to approximately match:
<number of filtering elements> * 2 * <number of already queried nodes to filter on>
where 'number of filtering elements' is the number of filters provided, i.e.
WHERE a.year='2009' and a.property_x='thing'
is two elements.
Thanks for any help.
EDIT:
Here are the results of PROFILE and EXPLAIN on the query.
This is just an example query. I've found the behaviour of
filter db hits = <number of filtering elements> * 2 * <number of already queried nodes to filter on>
to be generally true in queries I've run.
PROFILE MATCH (a:act)<-[r:CHILD_OF]-(n)
WHERE a.chapter='13' AND a.year='2009'
RETURN r, n
8 rows
55 ms
Compiler CYPHER 2.2
Planner COST
Projection
|
+Expand(All)
|
+Filter
|
+NodeIndexSeek
+---------------+---------------+------+--------+-------------+---------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------+---------------+------+--------+-------------+---------------------------+
| Projection | 1 | 8 | 0 | a, n, r | r; n |
| Expand(All) | 1 | 8 | 9 | a, n, r | (a)<-[r:CHILD_OF]-(n) |
| Filter | 0 | 1 | 12 | a | a.year == { AUTOSTRING1} |
| NodeIndexSeek | 1 | 6 | 7 | a | :act(chapter) |
+---------------+---------------+------+--------+-------------+---------------------------+
Total database accesses: 28
EXPLAIN MATCH (a:act)<-[r:CHILD_OF]-(n)
WHERE a.chapter='13' AND a.year='2009'
RETURN r, n
4 ms
Compiler CYPHER 2.2
Planner COST
Projection
|
+Expand(All)
|
+Filter
|
+NodeIndexSeek
+---------------+---------------+-------------+---------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+---------------+---------------+-------------+---------------------------+
| Projection | 1 | a, n, r | r; n |
| Expand(All) | 1 | a, n, r | (a)<-[r:CHILD_OF]-(n) |
| Filter | 0 | a | a.year == { AUTOSTRING1} |
| NodeIndexSeek | 1 | a | :act(chapter) |
+---------------+---------------+-------------+---------------------------+
Total database accesses: ?
Because reading a node (record) and reading property (records) is not the same db-operation.
You are right that the filter hit's should be at most 6 though.
Usually Neo4j pulls filters and predicates to the earliest possible moment, so it should filter directly after the index lookup.
In some situations though (due to the predicate) it can only filter after finding the paths, then the number of db-hits might equal the number of checked paths.
Which Neo4j version are you using? Can you share your full query plan?

Calculating Statistical summary in cypher

I run the following query the get the average likes for each Category
neo4j-sh (?)$ START n=node:node_auto_index(type = "U") match n-[r:likes]->()-[:mapsTo]->items return AVG(r.count) as AVGLIKES, items.name as CATEGORY;
==> +------------------------------------------------------+
==> | AVGLIKES | CATEGORY |
==> +------------------------------------------------------+
==> | 7.122950819672131 | "Culture" |
==> | 1.3333333333333333 | "Food & Drinks" |
==> | 2.111111111111111 | "Albums" |
==> | 2.581081081081081 | "Movies" |
==> | 2.1 | "Musicians" |
==> | 7.810126582278481 | "Culture Celebs" |
==> | 3.1206896551724137 | "TV Shows" |
==> | 1.0 | "Apps/Games" |
==> | 4.0256410256410255 | "Cars" |
But AVG is a built in function, how do I calculate the standard deviation and other statistical summaries for each category. I am looking for something like "GROUP BY" in SQL that will group everything for each category and then I could write some code or if there is a better way to do it.
I added the percentile_disc, percentile_cont aggregation functions late last year. I imagine they would be willing to merge in functions for std dev. In theory (I think, my statistics is rusty) you could calculate the standard deviation based on some samples of percentiles along with the average. So, aside from standard deviation, what else are you looking for?
Update:
I made a pull request for stdev/stdevp aggregation functions: https://github.com/neo4j/neo4j/pull/859

Resources