Consider the following two tables, with 3 columns each:
Table 1:
a INTEGER NOT NULL,
b INTEGER NOT NULL,
c INTEGER NOT NULL
Table 2:
d INTEGER NOT NULL,
e INTEGER,
f INTEGER NOT NULL
I'm trying to write a query expression that joins the two tables on a 2 part, composite key: (b, c) = (e, f).
I know that if column e was not Nullable I could just write:
query {
for r1 in c.table1 do
join r2 in c.table2 on ((r1.b, r1.c) = (r2.e, r2.f))
.
.
}
But how do I do it if column e is Nullable but column b in not?
Related
I need to join two tables, condition is one column of a table match any column form a very long list, i.e., the following:
columns = ['name001', 'name002', ..., 'name298']
df = df1.join(df2, (df1['name']==df2['name1']) | (df1['name']==df2['name2']) | ... | df1['name']==df2['name298'])
How can I implement this join in Pyspark, without writing the long conditions? Many thanks!
You can use loop over the columns list to build a join expression:
join_expr = (df1["name"] == df2[columns[0]])
for c in columns[1:]:
join_expr = join_expr | (df1["name"] == df2[c])
Or using functools.reduce:
from functools import reduce
join_expr = reduce(
lambda e, c: e | (df1["name"]==df2[c]),
columns[1:],
df1["name"]==df2[columns[0]]
)
Now use join_expr to join:
df = df1.join(df2, on=join_expr)
I would like to reduce a list into a string to adhere to a specific output format which requires a pipe ( '|' ) between the elements. I do it as follows:
WITH ["three", "two", "one"] AS a RETURN reduce(acc=head(a), s in tail(a) | acc + "|" + s)
My issue arises by the fact that the array has the wrong order: You see that it "counts" descending while I'd like to have it ascending (in my production environment the array is an intermediate result of a graph query, of course).
So I thought I would just do
WITH ["three", "two", "one"] AS a RETURN reduce(acc=head(a), s in REVERSE(tail(a)) | acc + "|" + s)
Unfortunately, reverse seems to return a collection of some generic type (any) which is not accepted by the string concatenation operator:
Type mismatch: expected Float, Integer, String or List<String> but was Any (line 1, column 98 (offset: 97))
"WITH ["three", "two", "one"] AS a RETURN reduce(acc=head(a), s in reverse(tail(a)) | acc + "|" + s)"
^
Thus I'd like to convert the 's' to a string via toString. This function, however, will only accept integer, float or boolean values and not any.
What can I do? I would also accept a solution without the conversion. I just want to be able reduce a reversed collections of strings into a single string.
Thank you!
You can avoid using the REVERSE() function by simply reversing the order in which you concatenate (i.e., using s + "|" + acc instead of acc + "|" + s):
WITH ["three", "two", "one"] AS a
RETURN REDUCE(acc=HEAD(a), s in TAIL(a) | s + "|" + acc )
This question already has answers here:
How to Pivot table in BigQuery
(7 answers)
Closed 2 years ago.
Good morning,
I'm trying to transpose some data in big query. I've looked at a few other people who have asked this on stackoverflow but the way to do this seems to be to use legacy sql (using group_concat_unquoted) rather than standard sql. I would use legacy but I've had issues with nested data in the past so have since used standard only.
Here's my example, to give some context I'm trying to map out some customer journeys which I have below:
uniqueid | page_flag | order_of_pages
A | Collection| 1
A | Product | 2
A | Product | 3
A | Login | 4
A | Delivery | 5
B | Clearance | 1
B | Search | 2
B | Product | 3
C | Search | 1
C | Collection| 2
C | Product | 3
However I'd like to transpose the data so it looks like this:
uniqueid | 1 | 2 | 3 | 4 | 5
A | Collection | Product | Product | Login | Delivery
B | Clearance | Search | Product | NULL | NULL
C | Search | Collection | Product | NULL | NULL
I've tried using multiple left joins but get the following error:
select a.uniqueid,
b.page_flag as page1,
c.page_flag as page2,
d.page_flag as page3,
e.page_flag as page4,
f.page_flag as page5
from
(select distinct uniqueid,
(case when uniqueid is not null then 1 end) as page_hit1,
(case when uniqueid is not null then 2 end) as page_hit2,
(case when uniqueid is not null then 3 end) as page_hit3,
(case when uniqueid is not null then 4 end) as page_hit4,
(case when uniqueid is not null then 5 end) as page_hit5
from `mytable`) a
LEFT JOIN (
SELECT *
from `mytable`) b on a.uniqueid = b.uniqueid
and a.page_hit1 = b.order_of_pages
LEFT JOIN (
SELECT *
from `mytable`) c on a.uniqueid = c.uniqueid
and a.page_hit2 = c.order_of_pages
LEFT JOIN (
SELECT *
from `mytable`) d on a.uniqueid = d.uniqueid
and a.page_hit3 = d.order_of_pages
LEFT JOIN (
SELECT *
from `mytable`) e on a.uniqueid = e.uniqueid
and a.page_hit4 = e.order_of_pages
LEFT JOIN (
SELECT *
from `mytable`) f on a.uniqueid = f.uniqueid
and a.page_hit5 = f.order_of_pages
Error: Query exceeded resource limits for tier 1. Tier 13 or higher required.
I've looked at using Array function as well but I've never used this before and I'm not sure if this is just for transposing the other way around. Any advice would be grand.
Thank you
for BigQuery Standard SQL
#standardSQL
SELECT
uniqueid,
MAX(IF(order_of_pages = 1, page_flag, NULL)) AS p1,
MAX(IF(order_of_pages = 2, page_flag, NULL)) AS p2,
MAX(IF(order_of_pages = 3, page_flag, NULL)) AS p3,
MAX(IF(order_of_pages = 4, page_flag, NULL)) AS p4,
MAX(IF(order_of_pages = 5, page_flag, NULL)) AS p5
FROM `mytable`
GROUP BY uniqueid
You can play/test with below dummy data from your question
#standardSQL
WITH `mytable` AS (
SELECT 'A' AS uniqueid, 'Collection' AS page_flag, 1 AS order_of_pages UNION ALL
SELECT 'A', 'Product', 2 UNION ALL
SELECT 'A', 'Product', 3 UNION ALL
SELECT 'A', 'Login', 4 UNION ALL
SELECT 'A', 'Delivery', 5 UNION ALL
SELECT 'B', 'Clearance', 1 UNION ALL
SELECT 'B', 'Search', 2 UNION ALL
SELECT 'B', 'Product', 3 UNION ALL
SELECT 'C', 'Search', 1 UNION ALL
SELECT 'C', 'Collection', 2 UNION ALL
SELECT 'C', 'Product', 3
)
SELECT
uniqueid,
MAX(IF(order_of_pages = 1, page_flag, NULL)) AS p1,
MAX(IF(order_of_pages = 2, page_flag, NULL)) AS p2,
MAX(IF(order_of_pages = 3, page_flag, NULL)) AS p3,
MAX(IF(order_of_pages = 4, page_flag, NULL)) AS p4,
MAX(IF(order_of_pages = 5, page_flag, NULL)) AS p5
FROM `mytable`
GROUP BY uniqueid
ORDER BY uniqueid
result is
uniqueid p1 p2 p3 p4 p5
A Collection Product Product Login Delivery
B Clearance Search Product null null
C Search Collection Product null null
Depends on your needs you can also consider below approach (not pivot though)
#standardSQL
SELECT uniqueid,
STRING_AGG(page_flag, '>' ORDER BY order_of_pages) AS journey
FROM `mytable`
GROUP BY uniqueid
ORDER BY uniqueid
if to run with same dummy data as above - result is
uniqueid journey
A Collection>Product>Product>Login>Delivery
B Clearance>Search>Product
C Search>Collection>Product
When given the choice to either join or filter in Pig, which is more performance-intensive?
Joins are always costly as you have to scan through second table for each tuple in table one. Consider below example
A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = LOAD 'data2' AS (b1:int,b2:int);
DUMP B;
(2,4)
(8,9)
(1,3)
(2,7)
(2,9)
(4,6)
(4,9)
X = JOIN A BY a1, B BY b1;
DUMP X;
(1,2,3,1,3)
(4,2,1,4,6)
(4,3,3,4,6)
(4,2,1,4,9)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
When we join in X we traverse through each tuple in B for each tuple in A. For filter we just traverse once through dataset and perform filter operation on each tuple.
X = FILTER A BY a3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
Situation:
I have a table called "word" which contains a word with the associated translations.
| ID | name | lang_id | parent_id |
|----|----------|---------|-----------|
| 1 | screw | 1 | null |
| 2 | schraube | 2 | 1 |
| 3 | vis | 3 | 1 |
So screw is the main word which has no parent. The other data sets have an association to the parent with the parent_id.
What I want:
I need a query which displays the word I searched for and the word which I typed in.
I want to get the datasets 2 and 3, if I query the word "schraube" from german to french.
I want to get the datasets 1 and 3, if I query the word "screw" from english to french.
...
What I tried:
select word.id, word.name, word.lang_id, word.parent_id
from word
left join word w2 on word.parent_id = w2.parent_id
WHERE w2.name = 'screw';
-- and word.lang_id = 2
Unfortunately the result doesn't contain the word I typed. Also this displays all datasets, not only the ones with the specific language.
You can modifiy th below query to get your answer.
DECLARE #FromLanguageId SMALLINT = 2; --german
DECLARE #ToLanguageId SMALLINT = 3; --french
DECLARE #NAME NVARCHAR(300) = 'schraube';
--DECLARE #FromLanguageId SMALLINT = 1; --english
--DECLARE #ToLanguageId SMALLINT = 3; --french
--DECLARE #NAME NVARCHAR(300) = 'screw';
--Get the mathing record
;
WITH ctematch
AS (
--gets the matching record (child or parent)
SELECT match.*
FROM [word] match
WHERE match.NAME LIKE #NAME),
--Join its sibling , parent and childs
ctefamilydata
AS (SELECT *
FROM ctematch match
UNION
--Parent
SELECT parent.*
FROM ctematch match
INNER JOIN [word] parent
ON match.[parent_id] = parent.[id]
UNION
--Child
SELECT child.*
FROM ctematch match
INNER JOIN [word] child
ON child.[parent_id] = match.[id]
UNION
--Siblings
SELECT siblings.*
FROM ctematch match
INNER JOIN [word] siblings
ON match.[parent_id] = siblings.[parent_id])
--Filter and get the data
SELECT *
FROM ctefamilydata Cte
WHERE Cte.[lang_id] = #ToLanguageId
OR Cte.[lang_id] = #FromLanguageId