What is the best way to combine these datasets to get my desired format? - join

Doing some ELT work...
What is the best way to combine these sets of data into the form of the desired output:
Dataset A:
| project_id1 | types1 |
A, apple
B, banana
Dataset B:
| project_id1 | project_id2 | types2 |
A, 15, strawberry
A, 25, onion
B, 5, peach
Desired Result:
| project_id1 | project_id2 | types |
A, 15, strawberry
A, 15, apple
A, 25, onion
A, 25, apple
B, 5, peach
B, 5, banana
And is there a name for this type of combination?

You can get that information by doing so:
Table
create table da (
project_id1 char(1),
types1 varchar(100)
);
insert into da values
('A', 'apple'),
('B', 'banana');
create table db (
project_id1 char(1),
project_id2 int,
types2 varchar(100)
);
insert into db values
('A', 15, 'strawberry'),
('A', 25, 'onion'),
('B', 5, 'peach');
Query
select * from (
select da.project_id1, db.project_id2, da.types1 as types
from da
inner join db on da.project_id1 = db.project_id1
UNION ALL
select db.project_id1, db.project_id2, db.types2 as types
from db
) x
order by project_id1, project_id2, types desc;
Result
project_id1 project_id2 types
A 15 strawberry
A 15 apple
A 25 onion
A 25 apple
B 5 peach
B 5 banana
Example
https://rextester.com/ISQA20343
I don't know of a name of this kind of data merging.

Related

Search using ActiveRecord with multiple conditions

I want to search through a table with multiple conditions given in an array.
I have an array like this
[{u_id: 4, lv: 2}, {u_id: 10, lv: 1}, {u_id: 11, lv: 3}, ...]
and a table like this.
==== User Levels Table ====
| id | u_id | lv | desc |
| 1 | 1 | 1 | hoge |
| 2 | 1 | 2 | moke |
| 3 | 2 | 1 | doge |
...
I tried the following code for testing, and I got the results that I wanted.
user_levels = UserLevel.where({u_id: 4, lv: 2})
.or(UserLevel.where({u_id: 10, lv: 1})
.or(UserLevel.where({u_id: 11, lv: 3}) ...
However, I do not know the length of the array, nor the values inside the hash, which means that I cannot hard-code like the last code above.
Is there any idea?
I think what you need is for the array of conditions to be mapped to relations that we can join together with or in a reduce block:
array_of_conditions = [{u_id: 4, lv: 2}....]
user_levels = array_of_conditions.map {|cond| UserLevel.where(cond)}.reduce {|memo, cond| memo.or(cond)}

Transpose rows into columns in BigQuery using standard sql [duplicate]

This question already has answers here:
How to Pivot table in BigQuery
(7 answers)
Closed 2 years ago.
Good morning,
I'm trying to transpose some data in big query. I've looked at a few other people who have asked this on stackoverflow but the way to do this seems to be to use legacy sql (using group_concat_unquoted) rather than standard sql. I would use legacy but I've had issues with nested data in the past so have since used standard only.
Here's my example, to give some context I'm trying to map out some customer journeys which I have below:
uniqueid | page_flag | order_of_pages
A | Collection| 1
A | Product | 2
A | Product | 3
A | Login | 4
A | Delivery | 5
B | Clearance | 1
B | Search | 2
B | Product | 3
C | Search | 1
C | Collection| 2
C | Product | 3
However I'd like to transpose the data so it looks like this:
uniqueid | 1 | 2 | 3 | 4 | 5
A | Collection | Product | Product | Login | Delivery
B | Clearance | Search | Product | NULL | NULL
C | Search | Collection | Product | NULL | NULL
I've tried using multiple left joins but get the following error:
select a.uniqueid,
b.page_flag as page1,
c.page_flag as page2,
d.page_flag as page3,
e.page_flag as page4,
f.page_flag as page5
from
(select distinct uniqueid,
(case when uniqueid is not null then 1 end) as page_hit1,
(case when uniqueid is not null then 2 end) as page_hit2,
(case when uniqueid is not null then 3 end) as page_hit3,
(case when uniqueid is not null then 4 end) as page_hit4,
(case when uniqueid is not null then 5 end) as page_hit5
from `mytable`) a
LEFT JOIN (
SELECT *
from `mytable`) b on a.uniqueid = b.uniqueid
and a.page_hit1 = b.order_of_pages
LEFT JOIN (
SELECT *
from `mytable`) c on a.uniqueid = c.uniqueid
and a.page_hit2 = c.order_of_pages
LEFT JOIN (
SELECT *
from `mytable`) d on a.uniqueid = d.uniqueid
and a.page_hit3 = d.order_of_pages
LEFT JOIN (
SELECT *
from `mytable`) e on a.uniqueid = e.uniqueid
and a.page_hit4 = e.order_of_pages
LEFT JOIN (
SELECT *
from `mytable`) f on a.uniqueid = f.uniqueid
and a.page_hit5 = f.order_of_pages
Error: Query exceeded resource limits for tier 1. Tier 13 or higher required.
I've looked at using Array function as well but I've never used this before and I'm not sure if this is just for transposing the other way around. Any advice would be grand.
Thank you
for BigQuery Standard SQL
#standardSQL
SELECT
uniqueid,
MAX(IF(order_of_pages = 1, page_flag, NULL)) AS p1,
MAX(IF(order_of_pages = 2, page_flag, NULL)) AS p2,
MAX(IF(order_of_pages = 3, page_flag, NULL)) AS p3,
MAX(IF(order_of_pages = 4, page_flag, NULL)) AS p4,
MAX(IF(order_of_pages = 5, page_flag, NULL)) AS p5
FROM `mytable`
GROUP BY uniqueid
You can play/test with below dummy data from your question
#standardSQL
WITH `mytable` AS (
SELECT 'A' AS uniqueid, 'Collection' AS page_flag, 1 AS order_of_pages UNION ALL
SELECT 'A', 'Product', 2 UNION ALL
SELECT 'A', 'Product', 3 UNION ALL
SELECT 'A', 'Login', 4 UNION ALL
SELECT 'A', 'Delivery', 5 UNION ALL
SELECT 'B', 'Clearance', 1 UNION ALL
SELECT 'B', 'Search', 2 UNION ALL
SELECT 'B', 'Product', 3 UNION ALL
SELECT 'C', 'Search', 1 UNION ALL
SELECT 'C', 'Collection', 2 UNION ALL
SELECT 'C', 'Product', 3
)
SELECT
uniqueid,
MAX(IF(order_of_pages = 1, page_flag, NULL)) AS p1,
MAX(IF(order_of_pages = 2, page_flag, NULL)) AS p2,
MAX(IF(order_of_pages = 3, page_flag, NULL)) AS p3,
MAX(IF(order_of_pages = 4, page_flag, NULL)) AS p4,
MAX(IF(order_of_pages = 5, page_flag, NULL)) AS p5
FROM `mytable`
GROUP BY uniqueid
ORDER BY uniqueid
result is
uniqueid p1 p2 p3 p4 p5
A Collection Product Product Login Delivery
B Clearance Search Product null null
C Search Collection Product null null
Depends on your needs you can also consider below approach (not pivot though)
#standardSQL
SELECT uniqueid,
STRING_AGG(page_flag, '>' ORDER BY order_of_pages) AS journey
FROM `mytable`
GROUP BY uniqueid
ORDER BY uniqueid
if to run with same dummy data as above - result is
uniqueid journey
A Collection>Product>Product>Login>Delivery
B Clearance>Search>Product
C Search>Collection>Product

How to delete mirrored values in 2 columns

I have a table with 2 columns that contain some rows with unique id pairs and some rows with pairs that are a mirrored duplicate of another row. I want to remove one of the duplicates.
id1 | id2
-----+-----
1 | 9
2 | 10
5 | 4
6 | 16
7 | 11
8 | 12
9 | 1
10 | 2
12 | 14
14 | 8
16 | 6
So 1 | 9 mirrors 9 | 1. I want to keep 1 | 9 but delete 9 | 1.
I've tried.
SELECT
id1,
id2
FROM
(
SELECT
id1, id2, ROW_NUMBER() OVER (PARTITION BY id1, id2 ORDER BY id1) AS occu
FROM
table
) t
WHERE
t.occu = 1;
But it has no effect.
I'm pretty new to this so any help you can give would be greatly appreciated.
====UPDATE====
I accepted the answer from #Mureinik and adapted it to work as a filter in a subquery:
SELECT
*
FROM
table
WHERE
id1 NOT IN (SELECT
id1
FROM
table a
WHERE
id1 > id2
AND
EXISTS (SELECT *
FROM table b
WHERE a.id1 = b.id2 AND a.id2 = b.id1));
You could arbitrarily decide to keep the rows where id1 < id2, and use an exists clause to find their counterparts:
DELETE FROM myable a
WHERE id1 > id2 AND
EXISTS (SELECT *
FROM mytable b
WHERE a.id1 = b.id2 AND a.id2 = b.id1)

How to list most frequent text values within a range?

I'm an intermediate excel user trying to solve an issue that feels a little over my head. Basically, I'm working with a spreadsheet which contains a number of orders associated with customer account #s and which have up to 5 metadata "tags" associated with them. I want to be use that customer account # to pull the 5 most commonly occurring metadata tags in order.
Here is a mock up of the first set of data
Account Number Order Number Metadata
5043 1 A B C D
4350 2 B D
4350 3 B C
5043 4 A D
5043 5 C D
1204 6 A B
5043 7 A D
1204 8 D B
4350 9 B D
5043 10 A C D
and the end result I'm trying to create
Account Number Most Common Tag 2nd 3rd 4th 5th
5043 A C B N/A
4350 B D C N/A N/A
1204 B A C N/A N/A
I was trying to work with the formula suggested here:
=ARRAYFORMULA(INDEX(A1:A7,MATCH(MAX(COUNTIF(A1:A7,A1:A7)),COUNTIF(A1:A7,A1:A7),0)))
But I don't know how to a) use the customer account # as a precondition for counting the text values within the range. b) how to circumvent the fact that the Match forumula only wants to work with a single column of data and c) how to read the 2nd, 3rd, 4th, and 5th most common values from this range.
The way I'm formatting this data isn't set in stone. I suspect the way I'm organizing this information is holding me back from simpler solutions, so any suggestions on re-thinking my organization would be just as helpful as insights on how to create a formula to do this.
Implementing this kind of frequency analysis using built-in functions is likely to be a frustrating exercise. Since you are working with Google Sheets, take advantage of the custom functions, written in JavaScript and placed into a script bound to the sheet (Tools > Script Editor).
The function I wrote for this purpose is below. Entering something like =tagfrequency(A2:G100) in the sheet will produce desired output:
+----------------+-----------------+-----+-----+-----+-----+
| Account Number | Most Common Tag | 2nd | 3rd | 4th | 5th |
| 5043 | D | A | C | B | N/A |
| 4350 | B | D | C | N/A | N/A |
| 1204 | B | A | D | N/A | N/A |
+----------------+-----------------+-----+-----+-----+-----+
Custom function
function tagFrequency(arr) {
var dict = {}; // the object in which to store tag counts
for (var i = 0; i < arr.length; i++) {
var acct = arr[i][0];
if (acct == '') {
continue; // ignore empty rows
}
if (!dict[acct]) {
dict[acct] = {}; // new account number
}
for (var j = 2; j < arr[i].length; j++) {
var tag = arr[i][j];
if (tag) {
if (!dict[acct][tag]) {
dict[acct][tag] = 0; // new tag
}
dict[acct][tag]++; // increment tag count
}
}
}
// end of recording, begin sorting and output
var output = [['Account Number', 'Most Common Tag', '2nd', '3rd', '4th', '5th']];
for (acct in dict) {
var tags = dict[acct];
var row = [acct].concat(Object.keys(tags).sort(function (a,b) {
return (tags[a] < tags[b] ? 1 : (tags[a] > tags[b] ? -1 : (a > b ? 1 : -1)));
})); // sorting by tag count, then tag name
while (row.length < 6) {
row.push('N/A'); // add N/A if needed
}
output.push(row); // add row to output
}
return output;
}
You also could get this report:
Account Number Tag count
1204 B 2
1204 A 1
1204 D 1
4350 B 3
4350 D 2
4350 C 1
5043 D 5
5043 A 4
5043 C 3
5043 B 1
with the formula:
=QUERY(
{TRANSPOSE(SPLIT(JOIN("",ArrayFormula(REPT(FILTER(A2:A,A2:A<>"")&",",5))),",")),
TRANSPOSE(SPLIT(ArrayFormula(CONCATENATE(FILTER(C2:G,A2:A<>"")&" ,")),",")),
TRANSPOSE(SPLIT(rept("1,",counta(A2:A)*5),","))
},
"select Col1, Col2, Count(Col3) where Col2 <>' ' group by Col1, Col2
order by Col1, Count(Col3) desc label Col1 'Account Number', Col2 'Tag'")
The formula will count the number of occurrences of any tag.

How to use average function in neo4j with collection

I want to calculate covariance of two vectors as collection
A=[1, 2, 3, 4]
B=[5, 6, 7, 8]
Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)
My problem for covariance computation is:
1) I can not have a nested aggregate function
when I write
SUM((ai-avg(a)) * (bi-avg(b)))
2) Or in another shape, how can I extract two collection with one reduce such as:
REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))
3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated
REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))
I mean that can I write nested reduce?
4) Is there any ways with "unwind", "extract"
Thank you in advanced for any help.
cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:
WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;
Edit:
Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
| aa | bb
---+----+----
1 | 1 | 4
2 | 1 | 5
3 | 1 | 6
4 | 2 | 4
5 | 2 | 5
6 | 2 | 6
7 | 3 | 4
8 | 3 | 5
9 | 3 | 6
Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 2.0 | 5.0
Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.
Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 500.0 | 1500.0
714 ms
Is significantly slower than using REDUCE:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
| e_a | e_b
---+-------+--------
1 | 500.0 | 1500.0
4 ms
To bring it all together, I'll compare the two queries in full on length-1000 vectors:
> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
covariance;
| covariance
---+------------
1 | 83583.5
9105 ms
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
| cov
---+---------
1 | 83583.5
33 ms
[EDITED]
This should calculate the covariance (according to your formula), given your sample inputs:
WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;
This approach is OK when n is small, as is the case with the original sample data.
However, as #NicoleWhite and #jjaderberg point out, when n is not small, this approach will be inefficient. The answer by #NicoleWhite is an elegant general solution.
How do you arrive at collections A and B? The avg function is an aggregating function and cannot be used in the REDUCE context, nor can it be applied to collections. You should calculate your average before you get to that point, but exactly how to do that best depends on how you arrive at the two collections of values. If you are at a point where you have individual result items that you then collect to get A and B, that's the point when you could use avg. For example:
WITH [1, 2, 3, 4] AS aa UNWIND aa AS a
WITH collect(a) AS aa, avg(a) AS aAvg
RETURN aa, aAvg
and for both collections
WITH [1, 2, 3, 4] AS aColl UNWIND aColl AS a
WITH collect(a) AS aColl, avg(a) AS aAvg
WITH aColl, aAvg,[5, 6, 7, 8] AS bColl UNWIND bColl AS b
WITH aColl, aAvg, collect(b) AS bColl, avg(b) AS bAvg
RETURN aColl, aAvg, bColl, bAvg
Once you have the two averages, let's call them aAvg and bAvg, and the two collections, aColl and bColl, you can do
RETURN REDUCE(x = 0.0, i IN range(0, size(aColl) - 1) | x + ((aColl[i] - aAvg) * (bColl[i] - bAvg))) / (size(aColl) - 1) AS covariance
Thank you so much Dears, however I wonder which one is most efficient
1) Nested unwind and range inside reduce -> #cybersam
2) nested Reduce -> #Nicole White
3) Nested With (reset query by with) -> #jjaderberg
BUT Important Issue is :
Why there is an error and difference between your computations and real and actual computations.
I mean your covariance equals to = 1.6666666666666667
But in real world covariance equals to = 1.25
please check: https://www.easycalculation.com/statistics/covariance.php
Vector X: [1, 2, 3, 4]
Vector Y: [5, 6, 7, 8]
I think this differences is because that some computation do not consider (n-1) as divisor and instead of (n-1) , just they use n. Therefore when we grow divisor from n-1 to n the result will be diminished from 1.6 to 1.25.

Resources