Related
I have Users and Checkpoint tables, each User can make multiple Checkpoints per day
I want to aggregate how many Checkpoints had been taken each day in the past 6 months based on each user's starting point, to create a graph showing avarage user Checkpoints witin thier x months.
for example:
if user1 started on January 1st, user2 started on March 15th, and user3 started on July 6th those would each be considered day 1, I would want the data from each of their day 1 even though they occur at different periods of time.
The current query I came up with, but unfotunatily it returns data based on fixed time for all of the users.
SELECT dates.date AS date,
checkpoints_count,
checkpoints_users
FROM (SELECT DATE_TRUNC('DAY', ('2000-01-01'::DATE - offs))::DATE AS date
-- 180 = 6 month in days
FROM GENERATE_SERIES(0, 180) AS offs) AS dates
LEFT OUTER JOIN (
SELECT
checkpoints_date::DATE AS date,
COUNT(id) AS checkpoints_count,
COUNT(user_id) AS checkpoints_users
FROM checkpoints
WHERE user_id in (1, 2, 3)
AND checkpoints_date::DATE
BETWEEN '2000-01-01'::DATE AND '2000-06-01'::DATE
GROUP BY checkpoints_date::DATE
) AS ck
ON dates.date = ck.date
ORDER BY dates.date;
EDIT
Here is a working SQL example that works (If I understand what you are looking for. Your SQL seems really complicated for what you are asking but I'm not a SQL expert)...
SELECT t1.*
FROM checkpoints AS t1
WHERE t1.user_id IN (1, 2, 3)
AND t1.id = (SELECT t2.id
FROM checkpoints AS t2
WHERE t2.user_id = t1.user_id
ORDER BY t2.check_date ASC
LIMIT 1)
SQL FIDDLE here
Since this is tagged Ruby on Rails I'll give you a rails answer.
If you know your user IDs (or use some other query to get them you have:
user_ids = [1, 2, 3]
first_checkpoints = []
user_ids.each do |id|
first_checkpoints << Checkpoint.where(user_id: id).order(:date).first
end
#returns an array of the first checkpoint of each user in list
This assumes a column in the checkpoints table called date. You didn't give a table structure for the two tables so this answer is a bit general. There might be a pure ActiveRecord answer. But this will get you what you want.
Background
I have 2 data tables.
For each row in tableA, I want to find the rows in tableB with the closest dates and join those values onto the row from tableA.
Example tables:
tableA:
p_id
category
l_date
1
catA
2005-01-05
1
catB
2005-06-10
2
catC
2000-01-10
tableB:
p_id
e_id
e_date
1
22
2005-01-01
1
23
2005-01-06
1
24
2005-01-06
1
28
2005-01-10
2
29
2010-08-10
desired result:
p_id
category
l_date
e_id
e_date
1
catA
2005-01-05
23
2005-01-06
1
catA
2005-01-05
24
2005-01-06
1
catB
2005-06-10
28
2005-01-10
2
catC
2000-01-10
29
2010-08-10
Tried
This query does not work, but I think this is the direction I should be going.
select a.p_id, a.category, a.l_date, c.e_id, c.e_date from tableA a
left join lateral
(
select top 1 p_id, e_id, e_date from tableB b
where a.pid = b.pid
order by abs(datediff(days, a.l_date, b.e_date))
) c on True;
TableA and tableB are massive, 17m and 150m respective rows.
Does this sound like the correct approach?
Using redshift cluster, running postgres 8.x
Correlated subquery approaches or a full cross join approach will all perform the task of comparing every row in one table with every row in the other (in one manner or another). Comparing (joining) all these rows when the tables get large get prohibitive. In these cases different approaches are needed.
Brute forcing won't be fast (if it even completes) so we need to be a bit more efficient in going about this. I tell clients to think about how they would do this query (by hand) if I gave them stacks of index cards. A person values their time so they don't go about this by making all possible combinations, they would come up with a more efficient way that they can complete quickly and get back to their lives. In cases like the one you are describing you need to find the more efficient approach. I'd be happy to talk to you more about building these types of queries.
Taking your data (and sprucing it up a bit for some more interesting cases) I created an example of how you can do this. (Yes, you could cross join the small tables and do this with simpler SQL but that won't scale.)
Data setup:
create table tableA (p_id int, category varchar(64), l_date date);
insert into tableA values
(1,'catA','2005-01-05'),
(1,'catB','2005-06-10'),
(2,'catC','2000-01-10');
create table tableB (p_id int, e_id int, e_date date);
insert into tableB values
(1,22,'2005-01-01'),
(1,23,'2005-01-06'),
(1,24,'2005-06-01'),
(1,28,'2005-06-15'),
(2,29,'2010-08-10');
The query looks like:
with combined as
(
select
*,
coalesce(max(l_date) OVER (partition by p_id order by
dt rows between unbounded preceding and 1 preceding), '1970-01-01'::date) cb,
coalesce(min(l_date) OVER (partition by p_id order by
dt desc rows between unbounded preceding and 1 preceding), '2100-01-01'::date) ca
from
(
select
p_id,
category,
l_date,
NULL as e_id,
NULL as e_date,
l_date dt
from
tableA
union all
select
p_id,
NULL as category,
NULL as l_date,
e_id,
e_date,
e_date dt
from
tableB
) c
)
,
closest as
(
select
p_id,
e_id,
e_date,
cb,
ca,
case
when
coalesce(e_date - cb, 0) > (ca - e_date)
then ca
else cb
end closest
from
combined
where
e_date is not NULL
)
select
c.p_id,
a.category,
a.l_date,
c.e_id,
c.e_date
from
closest c
left join tableA a
on c.closest = a.l_date and c.p_id = a.p_id
order by
c.p_id,
c.e_id ;
While this can look like a lot it isn't that complex. First CTE finds the closest l_date earlier than e_date (cb) and the closest l_date later than e_date (ca). It does this on on UNIONed set of data to allow for windowing. The second CTE just determines which is closer, ca or cb, and produces this as "closest". It also strips out all the tableB information that was added by the UNION (no longer needed). Lastly this "closest" date provides the join on information needed to build the final result.
Now this query doesn't account of many possible real-world data issues that can happen so take this as a starting point. I'm also making some assumptions about your data based on the test data (like no 2 rows in tableA will have the same l_date and P_id). So use this as a starting point.
And a last word on performance - while window functions are not cheap and will do more work as your data tables increase in size, they are orders of magnitude more performant than cross-joining massive tables. What you are looking to do is complex so will take some time but this is the fastest way I have found perform these complex operations that would normally be a massive loop problem.
I have data in a table (named: TESTING) on a dashDB2 on IBM bluemix (Db2 Warehouse on Cloud) which is looking like this:
ID TIMESTAMP NAME VALUE
abc 2017-12-21 19:55:38.762 test1 123
abc 2017-12-21 19:55:42.762 test2 456
abc 2017-12-21 19:57:38.762 test1 789
abc 2017-12-21 19:58:38.762 test3 345
def 2017-12-21 19:59:38.762 test1 678
I am looking for a query that:
samples the data (for each NAME) to a given timeformat (ex. to a 1 minute based timestamp)
VALUES in same timerange (in same minute) should be averaged, empty times should be NULL
for 1. and 2. something like (only for one NAME working):
with dummy(temporaer) as (
select TIMESTAMP('2017-12-01') from SYSIBM.SYSDUMMY1
union all
select temporaer + 1 MINUTES from dummy where temporaer < TIMESTAMP('2018-02-01')
)
select temporaer, avg(VALUE) as test1 from dummy
LEFT OUTER JOIN TESTING ON temporaer=date_trunc('minute', TIMESTAMP) and ID='abc' and NAME='test1'
group by temporaer
ORDER BY temporaer ASC;
join all different NAMES column-wise to a matrix, like:
TIMESTAMP test1 test2 test3
2017-12-01 00:00:00 null null null
...
2017-12-21 19:55:00 123 456 null
2017-12-21 19:56:00 null null null
2017-12-21 19:57:00 789 null null
2017-12-21 19:58:00 678 null 345
...
2018-01-31 23:59:00 null null null
the query result should be exportet as a csv. or given back as csv-string
Does anybody know how this could be done in one query or in a simple and fast way? Or is it necessary to save the data in another tabe-format - can you give me a hint?
here is a code snipped that does the job, but needs very long time:
WITH
-- get all distinct names in table:
header(names) AS (SELECT DiSTINCT name
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$') AND DATE(TIMESTAMP)>='2017-12-19' AND DATE(TIMESTAMP)<'2017-12-24'),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
dummie(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID='7b9bbe44d45d8f2ac324849a4951da54' AND REGEXP_LIKE(trim(VALUE),'^\d+(\.\d*)?$')),
-- generate a range of times from date to date in defined steps:
dummy(time, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM dummy
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- add each name (from header) to each time/row (in dummy):
dumpy(time, names) AS (SELECT Dummy.time, Header.names
FROM Dummy
LEFT OUTER JOIN Header
ON Dummy.time IS NOT NULL),
-- averages values by name and timeinterval and sorts result to dummy:
dummj(time, names, avgvalues) AS (SELECT Dummy.time, Dummie.names, AVG(Dummie.values)
FROM Dummy
LEFT OUTER JOIN Dummie
ON Dummie.time = Dummy.time
GROUP BY Dummie.names, Dummy.time),
-- joins the averages (by time, name) values to the times and names in dumpy (on empty value use -9999):
testo(time, names, avgvalues) AS (SELECT Dumpy.time, Dumpy.names, COALESCE(Dummj.avgvalues,-9999)
FROM Dumpy
LEFT OUTER JOIN Dummj
ON Dummj.time = Dumpy.time AND Dummj.names = Dumpy.names),
-- converts the high amount of rows to less rows with delimited strings:
test(time, names, avgvalues) AS (SELECT time, LISTAGG(names,';') WITHIN GROUP(ORDER BY names), LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
FROM Testo
GROUP BY time)
SELECT* FROM test ORDER BY time ASC, names ASC;
The performance problem is in the "testo" subquery. Does anybody have an idea what is the failure here or know how to improve the query?
Well, one problem I see is that you keep using functions on columns, but that shouldn't be too big a drain if id is reasonably unique. If this query is very common, it may also be worth it to permanently build and index the range table. Hmm, you probably need several indices (starting with FieldTest.id), but you might also try this version:
-- let's name things properly, too, to keep them straight.
WITH
-- generate a range of times from date to date in defined steps:
Range (rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM Range
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FieldTest
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
-- just make the white space check part of the regex
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data (rangeStart, name, averaged) AS (SELECT Range.rangeStart, Header.names, COALESCE(AVG(FieldTest.value), -9999)
FROM Range
CROSS JOIN Header
LEFT JOIN FieldTest
ON FieldTest.id = '7b9bbe44d45d8f2ac324849a4951da54'
AND FieldTest.names = Header.names
AND FieldTest.timestamp >= Range.rangeStart
AND FieldTest.timestamp < Range.rangeEnd
GROUP BY Range.rangeStart, Header.names),
-- I can't recall if DB2 allows using the new column name this way, you may need to wrap this again
SELECT rangeStart,
-- converts the high amount of rows to less rows with delimited strings:
LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS names,
LISTAGG(avgvalues,';') WITHIN GROUP(ORDER BY names)
GROUP BY rangeStart
ORDER BY rangeStart, names
(not tested)
the CROSS JOIN was defenitly a nice hint. Also I was not able to implement the following LEFT JOIN like you suggested, I found a workaround, which - I am sure - still keeps room for improvement but at this moment is acceptable for me (timesaving about factor 30 compared to my first query solution). Here the actual code:
WITH
-- generate a range of times from date to date in defined steps:
TimeRange(rangeStart, rangeEnd) AS (SELECT a, a + 1 MINUTE
FROM (VALUES(TIMESTAMP('2017-12-19'))) D(a)
UNION ALL
SELECT rangeEnd, rangeEnd + 1 MINUTE
FROM TimeRange
WHERE rangeEnd < TIMESTAMP('2017-12-24')),
-- get all distinct names in table:
Header(names) AS (SELECT DISTINCT name
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')
AND timestamp >= TIMESTAMP('2017-12-19')
AND timestamp < TIMESTAMP('2017-12-24')),
-- select data (names, values without stringvalues) from table dedicated by timestamp to bigger timeinterval (here minutes):
rawData(time, names, values) AS (SELECT date_trunc('minute', TIMESTAMP), NAME, VALUE
FROM FIELDTEST
WHERE ID = '7b9bbe44d45d8f2ac324849a4951da54'
AND REGEXP_LIKE(VALUE, '^\s*\d+(\.\d*)?\s*$')),
-- I'm assuming the (id, name) tuple is unique, which means we don't need to repeat the regex later
Data(rangeStart, name, averaged) AS (SELECT TimeRange.rangeStart, Header.names, COALESCE(AVG(rawData.values), -9999)
FROM TimeRange
CROSS JOIN Header
LEFT JOIN rawData
ON rawData.names = Header.names
AND rawData.time = TimeRange.rangeStart
GROUP BY TimeRange.rangeStart, Header.names),
test(time, names, avgvalues) AS (SELECT Data.rangeStart,
LISTAGG(Data.name,';') WITHIN GROUP(ORDER BY name),
LISTAGG(Data.averaged,';') WITHIN GROUP(ORDER BY name)
FROM Data
GROUP BY Data.rangeStart)
-- build my own delimited export-string:
SELECT CONCAT(CONCAT(SUBSTR(REPLACE(time,'.',':'),1,19),';'), REPLACE(CAST(avgvalues AS VARCHAR(3980)),'-9999',''))
FROM test
UNION ALL
SELECT CONCAT(CAST('TIME;' AS VARCHAR(5)), CAST(LISTAGG(names,';') WITHIN GROUP(ORDER BY names) AS VARCHAR(3980)))
FROM Header;
I have two tables in hive:
Table1: uid,txid,amt,vendor Table2: uid,txid
Now I need to join the tables on txid which basically confirms a transaction is finally recorded. There will be some transactions which will be present only in Table1 and not in Table2.
I need to find out number of avg of transaction matches found per user(uid) per vendor. Then I need to find the avg of these averages by adding all the averages and divide them by the number of unique users per vendor.
Let's say I have the data:
Table1:
u1,120,44,vend1
u1,199,33,vend1
u1,100,23,vend1
u1,101,24,vend1
u2,200,34,vend1
u2,202,32,vend2
Table2:
u1,100
u1,101
u2,200
u2,202
Example For vendor vend1:
u1-> Avg transaction find rate = 2(matches found in both Tables,Table1 and Table2)/4(total occurrence in Table1) =0.5
u2 -> Avg transaction find rate = 1/1 = 1
Avg of avgs = 0.5+1(sum of avgs)/2(total unique users) = 0.75
Required output:
vend1,0.75
vend2,1
I can't seem to find count of both matches and occurrence in just Table1 in one hive query per user per vendor. I have reached to this query and can't find how to change it further.
SELECT A.vendor,A.uid,count(*) as totalmatchesperuser FROM Table1 A JOIN Table2 B ON A.uid = B.uid AND B.txid =A.txid group by vendor,A.uid
Any help would be great.
I think you are running into trouble with your JOIN. When you JOIN by txid and uid, you are losing the total number of uid's per group. If I were you I would assign a column of 1's to table2 and name the column something like success or transaction and do a LEFT OUTER JOIN. Then in your new table you will have a column with the number 1 in it if there was a completed transaction and NULL otherwise. You can then do a case statement to convert these NULLs to 0
Query:
select vendor
,(SUM(avg_uid) / COUNT(uid)) as avg_of_avgs
from (
select vendor
,uid
,AVG(complete) as avg_uid
from (
select uid
,txid
,amt
,vendor
,case when success is null then 0
else success
end as complete
from (
select A.*
,B.success
from table1 as A
LEFT OUTER JOIN table2 as B
ON B.txid = A.txid
) x
) y
group by vendor, uid
) z
group by vendor
Output:
vend1 0.75
vend2 1.0
B.success in line 17 is the column of 1's that I put int table2 before the JOIN. If you are curious about case statements in Hive you can find them here
Amazing and precise answer by GoBrewers14!! Thank you so much. I was looking at it from a wrong perspective.
I made little changes in the query to get things finally done.
I didn't need to add a "success" colummn to table2. I checked B.txid in the above query instead of B.success. B.txid will be null in case a match is not found and be some value if a match is found. That checks the success & failure conditions itself without adding a new column. And then I set NULL as 0 and !NULL as 1 in the part above it. Also I changed some variable names as hive was finding it ambiguous.
The final query looks like :
select vendr
,(SUM(avg_uid) / COUNT(usrid)) as avg_of_avgs
from (
select vendr
,usrid
,AVG(complete) as avg_uid
from (
select usrid
,txnid
,amnt
,vendr
,case when success is null then 0
else 1
end as complete
from (
select A.uid as usrid,A.vendor as vendr,A.amt as amnt,A.txid as txnid
,B.txid as success
from Table1 as A
LEFT OUTER JOIN Table2 as B
ON B.txid = A.txid
) x
) y
group by vendr, usrid
) z
group by vendr;
I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?
This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.
So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user