I working on a project where I have to make real time recommandation based on filters. I decided to take a look on graph db and started to play with neo4j and compared it performance with mysql.
rows are about :
"broadcast": 159844,
"format": 5,
"genre": 10,
"program": 60495
the mysql the query look like :
select f.id, sum(weight) as total
from
(
select program.id, 15 as weight
from broadcast
inner join program on broadcast.programId = program.id
where broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
union all
select program.id, 10 as weight
from broadcast
inner join program on broadcast.programId = program.id
inner join genre ON program.genreId = genre.id
where genre.id in (13) and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
union all
select program.id, 5 as weight
from broadcast
inner join program on broadcast.programId = program.id
inner join genre ON program.genreId = genre.id
inner join format on genre.formatId = format.id
where format.id = 6 and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
) f
group by f.id
order by total desc, id desc
limit 0, 50
On my local machine the query perform in about 300ms. It may be acceptable but under 100ms would be better for real time processing.
I also have written with some help a thinkerpop3 query :
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.in('programBroadcast')
.dedup()
.union(
filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
filter(out('programGenre').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(10)),
filter(out('programGenre').out('genreFormat').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.order(local).by(valueDecr)
.limit(local, 50)
the query perform about 700ms !!
===== EDIT =====
since I wanted to display the profiling of the query and I got :
Step Count Traversers Time (ms) % Dur
=============================================================================================================
Neo4jGraphStep([],vertex) 220513 220513 14135.788 68.52
HasStep([~label.eq(broadcast)]) 159844 159844 391.087 1.90
VertexStep(IN,[programBroadcast],vertex) 159825 159825 267.202 1.30
DedupGlobalStep 60495 60495 95.848 0.46
UnionStep([[LambdaFilterStep(lambda)#[p, w], Pr... 63247 63247 2008.553 9.74
LambdaFilterStep(lambda)#[p, w] 60495 60495 194.406
SelectStep([p, w],[value(id), [ConstantStep(1... 60495 60495 487.158
ConstantStep(15) 60495 60495 24.214
EndStep 60495 60495 110.575
TraversalFilterStep([VertexStep(OUT,[programG... 2070 2070 410.689
VertexStep(OUT,[programGenre],vertex) 22540 22540 191.158
HasStep([id.eq(6)]) 0 0 140.934
SelectStep([p, w],[value(id), [ConstantStep(1... 2070 2070 52.203
ConstantStep(10) 2070 2070 0.654
EndStep 2070 2070 43.120
TraversalFilterStep([VertexStep(OUT,[programG... 682 682 443.347
VertexStep(OUT,[programGenre],vertex) 22540 22540 119.115
VertexStep(OUT,[genreFormat],vertex) 27510 27510 117.410
HasStep([id.eq(1)]) 0 0 133.517
SelectStep([p, w],[value(id), [ConstantStep(5... 682 682 43.247
ConstantStep(5) 682 682 0.217
EndStep 682 682 44.427
GroupStep([SelectOneStep(p), ProfileStep],[Sele... 1 1 3583.249 17.37
SelectOneStep(p) 63247 63247 26.836
SelectOneStep(w) 63247 63247 81.623
SumGlobalStep 60495 60495 3107.593
SelectOneStep(w) 0 0 0.000
SumGlobalStep 0 0 0.000
UnfoldStep 60495 60495 17.227 0.08
OrderGlobalStep([valueDecr]) 60495 60495 114.439 0.55
FoldStep 1 1 16.902 0.08
RangeLocalStep(0,10) 1 1 0.081 0.00
SideEffectCapStep([~metrics]) 1 1 0.215 0.00
- show quoted text -
this shows that 68% of the time happened to the g.V() that has no index!
suddenly I tried to find a way to have a single started point, so I did :
graph.addVertex(label, 'day', 'id', 1)
graph.cypher("CREATE INDEX ON :day(id)")
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.map{it.get().addEdge('broadcastDay', g.V().hasLabel('day').has('id', 1).next())}
and now the query look like :
g.V(14727)
.in('broadcastDay')
.in('programBroadcast')
.union(
filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
filter(out('programGenre').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(10)),
filter(out('programGenre').out('genreFormat').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.unfold().order().by(valueDecr).fold()
.limit(local, 50)
and the execution time is now 137ms !
===== END EDIT =====
Neo4j is slower than mysql in my case...
So I decided to make the query in cypher (thanks to this post) with this naive approach :
WITH [] AS res
MATCH (b:broadcast)-[:programBroadcast]-(p:program)
WHERE b.startedAt > timestamp() and b.startedAt < (timestamp() + 1000 * 60 * 60 * 24)
OPTIONAL MATCH (p)
WITH res, COLLECT({id: p.id, weight: 15}) AS data
WITH res + data AS res
OPTIONAL MATCH (p)-[:programGenre]-(g:genre{id:4})
WITH res, (CASE WHEN g IS NOT NULL THEN COLLECT({id: p.id, weight: 10}) ELSE [] END) AS data
WITH res + data AS res
OPTIONAL MATCH (p)-[:programGenre]-(g:genre)-[:genreFormat]-(f:format{id:4})
WITH res, (CASE WHEN f IS NOT NULL THEN COLLECT({id: p.id, weight: 5}) ELSE [] END) AS data
WITH res + data AS res
UNWIND res AS result
WITH result, result.id as id, SUM(result.weight) as weight
ORDER BY weight DESC
LIMIT 10
RETURN id, weight
I'm about 68614ms !!
I'm very disappointed by graph db, but I don't understand why, I used indices on every properties and set java memory about 4g, and it's stuck compared to mysql, why ? graph db only for big data where mysql can't perform join ?
Related
## First I Reset the table
UPDATE tbl_npwp set
id_skpd=9999 , saved_from = 9 , update_time = '1999-01-01 00:00:00' ;
## Then I update
UPDATE tbl_npwp
join data_spm on tbl_npwp.id_npwp = data_spm.id_npwp
join tbl_unitskpd on tbl_unitskpd.id_unitskpd = data_spm.id_unitskpd
SET
tbl_npwp.id_skpd = tbl_unitskpd.id_skpd
,tbl_npwp.saved_from = IF(data_spm.id_kontrak > 10 , 1 , 2 )
,tbl_npwp.update_time = data_spm.time_created
WHERE data_spm.id_npwp > 10 ORDER BY data_spm.id_spm desc;
After execute :
tbl_npwp.id_skpd and tbl_npwp.update_time was replaced with only last record of
tbl_unitskpd.id_skpd and data_spm.time_created Record Number 2
of select query result below
i was try this query to proof the result must have different result of update_time and id_skpd
SELECT a.id_npwp , b.id_spm , c.id_skpd , b.time_created from tbl_npwp a
join data_spm b on a.id_npwp = b.id_npwp
join tbl_unitskpd c on c.id_unitskpd = b.id_unitskpd
WHERE b.id_npwp > 10 ORDER BY b.id_spm desc;
Result :
id_npwp id_spm id_skpd time_created
1026 17407 3 2022-11-21 18:26:24
1495 17406 37 2022-11-17 23:59:10
1191 17405 17 2022-11-15 20:51:35
2027 17404 5 2022-11-15 20:40:22
2026 17403 5 2022-11-15 20:38:32
1878 17402 5 2022-11-15 20:37:15
...
...
Notes : All time_created have different value
I have the rank (say 1 to 103) and count of nodes for each rank in a cypher.
day.name rank count
12/14/2016 1 1
12/14/2016 2 2
12/14/2016 3 3
12/14/2016 4 10
...
12/14/2016 11 20000
12/14/2016 12 20500
...
12/14/2016 21 15000
12/14/2016 22 15000
I'd like to display in columns summary data such as count that was in rank 1 to 3, rank 4 to 10, etc.
In this example, the desired result would be
'rank_1_to_3' 'rank_4_to_10' 'rank_11_to_20' 'rank_21_to_100'
---------------------------------------------------------------------------
6 10 40500 30000
Here's the Cypher:
match (d:Domain {name:'hotelguides.com'})
match (c:Temp {name:"top30k_rank_1_to_100"})
match (day:Day {name:'2016-12-22'})
match (day)-[]-(kua:KeywordURLAssociation)-[]-(d)
match (k:Keyword)-[]-(kua)-[]-(r:Rank)
with day, kua, r, k return day.name, r.name, count(k)
I think this query can get you started. I will leave applying it to your particular domain up to you. Essentially, you can use filter to filter a collection of results into another collection with a subset of those results. With each smaller result you can use reduce to summarize that collection. Something like this..
with [
{name:'12/14/2016', rank:1, count:1},
{name:'12/14/2016', rank:2, count:2},
{name:'12/14/2016', rank:3, count:3},
{name:'12/14/2016', rank:4, count:10},
{name:'12/14/2016', rank:11, count:20000},
{name:'12/14/2016', rank:12, count:20500},
{name:'12/14/2016', rank:21, count:15000},
{name:'12/14/2016', rank:22, count:15000}
] as days
return reduce(sum_c=0, d in filter(d in days where d.rank < 4) | sum_c + d.count) as rank_1_3
, reduce(sum_c=0, d in filter(d in days where d.rank > 3 and d.rank < 11) | sum_c + d.count) as rank_4_to_10
, reduce(sum_c=0, d in filter(d in days where d.rank > 10 and d.rank < 21) | sum_c + d.count) as rank_11_to_20
, reduce(sum_c=0, d in filter(d in days where d.rank > 20 and d.rank < 101) | sum_c + d.count) as rank_21_100
I have a table which contains a feedback about a product.It has feedback type (positive ,negative) which is a text column, date on which comments made. I need to get total count of positive ,negative feedback for particular time period . For example if the date range is 30 days, I need to get total count of positive ,negative feedback for 4 weeks , if the date range is 6 months , I need to get total count of positive ,negative feedback for each month. How to group the count based on date.
+------+------+----------+----------+---------------+--+--+--+
| Slno | User | Comments | type | commenteddate | | | |
+------+------+----------+----------+---------------+--+--+--+
| 1 | a | aaaa | positive | 22-jun-2016 | | | |
| 2 | b | bbb | positive | 1-jun-2016 | | | |
| 3 | c | qqq | negative | 2-jun-2016 | | | |
| 4 | d | ccc | neutral | 3-may-2016 | | | |
| 5 | e | www | positive | 2-apr-2016 | | | |
| 6 | f | s | negative | 11-nov-2015 | | | |
+------+------+----------+----------+---------------+--+--+--+
Query i tried is
SELECT type, to_char(commenteddate,'DD-MM-YYYY'), Count(type) FROM comments GROUP BY type, to_char(commenteddate,'DD-MM-YYYY');
Here's a kick at the can...
Assumptions:
you want to be able to switch the groupings to weekly or monthly only
the start of the first period will be the first date in the feedback data; intervals will be calculated from this initial date
output will show feedback value, time period, count
time periods will not overlap so periods will be x -> x + interval - 1 day
time of day is not important (time for commented dates is always 00:00:00)
First, create some sample data (100 rows):
drop table product_feedback purge;
create table product_feedback
as
select rownum as slno
, chr(65 + MOD(rownum, 26)) as userid
, lpad(chr(65 + MOD(rownum, 26)), 5, chr(65 + MOD(rownum, 26))) as comments
, trunc(sysdate) + rownum + trunc(dbms_random.value * 10) as commented_date
, case mod(rownum * TRUNC(dbms_random.value * 10), 3)
when 0 then 'positive'
when 1 then 'negative'
when 2 then 'neutral' end as feedback
from dual
connect by level <= 100
;
Here's what my sample data looks like:
select *
from product_feedback
;
SLNO USERID COMMENTS COMMENTED_DATE FEEDBACK
1 B BBBBB 2016-08-06 neutral
2 C CCCCC 2016-08-06 negative
3 D DDDDD 2016-08-14 positive
4 E EEEEE 2016-08-16 negative
5 F FFFFF 2016-08-09 negative
6 G GGGGG 2016-08-14 positive
7 H HHHHH 2016-08-17 positive
8 I IIIII 2016-08-18 positive
9 J JJJJJ 2016-08-12 positive
10 K KKKKK 2016-08-15 neutral
11 L LLLLL 2016-08-23 neutral
12 M MMMMM 2016-08-19 positive
13 N NNNNN 2016-08-16 neutral
...
Now for the fun part. Here's the gist:
find out what the earliest and latest commented dates are in the data
include a query where you can set the time period (to "WEEKS" or "MONTHS")
generate all of the (weekly or monthly) time periods between the min/max dates
join the product feedback to the time periods (commented date between start and end) with an outer join in case you want to see all time periods whether or not there was any feedback
group the joined result by feedback, period start, and period end, and set up a column to count one of the 3 possible feedback values
x
with
min_max_dates -- get earliest and latest feedback dates
as
(select min(commented_date) min_date, max(commented_date) max_date
from product_feedback
)
, time_period_interval
as
(select 'MONTHS' as tp_interval -- set the interval/time period here
from dual
)
, -- generate all time periods between the start date and end date
time_periods (start_of_period, end_of_period, max_date, time_period) -- recursive with clause - fun stuff!
as
(select mmd.min_date as start_of_period
, CASE WHEN tpi.tp_interval = 'WEEKS'
THEN mmd.min_date + 7
WHEN tpi.tp_interval = 'MONTHS'
THEN ADD_MONTHS(mmd.min_date, 1)
ELSE NULL
END - 1 as end_of_period
, mmd.max_date
, tpi.tp_interval as time_period
from time_period_interval tpi
cross join
min_max_dates mmd
UNION ALL
select CASE WHEN time_period = 'WEEKS'
THEN start_of_period + 7 * (ROWNUM )
WHEN time_period = 'MONTHS'
THEN ADD_MONTHS(start_of_period, ROWNUM)
ELSE NULL
END as start_of_period
, CASE WHEN time_period = 'WEEKS'
THEN start_of_period + 7 * (ROWNUM + 1)
WHEN time_period = 'MONTHS'
THEN ADD_MONTHS(start_of_period, ROWNUM + 1)
ELSE NULL
END - 1 as end_of_period
, max_date
, time_period
from time_periods
where end_of_period <= max_date
)
-- now put it all together
select pf.feedback
, tp.start_of_period
, tp.end_of_period
, count(*) as feedback_count
from time_periods tp
left outer join
product_feedback pf
on pf.commented_date between tp.start_of_period and tp.end_of_period
group by tp.start_of_period
, tp.end_of_period
, pf.feedback
order by pf.feedback
, tp.start_of_period
;
Output:
negative 2016-08-06 2016-09-05 6
negative 2016-09-06 2016-10-05 7
negative 2016-10-06 2016-11-05 8
negative 2016-11-06 2016-12-05 1
neutral 2016-08-06 2016-09-05 6
neutral 2016-09-06 2016-10-05 5
neutral 2016-10-06 2016-11-05 11
neutral 2016-11-06 2016-12-05 2
positive 2016-08-06 2016-09-05 17
positive 2016-09-06 2016-10-05 16
positive 2016-10-06 2016-11-05 15
positive 2016-11-06 2016-12-05 6
-- EDIT --
New and improved, all in one easy to use procedure. (I will assume you can configure the procedure to make use of the query in whatever way you need.) I made some changes to simplify the CASE statements in a few places and note that for whatever reason using a LEFT OUTER JOIN in the main SELECT results in an ORA-600 error for me so I switched it to INNER JOIN.
CREATE OR REPLACE PROCEDURE feedback_counts(p_days_chosen IN NUMBER, p_cursor OUT SYS_REFCURSOR)
AS
BEGIN
OPEN p_cursor FOR
with
min_max_dates -- get earliest and latest feedback dates
as
(select min(commented_date) min_date, max(commented_date) max_date
from product_feedback
)
, time_period_interval
as
(select CASE
WHEN p_days_chosen BETWEEN 1 AND 10 THEN 'DAYS'
WHEN p_days_chosen > 10 AND p_days_chosen <=31 THEN 'WEEKS'
WHEN p_days_chosen > 31 AND p_days_chosen <= 365 THEN 'MONTHS'
ELSE '3-MONTHS'
END as tp_interval -- set the interval/time period here
from dual --(SELECT p_days_chosen as days_chosen from dual)
)
, -- generate all time periods between the start date and end date
time_periods (start_of_period, end_of_period, max_date, tp_interval) -- recursive with clause - fun stuff!
as
(select mmd.min_date as start_of_period
, CASE tpi.tp_interval
WHEN 'DAYS'
THEN mmd.min_date + 1
WHEN 'WEEKS'
THEN mmd.min_date + 7
WHEN 'MONTHS'
THEN mmd.min_date + 30
WHEN '3-MONTHS'
THEN mmd.min_date + 90
ELSE NULL
END - 1 as end_of_period
, mmd.max_date
, tpi.tp_interval
from time_period_interval tpi
cross join
min_max_dates mmd
UNION ALL
select CASE tp_interval
WHEN 'DAYS'
THEN start_of_period + 1 * ROWNUM
WHEN 'WEEKS'
THEN start_of_period + 7 * ROWNUM
WHEN 'MONTHS'
THEN start_of_period + 30 * ROWNUM
WHEN '3-MONTHS'
THEN start_of_period + 90 * ROWNUM
ELSE NULL
END as start_of_period
, start_of_period
+ CASE tp_interval
WHEN 'DAYS'
THEN 1
WHEN 'WEEKS'
THEN 7
WHEN 'MONTHS'
THEN 30
WHEN '3-MONTHS'
THEN 90
ELSE NULL
END * (ROWNUM + 1)
- 1 as end_of_period
, max_date
, tp_interval
from time_periods
where end_of_period <= max_date
)
-- now put it all together
select pf.feedback
, tp.start_of_period
, tp.end_of_period
, count(*) as feedback_count
from time_periods tp
inner join -- currently a bug that prevents the procedure from compiling with a LEFT OUTER JOIN
product_feedback pf
on pf.commented_date between tp.start_of_period and tp.end_of_period
group by tp.start_of_period
, tp.end_of_period
, pf.feedback
order by tp.start_of_period
, pf.feedback
;
END;
Test the procedure (in something like SQLPlus or SQL Developer):
var x refcursor
exec feedback_counts(10, :x)
print :x
Hi I want to sort my graph results by two filters..
My Cypher Query looks like this..
MATCH (n:User{user_id:304020})-[r:know]->(m:User) with m MATCH (m)-[s:like|create|share]->(o{is_active:1})
with m, s, o, (toInt(timestamp()/1000)-toInt(o.created_on))/86400 as days,
(toInt(timestamp()/1000)-toInt(o.created_on))/3600 as hours,
(1- round(o.impression_count_all/20)/50) as low_boost
with m,s,o,days,low_boost,hours,
CASE
WHEN days > 30 THEN 0.05
WHEN days >=20 AND days <=30 THEN 0.1
WHEN days >=10 AND days <=20 THEN 0.2
WHEN days >=5 AND days <=10 THEN 0.4
WHEN days >=2 AND days <=5 THEN 0.5
WHEN days =1 THEN 0.6
WHEN days < 1 THEN
CASE
WHEN hours <= 2 THEN 1
WHEN hours > 2 AND hours <= 8 THEN 0.9
WHEN hours > 8 AND hours <= 16 THEN 0.8
WHEN hours > 16 AND hours < 23 THEN 0.75
WHEN hours >= 23 AND hours <= 24 THEN 0.7
END
END as rs,
CASE
WHEN low_boost > 0 THEN low_boost
WHEN low_boost <= 0 THEN 0
END as lb
where has(o.trending_score_all) and has(o.impression_count_all) and not(o.is_featured=2)
RETURN distinct o.story_id as story_id,
(o.trending_score_all*4) as ts, (o.trending_score_all + rs + lb) as final_score,
count(s) as rel_count,max(s.activity_id) as id, toInt(o.created_on) as created_on
ORDER BY (CASE WHEN ts > 3 THEN final_score desc, rel_count desc ELSE ts) END) DESC
skip 0 limit 10;
Now If ts > 3 ,I want to sort results by both final_score and rel_count ELSE srt only by ts..
Please modify order by..
Does this much simplified query (which uses a single argument for ORDER BY) work for you?
MATCH (u:User)-[r:like]->(s:Story)
WITH s, (s.trending_score_all*4) AS ts
RETURN DISTINCT s.story_id, ts, TOINT(s.impression_count_72)
ORDER BY (CASE WHEN ts > 3 THEN ts ELSE TOINT(s.impression_count_72) END) DESC
LIMIT 10;
[EDITED]
If you need to sort by a varying number of values (depending on the situation) you have to use a workaround, as Cypher does not support that directly.
For example, suppose when (ts > 3) you wanted to order by ts DESC and then by s.story_id ASC. In this situation, you could change the above ORDER BY clause to this:
ORDER BY
(CASE WHEN ts > 3 THEN ts ELSE TOINT(s.impression_count_72) END) DESC,
(CASE WHEN ts > 3 THEN s.story_id ELSE NULL END) ASC
By using NULL (or any literal value) in this way, you can have any of the sub-sorts effectively do nothing.
1) You use pagination (skip and limit)
2) If I understand what you need, then add "else" to sort:
UNWIND RANGE(1,100) as i
WITH i, rand()*5 as x, toInt(rand()*10) as y
RETURN i, x, y, CASE WHEN x>3 THEN 1 ELSE 0 END as for_sort
ORDER BY
CASE
WHEN for_sort=1 THEN x ELSE y END DESC,
CASE
WHEN for_sort=1 THEN y ELSE x END DESC
I have some difficulties with mySQL commands that I want to do.
SELECT a.timestamp, name, count(b.name)
FROM time a, id b
WHERE a.user = b.user
AND a.id = b.id
AND b.name = 'John'
AND a.timestamp BETWEEN '2010-11-16 10:30:00' AND '2010-11-16 11:00:00'
GROUP BY a.timestamp
This is my current output statement.
timestamp name count(b.name)
------------------- ---- -------------
2010-11-16 10:32:22 John 2
2010-11-16 10:35:12 John 7
2010-11-16 10:36:34 John 1
2010-11-16 10:37:45 John 2
2010-11-16 10:48:26 John 8
2010-11-16 10:55:00 John 9
2010-11-16 10:58:08 John 2
How do I group them into 5 minutes interval results?
I want my output to be like
timestamp name count(b.name)
------------------- ---- -------------
2010-11-16 10:30:00 John 2
2010-11-16 10:35:00 John 10
2010-11-16 10:40:00 John 0
2010-11-16 10:45:00 John 8
2010-11-16 10:50:00 John 0
2010-11-16 10:55:00 John 11
This works with every interval.
PostgreSQL
SELECT
TIMESTAMP WITH TIME ZONE 'epoch' +
INTERVAL '1 second' * round(extract('epoch' from timestamp) / 300) * 300 as timestamp,
name,
count(b.name)
FROM time a, id
WHERE …
GROUP BY
round(extract('epoch' from timestamp) / 300), name
MySQL
SELECT
timestamp, -- not sure about that
name,
count(b.name)
FROM time a, id
WHERE …
GROUP BY
UNIX_TIMESTAMP(timestamp) DIV 300, name
I came across the same issue.
I found that it is easy to group by any minute interval is
just dividing epoch by minutes in amount of seconds and then either rounding or using floor to get ride of the remainder. So if you want to get interval in 5 minutes you would use 300 seconds.
SELECT COUNT(*) cnt,
to_timestamp(floor((extract('epoch' from timestamp_column) / 300 )) * 300)
AT TIME ZONE 'UTC' as interval_alias
FROM TABLE_NAME GROUP BY interval_alias
interval_alias cnt
------------------- ----
2010-11-16 10:30:00 2
2010-11-16 10:35:00 10
2010-11-16 10:45:00 8
2010-11-16 10:55:00 11
This will return the data correctly group by the selected minutes interval; however, it will not return the intervals that don't contains any data. In order to get those empty intervals we can use the function generate_series.
SELECT generate_series(MIN(date_trunc('hour',timestamp_column)),
max(date_trunc('minute',timestamp_column)),'5m') as interval_alias FROM
TABLE_NAME
Result:
interval_alias
-------------------
2010-11-16 10:30:00
2010-11-16 10:35:00
2010-11-16 10:40:00
2010-11-16 10:45:00
2010-11-16 10:50:00
2010-11-16 10:55:00
Now to get the result with interval with zero occurrences we just outer join both result sets.
SELECT series.minute as interval, coalesce(cnt.amnt,0) as count from
(
SELECT count(*) amnt,
to_timestamp(floor((extract('epoch' from timestamp_column) / 300 )) * 300)
AT TIME ZONE 'UTC' as interval_alias
from TABLE_NAME group by interval_alias
) cnt
RIGHT JOIN
(
SELECT generate_series(min(date_trunc('hour',timestamp_column)),
max(date_trunc('minute',timestamp_column)),'5m') as minute from TABLE_NAME
) series
on series.minute = cnt.interval_alias
The end result will include the series with all 5 minute intervals even those that have no values.
interval count
------------------- ----
2010-11-16 10:30:00 2
2010-11-16 10:35:00 10
2010-11-16 10:40:00 0
2010-11-16 10:45:00 8
2010-11-16 10:50:00 0
2010-11-16 10:55:00 11
The interval can be easily changed by adjusting the last parameter of generate_series. In our case we use '5m' but it could be any interval we want.
You should rather use GROUP BY UNIX_TIMESTAMP(time_stamp) DIV 300 instead of round(../300) because of the rounding I found that some records are counted into two grouped result sets.
For postgres, I found it easier and more accurate to use the
date_trunc
function, like:
select name, sum(count), date_trunc('minute',timestamp) as timestamp
FROM table
WHERE xxx
GROUP BY name,date_trunc('minute',timestamp)
ORDER BY timestamp
You can provide various resolutions like 'minute','hour','day' etc... to date_trunc.
The query will be something like:
SELECT
DATE_FORMAT(
MIN(timestamp),
'%d/%m/%Y %H:%i:00'
) AS tmstamp,
name,
COUNT(id) AS cnt
FROM
table
GROUP BY ROUND(UNIX_TIMESTAMP(timestamp) / 300), name
Not sure if you still need it.
SELECT FROM_UNIXTIME(FLOOR((UNIX_TIMESTAMP(timestamp))/300)*300) AS t,timestamp,count(1) as c from users GROUP BY t ORDER BY t;
2016-10-29 19:35:00 | 2016-10-29 19:35:50 | 4 |
2016-10-29 19:40:00 | 2016-10-29 19:40:37 | 5 |
2016-10-29 19:45:00 | 2016-10-29 19:45:09 | 6 |
2016-10-29 19:50:00 | 2016-10-29 19:51:14 | 4 |
2016-10-29 19:55:00 | 2016-10-29 19:56:17 | 1 |
You're probably going to have to break up your timestamp into ymd:HM and use DIV 5 to split the minutes up into 5-minute bins -- something like
select year(a.timestamp),
month(a.timestamp),
hour(a.timestamp),
minute(a.timestamp) DIV 5,
name,
count(b.name)
FROM time a, id b
WHERE a.user = b.user AND a.id = b.id AND b.name = 'John'
AND a.timestamp BETWEEN '2010-11-16 10:30:00' AND '2010-11-16 11:00:00'
GROUP BY year(a.timestamp),
month(a.timestamp),
hour(a.timestamp),
minute(a.timestamp) DIV 12
...and then futz the output in client code to appear the way you like it. Or, you can build up the whole date string using the sql concat operatorinstead of getting separate columns, if you like.
select concat(year(a.timestamp), "-", month(a.timestamp), "-" ,day(a.timestamp),
" " , lpad(hour(a.timestamp),2,'0'), ":",
lpad((minute(a.timestamp) DIV 5) * 5, 2, '0'))
...and then group on that
How about this one:
select
from_unixtime(unix_timestamp(timestamp) - unix_timestamp(timestamp) mod 300) as ts,
sum(value)
from group_interval
group by ts
order by ts
;
I found out that with MySQL probably the correct query is the following:
SELECT SUBSTRING( FROM_UNIXTIME( CEILING( timestamp /300 ) *300,
'%Y-%m-%d %H:%i:%S' ) , 1, 19 ) AS ts_CEILING,
SUM(value)
FROM group_interval
GROUP BY SUBSTRING( FROM_UNIXTIME( CEILING( timestamp /300 ) *300,
'%Y-%m-%d %H:%i:%S' ) , 1, 19 )
ORDER BY SUBSTRING( FROM_UNIXTIME( CEILING( timestamp /300 ) *300,
'%Y-%m-%d %H:%i:%S' ) , 1, 19 ) DESC
Let me know what you think.
select
CONCAT(CAST(CREATEDATE AS DATE),' ',datepart(hour,createdate),':',ROUNd(CAST((CAST((CAST(DATEPART(MINUTE,CREATEDATE) AS DECIMAL (18,4)))/5 AS INT)) AS DECIMAL (18,4))/12*60,2)) AS '5MINDATE'
,count(something)
from TABLE
group by CONCAT(CAST(CREATEDATE AS DATE),' ',datepart(hour,createdate),':',ROUNd(CAST((CAST((CAST(DATEPART(MINUTE,CREATEDATE) AS DECIMAL (18,4)))/5 AS INT)) AS DECIMAL (18,4))/12*60,2))
This will do exactly what you want.
Replace
dt - your datetime
c - call field
astro_transit1 - your table
300 as seconds for each time gap increase
SELECT
FROM_UNIXTIME(300 * ROUND(UNIX_TIMESTAMP(r.dt) / 300)) AS 5datetime,
(SELECT
r.c
FROM
astro_transit1 ra
WHERE
ra.dt = r.dt
ORDER BY ra.dt DESC
LIMIT 1) AS first_val
FROM
astro_transit1 r
GROUP BY UNIX_TIMESTAMP(r.dt) DIV 300
LIMIT 0 , 30
Based on #boecko answer for MySQL, I used a CTE (Common Table Expression) to accelerate the query execution time :
so this :
SELECT
`timestamp`,
`name`,
count(b.`name`)
FROM `time` a, `id` b
WHERE …
GROUP BY
UNIX_TIMESTAMP(`timestamp`) DIV 300, name
becomes :
WITH cte AS (
SELECT
`timestamp`,
`name`,
count(b.`name`),
UNIX_TIMESTAMP(`timestamp`) DIV 300 AS `intervals`
FROM `time` a, `id` b
WHERE …
)
SELECT * FROM cte GROUP BY `intervals`
In a large amount of data, the speed is accelerated by more than 10!
As timestamp and time are reserved in MySQL, don't forget to use `...` on each table and column name !
Hope it will help some of you.