Hi I want to sort my graph results by two filters..
My Cypher Query looks like this..
MATCH (n:User{user_id:304020})-[r:know]->(m:User) with m MATCH (m)-[s:like|create|share]->(o{is_active:1})
with m, s, o, (toInt(timestamp()/1000)-toInt(o.created_on))/86400 as days,
(toInt(timestamp()/1000)-toInt(o.created_on))/3600 as hours,
(1- round(o.impression_count_all/20)/50) as low_boost
with m,s,o,days,low_boost,hours,
CASE
WHEN days > 30 THEN 0.05
WHEN days >=20 AND days <=30 THEN 0.1
WHEN days >=10 AND days <=20 THEN 0.2
WHEN days >=5 AND days <=10 THEN 0.4
WHEN days >=2 AND days <=5 THEN 0.5
WHEN days =1 THEN 0.6
WHEN days < 1 THEN
CASE
WHEN hours <= 2 THEN 1
WHEN hours > 2 AND hours <= 8 THEN 0.9
WHEN hours > 8 AND hours <= 16 THEN 0.8
WHEN hours > 16 AND hours < 23 THEN 0.75
WHEN hours >= 23 AND hours <= 24 THEN 0.7
END
END as rs,
CASE
WHEN low_boost > 0 THEN low_boost
WHEN low_boost <= 0 THEN 0
END as lb
where has(o.trending_score_all) and has(o.impression_count_all) and not(o.is_featured=2)
RETURN distinct o.story_id as story_id,
(o.trending_score_all*4) as ts, (o.trending_score_all + rs + lb) as final_score,
count(s) as rel_count,max(s.activity_id) as id, toInt(o.created_on) as created_on
ORDER BY (CASE WHEN ts > 3 THEN final_score desc, rel_count desc ELSE ts) END) DESC
skip 0 limit 10;
Now If ts > 3 ,I want to sort results by both final_score and rel_count ELSE srt only by ts..
Please modify order by..
Does this much simplified query (which uses a single argument for ORDER BY) work for you?
MATCH (u:User)-[r:like]->(s:Story)
WITH s, (s.trending_score_all*4) AS ts
RETURN DISTINCT s.story_id, ts, TOINT(s.impression_count_72)
ORDER BY (CASE WHEN ts > 3 THEN ts ELSE TOINT(s.impression_count_72) END) DESC
LIMIT 10;
[EDITED]
If you need to sort by a varying number of values (depending on the situation) you have to use a workaround, as Cypher does not support that directly.
For example, suppose when (ts > 3) you wanted to order by ts DESC and then by s.story_id ASC. In this situation, you could change the above ORDER BY clause to this:
ORDER BY
(CASE WHEN ts > 3 THEN ts ELSE TOINT(s.impression_count_72) END) DESC,
(CASE WHEN ts > 3 THEN s.story_id ELSE NULL END) ASC
By using NULL (or any literal value) in this way, you can have any of the sub-sorts effectively do nothing.
1) You use pagination (skip and limit)
2) If I understand what you need, then add "else" to sort:
UNWIND RANGE(1,100) as i
WITH i, rand()*5 as x, toInt(rand()*10) as y
RETURN i, x, y, CASE WHEN x>3 THEN 1 ELSE 0 END as for_sort
ORDER BY
CASE
WHEN for_sort=1 THEN x ELSE y END DESC,
CASE
WHEN for_sort=1 THEN y ELSE x END DESC
Related
I'm trying to count the amount of rows that Neo4j will return but the count (or the query) is very slow.
Version 1 (70 sec):
MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WHERE order.timestamp >= 1632434400 AND size((order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(:Product)) <= 20
WITH order
MATCH (order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(product:Product)
RETURN COUNT(product);
Version 2 (68 sec.):
MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WITH size((order)<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(:Product)) AS amount
WHERE order.timestamp >= 1632434400 AND amount <= 20
RETURN SUM(amount)
Using Neo4j 4.4 community with about 800000 orders and about 17000000 order lines.
Is there a more efficient way to count the rows?
These are the indexes:
CREATE INDEX idx_order_torder_id FOR (n:Order) ON (n.order_id);
CREATE INDEX idx_order_timestamp FOR (n:Order) ON (n.timestamp);
CREATE INDEX idx_person_person_id FOR (n:Person) ON (n.person_id);
CREATE INDEX idx_product_product_id FOR (n:Product) ON (n.product_id);
The amount of rows are equal to 4269011.
The EXPLAIN plan:
Please try below, hope it will give faster results
MATCH (person:Person)-[:HAS_ORDER]->(order:Order)
WHERE order.timestamp >= 1632434400
WITH order.order_id AS orderid
MATCH (o:Order { order_id: orderid })<-[:HAS_ORDER]-(:OrderLine)-[:HAS_PRODUCT]->(product:Product)
WITH COUNT(product) as productCount
WHERE productCount <= 20
RETURN productCount;
Because every order line has one product, i can skip the counting of the relation order lines to products:
MATCH (order:Order)
WHERE order.timestamp >= 1632434400
WITH order
MATCH (order)<-[:HAS_ORDER]-(orderLine:OrderLine)
WITH COUNT(orderLine) as productCount
WHERE productCount <= 20
RETURN SUM(productCount);
This query took 0m17.342s
But i managed to snoop some seconds with the following query:
MATCH (order:Order)
WHERE order.timestamp >= 1632434400
WITH order, size((order)<-[:HAS_ORDER]-(:OrderLine)) AS amount
WHERE amount <= 20
RETURN SUM(amount);
This query took 0m15.675s
Let say I have a nodes with values a multiples of 10. I want to find the first GAP in the values.
Here is how I would do it in numpy :
> np.where(np.diff([11,21,31,51,61,71,91]) > 10)[0][0] + 2
> 4 i.e. 41
How would I do this in Cypher... ?
match (n) where n.val % 10 = 1
with n.val
order by val ....???
I'm using RedisGraph.
PS>
if no GAP it should return the next value i.e. biggest + 10, if possible !
I'm not sure if this is the most performant solution, but you can accomplish this using a combination of collect() and list comprehensions:
MATCH (n) WHERE n.val % 10 = 1 WITH n.val AS val ORDER BY n.val // collect ordered vals
WITH collect(val) AS vals // combine vals into array
WITH vals, [idx IN range(0, size(vals) + 1) WHERE vals[idx + 1] - vals[idx] > 10] AS gaps // find first index with diff > 10
RETURN vals[gaps[0]] + 10 // return missing value
To additionally return the next-biggest value if no gaps are found, change the RETURN clause to use a CASE statement:
RETURN CASE size(gaps) WHEN 0 THEN vals[-1] + 10 ELSE vals[gaps[0]] + 10 END
I use the following Cypher query:
MATCH (v:Value)-[:CONTAINS]->(hv:HistoryValue)
WHERE v.id = {valueId}
OPTIONAL MATCH (hv)-[:CREATED_BY]->(u:User)
WHERE {fetchCreateUsers}
WITH u, hv
ORDER BY hv.createDate DESC
WITH count(hv) as count, ceil(toFloat(count(hv)) / {maxResults}) as step, COLLECT({userId: u.id, historyValueId: hv.id, historyValue: hv.originalValue, historyValueCreateDate: hv.createDate}) AS data
RETURN REDUCE(s = [], i IN RANGE(0, count - 1, CASE step WHEN 0 THEN 1 ELSE step END) | s + data[i]) AS result, step, count
This query works fine and does exactly what I need.
Right now I'm concerned about two possible issues inside of this query from the performance point of view and Cypher best practices.
First of all, as you may see - I two times use the same count(hv) function. Will it cause the problems during the execution from the performance point of view or Cypher and Neo4j are smart enough to optimize it? If no, please show how to fix it.
And the second place is the CASE statement inside range() function? The same question here - will this CASE statement be executed only once or every time for every iteration over my range? Please show how to fix it if needed.
UPDATED
I tried to do a separator WITH for count but the query doesn't return the results(returns empty result)
MATCH (v:Value)-[:CONTAINS]->(hv:HistoryValue)
WHERE v.id = {valueId}
OPTIONAL MATCH (hv)-[:CREATED_BY]->(u:User)
WHERE {fetchCreateUsers}
WITH u, hv ORDER BY hv.createDate DESC
WITH u, hv, count(hv) as count
WITH u, hv, count, ceil(toFloat(count) / {maxResults}) as step, COLLECT({userId: u.id, historyValueId: hv.id, historyValue: hv.originalValue, historyValueCreateDate: hv.createDate}) AS data
RETURN REDUCE(s = [], i IN RANGE(0, count - 1, CASE step WHEN 0 THEN 1 ELSE step END) | s + data[i]) AS result, step, count
1 MATCH (v:Value)-[:CONTAINS]->(hv:HistoryValue)
2 WHERE v.id = {valueId}
3 OPTIONAL MATCH (hv)-[:CREATED_BY]->(u:User)
4 WHERE {fetchCreateUsers}
5 WITH u, hv
6 ORDER BY hv.createDate DESC
7 WITH count(hv) as count, ceil(toFloat(count(hv)) / {maxResults}) as step, COLLECT({userId: u.id, historyValueId: hv.id, historyValue: hv.originalValue, historyValueCreateDate: hv.createDate}) AS data
8 RETURN REDUCE(s = [], i IN RANGE(0, count - 1, CASE step WHEN 0 THEN 1 ELSE step END) | s + data[i]) AS result, step, count
(1) You need to pass hv in line 5, because it's values are collected in line 7. That said, you can still do something like this:
5 WITH u, collect(hv) AS hvs, count(hv) as count
UNWIND hvs AS hv
However, this is not very elegant and probably not worth doing.
(2) You can calculate the CASE expression in line 7:
7 WITH count, data, step, CASE step WHEN 0 THEN 1 ELSE step END AS stepFlag
8 RETURN REDUCE(s = [], i IN RANGE(0, count - 1, stepFlag) | s + data[i]) AS result, step, count
I working on a project where I have to make real time recommandation based on filters. I decided to take a look on graph db and started to play with neo4j and compared it performance with mysql.
rows are about :
"broadcast": 159844,
"format": 5,
"genre": 10,
"program": 60495
the mysql the query look like :
select f.id, sum(weight) as total
from
(
select program.id, 15 as weight
from broadcast
inner join program on broadcast.programId = program.id
where broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
union all
select program.id, 10 as weight
from broadcast
inner join program on broadcast.programId = program.id
inner join genre ON program.genreId = genre.id
where genre.id in (13) and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
union all
select program.id, 5 as weight
from broadcast
inner join program on broadcast.programId = program.id
inner join genre ON program.genreId = genre.id
inner join format on genre.formatId = format.id
where format.id = 6 and broadcast.startedAt > now() and broadcast.startedAt < date_add(now(), INTERVAL +1 DAY)
group by program.id
) f
group by f.id
order by total desc, id desc
limit 0, 50
On my local machine the query perform in about 300ms. It may be acceptable but under 100ms would be better for real time processing.
I also have written with some help a thinkerpop3 query :
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.in('programBroadcast')
.dedup()
.union(
filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
filter(out('programGenre').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(10)),
filter(out('programGenre').out('genreFormat').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.order(local).by(valueDecr)
.limit(local, 50)
the query perform about 700ms !!
===== EDIT =====
since I wanted to display the profiling of the query and I got :
Step Count Traversers Time (ms) % Dur
=============================================================================================================
Neo4jGraphStep([],vertex) 220513 220513 14135.788 68.52
HasStep([~label.eq(broadcast)]) 159844 159844 391.087 1.90
VertexStep(IN,[programBroadcast],vertex) 159825 159825 267.202 1.30
DedupGlobalStep 60495 60495 95.848 0.46
UnionStep([[LambdaFilterStep(lambda)#[p, w], Pr... 63247 63247 2008.553 9.74
LambdaFilterStep(lambda)#[p, w] 60495 60495 194.406
SelectStep([p, w],[value(id), [ConstantStep(1... 60495 60495 487.158
ConstantStep(15) 60495 60495 24.214
EndStep 60495 60495 110.575
TraversalFilterStep([VertexStep(OUT,[programG... 2070 2070 410.689
VertexStep(OUT,[programGenre],vertex) 22540 22540 191.158
HasStep([id.eq(6)]) 0 0 140.934
SelectStep([p, w],[value(id), [ConstantStep(1... 2070 2070 52.203
ConstantStep(10) 2070 2070 0.654
EndStep 2070 2070 43.120
TraversalFilterStep([VertexStep(OUT,[programG... 682 682 443.347
VertexStep(OUT,[programGenre],vertex) 22540 22540 119.115
VertexStep(OUT,[genreFormat],vertex) 27510 27510 117.410
HasStep([id.eq(1)]) 0 0 133.517
SelectStep([p, w],[value(id), [ConstantStep(5... 682 682 43.247
ConstantStep(5) 682 682 0.217
EndStep 682 682 44.427
GroupStep([SelectOneStep(p), ProfileStep],[Sele... 1 1 3583.249 17.37
SelectOneStep(p) 63247 63247 26.836
SelectOneStep(w) 63247 63247 81.623
SumGlobalStep 60495 60495 3107.593
SelectOneStep(w) 0 0 0.000
SumGlobalStep 0 0 0.000
UnfoldStep 60495 60495 17.227 0.08
OrderGlobalStep([valueDecr]) 60495 60495 114.439 0.55
FoldStep 1 1 16.902 0.08
RangeLocalStep(0,10) 1 1 0.081 0.00
SideEffectCapStep([~metrics]) 1 1 0.215 0.00
- show quoted text -
this shows that 68% of the time happened to the g.V() that has no index!
suddenly I tried to find a way to have a single started point, so I did :
graph.addVertex(label, 'day', 'id', 1)
graph.cypher("CREATE INDEX ON :day(id)")
g.V().hasLabel('broadcast')
.has('startedAt', inside(new Date().getTime(), new Date().getTime() + (1000 * 60 * 60 * 24)))
.map{it.get().addEdge('broadcastDay', g.V().hasLabel('day').has('id', 1).next())}
and now the query look like :
g.V(14727)
.in('broadcastDay')
.in('programBroadcast')
.union(
filter{true}
.as('p', 'w').select('p', 'w').by('id').by(constant(15)),
filter(out('programGenre').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(10)),
filter(out('programGenre').out('genreFormat').has('id', 4))
.as('p', 'w').select('p', 'w').by('id').by(constant(5))
)
.group().by(select("p")).by(select("w").sum())
.unfold().order().by(valueDecr).fold()
.limit(local, 50)
and the execution time is now 137ms !
===== END EDIT =====
Neo4j is slower than mysql in my case...
So I decided to make the query in cypher (thanks to this post) with this naive approach :
WITH [] AS res
MATCH (b:broadcast)-[:programBroadcast]-(p:program)
WHERE b.startedAt > timestamp() and b.startedAt < (timestamp() + 1000 * 60 * 60 * 24)
OPTIONAL MATCH (p)
WITH res, COLLECT({id: p.id, weight: 15}) AS data
WITH res + data AS res
OPTIONAL MATCH (p)-[:programGenre]-(g:genre{id:4})
WITH res, (CASE WHEN g IS NOT NULL THEN COLLECT({id: p.id, weight: 10}) ELSE [] END) AS data
WITH res + data AS res
OPTIONAL MATCH (p)-[:programGenre]-(g:genre)-[:genreFormat]-(f:format{id:4})
WITH res, (CASE WHEN f IS NOT NULL THEN COLLECT({id: p.id, weight: 5}) ELSE [] END) AS data
WITH res + data AS res
UNWIND res AS result
WITH result, result.id as id, SUM(result.weight) as weight
ORDER BY weight DESC
LIMIT 10
RETURN id, weight
I'm about 68614ms !!
I'm very disappointed by graph db, but I don't understand why, I used indices on every properties and set java memory about 4g, and it's stuck compared to mysql, why ? graph db only for big data where mysql can't perform join ?
The structure of my data base is:
( :node ) -[:give { money: some_int_value } ]-> ( :Org )
One node can have multiple relations.
I need to find top 3 nodes with the most number of relations :give with their property money holding: vx <= money <= vy
Using ORDER BY and LIMIT should solve your problem:
Match ( n:node ) -[r:give { money: some_int_value } ]-> ( :Org )
RETURN n
ORDER BY count(r) DESC //Order by the number of relations each node has
LIMIT 3 //We only want the top 3 nodes
Instead of using the label 'node', maybe use something more descriptive like Person for the label so the datamodel is more clear:
MATCH (p:Person)-[r:give]->(o:Org)
WITH count(r) AS num, sum(r.money) AS total, p
RETURN p, num, total ORDER BY num DESC LIMIT 3;
I'm not sure what you mean by "their property money holding: vx <= money <= vy". If you could clarify I can update my answer accordingly. You can calculate the total of the money properties using the sum() function.
Edit
To only include relationships with money property with value greater than 10 and less 25:
MATCH (p:Person)-[r:give]->(o:Org)
WHERE r.money >= 10 AND r.money <= 25
WITH count(r) AS num, sum(r.money) AS total, p
RETURN p, num, total ORDER BY num DESC LIMIT 3;