Would anyone know how to count the number of records using distinct but with multiple columns?
An example of what I want would be:
SELECT COUNT(DISTINCT a, b, c, d)
FROM temp
I am using Informix.
Would this do the job (assuming a sufficiently recent version of Informix):
SELECT COUNT(*)
FROM (SELECT DISTINCT a, b, c, d FROM temp);
The inner SELECT generates the list of distinct combinations of the columns a, b, c, and d; the outer SELECT counts the number of rows generated by the inner SELECT.
It's unlikely that you don't have a sufficiently recent version of Informix.
Unfortunately I have version 9 of Informix and that doesn't work for me.
I wish I wasn't as prescient — you don't have a sufficiently recent version of Informix. If you're using an obsolete version of Informix, it is important to say which version when you ask the question. Assuming you're using 9.40 (the latest version in the 9.x family, first released in 2003) and not some even more antique version such as 9.00 (from 1996), then there are versions 10.00, 11.10, 11.50 and 11.70 that have all been released since your version, and these versions have also all been removed from support. The currently supported versions are 12.10 and 14.10. You should upgrade to 14.10. (If you ask more questions about Informix, please include your version number in the question. You'll get a better answer in the first pass.)
Your simplest alternative is to use:
SELECT DISTINCT a, b, c, d FROM temp INTO TEMP distinct_a_b_c_d;
SELECT COUNT(*) FROM distinct_a_b_c_d;
DROP TABLE distinct_a_b_c_d;
Given that you're using version 9.x, you can't even use the protective DROP TABLE IF EXISTS distinct_a_b_c_d; before the sequence to remove a pre-existing table.
I'm not sure whether there are any feasible alternatives to creating the intermediate result table (which could be a permanent table, at a pinch, but that's problematic for a variety of reasons).
select count(*)
from table (
multiset(select distinct a, b, c, d from temp)
) a;
Related
I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes
With the following graph:
How can I write a query that would return N latest relationships by the unique target node?
For an example, this query: MATCH (p)-[r:RATED_IN]->(s) WHERE id(p)={person} RETURN p,s,r ORDER BY r.measurementDate DESC LIMIT {N} with N = 1 would return the latest relationship, whether it is RATED_IN Team Lead or Programming, but I would like to get N latest by each type. Of course, with N = 2, I would like the 2 latest measurements per skill node.
I would like the latest relationship by a person for Team Lead and the latest one for Programming.
How can I write such a query?
-- EDIT --
MATCH (p:Person) WHERE id(p)=175
CALL apoc.cypher.run('
WITH {p} AS p
MATCH (p)-[r:RATED_IN]->(s)
RETURN DISTINCT s, r ORDER BY r.measurementDate DESC LIMIT 2',
{p:p}) YIELD value
RETURN p,value.r AS r, value.s AS s
Here's a Cypher knowledge base article on limiting MATCH results per row, with a few different suggestions on how to accomplish this given current limitations. Using APOC's apoc.cypher.run() to perform a subquery with a RETURN using a LIMIT will do the trick, as it gets executed per row (thus the LIMIT is per row).
Note that for the upcoming Neo4j 4.0 release at the end of the year we're going to be getting some nice Cypher goodies that will make this significantly easier. Stay tuned as we reveal more details as we approach its release!
I've got a graph consisting of nodes representing document versions connected together in path of versions. These paths can be connected by another type of relationships that represent a change in the way documents were versionned. One of the problem of the graph is that the sources used to create it where not really clean, that's why I'm trying to write a query that would add a relationship to have a clean path of versions in the graph.
The part I'm stuck on is the following :
Say that I've got two paths of nodes from two different versioning periods. These paths are connected together by one or multiple relationship from the second type that indicate a port of the document to the new system. I want a query that will take the last one satisfying some conditions in the old path and connect it to the first one satisfyng some other conditions in the new path.
For example in the following graph i would want to connect (D) to (2) because (1) does not satisfy my set of conditions:
(A)-[:Version]->(B)-[:Version]->(C)-[:Version]->(D)
| |
Ported Ported
| |
(1)-[:Version]->(2)-[:Version]->(3)
I came up with different queries but all of them fails in some cases :
This one fails because sometimes old documents where ported and split into multiple documents, meaning different path but my query select only one new 'new' node for one 'old' one thus ignoring some paths.
//match all the 'port' and 'ported' relations between old and new versioning system
match (new:Document)-[r:Link]-(old:Document)
where new.num =~'[A-Z].{4}-.*' and old.num =~'[A-Z].{3}-.*' and r.type in ['PORT','PORTED']
//find youngest one satisfying a condition, here a date
optional match(new)<-[:Version*]-(newAncestor:ArticleCode)
where newAncestor.dateBegin >= '2012-01-01'
with old, collect(new) + collect(newAncestor) as potentialNewVersions
unwind potentialNewVersions as potentialNew
with distinct old, potentialNew
order by potentialNew.dateBegin , potentialNew.dateEnd
with distinct old, collect(potentialNew)[0] as youngestNew
//find oldest one satisfying a condition
optional match(old) -[:Version *]->(oldChild:ArticleCode)
where oldChild.dateEnd <= youngestNew.dateBegin
with old, youngestNew, collect(old) + collect(oldChild) as potentialOldVersions
unwind potentialOldVersions as potentialOld
with distinct old, youngestNew, potentialOld
order by potentialOld.dateEnd desc, potentialOld.dateBegin desc
with distinct youngestNew, collect(potentialOld)[0] as oldestOld
merge(youngestNew)<-[:VersionGlobal]-(oldestOld)
The second one is much simpler but select too much nodes for the 'new' ones as multiple version can satisfy the date condition. In addition it could fail if the only 'ported' relationship between the old and new path was on a node before the limit date.
//this time I match all path of new versions whose first node satisfy condition
match p=(new:Document)-[:Version*0..]->(:Document)-[r:Link]-(old:ArticleCode)
where new.num =~'[A-Z].{4}-.*' and old.num =~'[A-Z].{3}-.*' and r.type in ['PORT','PORTED'] and new.dateBegin >= '2012-01-01'
//take first node of each path
with distinct nodes(p)[0] as youngestNew, old
//find latest old node
optional match p=(old)-[:Version*0..]->(oldChild:ArticleCode)
where oldChild.dateFin <= youngestNew.dateDebut
with distinct last(nodes(p)) as oldestOld, old
merge(youngestNew)<-[:VersionGlobal]-(oldestOld)
Thanks
I think we found an answer using optional matches and cases :
match (new:Document)-[:Version*0..]-(:Document)-[r:Lien]-(:Document)-[:Version*0..]-(old:Document)
where *myConditions*
optional match (newAncestor:Document)-[:Version]->(new)
with distinct
case
when newAncestor.dateBegin < '2012-01-01' or newAncestor is null
then new
end as youngestNew, old
where not(youngestNew is null)
optional match (old)-[:Version]->(oldChild:Document)
with distinct
youngestNew,
case
when oldChild.dateBegin > youngestNew.dateBegin or oldChild is null
then old
end as oldestOld
where not(oldestOld is null)
*merge part*
First, sorry for my english. I am modeling a railways's DB in neo4j. I want to link station in the order that are linked by railway, using the stops's table. Every stop own a "stop sequence" that unfortunately isn't like 1,2,3 (not always) but only progressive like 1,3,4,6. I wrote this query that, for the problem described, don't work always.
MATCH (a:Station)-[:stop]->(c:Stops_times)-[:trip]->(z:Trips)<-[:trip]-(d:Stops_times)<-[:stop]-(b:Station) WHERE toint(c.stop_sequence)=toint(d.stop_sequence)+1 CREATE (a)-[s:next]->(b)
To find the right "next" I need a query similar to this:
MATCH (a:Station)-[:stop]->(c:Stops_times)-[:trip]->(z:Trips)<-[:trip]-(d:Stops_times)<-[:stop]-(b:Station) WITH c as c, d as d, MIN(d.stop_sequence) as min_ WHERE min_>c.stop_sequence CREATE UNIQUE (a)-[s:next]->(b)
therefore, for every stop, I have to found the minimum "stop_sequence" between the higher ones than the "stop_sequence" of the stop of wich i want to find the next
The following query seems to do what you want. It orders all the stops by stop_sequence, aggregates all the stops (still in order) for each trip, pairs up all adjoining stops for each trip, UNWINDs the pairs so that MERGE can use the paired nodes, and then uses MERGE to ensure that the :next relationship exists between all node pairs.
MATCH (a:Station)-[:stop]->(c:Stops_times)-[:trip]->(t:Trips)
WITH a, c, t
ORDER BY c.stop_sequence
WITH t, COLLECT(a) AS s
WITH REDUCE(x =[], i IN RANGE(1, SIZE(s)-1)| x + {a: s[i-1], b: s[i]}) AS pairs
UNWIND pairs AS p
WITH p.a AS a, p.b AS b
MERGE (a)-[n:next]->(b);
It works properly in 2.3.2 on my Mac (but the neo4j versions available at http://console.neo4j.org/ do not work correctly when the query gets to MERGE).
I have in my graph places and persons as labels, and a relationship "knows_the_place". Like:
(person)-[knows_the_place]->(place)
A person usually knows multiple places.
Now I want to find the persons with a "strong" relationship via the places (which have a lot of "places" in common), so for example I want to query all persons, that share at least 3 different places, something like this (not working!) query:
MATCH
(a:person)-[:knows_the_place]->(x:place)<-[:knows_the_place]-(b:person),
(a:person)-[:knows_the_place]->(y:place)<-[:knows_the_place]-(b:person),
(a:person)-[:knows_the_place]->(z:place)<-[:knows_the_place]-(b:person)
WHERE NOT x=y and y=z
RETURN a, b
How can I do this with neo4j Query?
Bonus-Question:
Instead of showing me the person which have x places in common with another person, even better would be, if I could get a order list like:
a shares 7 places with b
c shares 5 places with b
d shares 2 places with e
f shares 1 places with a
...
Thanks for your help!
Here you go:
MATCH (a:person)-[:knows_the_place]->(x:place)<-[:knows_the_place]-(b:person)
WITH a, b, count(x) AS count
WHERE count >= 3
RETURN a, b, count
To order:
MATCH (a:person)-[:knows_the_place]->(x:place)<-[:knows_the_place]-(b:person)
RETURN a, b, count(x) AS count
ORDER BY count(x) DESC
You can also do both by adding an ORDER BY to the of the first query.
Keep in mind that this query is a cartesian product of a and b so it will examine every combination of person nodes, which may be not great performance-wise if you have a lot of person nodes. Neo4j 2.3 should warn you about these sorts of queries.