I am new to graphs. I have two sets in a bipartite graph. I need to find unique matching of all the possible combinations. So I thought I use Hopcroft-Karp to find maximum matching. Being a newbie I thought I would get the resulting matching graph but all it tells me is 42. Ahhh that really helps. I don't need to know how many matchings there are I need to know the unique matchings themselfs.
Am I missing something? How do I get the resulting matching?
I diden't check the datastructures generated by the Hopcroft-Karp match function, only the retrun value. The return value is the number of matchings. However there was also a self.pair dictionary in the python code, the pair dictionary contains the matchings from "both" sides, which answers my question.
Related
I'm working on a scientific database that contains model statements such as:
"A possible cause of Fibromyalgia is Microglial hyperactivity, as supported by these 10 studies: [...] and contradicted by 1 study [...]."
I need to specify a source for statements in Neo4j and be able to do 2 ways operations, like:
Find all statements supported by a study
Find all studies supporting a statement
The most immediate idea I had is to use the DOI of studies as unique identifiers in the relationship property. The big con of this idea is that I have to scan all the relationships to find the list of all statements supported by a study.
So, since it is impossible to make a link between a study and a relationship, I had the idea to make 2 links, at each extremity of the relationship. The obvious con is that it does not give information about the relationship, like "support" or "contradict".
So, I came to the conclusion that I need a node for the hypothesis:
However, it overloads the graph and we are not anymore in the classical node -relationship-> node design that makes property graphs so easy to understand.
Using RDF, it is possible to add properties to relationships using subgraphs, however there we enter semantic graphs and quad stores, which is a more complex tool.
So I'm wondering if there is a "correct" design pattern for Neo4j to support this type of need that I may not have imagined instead?
Thanks
Based on your requirements, I think put support_study as property of edge will do the work:
Thus we could query the following as:
Find all statements supported by a study
MATCH ()-[e:has_cause{support_study: "doi_foo_bar"}]->()
RETURN e;
Find all studies supporting a statement
Given statement is “foo” is caused by “bar”
MATCH (v:disease{name: "foo"})-[e:has_cause]->(v1:sympton{name: "bar")
RETURN DISTINCT e.support_study;
While, this is mostly based on NebulaGraph, where:
It speaks cypher DQL(together with nGQL)
It supports properties in edge
It used 4-tuple(rather than a Key) to distingush an edge(src,dst,edge_type,rank), where rank is an unique design to enable multiple has_cause edge instance between one pair of disease-> sympton, you could put the hash of doi or other number as rank field(or omit, of cause, it will be 0)
It’s distributed and Open-Source(Apache 2.0)
Note:
In NebulaGraph, index should be created on has_cause(support_study) and disease(name), ref: https://www.siwei.io/en/nebula-index-explained/ and https://docs.nebula-graph.io/3.2.0/3.ngql-guide/14.native-index-statements/
But, I think it applies to neo4j, too :)
We have a long list of organizations that we are trying to match to another long list of organizations. Normally, I would just use an index/match formula to match the 2 columns together. However, they both utilize organization names and the spelling can vary mildly.
For example, one column might refer to Organization A as "Tanner Health" and another might refer to it as "Tanner Healthcare". Another example, one column might refer to Organization B as "The Tanner Health Institute" and another might refer to it as "Tanner Health Institute". There are countless examples, but the idea is that we want to try and get the closest possible match between the 2 columns.
It is totally possible that a Organization in column 1 might not exist in column 2, in which case, we don't want it to return anything.
If it helps at all, we also have information about each organizations state and zip code. So, I considered an array formula that matches Zip code, State, and then does the best fuzzy match to the available options.
I am struggling to find a formula that can do a fuzzy match of the 2 columns, and am looking for suggestions on the best way to approach this problem.
I'm implementing a reduce-side join to find matches between databases A and B. Both files from the datasets contains a json object per line. The join key is the name attribute of each record, so, the mapper extract the name of the json and pass it as key and the json itself as value. The reducer must merge the jsons objects for the same or similar person name.
The problem is that I need to group keys using a string similarity matching algorithm, e.g., John White must be considered equal to John White Lennon.
I've tried to do that using a grouping comparator but it is not working as expected.
How can this be implemented?
Thanks in advance!
What you request here could be described as a set similarity join, where the sets are, e.g. the sets of tokens, or n-grams of each line. Here is a research paper, that describes how you can achieve that in MapReduce. I hope you find it useful.
START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.
I have the following nodes in my graph:
Car
Trash
CarToTrash
[:has_input]-(Car)
[:has_output]-(Trash)
RecycleTrash
[:has_input]-(Trash)
[:has_output]-(Car)
I'm trying to find a query which will give me all shortest paths between the two types, i.e.
(Car)-[has_input]-(CarToTrash)-[has_output]-(Trash)-[has_input]-(RecycleTrash)-[has_output]-(Car)
The length of the path can vary though. It can have more nodes like XToY with an has_input and has_output relation. I'd like to find the shortest path between any two types I might add to the graph. CarToTrash and RecycleTrash represents function and the relation has_input and has_output is the input type and return type of the function. Basically what I have is a graph of types and functions, and I'd like to see check if there is a path of functions between any two arbitrary types in the graph.
I've tried with the following query which works somewhat, but it would find paths which does not follow the pattern has_input, has_output if those existed. Also I tried finding the way from Car back to Car which I was unable to do, I could only find Car to Trash, I might manage without though if it's not possible to query this kind of loops.
MATCH car, trash WHERE car.uid='Car' AND trash.uid='trash'
WITH car, trash MATCH p = allShortestPaths(car-[*..15]-trash) return p;
Since you want structure in your shortest-path path segments, I believe that this algo is not usable for you.
I should probably device your own traversal algo, and use the Java API to do it, based on http://docs.neo4j.org/chunked/stable/tutorial-traversal-java-api.html, which gives you even more flexibility than current Cypher here.