Essentially, I'm storing a directed graph of entities in CouchDB, and need to be able to find edges going IN and OUT of the graph.
SETUP:
The way the data is being stored right now is as follows. Each document represents a RELATION between two entities:
doc: {
entity1: { name: '' ... },
entity2: { name: '' ... }
...
}
I have a view which does a bunch of emits, two of which emit documents keyed on their entity1 component and on their entity2 component, so something like:
function() {
emit(['entity1', doc.entity1.name]);
emit(['entity2', doc.entity2.name]);
}
Edges are directed, and go from entity1 and entity2. So if I want to find edges going out of an entity, I just query the first emit; if I want edges going into an entity, I query the second emit.
PROBLEM:
The problem here lies in the fact that I also have the need to capture edges both going INTO and OUT OF entities. Is there a way I can group or reduce these two emits into a single bi-directional set of [x] UNIQUE pairs?
Is there a better way of organizing my view to promote this action?
It might be preferable to just create a second view. But there's nothing stopping you from cramming all sorts of different data into the same view like so:
function() {
if (doc.entity1.name == doc.entity2.name) {
emit(['self-ref', doc.entity1.name], 1);
}
emit(['both' [doc.entity1.name, doc.entity2.name]], 1);
emit(['either' [doc.entity1.name, "out"]], 1);
emit(['either' [doc.entity2.name, "in"]], 1);
emit(['out', doc.entity1.name], 1);
emit(['in', doc.entity2.name], 1);
}
Then you could easily do the following:
find all the self-ref's:
startkey=["self-ref"]&endkey=["self-ref", {}].
find all of the edges (incoming or outgoing) for a particular node:
startkey=["either", [nodeName]]&endkey=["either", [nodeName, {}]]
if you don't reduce this, then you'll still be preserving "in" vs "out" in the key. If you never need to query for all nodes with incoming or outgoing edges, then you can replace the last two emits with the "either" emits.
find all of the edges from node1 -> node2:
key=["both", [node1, node2]
as well as your original queries for incoming or outgoing for a particular node.
I'd recommend benchmarking your application's typical use cases before choosing between this combined view approach or a multi-view approach.
Related
I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.
I am currently investigating how to model a bitemporal graph in neo4j. Unfortunately noone seems to have publicly undertaken this before.
One particular thing I am looking at is whether I can store in a new node only those values that have changed and then express a query that would merge all those values ordered by a given timestamp:
This creates the data I am playing with:
CREATE (:P1 {id: '1'})<-[:EXPANDS {date:5200, recorded:5100}]-(:P1Data {name:'Joe', wage: 3000})
// New data, recorded 2014-10-1 for 2015-1-1
MATCH (p:P1 {id: '1'}) CREATE (:P1Data { wage:3100 })-[:EXPANDS { date:5479, recorded: 5387}]->(p)
Now, I can get a history for a given point in time so far, e.g. like
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WHERE x.recorded < 6000
WITH {date: x.date, data:d} as data
RETURN data
ORDER BY data.date DESC
What I would like to achieve is to merge the name and wage values such that I get a whole view of the data at a given point in time. The answer may also be that this is not really possible.
(PS: I say only in query, because I found a refactor function in apoc which does merge nodes, but that procedure actually merges and persists the node, while I would just want to query it).
As with most things, you can do it using REDUCE like so:
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WITH x.date AS date, d AS data
ORDER BY date
WITH COLLECT(data) AS datas
WITH REDUCE(s = {}, y IN datas|
{name: COALESCE(y.name, s.name),
wage: COALESCE(y.wage, s.wage)})
AS most_recent_fields
RETURN most_recent_fields.name AS name, most_recent_fields.wage AS wage
You can do it in descending order instead (swap s and y inside the COALESCE statements if so), but there isn't really a way to shortcut processing the entire set of results from your queried time back to the start.
UPDATE: This will, of course, generate a Map and not a Node, but if you only want the properties and don't want to create a permanent record, a Map is actually better suited to your needs.
EXTENDED: If you don't want to specify which keys to use, you can do it without REDUCE like this instead:
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WITH x.date AS date, d AS data
ORDER BY date
WITH COLLECT(data) AS datas
CREATE (t:Temp)
FOREACH(data IN datas|
SET t += data)
DELETE t
RETURN t
This approach does create a node, but if you DELETE it right before you RETURN it, it won't persist at all. += ensures that pre-existing properties aren't removed, only overwritten if the data node has existing values.
I apologize now for my bad English. I'm Italian
I'm just using Neo4j for the thesis, but I still have doubts about the multi use.
1) I have created from the web interface, two nodes. I realized that Neo4j has given these indices 0 and 1 (for research). Now suppose that I was wrong and I have to delete the node with index 1 .. Once deleted do I create a new one and the system puts index 2.
Practically now the first node with index 0 and the second node with index 2 But I want the second node still has index 1 (basically I want to use the index of the first, as I do?)
2) The same problem with the relationship between two nodes. if I'm wrong to create it, the gate and I create another, I lose the index of the one deleted.
3) If I have to create a relationship between 2 nodes with the double arrow, as I do.
I saw that every arrow must have a label, so if I create a relationship between 1 and 2, and a relationship between 2 and 1, you get the double arrow, but with two labels and does not suit me. Thank you for your help
sorry for my very bad English
You should really try to use your own IDs or unique identifier for your nodes, then you can disregard the internal node IDs all together.
If you begin with this Cypher statement in a new database (you only have to set it once),
CREATE CONSTRAINT ON (node:MyNodeLabel) ASSERT node.myid IS UNIQUE
then you can create nodes and relationship like this,
CREATE (a:MyNodeLabel { myid : 0 })
CREATE (b:MyNodeLabel { myid : 1 })
CREATE (a)-[r:RELTYPE]->(b)
or if you do not write the create statements in the same transaction,
CREATE (:MyNodeLabel { myid : 2 })
CREATE (:MyNodeLabel { myid : 3 })
then later,
MATCH (a:MyNodeLabel { myid : 2 }), (b:MyNodeLabel { myid : 3 })
CREATE (a)-[r:RELTYPE]->(b)
or create two nodes and a relationship at the same time
MERGE (:MyNodeLabel { myid : 4 })-[r:RELTYPE]->(:MyNodeLabel { myid : 5 })
You can of course change MyNodeLabel and myid to any identifier you like.
The problem you have with the relationship labels is purely visual or do I misunderstand you?
You know that you can traverse relationships in any direction so maybe you do not need two relationships?
Here is the documentation for Cypher if you have missed it, http://docs.neo4j.org/chunked/stable/.
ok. sorry. I return at home today.... Another question. when you create a relationship with a label on the arrow. how do sometime in the future to change the label
without delete the relationship? is possible?
Q ok. sorry. I return at home today.... Another question. when you create a relationship with a label on the arrow. how do sometime in the future to change the label without delete the relationship? is possible?
ANS Yes,
You can do that easily you can server for the node and remove label only from the node it will not impact the relation ship , but i will suggest you to assing another one if that was the only label on the node so you can group it properly.
match(n:User{Id:1})
remove n:User set n:DeletedUser
return n
I'm trying to figure out the best approach to display combined tables based on matching logic and input search criteria.
Here is the situation:
We have a table of customers stored locally. The fields of interest are ssn, first name, last name and date of birth.
We also have a web service which provides the same information. Some of the customers from the web service are the same as the local file, some different.
SSN is not required in either.
I need to combine this data to be viewed on a Grails display.
The criteria for combination are 1) match on SSN. 2) For any remaining records, exact match on first name, last name and date of birth.
There's no need at this point for soundex or approximate logic.
It looks like what I should do is extract all the records from both inputs into a single collection, somehow making it a set on SSN. Then remove the blank ssn.
This will handle the SSN matching (once I figure out how to make that a set).
Then, I need to go back to the original two input sources (cached in a collection to prevent a re-read) and remove any records that exist in the SSN set derived previously.
Then, create another set based on first name, last name and date of birth - again if I can figure out how to make a set.
Then combine the two derived collections into a single collection. The collection should be sorted for display purposes.
Does this make sense? I think the search criteria will limit the number of record pulled in so I can do this in memory.
Essentially, I'm looking for some ideas on how the Grails code would look for achieving the above logic (assuming this is a good approach). The local customer table is a domain object, while what I'm getting from the WS is an array list of objects.
Also, I'm not entirely clear on how the maxresults, firstResult, and order used for the display would be affected. I think I need to read in all the records which match the search criteria first, do the combining, and display from the derived collection.
The traditional Java way of doing this would be to copy both the local and remote objects into TreeSet containers with a custom comparator, first for SSN, second for name/birthdate.
This might look something like:
def localCustomers = Customer.list()
def remoteCustomers = RemoteService.get()
TreeSet ssnFilter = new TreeSet(new ClosureComparator({c1, c2 -> c1.ssn <=> c2.ssn}))
ssnFilter.addAll(localCustomers)
ssnFilter.addAll(remoteCustomers)
TreeSet nameDobFilter = new TreeSet(new ClosureComparator({c1, c2 -> c1.firstName + c1.lastName + c1.dob <=> c2.firstName + c2.lastName + c2.dob}))
nameDobFilter.addAll(ssnFilter)
def filteredCustomers = nameDobFilter as List
At this point, filteredCustomers has all the records, except those that are duplicates by your two criteria.
Another approach is to filter the lists by sorting and doing a foldr operation, combining adjacent elements if they match. This way, you have an opportunity to combine the data from both sources.
For example:
def combineByNameAndDob(customers) {
customers.sort() {
c1, c2 -> (c1.firstName + c1.lastName + c1.dob) <=>
(c2.firstName + c2.lastName + c2.dob)
}.inject([]) { cs, c ->
if (cs && c.equalsByNameAndDob(cs[-1])) {
cs[-1].combine(c) //combine the attributes of both records
cs
} else {
cs << c
}
}
}
I have a mnesia table for this record.
-record(peer, {
peer_key, %% key is the tuple {FileId, PeerId}
last_seen,
last_event,
uploaded = 0,
downloaded = 0,
left = 0,
ip_port,
key
}).
Peer_key is a tuple {FileId, ClientId}, now I need to extract the ip_port field from all peers that have a specific FileId.
I came up with a workable solution, but I'm not sure if this is a good approach:
qlc:q([IpPort || #peer{peer_key={FileId,_}, ip_port=IpPort} <- mnesia:table(peer), FileId=:=RequiredFileId])
Thanks.
Using on ordered_set table type with a tuple primary key like { FileId, PeerId } and then partially binding a prefix of the tuple like { RequiredFileId, _ } will be very efficient as only the range of keys with that prefix will be examined, not a full table scan. You can use qlc:info/1 to examine the query plan and ensure that any selects that are occurring are binding the key prefix.
Your query time will grow linearly with the table size, as it requires scanning through all rows. So benchmark it with realistic table data to see if it really is workable.
If you need to speed it up you should focus on being able to quickly find all peers that carry the file id. This could be done with a table of bag-type with [fileid, peerid] as attributes. Given a file-id you would get all peers ids. With that you could construct your peer table keys to look up.
Of course, you would also need to maintain that bag-type table inside every transaction that change the peer-table.
Another option would be to repeat fileid and add a mnesia index on that column. I am just not that into mnesia's own secondary indexes.