Expressing, querying and enforcing constraints on graph contents - neo4j

Say you would like to create a graph authoring system that puts constraints on the contents of the graph. Say you have a "contains" relationship where "city" may contain "houses", which in turn contain "bedrooms" and "bathrooms". But it is not legal for a city to contain bedrooms or bathrooms, or for bathrooms to contain bedrooms.
Further, say you want to offer suggestions to graph author - if they select a "city" node, you might want to give them suggestions for what can be added to the city "houses", "hospitals" and "schools", but not "bedrooms".
I am guessing that these constraints, in and of themselves, could be represented as a graph. Has anyone had any luck doing that? What was your experience?

There are a number of ways you could express these rules, for example:
You could use your application layer to check whether Node Label to Relationship Type rules are being respected
You could perhaps use triggers to check/act on any specific rules
You may choose to create user defined rules engine
There will be other approaches too.

Related

Node vs Relationship

So I've just worked through the tutorial and I'm unclear about a few things. The main one, however, is how do you decide when something is a relationship and when it should be a Node?
For example, in the Movies Database,there is a relationship showing who acted in which film. A property of that relationship is the Role. BUT, what if it's a series of films? The role may well be constant between films (say, Jack Ryan in The Hunt for Red October, Patriot Games etc.)
We may also want to have some kind of Character bio, which would obviously remain constant between movies. Worse, the actor may change from one movie to another (Alec Baldwin then Harrison Ford.) There are many others like this (James Bond, for example).
Even if the actor doesn't change (Main roles in Harry Potter) the character is constant. So, at what point would the Role become a node in its own right? When it does, can I have a 3-way relationship (Actor-Role-Movie)? Say I start of with it being a relationship and then, down the line, decide it should've been a node, is there a simple way to go through the database and convert it?
No there is no way to convert your datamodel. When you start your own Database first take time to find a fitting schema. There is no ideal way to create a schema and also there are many different models fitting to the same situation without being totally wrong.
My strategy is to put less information to the relationship itself. I only add properties that directly concern the relationship and store all the other data in the nodes. Also think of properties you could use for traversing the graph. For example you might need some flags or even different labels for relationships even they more or less are the same. The apoc.algo.aStar is only including relationshiptypes you want (you could exclude certain nodes by giving them a special relationshiptype). So keep that in mind that you take a look at procedures that you might use later.
Try to create the schema as simple as possible and find a way to stay consistent in terms of what things are nodes and what deserves a relationship. Dont mix it up.
Choose a design that makes sense for you! (device 1)-[cable]-(device 2) vs (device 1)-[has cable]-(cable)-[has cable]-(device 2) in this case I'd prefer the first because [has cable] wouldn't bring anymore information. Irrespective to what I wrote above I would have a lot of information in this [cable] relationship but it totally makes sense for me because I wouldnt want to search in a device node for cable information.
For your example giving the role a own node is also valid way. For example if you want to espacially query which actors had the same role in common I'll totally go for giving the role a extra node.
Summary:
Think of what you want to do with the data and choose the easiest model.

can predefined keywords exist beside free keywords in DSpace submit?

Default a submitter (uploader) of a document can add self chosen keywords to that document.
It is also possible to configure DSpace in a way that the submitter has to choose from one or more predefined keywords (controlled vocabulary).
The DSpace manual seems to suggest that you - when configuring - have to choose between free and predefined keywords.
I would like to give the submitter the possibility to choose between one or more predefined keywords. But also that he or she can add one or more self chosen keywords.
Is that possible?
The hierarchical taxonomy feature gives you exactly this:
https://wiki.duraspace.org/display/DSDOC5x/Authority+Control+of+Metadata+Values#AuthorityControlofMetadataValues-HierarchicalTaxonomiesandControlledVocabularies
You can see it in the demo installation on the "subject" field: you have a lookup feature that allows lookup in a tree of subjects, but manually entered values are possible as well.
screencast:
http://screencast.com/t/0Cth3mORwxd
I personally would set this up to use two different metadata fields.
Something like dc.subject.whateverdescribesyourlistoffixedterms -- or even localschema.subject.whateverdescribesyourlistoffixedterms -- for the list of terms the user should select from. Note, for "whateverdescribesyourlistoffixedterms" I would choose something related to the name of the list of terms if at all possible (see example below).
dc.subject for "standard" user-supplied keywords
Then just add both to your input forms, perhaps going with Bram's suggestion of a hierarchical taxonomy for the first.
To give you better advice on what's most appropriate, it would be great if you could give some more details about what you're trying to achieve. For example
Is your list of fixed keywords something that's used beyond your own organisation? If yes, this strongly points to having its own metadata field to me, with the qualifier something that's related to the name of the classification system -- eg, dc.subject.anzsrc for the Australia/New Zealand fields of research codes.
Do you want to mix the two types of keywords in browse/facet options? You can do this even when they're in two separate fields. Have a look at the Discovery search filters & sidebar facets documentation and see how that puts dc.contributor.author and dc.creator into the author facet. The documentation for browse indexes has a similar example in the author browse.
Are both types of subject keywords required for submission? Both optional? One type required, the other type optional? You say in a comment (if I read you correctly) that you want the fixed keywords to be mandatory during submission, while the free-text keywords should be optional. That means they must be in separate metadata fields because otherwise you wouldn't know, if the submitter gives keywords, whether they are from the fixed list of terms or not. If you use separate fields, you can make eg dc.subject.anzsrc a required field in the submission form and dc.subject an optional one.

What's the optimal structure for a multi-domain sentence/word graph in Neo4j?

I'm implementing abstractive summarization based on this paper, and I'm having trouble deciding the most optimal way to implement the graph such that it can be used for multi-domain analysis. Let's start with Twitter as an example domain.
For every tweet, each sentence would be graphed like this (ex: "#stackoverflow is a great place for getting help #graphsftw"):
(#stackoverflow)-[next]->(is)
-[next]->(a)
-[next]->(great)
-[next]->(place)
-[next]->(for)
-[next]->(getting)
-[next]->(help)
-[next]->(#graphsftw)
This would yield a graph similar to the one outlined in the paper:
To have a kind of domain layer for each word, I'm adding them to the graph like this (with properties including things like part of speech):
MERGE (w:Word:TwitterWord {orth: "word" }) ON CREATE SET ... ON MATCH SET ...
In the paper, they set a property on each word {SID:PID}, which describes the sentence id of the word (SID) and also the position of each word in the sentence (PID); so in the example sentence "#stackoverflow" would have a property of {1:1}, "is" would be {1:2}, "#graphsftw" {1:9}, etc. Each subsequent reference to the word in another sentence would add an element to the {SID:PID} property array: [{1:x}, {n:n}].
It doesn't seem like having sentence and positional information as an array of elements contained within a property of each node is efficient, especially when dealing with multiple word-domains and sub-domains within each word layer.
For each word layer or domain like Twitter, what I want to do is get an idea of what's happening around specific domain/layer entities like mentions and hashtags; in this example, #stackoverflow and #graphsftw.
What is the most optimal way to add subdomain layers on top of, for example, a 'Twitter' layer, such that different words are directed towards specific domain-entities like #hashtags and #mentions? I could use a separate label for each subdomain, like :Word:TwitterWord:Stackoverflow, but that would give my graph a ton of separate labels.
If I include the subdomain entities in a node property array, then it seems like traversal would become an issue.
Since all tweets and extracted entities like #mentions and #hashtags are being graphed as nodes/vertices prior to the word-graph step, I could have edges going from #hashtags and #mentions to words. Or, I could have edges going from tweets to words with the entities as an edge property. Basically, I'm looking for a structure that is the "cheapest" in terms of both storage and traversal.
Any input on how generally to structure this graph would be greatly appreciated. Thanks!
You could also put the domains / positions on the relationships (and perhaps also add a source-id).
OTOH you can also infer that information as long as your relationships represent the original sentence.
You could then either aggregate the relationships dynamically to compute the strengths or have a separate "composite" relationship that aggregates all the others into a counter or sum.

In Neo4j, faster structure for clustering

Suppose I have a bunch of User node, which has a property named gender, which can be male or female. Now in order to cluster user based on gender, I have two choice of structure:
1) Add an index to the gender property, and use a WHERE to select users under a gender.
2) Create a Male node and a Female node, and edges linking them to relevant users. Then every time when querying upon gender, I use pattern ,say, (:Male)-[]->(:User).
My question is, which one is better?
Indices should never be a replacement for putting things in the graph.
Indexing is great for looking up unique values and, in some cases, groups of values; however, with the caching that Neo4j can do (and the extensibility of modeling your domain).
Only indexing a property with two (give or take) properties is not the best use of an index and likely won't net too much of a performance boost given the number of results per property value.
That said, going with option #2 can create supernodes, a bottle-necking issue which can become a major headache depending on your model.
Maybe consider using labels (:Male and :Female, for example) as they are essentially "schema indices". Also keep in mind you can use multiple labels per node, so you could have (user:User:Male), etc. It also helps to avoid supernodes while not creating a classic or "legacy" index.
HTH

Neo4j node property type

I'm playing around with neo4j, and I was wondering, is it common to have a type property on nodes that specify what type of Node it is? I've tried searching for this practice, and I've seen some people use name for a purpose like this, but I was wondering if it was considered a good practice or if indexes would be the more practical method?
An example would be a "User" node, which would have type: user, this way if the index was bad, I would be able to do an all-node scan and look for types of user.
Labels have been added to neo4j 2.0. They fix this problem.
You can create nodes with labels:
CREATE (me:American {name: "Emil"}) RETURN me;
You can match on labels:
MATCH (n:American)
WHERE n.name = 'Emil'
RETURN n
You can set any number of labels on a node:
MATCH (n)
WHERE n.name='Emil'
SET n :Swedish:Bossman
RETURN n
You can delete any number of labels on a node:
MATCH (n { name: 'Emil' })
REMOVE n:Swedish
Etc...
True, it does depend on your use case.
If you add a type property and then wish to find all users, then you're in potential trouble as you've got to examine that property on every node to get to the users. In that case, the index would probably do better- but not in cases where you need to query for all users with conditions and relations not available in the index (unless of course, your index is the source of the "start").
If you have graphs like mine, where a relation type implies two different node types like A-(knows)-(B) and A or B can be a User or a Customer, then it doesn't work.
So your use case is really important- it's easy to model graphs generically, but important to "tune" it as per your usage pattern.
IMHO you shouldn't have to put a type property on the node. Instead, a common way to reference all nodes of a specific "type" is to connect all user nodes to a node called "Users" maybe. That way starting at the Users node, you can very easily find all user nodes. The "Users" node itself can be indexed so you can find it easily, or it can be connected to the reference node.
I think it's really up to you. Some people like indexed type attributes, but I find that they're mostly useful when you have other indexed attributes to narrow down the number of index hits (search for all users over age 21, for example).
That said, as #Luanne points out, most of us try to solve the problem in-graph first. Another way to do that (and the more natural way, in my opinion) is to use the relationship type to infer a practical node type, i.e. "A - (knows) -> B", so A must be a user or some other thing that can "know", and B must be another user, a topic, or some other object that can "be known".
For client APIs, modeling the element type as a property makes it easy to instantiate the right domain object in your client-side code so I always include a type property on each node/vertex.
The "type" var name is commonly used for this, but in some languages like Python, "type" is a reserved word so I use "element_type" in Bulbs ( http://bulbflow.com/quickstart/#models ).
This is not needed for edges/relationships because they already contain a type (the label) -- note that Neo4j also uses the keyword "type" instead of label for relationships.
I'd say it's common practice. As an example, this is exactly how Spring Data Neo4j knows of which entity type a certain node is. Each node has "type" property that contains the qualified class name of the entity. These properties are automatically indexed in the "types" index, thus nodes can be looked up really fast. You could implement your use case exactly like this.
Labels have recently been added to Neo4j 2.0 ( http://docs.neo4j.org/chunked/milestone/graphdb-neo4j-labels.html ). They are still under development at the moment, but they address this exact problem.

Resources