Py2Neo Label Indexing - py2neo

I have a dataset containing words and documents associated with those words. I would like to set labels on them to separate them into these two categories. I was able to create the labels by doing this:
if not "Social Words" in graph_db.node_labels:
neo4j.Schema.create_index(graph_db.schema, "Social Words", "word")
if not "Documents" in graph_db.node_labels:
neo4j.Schema.create_index(graph_db.schema, "Documents", "url")
The problem is that I need to enforce uniqueness on the "word" and "url" fields. I am adding the nodes & labels as follows
doc,= graph_db.create({"url": url})
doc.add_labels("Documents")
My questions are :
Is there a way to add the node to the label index by using get_or_create
Does the py2neo api have a way to enforce uniqueness on the label index
Is there a better way to do all this. The documentation is a little fuzzy

Answers:
No because there is no need to explicitly add a node to a schema index - these are included automatically when the label is present.
Py2neo does not have specific functions to support unique constraint management.
You could use Cypher for this instead (http://docs.neo4j.org/chunked/stable/query-constraints.html#constraints-create-uniqueness-constraint)

Related

Is there a way to do a "left outer join" like query in PromQL?

I am trying to use two metrics (that share some labels, including one that I can use as an UUID) that should describe the same entities, in order to create alerts/dashboard that will alert me one an entity reports in one metric but not the other.
For example, for the following metrics:
item_purchases{name="item1", count="5"}
item_purchases{name="item2", count="7"}
item_stock{name="item1", in_stock="1"}
item_stock{name="item2", in_stock="0"}
item_stock{name="item3", in_stock="1"}
I use item_stock as my "source of truth", and I'm trying to write a query that will return:
item_stock{name="item3", ...} # I don't care about the other labels, just the name.
I already have a query that helps me filter on certain conditions (For example - if an item was purchased but is not in stock like "item2") that looks something like:
item_purchases{in_stock="1"} * on (name) group_left () (item_purchases)
but unfortunately it just drops all the records in item_stock that don't have a matching timeseries in item_purchases - like "item3", which is actually the result I'm looking for.
Does anyone have any experience coding these type of queries? Are they even possible in PromQL or should I revert to some other solution?
It is possible using unless operator:
item_stock unless item_purchases
Results in a vector consisting of the elements of item_stock for which there are no elements in item_purchases with exactly matching label sets. All matching elements in both vectors are dropped.
Note that the metrics in question do not have 'exactly matching label sets' (name is common but count and in_stock are not). In this case you can use on to set the list of labels to match:
item_stock unless on(name) item_purchases

Storing multiple independent trees in Neo4j

I want to store multiple independent trees(there is no relation between those trees). I think to generate and assign a unique label to every independent tree. Then every query will have a filter using those labels. So if there are 10,000 trees, I would have to generate 10,000 different labels. Is there a better solution like a multi-graph or something else available?
I recommend using one label, say :Root, for all of your trees and a root_id property that contains the tree's unique identifier.
You can create a unique constraint on root_id to ensure that no two trees have the same ID. The unique constraint has the side effect of creating an index on the property so accessing the :Root nodes by root_id will be very fast.

Expressing, querying and enforcing constraints on graph contents

Say you would like to create a graph authoring system that puts constraints on the contents of the graph. Say you have a "contains" relationship where "city" may contain "houses", which in turn contain "bedrooms" and "bathrooms". But it is not legal for a city to contain bedrooms or bathrooms, or for bathrooms to contain bedrooms.
Further, say you want to offer suggestions to graph author - if they select a "city" node, you might want to give them suggestions for what can be added to the city "houses", "hospitals" and "schools", but not "bedrooms".
I am guessing that these constraints, in and of themselves, could be represented as a graph. Has anyone had any luck doing that? What was your experience?
There are a number of ways you could express these rules, for example:
You could use your application layer to check whether Node Label to Relationship Type rules are being respected
You could perhaps use triggers to check/act on any specific rules
You may choose to create user defined rules engine
There will be other approaches too.

NEO4J - Best practices to store 40 millions of text nodes

I've been using Neo4j for some weeks and I think it's awesome.
I'm building an NLP application, and basically, I'm using Neo4j for storing the dependency graph generated by a semantic parser, something like this:
https://explosion.ai/demos/displacy?text=Hi%20dear%2C%20what%20is%20your%20name%3F&model=en_core_web_sm&cpu=1&cph=0
In the nodes, I store the single words contained in the sentences, and I connect them through relations with a number of different types.
For my application, I have the requirement to find all the nodes that contain a given word, so basically I have to search through all the nodes, finding those that contain the input word. Of course, I've already created an index on the word text field.
I'm working on a very big dataset:
On my laptop, the following query takes about 20 ms:
MATCH (t:token) WHERE t.text="avoid" RETURN t.text
Here are the details of the graph.db:
47.108.544 nodes
45.442.034 relationships
13.39 GiB db size
Index created on token.text field
PROFILE MATCH (t:token) WHERE t.text="switch" RETURN t.text
------------------------
NodeIndexSeek
251,679 db hits
---------------
Projection
251,678 db hits
--------------
ProduceResults
251,678 db hits
---------------
I wonder if I'm doing something wrong in indexing such amount of nodes. At the moment, I create a new node for each word I encounter in the text, even if the text is the same of other nodes.
Should I create a new node only when a new word is encountered, managing the sentence structures through relationships?
Could you please help me with a suggestion or best practice to adopt for this specific case?
Thank you very much
For this use case, each of your :Token nodes should be unique. When you create these you should be using MERGE instead of CREATE for the node itself, so if the node already exists it will use the existing one rather than creating a new one.
It may help to also add a unique constraint for this after you've cleaned up your data.

What's the optimal structure for a multi-domain sentence/word graph in Neo4j?

I'm implementing abstractive summarization based on this paper, and I'm having trouble deciding the most optimal way to implement the graph such that it can be used for multi-domain analysis. Let's start with Twitter as an example domain.
For every tweet, each sentence would be graphed like this (ex: "#stackoverflow is a great place for getting help #graphsftw"):
(#stackoverflow)-[next]->(is)
-[next]->(a)
-[next]->(great)
-[next]->(place)
-[next]->(for)
-[next]->(getting)
-[next]->(help)
-[next]->(#graphsftw)
This would yield a graph similar to the one outlined in the paper:
To have a kind of domain layer for each word, I'm adding them to the graph like this (with properties including things like part of speech):
MERGE (w:Word:TwitterWord {orth: "word" }) ON CREATE SET ... ON MATCH SET ...
In the paper, they set a property on each word {SID:PID}, which describes the sentence id of the word (SID) and also the position of each word in the sentence (PID); so in the example sentence "#stackoverflow" would have a property of {1:1}, "is" would be {1:2}, "#graphsftw" {1:9}, etc. Each subsequent reference to the word in another sentence would add an element to the {SID:PID} property array: [{1:x}, {n:n}].
It doesn't seem like having sentence and positional information as an array of elements contained within a property of each node is efficient, especially when dealing with multiple word-domains and sub-domains within each word layer.
For each word layer or domain like Twitter, what I want to do is get an idea of what's happening around specific domain/layer entities like mentions and hashtags; in this example, #stackoverflow and #graphsftw.
What is the most optimal way to add subdomain layers on top of, for example, a 'Twitter' layer, such that different words are directed towards specific domain-entities like #hashtags and #mentions? I could use a separate label for each subdomain, like :Word:TwitterWord:Stackoverflow, but that would give my graph a ton of separate labels.
If I include the subdomain entities in a node property array, then it seems like traversal would become an issue.
Since all tweets and extracted entities like #mentions and #hashtags are being graphed as nodes/vertices prior to the word-graph step, I could have edges going from #hashtags and #mentions to words. Or, I could have edges going from tweets to words with the entities as an edge property. Basically, I'm looking for a structure that is the "cheapest" in terms of both storage and traversal.
Any input on how generally to structure this graph would be greatly appreciated. Thanks!
You could also put the domains / positions on the relationships (and perhaps also add a source-id).
OTOH you can also infer that information as long as your relationships represent the original sentence.
You could then either aggregate the relationships dynamically to compute the strengths or have a separate "composite" relationship that aggregates all the others into a counter or sum.

Resources