Storing text in Neo4j - neo4j

We're using Neo4J and liking it. We do all sorts of graphy things in it. However, some of what we do is not graphy. For example, we keep a log of all changes to a certain type of node:
(n)-[:CHANGE]->(c1)-[:CHANGE]->(c2) etc etc
This list of changes can get to be 20 or 30 c1 nodes long. While it looks weird, I don't have a real problem with it. (Of course, I am smarter now, and since each :CHANGE relationship has a date in it, I could porcupine all the c1 nodes right from n. But whatever.)
But what if I wanted to store large amounts of text, or images. Is there a problem with storing large amounts of data in a node? I could use a different database for these things, but this just increases the skill set required to run the business. And of course, joining data in two disparate databases is always a PITA.
So I need to worry about storing large amounts of text in a single property? Do I need to avoid creating logs in the way I did above?

Creating linked lists of events, whatever they may be, is fine. It'd even say it is graphy! We use this approach a lot, in the case of ChangeFeed for something similar to what you're doing.
In Neo4j, such a linked list is actually better than a "porcupine", because you have to traverse all the relationships anyway, if you're looking for all changes. But in the linked list case, you don't have to look at properties to order them. In fact, the best approach is a hybrid approach, like the TimeTree.
As for storing large amounts of text or images, Neo4j is not the best place for it. Another database would be best, especially if the volume of these is large ( > hundreds of thousands).
But if you do want to store them in Neo4j, one thing to keep in mind is that Neo4j will load all the properties of a node/relationship once a single property is accessed. So in order to achieve good performance even with text/images in Neo4j, I would store them in a node of their own. That way, you can only load them if you really need them, but not during regular traversals.
For example:
CREATE (b:BlogPost {title:'Neo4j Rocks', author:'Tony Ennis', date:".."})-[:HAS_BODY]->(:BlogPostBody {content:'..'})

Related

When should inferred relationships and nodes be used over explicit ones?

I was looking up how to utilise temporary relationships in Neo4j when I came across this question: Cypher temp relationship
and the comment underneath it made me wonder when they should be used and since no one argued against him, I thought I would bring it up here.
I come from a mainly SQL background and my main reason for using virtual relationships was to eliminate duplicated data and do traversals to get properties of something instead.
For a more specific example, let's say we have a robust cake recipe, which has sugar as an ingredient. The sugar is what makes the cake sweet.
Now imagine a use case where I don't like sweet cakes so I want to get all the ingredients of the recipe that make the cake sweet and possibly remove them or find alternatives.
Then there's another use case where I just want foods that are sweet. I could work backwards from the sweet ingredients to get to the food or just store that a cake is sweet in general, which saves time from traversal and makes a query easier. However, as I mentioned before, this duplicates known data that can be inferred.
Sorry if the example is too strange, I suck at making them. I hope the main question comes across, though.
My feeling is that the only valid scenario for creating redundant "shortcut" relationships is this:
Your use case has a stringent time constraint (e.g., average query time must be less than 200ms), but your neo4j query -- despite optimization -- exceeds that constraint, and you have verified that adding "shortcut" relationships will indeed make the response time acceptable.
You should be aware that adding redundant "shortcut" relationships comes with its own costs:
Queries that modify the DB would need to be more complex (to modify the redundant relationships) and also slower.
You'd always have to add the redundant relationships -- even if actually you never need some (most?) of them.
If you want to make concurrent updates to the DB, the chances that you may lose some updates and introduce inconsistencies into the DB would increase -- meaning that you'd have to work even harder to avoid inconsistencies.
NOTE: For visualization purposes, you can use virtual nodes and relationships, which are temporary and not actually stored in the DB.

One Node to Gather Them All

Say I am managing collectibles. I have thousands of baseball trading cards, thousands still of gaming cards (think Magic: the Gathering), and then thousands and thousands of doilies.
The part of me that's been steeped in relational databases for 20+ years is uncomfortable with the idea of thousands of Neo4J nodes floating out in space.
So I am inclined to gather them all with a node such as (:BASEBALL_CARDS), (:MTG_CARDS), and of course (:DOILIES). The idea is that these are singletons.
Now if I want all baseball cards that perhaps refer to a certain player, I could do something like:
(:BASEBALL_CARDS)-[GATHERS]->(:BASEBALL_CARD)-[:FEATURES]->(p:PLAYER {name: '...'})
It's very comforting to have the :BASEBALL_CARDS singleton, but does it do anything more than could be accomplished by indexing :BASEBALL_CARD?
(:BASEBALL_CARD)-[:FEATURES]->(p:PLAYER {name: '...'})
Is it best-practice to have thousands of free-ranging nodes?
One exceptional strong point of the graph database is the local query: the relationship lives in the instance, not in the type. A particular challenge (apart from modelling well) is determining the starting point of the local query (and keeping it local, i.e., avoiding path explosions). In Neo4j 1.x your One Node was a way to achieve a starting point for a certain kind of query. With 2.x and the introduction of labels, indexing :BaseballCard is the standard way to accomplish the same. If the purpose of that One Node is as a starting point for the kind of query in your example, then you are better off using a label index. A common problem in 1.x was that a node with an increasing number of relationships of the same type and direction eventually becomes a bottle neck for traversals. People started partitioning your One Node into A Paged Handful of Nodes, something like
(:BaseballCards)-[:GATHERS]->(:BaseballCards1to10000)-[:GATHERS]->(:BaseballCard)
The purpose of finding a starting point for the local query is often better served by labels, perhaps in combination with a basic, ordinary, local traversal, than by A Handful of Nodes. Then again, if it calms your nerves or satisfies your sense of the epic to have such a node, by all means have it. Because of the locality of queries, it will do you no harm.
In your example, however, neither the One Node nor an index on :BaseballCard would best serve as the starting point of the local query. The most particular pattern of interest is instead the name of the player. If you index (:Player) on name you will get the best starting point. The traversal across the one or handful* of [:FEATURES] relationships is very cheap and with a simple test on the other end for the :BaseballCard label, you are done. You could of course maintain the One Node for all players that share a name...
In my most humble opinion there is little need for discomfort. I do, however, want to affirm and commend your unease, in this one regard: that the graph is most powerful for connected data. The particular connection gathering the baseball cards doesn't seem to add new understanding or improve performance, but wherever there is disconnected data there is the potential for discovering exciting and meaningful patterns. Perhaps in the future the cards will be connected through patterns that signify their range of value, or the quality of their lamination, or a linked list of previous owners, or how well they work as conversations starters on a date. The absence of relationships is a call to find that One Missing Link that brings tremendous insight and value into your data.
* Handful, assuming that more than one baseball card features the same player, or some baseball players are also featured on cards of Magic: The Gathering. I'm illiterate in both domains, so I want to at least allow for the possibility.
It is ironic that you are concerned about nodes "floating out in space", when the whole idea behind graph DBs is making the connections between nodes a first class DB construct.
But I think your actual concern is that nodes do not "belong to a table" (in relational DB parlance). So, you would feel more comfortable in creating a special singleton node that in some sense takes the place of a table, from which you can access all the nodes that ought belong to that table.
A node label can be seen as the equivalent of a "table name". So, not only is there no need for you to also create a singleton "table node", doing so would be wasteful in DB resources, and complicate and slow down your queries. And neo4j can quickly access all the nodes with the same label.

How much does the architecture of data affect the speed of a query

I have the following nodes and relationships in Neo4j database.
The grey and the pink node are furtherly connected with more nodes. Running the following query:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..3]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
I get a result very fast (the node n:RealNode is not one of the nodes in the image).
If I increase the depth to 4 like:
MATCH (n:RealNode {gid:'$obj_id'})-[:CONTAINS*..4]-(z)
RETURN DISTINCT ID(z), z.id,n.id as InternalID"
The response gets extremely slow. I will never get a response with depth 5 etc.
The depth 4 is actually the relationship between the blue-pink node. So my question is: can the architecture of data (in this case) affect in such a great level the speed of the query? If yes what should I do?
I have tried to run the query also using parameters but the result was the same. Also the gid of n:RealNode is an indexed value.
The architecture of your data has a huge, no...massive impact on query performance. There's a lot you can do with improving performance by reformulating your query, but you can do even more than that by changing your data model.
The model needs to be chosen in a way that's an accurate depiction of the real-world domain, but it often also has to make certain concessions to usage patterns. If you know you're going to do certain queries over and over, it makes sense to choose a data model that makes it easy for the DBMS to answer that query. In the RDBMS world, that entire line of thinking gets summarized in the word "denormalization". In graph databases, the concept is the same but the way you go about it is different.
The thing to keep in mind when adjusting your data model is that neo4j is good at traversing relationships fast, and that with all queries, the less data you have to consider, the faster the query will go.
So in your case, I don't know how many nodes branch off of each other node by a :CONTAINS relationship, but I'm guessing that at each level of the hierarchy you have many items below it. So going from level 4 to level 5 probably doesn't just add a fixed number of additional nodes, but if say each level of the hierarchy has 3x the number of nodes as the level above, the deeper you go, the more you're multiplying how much data you have to consider. If it's 10x...then ouch.
You have many different options. One is to create short-cut relationships, and "pre-materialize" the query. Imagine creating :grandfather and :greatgrandfather relationships to "hop" levels of the tree. That would make it faster. Another way would be to filter intermediate nodes, or the return nodes, so that you're not considering everything, but some subset.
In the end, really huge queries will always take longer than really small ones. You must first begin with a careful understanding of what data you want, and how often you have to run this query. I would not attempt to optimize your data model for infrequently run queries, but if you do this all the time, you should look at your options. Your query to me looks like it's going to return a whole lot of data no matter what you do.

Neo4j - individual properties, or embedded in JSON? (ROR)

I want to know which is more efficient in terms of speed and property limitations of Neo4j.. (I'm using Ruby on Rails 3.2 and REST)
I'm wondering whether I should be storing node properties in a single property, much like a database table, or storing most/all for a node in a single node property but in JSON format.
Right now in a test system I have 1000 nodes with a total of 10000 properties.. Obviously the number of properties is going to skyrocket as more features and new node types are added to my system.
So I was considering storing all the non-searchable properties for a node in an embedded JSON structure.. Except this seems like it will put more burden on the web servers, having to parse the JSON after retrieving it, etc. (I'm going to use a single property field with JSON for activity feed nodes, but I'm addressing things like photo nodes, profile nodes etc).
Any advice here? Keep things in separate properties? A hybrid of JSON and individual properties?
What is your goal by storing things in JSON? Do you think you'll hit the 67B limit (which will be going up in 2.1 in a few months to something much larger)?
From a low level store standpoint, there isn't much difference between storing a long string and storing many shorter properties. The main thing you're doing is preventing yourself from using those fields in a query.
Also, if you're using REST, you're going to have to do JSON parsing anyway, so it's not like you're going to completely avoid that.

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources