Neo4j - individual properties, or embedded in JSON? (ROR) - ruby-on-rails

I want to know which is more efficient in terms of speed and property limitations of Neo4j.. (I'm using Ruby on Rails 3.2 and REST)
I'm wondering whether I should be storing node properties in a single property, much like a database table, or storing most/all for a node in a single node property but in JSON format.
Right now in a test system I have 1000 nodes with a total of 10000 properties.. Obviously the number of properties is going to skyrocket as more features and new node types are added to my system.
So I was considering storing all the non-searchable properties for a node in an embedded JSON structure.. Except this seems like it will put more burden on the web servers, having to parse the JSON after retrieving it, etc. (I'm going to use a single property field with JSON for activity feed nodes, but I'm addressing things like photo nodes, profile nodes etc).
Any advice here? Keep things in separate properties? A hybrid of JSON and individual properties?

What is your goal by storing things in JSON? Do you think you'll hit the 67B limit (which will be going up in 2.1 in a few months to something much larger)?
From a low level store standpoint, there isn't much difference between storing a long string and storing many shorter properties. The main thing you're doing is preventing yourself from using those fields in a query.
Also, if you're using REST, you're going to have to do JSON parsing anyway, so it's not like you're going to completely avoid that.

Related

Neo4j web client fails with large Cypher CREATE query. 144000 lines

I'm new to neo4j and currently attempting to migrate existing data into a neo4j database. I have written a small program to convert current data (in bespoke format) into a large CREATE cypher query for initial population of the database. My first iteration has been to somewhat retain the structuring of the existing object model, i.e Objects become nodes, node type is same as object name in current object model, and the members become properties (member name is property name). This is done for all fundamental types (and strings) and any member objects are thus decomposed in the same way as in the original object model.
This has been fine in terms of performance and 13000+ line CREATE cypher queries have been generated which can be executed throuh the web frontend/client. However the model is not ideal for a graph database, I beleive, since there can be many properties, and instead I would like to deomcompose these 'fundamental' nodes (with members which are fundamental types) into their own node, relating to a more 'abstract' node which represents the more higher level object/class. This means each member is a node with a single (at first, it may grow) property say { value:"42" }, or I could set the node type to the data type (i.e integer). If my understanding is correct this would also allow me to create relationships between the 'members' (since they are nodes and not propeties) allowing a greater freedom when expressing relationships between original members of different objects rather than just relating the parent objects to each other.
The problem is this now generates 144000+ line Cypher queries (and this isn't a large dataset in compraison to others) which the neo4j client seems to bulk at. The code highlighting appears to work in the query input box of the client (i.e it highlights correctly, which I assume implies it parsed it correctly and is valid cypher query), but when I come to run the query, I get the usual browser not responding and then a stack overflow (no punn intended) error. Whats more the neo4j client doesn't exit elegantly and always requires me to force end task and the db is in the 2.5-3GB usage from, what is effectively and small amount of data (144000 lines, approx 2/3 are relationships so at most ~48000 nodes). Yet I read I should be able to deal with millions of nodes and relationships in the milliseconds?
Have tried it with firefox and chrome. I am using the neo4j community edition on windows10. The sdk would initially be used with C# and C++. This research is in its initial stages so I haven't used the sdk yet.
Is this a valid approach, i.e to initially populate to database via a CREATE query?
Also is my approach about decomposing the data into fundamental types a good one? or are there issues which are likely to arise from this approach.
That is a very large Cypher query!!!
You would do much better to populate your database using LOAD CSV FROM... and supplying a CSV file containing the data you want to load.
For a detailed explaination, have a look at:
https://neo4j.com/developer/guide-import-csv/
(This page also discusses the batch loader for really large datasets.)
Since you are generating code for the Cypher query I wouldn't imagine you would have too much trouble generating a CSV file.
(As an indication of performance, I have been loading a 1 million record CSV today into Neo4j running on my laptop in under two minutes.)

Storing text in Neo4j

We're using Neo4J and liking it. We do all sorts of graphy things in it. However, some of what we do is not graphy. For example, we keep a log of all changes to a certain type of node:
(n)-[:CHANGE]->(c1)-[:CHANGE]->(c2) etc etc
This list of changes can get to be 20 or 30 c1 nodes long. While it looks weird, I don't have a real problem with it. (Of course, I am smarter now, and since each :CHANGE relationship has a date in it, I could porcupine all the c1 nodes right from n. But whatever.)
But what if I wanted to store large amounts of text, or images. Is there a problem with storing large amounts of data in a node? I could use a different database for these things, but this just increases the skill set required to run the business. And of course, joining data in two disparate databases is always a PITA.
So I need to worry about storing large amounts of text in a single property? Do I need to avoid creating logs in the way I did above?
Creating linked lists of events, whatever they may be, is fine. It'd even say it is graphy! We use this approach a lot, in the case of ChangeFeed for something similar to what you're doing.
In Neo4j, such a linked list is actually better than a "porcupine", because you have to traverse all the relationships anyway, if you're looking for all changes. But in the linked list case, you don't have to look at properties to order them. In fact, the best approach is a hybrid approach, like the TimeTree.
As for storing large amounts of text or images, Neo4j is not the best place for it. Another database would be best, especially if the volume of these is large ( > hundreds of thousands).
But if you do want to store them in Neo4j, one thing to keep in mind is that Neo4j will load all the properties of a node/relationship once a single property is accessed. So in order to achieve good performance even with text/images in Neo4j, I would store them in a node of their own. That way, you can only load them if you really need them, but not during regular traversals.
For example:
CREATE (b:BlogPost {title:'Neo4j Rocks', author:'Tony Ennis', date:".."})-[:HAS_BODY]->(:BlogPostBody {content:'..'})

NSKeyedArchiver vs Core Data

I am building an app with Objective-C and I would like to persist data. I am hesitating between NSKeyedArchiver and core Data. I am aware there are plenty of ressources about this on the web (including Objective-C best choice for saving data) but I am still doubtful about the one I should use. Here are the two things that make me wonder :
(1) I am assuming I will have around 1000-10000 objects to handle for a data volume of 1-10 Mb. I will do standard database queries on these objects. I would like to be able to load all these objects on launching and to save them from time to time -- a 1 second processing time for loading or saving would be fine by me.
(2) For the moment my model is rather intricate : for instance classA contains among other properties an array of classB which is itself formed by (among other) a property of type classC and a property of type classD. And class D itself contains properties of type classE.
Am I right to assume that (1) means that NSKeyedArchiver will still work fine and that (2) means that using core Data may not be very simple ? I have tried to look for cases where core Data was used with complex object graph structure like my case (2) on the web but haven't found that many ressources. This is for the moment what refrains me the most from using it.
The two things you identify both make me lean towards using CoreData rather than NSKeyedArchiver:
CoreData is well able to cope with 10,000 objects (if not considerably more), and it can support relatively straight-forward "database-like" queries of the data (sorting with NSSortDescriptors, filtering with NSPredicate). There are limitations on what can be achieved, but worst case you can load all the data into memory - which is what you would have to do with the NSKeyedArchiver solution.
Loading in sub-second times should be achievable (I've just tested with 10,000 objects, totalling 14Mb, in 0.17 secs in the simulator), particularly if you optimise to load only essential data initially, and let CoreData's faulting process bring in the additional data when necessary. Again, this will be better than NSKeyedArchiver.
Although most demos/tutorials opt for relatively straight forward data models (enough to demonstrate attributes and relationships), CoreData can cope with much more sophisticated data models. Below is a mock-up of the relationships that you describe, which took a few minutes to put together:
If you generate subclasses for all those entities, then traversing those relationships is simple (both forwards and backwards - inverse relationships are managed automatically for you). Again, there are limitations (CoreData does the SQL work for you, but in so doing it is less flexible than using a relational database directly).
Hope that helps.

Is it bad to change _id type in MongoDB to integer?

MongoDB uses ObjectId type for _id.
Will it be bad if I make _id an incrementing integer?
(With this gem, if you're interested)
No it isn't bad at all and in fact the built in ObjectId is quite sizeable within the index so if you believe you have something better then you are more than welcome to change the default value of the _id field to whatever.
But, and this is a big but, there are some considerations when deciding to move away from the default formulated ObjectId, especially when using the auto incrementing _ids as shown here: https://docs.mongodb.com/v3.0/tutorial/create-an-auto-incrementing-field
Multi threading isn't such a big problem because findAndModify and the atomic locks can actually take care of that, but then you just hit into your first problem. findAndModify is not the fastest function nor the lightest and there have been significant performance drops noticed when using it regularly.
You also have to consider the overhead of doing this yourself anyway, even without findAndModify. For every insert you will need an extra query. Imagine having a unique id that you have to query the uniqueness of every time you want to insert. Eventually your insert rate will drop to a crawl and your lock time will build up.
Of course the ObjectId is really good at being unique without having to check or formulate its own uniqueness by touching the database prior to insertion, hence it doesn't have this overhead.
If you still feel an integer _id suites your scenario, then go for it, but bare in mind the overhead described above.
You can do it, but you are responsible to make sure that the integers are unique.
MongoDB doesn't support auto-increment fields like most SQL databases. When you have a distributed or multithreaded application which has multiple processes and/or threads which create new database entries, you have to make sure that they use the same counter. Otherwise it could happen that two threads try to store a document with the same _id in the database.
When that happens, one of them will fail. That means you have to wait for the database to return a success or error (by calling GetLastError or by setting the write concerns to acknowledged), which takes longer than just sending data in a fire-and-forget manner.
I had a use case for this: replacing _id with a 64 bit integer that represented a simhash of a document index for searching.
Since I intended to "Get or create", providing the initial simhash, and creating a new record if one didn't exist was perfect. Also, for anyone Googling, MongoDB support explained to me that simhashes are absolutely perfect for sharding and scaling, and even better than the more generic ObjectId, because they will divide up the data across shards perfectly and intrinsically, and you get the key stored for negative space (a uint64 is much smaller than an objectId and would need to be stored anyway).
Also, for you Googlers, replacing a MongoDB _id with something other than an objectId is absolutely simple: Just create an object with the _id being defined; use an integer if you like. That's it: Mongo will simply use it. If you try to create a document with the same _id you'll get an error (E11000/Duplicate key). So like me, if you're using simhashing, this is ideal in all respects.

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources