Neo4j rebuilding database - neo4j

We would like to import every day or so, tens of millions of nodes with relationships.
neo4j-admin import seems to be very slow (we removed all constrains and indexes before).
What is the best practice to load a huge amount of data into neo?
We would like every other day, when we import, to import it into a different database and SWITCH between the current database and the newly built database. (similar to alias index in elastic search).
How can this be done?
Thank you

Related

Neo4j json dl import

I'm just getting into Neo4j and I spent a couple hours mapping out some nodes and relationships.
I D/L'd the JSON and I'm trying to move the nodes to another computer, it seems like it should be a pretty simple query, but everything I'm finding about batch import is for csv's and a bit more involved.
Is the just a simple cypher to import the JSON d'l from the local Neo4j server?
Moving a full graph db to another box is most simply done by copying over the data/graph.db directory.
Alternatively you can use neo4j-shell's dump command.

Neo4J Subgraphs or Multiple Databases

I have my Neo4J (embedded) database setup like this:
I attach several user nodes to the reference node.
To each user node can be attached one or more project nodes.
To each project node is a complex graph attached.
The complex graph is traversable with a single traverse pattern (there is a hidden tree structure in them).
What I'd like to do is the following:
Remove all the nodes below a project node.
Remove all the project nodes below a user, when there's nothing below the project nodes
Export all nodes below a specific user node to .graphML (probably using Gremlin Java API?)
Import a .graphML file back to the database below a specific user node without removing the information that is located under different user nodes.
I've already worked with the Gremlin GraphML reader for importing and exporting entire Neo4J databases but I was not able to find something about importing/exporting subgraphs.
If this is in fact possible, how would Neo4J handle two users trying to import something at the same time? For instance user 1 imports his section under the user1 node, and user 2 imports his data under the user 2 node simultaneously.
The other possibility is to have a Neo4J database per user, but this is the less preferable option really and I'm very unsure whether it's actually posssible, be it embedded or server version. I've read something about having multiple server versions on different ports but our amount of users is per definition unlimited...
Any help would be greatly appreciated.
EDIT 1: I've also come across something called Geoff (org.neo4j.geoff) which deals with subgraphs. I'm absolutely clueless as to how this works but I'm looking into it right now.
You might take a lock on the user node when starting the import, so that the second import would have to wait (and would have to check).
With cypher queries you can delete the subgraphs and also export them to cypher again. There is export code for query-results in the Neo4j Console repository.
There you can also find geoff-export and import and also cypher importers.
One option may be to use something like Tinkerpop blueprints to create a generic Graph when traversing, then doing a GraphML export.
https://github.com/tinkerpop/blueprints/wiki/GraphML-Reader-and-Writer-Library would have more information, but if you are looking to export subgraphs, this is probably your best option.

Migrate Data from Neo4j to SQL

Hi I am using neo4j in my application and my structure is as following:
I am using Embedded Graph API
I have several databases that I point to using a pool that I maintain in my application eg-> db1, db2, db3, ..... db100
When I want to access a particular database I point to it using new EmbeddedGraphDatabase("Path to db(n)")
The problem is that when the connection pool count increases the RAM size being consumed by the application keep increasing and breaks down the application at a point of limit.
So I am Thinking of migrating from Neo4j to some other Database.
Additionally only a small part of my database is utilizing the graph structure.
One way for migration is that I write a script for it. Is there any better option?
My another question is what is the best Database so that my structure can be maintained.
Other view-point that I am thinking about is I can keep part of my data into Neo4j and shift another part to some other database.
If anything is unclear I can clarify.
Thanks in advance.
An EmbeddedGraphDatabase instance is not the equivalent of a "connection" in SQL. It's designed to run a long time (days, months). Hence starting/stopping is costly.
What is the use case for having hundreds of separate databases in the same JVM?
Your lots of small databases will perform poorly as the graphdb is designed to hold the whole datamodel on a single host.
Do you run a single JVM per database?
You can control the amount of memory used by neo4j by providing the correct properties for memory mapping and also use the gcr cache from neo4j-enterprise and control the cache size-property variables.
I think it still makes sense to keep the graph part in Neo4j and only move the non-graphy part.

Updating large amounts of data in Rails App

I have a rails app with a table of about 30 million rows that I build from a text document my data provider gives me quarterly. From there I do some manipulation and comparison with some other tables and create an additional table with a more customized data.
My first time doing this, I ran a ruby script through Rails console. This was slow and obviously not the best way.
What is the best way to streamline this process and update it on my production server without any, or at least very limited downtime?
This is the process I'm thinking is best for now:
create rake tasks for reading in the data. Use activerecord-import plugin to do batch writing and to turn off activerecord validations. Load this data into brand new, duplicate tables.
Build indexes on newly created tables.
Rename newly created tables to the names the rails app is looking for.
Delete the old.
All of this I'm planning on doing right on the production server.
Is there a better way to do this?
Other notes from comments:
Tables already exist
Old tables and data are disposable
Tables can be locked for select only
Must minimize downtime
Our current server situation is 2 High CPU Amazon EC2 instances. I believe they have 1.7GB of RAM so storing the entire import temporarily is probably not an option.
New data is raw text file, line delimited. I have the script for parsing it already written in Ruby.
1) create "my_table_new" as an empty clone of "my_table"
2) import the file (in batches of x lines) into my_new_table - indexes built as you go.
3) Run: RENAME TABLE my_table TO my_table_old, my_table_new TO my_table;
Doing this as one command makes it instant (close enough) so virtually no downtime. I've done this with large data sets, and as its the rename that's the 'switch' you should retain uptime.
Depending on your logic, I would seriously consider processing the data in the database using SQL. This is close to the data and 30m rows is typically not a thing you want to be pulling out of the database and comparing to other data you have also pulled out of the database.
So think outside of the Ruby on Rails box.
SQL has built-in capability to join data and compare data and insert and update tables, those capabilities can be very powerful and fast, allowing the data to be processed close to the data.

RavenDB - Planning for scalability

I have been learning RavenDB recently and would like to put it to use.
I was wondering what advice or suggestions people had around building the system in a way that is ready to scale, specifically sharding the data across servers, but that can start on a single server and only grow as needed.
Is it advisable, or even possible, to create multiple databases on a single instance and implement sharding across them. Then to scale it would simply be a matter of spreading these databases across the machines?
My first impression is that this approach would work, but I would be interested to hear the opinions and experiences of others.
Update 1:
I have been thinking more on this topic. I think my problem with the "sort it out later" approach is that it seems to me difficult to spread data evenly across servers in that situation. I will not have a string key which I can range on (A-E,F-M..) it will be done with numbers.
This leaves two options I can see. Either break it at boundaries, so 1-50000 is on shard 1, 50001-100000 is on shard 2, but then with a site that ages, say like this one, your original shards will be doing a lot less work. Alternatively a strategy that round robins the shards and put the shard id into the key will suffer if you need to move a document to a new shard, it would change the key and break urls that have used the key.
So my new idea, and again I am putting it out there for comment, would be to create from day one a bucketting system. Which works like stuffing the shard id into the key, but you start with a large number, say 1000 which you distribute evenly between. Then when it comes time to split the load into a shard, you can say move buckets 501-1000 to the new server and write your shard logic that 1-500 goes to shard 1 and 501-1000 goes to shard 2. Then when a third server comes online you pick another range of buckets and adjust.
To my eye this gives you the ability to split into as many shards as you originally created buckets, spreading the load evenly both in terms of quantity and age. Without having to change keys.
Thoughts?
It is possible, but really unnecessary. You can start using one instance, and then scale when necessary by setting up sharding later.
Also see:
http://ravendb.net/documentation/docs-sharding
http://ayende.com/blog/4830/ravendb-auto-sharding-bundle-design-early-thoughts
http://ravendb.net/documentation/replication/sharding
I think a good solution is to use virtual shards. You can start with one server and point all virtual shard to a single server. Use module on the incremental id to evenly distribute the rows across the virtual shards. With Amazon RDS you have the option to turn a slave into a master, so before you change the sharding configuration (point more virtual shards to the new server), you should make a slave a master, then update your configuration file, and then delete all the records on the new master using modulu that doesn't comply with the shard range that you use for the new instance.
You also need to delete rows from the original server, but by now all the new data with IDs that are modulu based on the new virtual shard ranges will point to the new server. So you actually don't need to move the data, but take advantage of Amazon RDS server promotion feature.
You can then make replica off the original server. You create a unique ID as: Shard ID + Table Type ID + Incremental number. So when you query the database, you know to which shard to go and fetch the data from.
I don't know how it's possible to do it with RavenDB, but it can work pretty well with Amazon RDS, because Amazon already provide you with replication and server promotion feature.
I agree that their should be a solution that right from the start offer seamless sociability and not telling the developer to sort the problems out when those occur. Furthermore, I've find out that many NoSQL solution that evenly distribute data across shards need to work within a cluster with low latency. So you have to take that into consideration. I've tried using Couchbase with two different EC2 machines (not in a dedicated Amazon cluster) and data balancing was very very slow. That adds to the overall cost too.
I also want to add that what pinterest had done to solve their scalability issues, using 4096 virtual shards.
You should also need to look into paging issues with many NoSQL databases. With that approach you can page data quite easily, but maybe not in the most efficient way, because you might need to query several databases. Another problem is changing schema. Pinterest solved this by putting all the data in a JSON Blob in MySQL. When you want to add a new column, you create a new table with the new column data + key, and can use Index on that column. If you need to query the data, for example, by email, you can create another table with the emails + ID and put an index on the email column. Counters are another problem , I mean atomic counters. So it's better taking those counters out from the JSON and put them in a column so you can increment the counter value.
There are great solutions out there, but at the end of the day you find out that they can be very expensive. I preferred spending time on building my own sharding solution and prevent myself the headache later on. If you choose the other path, there are plenty of companies waiting for you to get into trouble and ask for quite a lot of money to solve your problems. Because at the moment that you need them, they know that you will pay everything to make your project work again. That's from my own experience, that's why I am breaking my head to build my own sharding solution using your approach, which also be much cheaper.
Another option is to use middleware solutions for MySQL like ScaleBase or DBshards. So you can continue working with MySQL, but at the time you need to scale, they have well proven solution. And the costs might be much lower then the alternative.
Another tip: when you create your config for shards, put a write_lock attribute that accepts false or true. So when it false, data won't be written to that shard, so when you fetch the list of shards for specific table type (ie. users), it will be written only to the other shards for that same type. This is also good for backup, so you can show a friendly error for visitors when you want to lock all the shard when backing up all the data to get a point-in-time snapshots of all the shards. Although I think you can send a global request for snapshoting all the databases with Amazon RDS and using point-in-time backup.
The thing is that most companies won't spend time working with a DIY sharding solution , they will prefer paying for ScaleBase. Those solution comes from single developers that can afford paying for a scalable solution from the start, but want to rest assured that when they reach to the point they need it, they have a solution. Just look at the prices out there and you can figure out that it will cost you A LOT. I will gladly share my code with you once I'm done. You are going with the best path in my opinion, it's all depends on your application logic. I model my database to be simple, no joins, not complicated aggregation queries - this solves many of my problems. In the future you can use Map Reduce to solve those big data queries needs.

Resources