graph database revision control - ruby-on-rails

GitHub for Neo4J?
I'm evaluating graph databases as a possible solution for modeling a complex computer network. It occurs to me something like a revision control system would be useful for planning and testing updates to the database. I had been assuming that either we would instantiate a test network graph for such planning and then write a routing to sync the changes.
I see that this question has been asked and answered for relational databases (How do you maintain revision control of your database structure?). But I'm asking for graph databases, probably Neo4J.
In that relational thread someone pitches the Rails approach of making rollback a required element of database development. I like this idea too; I'm not sure how easy it is in graph databases.
How is this handled in the real world?

I found your question while also searching for an answer, so I don't have tested solutions to offer. But I can share that there's some discussion of this at How do I implement revisions with neo4j?, including a specific case at Neo4j / Strategy to keep history of node changes.
There's also a more detailed blog post at http://iansrobinson.com/2014/05/13/time-based-versioned-graphs/, which weighs the read-time / write-time / storage requirements of several alternatives. It also includes a number of diagrams and example queries that helped me wrap my head around what all this would look like.
Hope that's still useful, lo these months later, and sorry I can't be of more help! If you've found something that works in the meantime, can you let us know?

Related

Neo4J end user interface

I need to share a Neo4J graph visualization with end users. They should be able to interact with the graph, and perform some very basic querying. For example:
- show me the relationships up to 3 hops away from node named 'Joe'
A first option would be to just give them the standard user interface (usually exposed at port 7474); however this is too powerful as they could perform anything in Cypher.
Is there any way of restricting this interface (so that they cannot trigger expensive queries or even graph updates)? Or maybe other open source / community alternatives?
Thanks
If you are using the Enterprise Edition of neo4j, you will have access to extensive authentication and authorization capabilities, including the ability to assign a reader role to specific user names.
If you do want to use the standard browser interface, you can apply some settings on the neo4j.conf file that may help you out:
dbms.transaction.timeout=10s
dbms.read_only=true
dbms.transaction.timeout will terminate queries exceeding the timeout, so that can prevent expensive queries.
dbms.read_only makes the entire db instance read-only.
You may also build a custom web UI that calls the REST endpoint (need to auth in headers)
or
create an unmanaged extension
https://neo4j.com/docs/java-reference/3.1/#server-unmanaged-extensions
I suggest you the chapter 8 of the excellent book Learning Neo4j, by Rik Van Bruggen. This book is available for download at Neo4j web site.
One of the sections of this chapter shows some open source visualization libraries and visualization solutions.
EDIT 1:
Analyzing a bit more the chapter 8 of the Learning Neo4j book I believe that a promising tool for your use case is the paid solution Linkurio.us (you can run a demo in the site). This solution has a native integration with Neo4j and others graph databases.
EDIT 2:
Alternatively you can build your own visualization solution with a graph visualization library in JavaScript, for example. Here a very useful answer from another StackOverflow question that lists more some libraries that can help you.

Geodata Querying Optimisations

I am planning to write a Node.js-powered RESTful web service that I will use for a mobile application which provides some sort of location based features. The most basic use case is going to look something like this:
the user can create a resource by sending a request to the web service containing the resource's name and the user's current location (latitude and longitude)
the web service will store the metadata about this resource internally in some sort of collection
the user can query the web service for a list of resources within 5km of his current location
One of the first problems that came up in my mind was scalability. Let's suppose that at some point in the future the server will hold metadata for 1 million resources. When a user will query for nearby results, looping through 1 million entries to compute the distance will take forever.
There are many services out there that have the same flow, so I thought implementing something like this is not going to take me a lot of time. I might have been wrong.
I am now two days into researching proven methods and algorithms. By now I have read everything I could put my hands on about QuadTrees, Geohases, databases with spatial indexing support, formulas and so on. However, I still can't get the whole picture of how everything is going to work.
I was hoping that maybe someone who has worked on something similar could share his insight on what approach might be the most suitable considering this use case and the technologies that I am planning to use. Also, a short description of how it can be implemented would help me a lot!
For those who are also looking for more information on this topic out of curiosity, my answer might not provide much clearance. However, some answers in here might help you understand how you could achieve proximity searches using Geohashes.
My approach, after doing a little research on Redis, will be not to overcomplicate things and just use the tools that are already out there. It has out of the box support for spatial indexing and will most probably meet all my persistance requirements for this project.
Apparently MongoDB also comes with built-in support for geodata. In fact, even RDBMS like MySQL or SQLite do come with such capabilities.

Document-Oriented or Graph databases

It's a RoR project.
We want to store user activities, like uploaded a photo, voted for somebody, followed somebody, etc. When listing the activities, we need to list your friends activities as well. So, what is better to use in this case: a document-oriented database (couchdb, mongo db), a graph database (neo4js), or maybe some other approach?
Thank you for helping in advance guys :)
Yeah, I think Neo4j is a good choice, the Rails 3 support is excellent, see https://github.com/andreasronge/neo4j, see even the social examples with cypher like in http://docs.neo4j.org/chunked/snapshot/data-modeling-examples.html, and for activity streams, there are various cool approaches like Graphity, see http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/
Depending on the scale of your application, and volume of activity, I'd recommend a combination of Couchbase (not CouchDB) for actual activity data which is extremely scalable and fast, and Neo4J for the graph discovery (both databases at the same time). I've used the combination very effectively in my application that was both social and real-time.
If you want more info from me, please feel free to contact me directly and I can help with architectural decisions or implementation help.
Take a look at Infinitegraph. It is scalable unlike neo4j. I think they have a free download for 1 million nodes.
Consider using Sqlite. It is a flat file mimicing database and can be used as an embedded database.

Free data warehousing systems--specifically, for data storage

I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.
Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.
Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.
http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.
Hypertable is based on google's bigTable project.
see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks
you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).
hope that helps
I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases.
You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development.
Although this may not be what you were expecting, it may be worth considering.
Pentaho Mondrian
Open source
Uses standard relational database
MDX (think pivot table)
ETL ( via Kettle )
I use this.
In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:
http://hadoop.apache.org/
They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.
It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.
Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.
Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.
About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.
I see a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.
Hope this helps a bit.
A lot of people just use Mysql or Postgres :)

Ruby on Rails scalability/performance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I have used PHP for awhile now and have used it well with CodeIgniter, which is a great framework. I am starting on a new personal project and last time I was considering what to use (PHP vs ROR) I used PHP because of the scalability problems I heard ROR had, especially after reading what the Twitter devs had to say about it. Is scalability still an issue in ROR or has there been improvements to it?
I would like to learn a new language, and ROR seems interesting. PHP gets the job done but as everyone knows its syntax and organization are fugly and it feels like one big hack.
To expand on Ryan Doherty's answer a bit...
I work in a statically typed language for my day job (.NET/C#), as well as Ruby as a side thing. Prior to my current day job, I was the lead programmer for a ruby development firm doing work for the New York Times Syndication service. Before that, I worked in PHP as well (though long, long ago).
I say that simply to say this: I've experienced rails (and more generally ruby) performance problems first hand, as well as a few other alternatives. As Ryan says, you aren't going to have it automatically scale for you. It takes work and immense amounts of patience to find your bottlenecks.
A large majority of the performance issues we saw from others and even ourselves were dealing with slow performing queries in our ORM layer. We went from Rails/ActiveRecord to Rails/DataMapper and finally to Merb/DM, each iteration getting more speed simply because of the underlying frameworks.
Caching does amazing wonders for performance. Unfortunately, we couldn't cache our data. Our cache would effectively be invalidated every five minutes at most. Nearly every single bit of our site was dynamic. So if/when you can't do that, perhaps you can learn from our experience.
We had to end up seriously fine tuning our database indexes, making sure our queries weren't doing very stupid things, making sure we weren't executing more queries than was absolutely necessary, etc. When I say "very stupid things", I mean the 1 + N query problem...
# 1 query
Dog.find(:all).each do |dog|
# N queries
dog.owner.siblings.each do |sibling|
# N queries per above N query!
sibling.pets.each do |pet|
# Do something here
end
end
end
DataMapper is an excellent way to handle the above problem (there are no 1 + N problems with it), but an even better way is to use your brain and stop doing queries like that. When you need raw performance, most of the ORM layers won't easily handle extremely custom queries, so you might as well hand write them.
We also did common sense things. We bought a beefy server for our growing database, and moved it off onto it's own dedicated box. We also had to do TONS of processing and data importing constantly. We moved our processing off onto its own box as well. We also stopped loading our entire freaking stack just for our data import utilities. We tastefully loaded only what we absolutely needed (thus reducing memory overhead!).
If you can't tell already... generally, when it comes to ruby/rails/merb, you have to scale out, throwing hardware at the problem. But in the end, hardware is cheap; though that's no excuse for shoddy code!
And even with these difficulties, I personally would never start projects in another framework if I can help it. I'm in love with the language, and continually learn more about it every day. That's something that I don't get from C#, though C# is faster.
I also enjoy the open source tools, the low cost to start working in the language, the low cost to just get something out there and try to see if it's marketable, all the while working in a language that often times can be elegant and beautiful...
In the end, it's all about what you want to live, breathe, eat, and sleep in day in and day out when it comes to choosing your framework. If you like Microsoft's way of thinking, go .NET. If you want open source but still want structure, try Java. If you want to have a dynamic language and still have a bit more structure than ruby, try python. And if you want elegance, try Ruby (I kid, I kid... there are many other elegant languages that fit the bill. Not trying to start a flame war.)
Hell, try them all! I tend to agree with the answers above that worrying about optimizations early isn't the reason you should or shouldn't pick a framework, but I disagree that this is their only answer.
So in short, yes there are difficulties you have to overcome, but the elegance of the language, imho, far outweighs those shortcomings.
Sorry for the novel, but I've been there and back with performance issues. It can be overcome. So don't let that scare you off.
RoR is being used with lots of huge websites, but as with any language or framework, it takes a good architecture (db scaling, caching, tuning, etc) to scale to large numbers of users.
There's been a few minor changes to RoR to make it easier to scale, but don't expect it to scale magically for you. Every website has different scaling issues, so you'll have to put in some work to make it scale.
Develop in the technology that is going to give your project the best chance of success - quick to develop in, easy debugging, easy deployment, good tools, you know it inside out (unless the point is to learn a new language), etc.
If you get tens of million of uniques a month you can always hire in a couple of people and rewrite in a different technology if you need to as ...
... you'll be rake-ing in the cache (sorry - couldn't resist!!)
First of all, it would perhaps make more sense to compare Rails to
Symfony, CodeIgniter or CakePHP, since Ruby on Rails is a complete web application
framework. Compared to PHP or PHP frameworks, Rails applications offer
the advantages that they are small, clean, and readable. PHP is perfect
for small, personal pages (originally it stood for "Personal Home Page"),
while Rails is a full MVC framwork which can be used to build large
sites.
Ruby on Rails has not a larger scalability issue than comparable PHP frameworks.
Both Rails and PHP will scale well if you have only a moderate number
of users (10,000-100,000) which operate on a similar number of objects.
For a few thousand users a classic monolithic architecture will
be sufficient. With a bit of M&M (Memcached and MySQL) you can also
handle millions of objects. The M&M architecture uses a MySQL server to
handle writes and Memcached to handle high read loads. The traditional
storage pattern, a single SQL server using normalized relational tables
(or at best a SQL Master/Multiple Read Slave setup), no longer works
for very large sites.
If you have billions of users like Google, Twitter and Facebook, then
probably a distributed architecture will be better. If you really want to
scale your application without limit, use some kind of cheap commodity hardware
as a foundation, divide your application into a set of services, keep
each component or service scalable itself (design every component as
a scalable service), and adapt the architecture to your application.
Then you will need suitable scalable datastores like NoSQL databases
and distributed hash tables (DHTs), you will need sophisticated map-reduce
algorithms to work with them, you will have to deal with SOA, external
services, and messaging. Neither PHP nor Rails offer a magic bullet here.
What is breaks down to with RoR is that unless you're in Alexa's top 100, you will not have any scalability problems. You'll have more issues with stability on shared hosting unless you can squeeze Phusion, Passenger, or Mongrel out.
Take a little while to look at the problems the Twitter people had to deal with, then ask yourself if your app is going to need to scale to that level.
Then build it in Rails anyway, because you know it makes sense. If you get to Twitter-level volumes then you'll be in the happy position of considering performance optimisaton options. At least you'll be applying them in a nice language!
You can't compare PHP and ROR, PHP is a scripting language as Ruby, and Rails is a framework as CakePHP.
Stated that, I strongly suggest you Rails, because you will have an application strictly organized in MVC pattern, and this is a MUST for your scalability requirement. (Using PHP you had to take care about the project organization on your own).
But for what about scalability, Rails it's not just MVC: For instance, you can start to develop your application with a database, changing it on road without any effort (in the most part of cases), so we can state that a Rails application is (almost) database indipendent because it's ORM (that allow you to avoid database query), you can do a lot of other stuff. (take a look to this video http://www.youtube.com/watch?v=p5EIrSM8dCA )
Just wanted to add some more info to Keith Hanson's smart point about 1 + N problem where he states:
DataMapper is an excellent way to handle the above problem (there are no 1 + N problems with it), but an even better way is to use your brain and stop doing queries like that. When you need raw performance, most of the ORM layers won't easily handle extremely custom queries, so you might as well hand write them.
Doctrine is one of the most popular ORM's for PHP. It addresses this 1 + N complexity problem intrinsic to ORMs by providing a language called Doctrine Query Language (DQL). This allows you to write SQL like statements that use your existing model relationships. e.g
$q = Doctrine_Query::Create()
->select(*)
->from(ModelA m)
->leftJoin(m.ModelB)
->execute()
I'm getting the impression from this thread that the scalability issues of ROR come down primarily to the mess that ORMs are in with regard to loading child objects - ie the '1+N' problem mentioned above. In the above example that Ryan gave with dogs and owners:
Dog.find(:all).each do |dog|
#N queries
dog.owner.siblings.each do |sibling|
#N queries per above N query!!
sibling.pets.each do |pet|
#Do something here
end
end
end
You could actually write a single sql statement to get all that data, and you could also 'stitch' that data up into the Dog.Owner.Siblings.Pets object heirarchy of your custom-written objects. But could someone write an ORM that did that automatically, so that the above example would incur a single round-trip to the DB and a single SQL Statement, instead of potentially hundreds? Totally. Just join those tables into one dataset, then do some logic to stitch it up. It's a bit tricky to make that logic generic so it can handle any set of objects but not the end of the world. In the end, tables and objects only relate to each other in one of three categories (1:1, 1:many, many:many). It's just that no one ever built that ORM.
You need a syntax that tells the system upfront what children you want to load for this particular query. You can sort of do this with the 'eager' loading of LinqToSql (C#), which is not a part of ROR, but even though that results in one round trip to the DB, it's still hundreds of separate SQL statements the way it has currently been set up. It's really more about the history of ORMs. They just got started down the wrong path with that and never really recovered in my opnion. 'Lazy loading' is the default behavior of most ORMs, ie incurring another round trip for every mention of a child object, which is crazy. Then with 'eager' loading - loading the children upfront, that is set up statically in everything I am aware outside of LinqToSql - ie which children always load with certain objects - as if you would always need the same children loaded when you loaded a collection of Dogs.
You need some kind of strongly-typed syntax saying that this time I want to load these children and grandchilren. Ie, something like:
Dog.Owners.Include()
Dog.Owners.Siblings.Include()
Dog.Owners.Siblings.Pets.Include()
then you could issue this command:
Dog.find(:all).each do |dog|
The ORM system would know what tables it needs to join, then stitch up the resulting data into the OM heirarchy. It's true that you can throw hardware at the current problem, which I'm generally in favor of, but it's no reason the ORM (ie Hibernate, Entity Framework, Ruby ActiveRecord) shouldn't just be better written. Hardware really doesn't bail you out of an 8 round-trip, 100-SQL statement query that should have been one round trip and one SQL statement.

Resources