Sesame scalability - scalability

How to scale up Sesame? I'm planning to store a lot of triples in my Sesame and I'm wondering what I should do in order to have a scalable solution.
Ideally I would like my (native) store distribuited among several sesame instances, so a first question is: is there a way to "shard" sesame? If so, could you please point me to some kind of documentation?
In case of using a relational store, should I rely on a relational backend store?
In general, other than hardware resources and front-end load-balancers, what kind of support Sesame provides for medium / big data scenarios?

There are several ways to scale up. I won't give you a complete overview of all possibilities here but give you a few pointers instead.
A single Sesame native store scales to about 100-150 million triples on typical hardware. Beyond that, you can either use a third-party Sesame-compatible store such as USeekM, Bigdata, CumulusRDF or OWLIM (which scales well into the billions of triples), or you can use Sesame's own Federation SAIL. The federation members can be any combination of Sesame-compatible stores, including native stores running locally or remote stores accessible over HTTP.
The Federation SAIL distributes write operations using a simple size-dependent sharding algorithm, trying to distribute data over all members equally. Queries are of course automatically distributed and results re-integrated.

Sesame's relational backend is deprecated now. Explanation on their mailing list.
I am not sure but I think that Sesame wouldn't scale well with its native backends. As far as I know, people tend to use for example OWLIM. You would perhaps need OWLIM-Enterprise (previously BigOWLIM Replication Cluster) if you want a cluster solution.
If Sesame is not a hard requirement, then many people use the clustered edition of Virtuoso to store large amounts of triples.

Related

Graph database in distributed env

I have a question concerning graph databases! Is there a mechanism to use graph databases in a distributed environment?! I mean can you distribute a graph database?! Can we even traverse a graph database in a distributed environment?!
Definitely you can do it.
There are different databases which scale very good nowadays (JanusGraph, OrientDB, ArangoDB etc).
Even if you have a very big database which has to be scaled beyond a single datacenter to multiple geo-distributed datacenters you still has options.
For example, you can use JanusGraph with Cassandra / ScyllaDB storage backends. It will give you an option to asynchronously synchronize all your data from different datacenters.
Of course, there are some issues to be solved like consistency and so on but with today's tools, it's very possible to organize a distributed graph database.
Neo4j enterprise edition features clustering, read more on http://neo4j.com/docs/stable/ha.html.
Yes, you can use all sorts of graph databases in distributed environments. Can you distribute a graph database? Definitely yes.
BUT - distributing the same graph database in many different places (to speed up reads) is quite easy, and done all the time. Distributing a ridiculously massive database (so that parts of the graph database are in a bunch of different places) is quite hard.
I recommend this related question which talks about sharding and distributing databases. Pay particular attention to the bit about "sharding is an anti-pattern".

Neo4j sharding aspect

I was looking on the scalability of Neo4j, and read a document written by David Montag in January 2013.
Concerning the sharding aspect, he said the 1st release of 2014 would come with a first solution.
Does anyone know if it was done or its status if not?
Thanks!
Disclosure: I'm working as VP Product for Neo Technology, the sponsor of the Neo4j open source graph database.
Now that we've just released Neo4j 2.0 (actually 2.0.1 today!) we are embarking on a 2.1 release that is mostly oriented around (even more) performance & scalability. This will increase the upper limits of the graph to an effectively unlimited number of entities, and improve various other things.
Let me set some context first, and then answer your question.
As you probably saw from the paper, Neo4j's current horizontal-scaling architecture allows read scaling, with writes all going to master and fanning out. This gets you effectively unlimited read scaling, and into the tens of thousands of writes per second.
Practically speaking, there are production Neo4j customers (including Snap Interactive and Glassdoor) with around a billion people in their social graph... in all cases behind an active and heavily-hit web site, being handled by comparatively quite modest Neo4j clusters (no more than 5 instances). So that's one key feature: the Neo4j of today an incredible computational density, and so we regularly see fairly small clusters handling a substantially large production workload... with very fast response times.
More on the current architecture can be found here: www.neotechnology.com/neo4j-scales-for-the-enterprise/
And a list of customers (which includes companies like Wal-Mart and eBay) can be found here: neotechnology.com/customers/ One of the world's largest parcel delivery carriers uses Neo4j to route all of their packages, in real time, with peaks of 3000 routing operations per second, and zero downtime. (This arguably is the world's largest and most mission-critical use of a graph database and of a NOSQL database; though unfortunately I can't say who it is.)
So in one sense the tl;dr is that if you're not yet as big as Wal-Mart or eBay, then you're probably ok. That oversimplifies it only a bit. There is the 1% of cases where you have sustained transactional write workloads into the 100s of thousands per second. However even in those cases it's often not the right thing to load all of that data into the real-time graph. We usually advise people to do some aggregation or filtering, and bring only the more important things into the graph. Intuit gave a good talk about this. They filter a billion B2B transactions into a much smaller number of aggregate monthly transaction relationships with aggregated counts and currency amounts by direction.
Enter sharding... Sharding has gained a lot of popularity these days. This is largely thanks to the other three categories of NOSQL, where joins are an anti-pattern. Most queries involve reading or writing just a single piece of discrete data. Just as joining is an anti-pattern for key-value stores and document databases, sharding is an anti-pattern for graph databases. What I mean by that is... the very best performance will occur when all of your data is available in memory on a single instance, because hopping back and forth all over the network whenever you're reading and writing will slow things significantly down, unless you've been really really smart about how you distribute your data... and even then. Our approach has been twofold:
Do as many smart things as possible in order to support extremely high read & write volumes without having to resort to sharding. This gets you the best and most predictable latency and efficiency. In other words: if we can be good enough to support your requirement without sharding, that will always be the best approach. The link above describes some of these tricks, including the deployment pattern that lets you shard your data in memory without having to shard it on disk (a trick we call cache-sharding). There are other tricks along similar lines, and more coming down the pike...
Add a secondary architecture pattern into Neo4j that does support sharding. Why do this if sharding is best avoided? As more people find more uses for graphs, and data volumes continue to increase, we think eventually it will be an important and inevitable thing. This would allow you to run all of Facebook for example, in one Neo4j cluster (a pretty huge one)... not just the social part of the graph, which we can handle today. We've already done a lot of work on this, and have an architecture developed that we believe balances the many considerations. This is a multi-year effort, and while we could very easily release a version of Neo4j that shards naively (that would no doubt be really popular), we probably won't do that. We want to do it right, which amounts to rocket science.
TL;DR With 2018 is days away neo4j still does not support sharding as it is typically considered.
Details Neo4j still requires all data to fit on a single node. The node contents can be replicated within a cluster - but actual sharding is not part of the picture.
When neo4j talks of sharding they are referring to caching portions of the database in memory: different slices are cached on different replicated nodes. That differs from say mysql sharding in which each node contains only a portion of the total data.
Here is a summary of their "take" on scalability: their product term is "High Availability" https://neo4j.com/blog/neo4j-scalability-infographic/
. Note that High Availability should not be the same as Scalability: so they do not actually support the latter in the traditional understanding of the term.

In-memory database scalability

I have been exploring MMDB systems lately and I haven't been able to find much information with regards to how an in-memory database is supposed to scale. My quite basic assumption is that a main-memory db is constrained by the memory available on the db node, and by the operating system management of this memory. So how can I expand an in-memory system size beyond that of the main memory available? I assume the answer is along the lines of a distributed system but I haven't got it clear in my head how it would work. And of course it's also possible I completely misunderstood the idea of mmdb and i'm missing something obvious.
A bit of background on the question: I am writing a number of cross-platform mobile apps (even though my background is heavily involved with mysql and mongodb), and I don't like native database solutions like sqlite for android and ios. So I thought I'd write my own solution (site and github) in javascript (i'm working on cordova/phonegap). I realised that I could make this a nodejs module and use it as a db for a web app (I'm creating a blog powered by it as an experiment and it's working pretty well), but of course I'm now thinking of making it a separate tier and I started thinking about the obvious limitation of memory size, hence my question.
in-memory databases scale in size the same way as on-disk (aka persistent) databases do: either throw more storage at it (memory, in this case) or distribute it across multiple nodes of a cluster. The latter alternative increases the complexity (both of the DBMS, and your administration of it), relative to an in-memory database on a single system. Consider the difference between vanilla MySQL and MySQL Cluster. And, you'll want to have a really fast network for those times when the DBMS has to perform inter-node operations (e.g. distribute the data or pull data from multiple nodes to satisfy a query).
There's nothing particularly special about in-memory databases in this regard. There are some special optimizations in the database engine when you know storage is memory. But it doesn't change the fundamental principles of database systems.
What you don't want to do is create an in-memory database larger than physical memory. You'll force the OS to swap in-memory database pages in/out of swap space, and the performance will suck. You're better off, in that case, using a conventional DBMS and giving it as much cache as you have memory available for. The DBMS will use the cache more intelligently than the OS' will the swap space.
Current production-ready in-memory databases have mainly focused on scale-up as opposed to scale-out. So-far, they have either managed to integrate main memory tier into their core, existing architecture (IBM via Blu acceleration) or have re-built the database from almost-scratch to leverage the main-memory as primary storage layer (SAP HANA), and in both cases their claim to fame is the obvious speedup that DRAM offers in comparision to the disk.
However very few databases, presently, have a complete offering which scales-out in-memory performance benefit accross multiple nodes. Most of the in-memory databases require the applications to manage the distribution of data/objects across nodes (Ex: SAP HANA).
Oracle's DBIM and MemSQL are a few scalable and distributed options, at this time, that implement distributed in-memory database/tier by collective utilization of memory resources across the cluster (RAC in case of Oracle). MemSQL can be deployed on a cluster of commodity compute nodes and it claims to scale by utilizing aggregate resources, including memory. Oracle RAC is a shared cache architecture that overcomes the limitations of traditional shared-nothing and shared-disk approaches to provide highly scalable and available database solutions, including in-memory benefits.

Scalaris vs CouchDB

I have this requirement to use a document store for one of the applications. I am assuming scalaris and couchdb are comparable as document stores. Do you have any experiences to share on these two solutions? Do you think one is better than the other?
transactions are attractive to me from scalaris. With little erlang background I have more trust on solutions built on Erlang. riak is another one I found interesting. So please share your thoughts or pointers to more information on them.
I think you need more research in this area.
Type of a key in key/value store is obviously important, however you need to be more precise about rest of the requirements. Things like distribution strategy shapes availability and consistency. How much data do you want to store? Maybe MySQL is still all right? What kind of query you want to perform? Write it down and try to fit to each solution!
What I can say:
- CouchDB the most important is off-line replication model. Its like having mirrored DB for free anywhere you want. Fast read, slow re-balance after lots of deletes. Pure Couch is not distributed and do not guarantee fault tolerance
- Riak - dynamo model = many replicas distributed in smart way. Reliable & scalable cpu, storage, ram.
- Hibari - distributed. Also Erlang. Transactions (?).
All of above have serious industrial use-cases. Scalaris seems to be rather scientific.
Depending on the way you will retrieve data, there is lots of original solutions like Graph databases or Redis (lets say rich k/v store).

Implementing large scale log file analytics

Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.
I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.
Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?
Unfortunately there is no one size fits all answer.
I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.
Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.
Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.
in full disclosure, I am a developer on the Cascading project.
http://www.asterdata.com/
http://www.cascading.org/
The book Hadoop: The definitive Guide by O'Reilly has a chapter which discusses how hadoop is used at two real-world companies.
http://my.safaribooksonline.com/9780596521974/ch14
Have a look at the paper Interpreting the Data: Parallel Analysis with Sawzall by Google. This is a paper on the tool Google uses for log analysis.

Resources