how is apache usergrid served? free or commercial? - usergrid

Please guys I want to know if apache usergrid is free or commercial and how large does it scale for example how large can my data be and my users. Thanks

Apache Usergrid is free to use and open source. See the website here:
http://usergrid.incubator.apache.org/
It is backed by Cassandra and can scale to trillions of entities (technically 250 trillion), provided you have the disk storage to support this. In terms of seek times, using big-O notation, it will be:
If you use the UUID, O(1)
If you use the name, it's O(2)
If you search on graph edges it's O(3)
Essentially, storing more entities doesn't change the seek times unless you use complex queries.

Related

Is there a reason that Cassandra doesn't have Geospatial support?

Since Cassandra is based off of the Dynamo paper (distributed, self-balancing hash table) + BigTable and there are spatial indexes that would fit nicely into that paradigm (quadkey or geohash). Is there a reason that Geospatial support hasn't been implemented?
You could add a GeoPoint datatype as a tuple with an internal geohash and specify a CF as containing geo data. From there you can choose the behavior as having the geo data being a secondary index, or a denormalized SCF. That could lay the ground work for geospatial development and you could start by implementing some low hanging fruit such as .nearby() which could just return columns that share the same geohash. (I know that wouldn't give you the "nearest", you'd have to do a walk of surrounding geohashes or use a shape and a space filling curve for that which could be implemented later, but is a general operation for finding some nearby columns)
I know SimpleGeo/Urban Airship built geo support into Cassandra, but it doesn't look like that was ever opened up. Also, let me know if there's a better place to ask this (quora, mailing lists, etc...)
I think there are two parts to the answer.
The reason for why it's not there, is because nobody who commits code into Cassandra has thought of this feature, or thought that this capability is of high enough priority to spend major time on it. Most of the development in Cassandra is done by Datastax, and they, being a commercial entity, are privy to user demands and suggestions and also pretty pragmatic about what can give them the most ROI in terms of new features.
If there were a good enough third-party developer (or a team) with enough time on their hands, this could be done, and conceptually C* committers would likely have no problems about adding a major feature like this.
The second aspect is that Cassandra supports blobs (byte arrays), which means that what you're describing can be implemented in the client app/driver in a relatively straightforward manner. The drive would in that case be responsible for translating geo calls into appropriate raw byte operations. I'm also suspecting this would be less work than supporting a whole new data primitive with relevant set of operators in the core storage engine.

Sesame scalability

How to scale up Sesame? I'm planning to store a lot of triples in my Sesame and I'm wondering what I should do in order to have a scalable solution.
Ideally I would like my (native) store distribuited among several sesame instances, so a first question is: is there a way to "shard" sesame? If so, could you please point me to some kind of documentation?
In case of using a relational store, should I rely on a relational backend store?
In general, other than hardware resources and front-end load-balancers, what kind of support Sesame provides for medium / big data scenarios?
There are several ways to scale up. I won't give you a complete overview of all possibilities here but give you a few pointers instead.
A single Sesame native store scales to about 100-150 million triples on typical hardware. Beyond that, you can either use a third-party Sesame-compatible store such as USeekM, Bigdata, CumulusRDF or OWLIM (which scales well into the billions of triples), or you can use Sesame's own Federation SAIL. The federation members can be any combination of Sesame-compatible stores, including native stores running locally or remote stores accessible over HTTP.
The Federation SAIL distributes write operations using a simple size-dependent sharding algorithm, trying to distribute data over all members equally. Queries are of course automatically distributed and results re-integrated.
Sesame's relational backend is deprecated now. Explanation on their mailing list.
I am not sure but I think that Sesame wouldn't scale well with its native backends. As far as I know, people tend to use for example OWLIM. You would perhaps need OWLIM-Enterprise (previously BigOWLIM Replication Cluster) if you want a cluster solution.
If Sesame is not a hard requirement, then many people use the clustered edition of Virtuoso to store large amounts of triples.

Riak vs Amazon SimpleDB

I am looking for an eventually consistent key value data store and i decided to choose between Amazon SimpleDB and Riak ,so can anyone share their valuable experiences comparing both .
Thanks in advance
Fedrick
Riak is a key-value store. The data values you store is opaque to the database, so you have no secondary indexes. But you do have the ability to run map-reduce if your data is JSON (or XML, I think). You can run map-reduce over all data, or just a subset ("seed keys"). It also has a "link walking" feature where documents can refer to other documents, which can be auto-fetched. They don't currently have an incremental map-reduce like CouchDB, which means any secondary queries (non-key) are quite expensive. They have plans to fix this.
SimpleDB is actually halfway between a docstore and a keystore: Each key->item supports multiple attributes, but it only goes one level deep. You can query on your key or your attribute values.
In production, Riak should be pretty "hands-off". If it's slow or getting full, just spin up a new server and tell it to join the cluster. (unlike CouchDB or MongoDB where you have to futz with multiple config files).
SimpleDB can take a pounding (tens of thousands of requests per second I've heard), but you are responsible for data scaling (i.e. don't violate their domain size limits or it will slow down).
I have used SimpleDB for about 6 months now. I am going into production with it. It works well, but I wish it were faster. I perform %like% queries for searching, and I can't seem to get it to dive through more than a few MB a second worth of values. But non %like% searches are much faster. I get the feeling that it could be sped up if someone at Amazon wrote a few algorithms in good old c, rather than Erlang, but then again I am a c coder.
Also the first few queries on a recently opened Domain will take longer, as the system gets it all read in.
Overall it worked for me, but if I want to scale higher I will have to go with something else.
Also, I think that almost all my use of it will be free - there is a generous allocation of space, etc.
Make sure you plan on the fact that SimpleDB currently has no 'read only' access modes, etc. Any user that can use it can edit it.
--Tom

Implementing large scale log file analytics

Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.
I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.
Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?
Unfortunately there is no one size fits all answer.
I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.
Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.
Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.
in full disclosure, I am a developer on the Cascading project.
http://www.asterdata.com/
http://www.cascading.org/
The book Hadoop: The definitive Guide by O'Reilly has a chapter which discusses how hadoop is used at two real-world companies.
http://my.safaribooksonline.com/9780596521974/ch14
Have a look at the paper Interpreting the Data: Parallel Analysis with Sawzall by Google. This is a paper on the tool Google uses for log analysis.

Suitable data storage backend for Erlang application when data doesn't fit memory

I'm researching possible options how to organize data storage for an Erlang application. The data it supposed to use is basically a huge collection of binary blobs indexed by short string ids. Each blob is under 10 Kb but there are many of them. I'd expect that in total they would have size up to 200 Gb so obviously it cannot fit into memory. The typical operation on this data is either reading a blob by its id or updating a blob by its id or adding a new one. At each given period of day only a subset of ids is being used so the data storage access performance might benefit from in-memory cache. Speaking about performance - it is quite critical. The target is to have around 500 reads and 500 updates per second on commodity hardware (say on EC2 VM).
Any suggestions what to use here? As I understand dets is out of question as it is limited to 2G (or was it 4G?). Mnesia probably out of question too; my impression is that it was mainly designed for cases when data fits memory. I'm considering trying EDTK's Berkeley DB driver for the task. Would it work in the above scenario? Does anybody have experience using it in the production in the similar conditions?
tcerl came out of facing the same size limit. I'm not using Erlang these days but it sounds like what you're looking for.
Have you looked at what CouchDB is doing? It might not be quite what you are after as a drop in product, but there is lots of erlang code in there for storing data. There is also some talk of providing a native erlang interface instead of the REST api.
Is there any reason why you can't just use a file system, treating filename as your string id and file contents as a binary blob? You can choose one (filesystem) that fits your performance requirements, and you should get caching basically for free, provided by your OS.
Mnesia can store data on disk just fine. There's also dets (disk based term storage) which is roughly analogous to Berkeley DB. It's in the standard lib: http://www.erlang.org/doc/apps/stdlib/index.html
I would recommend Apache CouchDB.
It's a great fit for Erlang, and from the sound of it (you mention ID-based blobs and don't mention any relational requirements) you're looking for a document-oriented database.
Since the interface is REST, you can very simply add a commodity HTTP cache in front of it if you need caching.
The documentation for CouchDB is of a very high quality.
It also has built-in Map-Reduce :)

Resources