I am new in Cassandra. Although I can do some stuff in SQL, I am finding it pretty hard to do simple join in Cassandra. My schema looks like this:
Now I have to find, for each department how many emails in total were sent out from employees working there. The output per department shall contain the corresponding number of emails.
Maybe I am missing some simple thing, but no matter what I do, I am not even being able to retrieve data from two tables.
Cassandra has no join operation. It has been implemented in such way to increase the performance in basic operations like reading and writing, but with the caveat that you write to and read from a single table at a particular moment.
If your model is relational, so it is based on relations between tables, than Cassandra is not the way to go. In this case you should go with some RDBMS(Relational Database Management System) like: PostgreSQL, MySql, Sql Server etc.
Related
I'm playing around with neo4j - seeing what I can and can't do with it before suggesting it for something serious. One of the things I'm looking at now is Data Partitioning. By this I mean having a single data store that contains data from many different customers, and knowing which customer the data belongs to.
In the SQL world, we've always done this by having a customer_id field on the tables that are customer specific, and then always including that in the queries and indices. This works perfectly well for us, but in the Graph DB world it feels like we can do better.
The options that I've come up with some far are:
The same as before - including a property on the nodes that is the Customer ID
Storing a Label on each Node that identifies the Customer. However, as far as I can tell you can't bind parameters to labels so this would mean that the queries are generated slightly awkwardly.
Storing a Customer Node, and linking all of the other nodes to it.
Number #3 seems to be the "correct" Graph DB way of managing this, but I'm concerned with the impact of this on the performance of the data. It's perfectly feasible that there will be hundreds of thousands of links from a single Customer Node to the other data nodes, and there will be hundreds of different Customer Nodes. (Based on the volume of data in the existing SQL database)
What's the recommended way of achieving this level of data partitioning whilst maintaining performance?
we are looking to transfer a part of our database to from PostgreSQL to Elastic: basically we want to combine three tables - Properties, Listings and Addresses into single document.
I couldn't find any standard tools and probably, as we have a Ruby on Rails application, the pimpliest and most reliable way will be to write a migration which will just iterate through the models, compose them into single document and save to Elastic.
The task doesn't seem to be complicated but as that it's my first experience with Elastic I want to check with the community.
Thanks.
The closest thing I'm away of is the JDBC importer. However, I think writing your own script is probably equally as fast.
There is a postgres function, row_to_json, that will convert a resulting row to JSON, which you can then publish into elasticsearch. There's nothing I'm aware of that will automatically do this for you. Assuming it's not billions of rows, I'd stick with your plan of writing a short script to run your query, and HTTP post the results into elasticsearch.
You'll need to decide on two things: Index name, and document type(s).
Some notes:
The consistency model between a relational database like postgres and an eventually consistent document store like elasticsearch are quite different. You should be aware of these differences and the drawbacks of them.
You will likely want the data in elasticsearch to be de-normalized, as there are no awesome ways of doing joins.
I have tables which have millions of records. I need to load these records as nodes in neo4j.
Please help me out on how to do it as I'm new to neo4j.
It is quite easy, just map your entities that should become nodes into a set of csv files and the connections that should become relationships in another set of files.
Then run them with my batch-importer: https://github.com/jexp/batch-import/tree/20#binary-download
import.sh nodes1.csv,nodes2.csv rels1.csv,rels2.csv
Add types and index information to the headers and the batch.properties config file as needed.
You can use the batch-importer for the initial inserter but also subsequent updates (but the database has to be shut-down for that).
It is pretty easy to connect to your existing database using its driver and then extract the information of the right shape and kind and insert it into your graph model,
Either using Cypher statements with parameters or the embedded, transactional Java API for ongoing updates.
See: http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/
You can export to CSV and import it into node (probably wont work well since you have millions of records)
You can write a program to do it (this is what I am currently working on).
This also depends on what programming languages you know... but the bottom line is, because no two databases are created equally (unless on purpose), it's very difficult to create a catch-all solution for migrating data from SQL to Neo.
The best way that I've discovered so far is to create a program that queries the tables in the database, finds all related tables (i.e. foreign keys), and imports all those table rows into Neo, labeling the nodes using the Table name, then process the foreign keys as relationships.
It's not easy. I've been working on something for my database here for a week or so now... but I'm close!
I'm building an ASP.NET MVC application.
I will store a lot of data per application user, so I was considering having separate DBs for each user's data. Of course all DBs will be 100% the same in terms of structure. They should all be derived from some kind of master DB.
Can someone give me any leads on how to realize this?
A database per user sounds a terrible idea for a lot of reasons. If the data or reads/writes are huge you should consider sharding (splitting the database into separate databases based on primary keys but NOT a separate DB for each primary key).
If you are using SQL Server this can handle several billion rows with just a bit of tuning -http://www.sql-server-performance.com/2011/index-maintenance-performance/ for a good reference on this.
Unless you are planning to handle millions of users with millions of rows in each table you will not need to use multiple DB's.
MS Sql can easily handle a billion rows in a table with good indexing and still keep good query performance.
Using multiple DB's will most likely only cause headaches.
That said, you should design scrips to create the databases, and later update scripts for changing them since you would have to propagate any changes to all databases.
But my recommendation is to stick with one DB. If you actually manages to outgrow that you most likely need to redesign the database structure anyway :)
I am building an ad analytics tool which assumes a data structure like this:
Account
Campaign
Keyword
Conversion
I have a lot of information about individual conversion events, which can be tied back to the cost data of each campaign, keyword, ad group, etc. In SQL, you could consider each property a sort of foreign key (text-based) to the campaign, keyword or ad in a particular account, but that's inefficient and slow. It doesn't sound like a great idea to make campaign_id, keyword_id, etc. fields and populate them either, because I want the analytics to be available in near-real time.
What would be a good way to model this with MongoDB?
Assuming a very high volume of conversion events (millions per day or more), a storage engine alone (MongoDB or anything else) won't help you. What you need is the ability to run map-reduce jobs on the data in order to calculate the analytics. You can scale-out your cluster as necessary to achieve near-real time performance.
The free/open-source options that I can suggest are Hadoop (and probably HBase and Hive) or Riak.
There are other options - I'm only suggesting these two because I've personal experience with them in a high scale production environment. We're currently using Hadoop to power an analytics system processing billions of events per day.
If you're not into rolling your own and are able and willing to pay (a lot!) then look at GreenPlum and Vertica.
I'll be happy to share more information on potential solution designs - but I'll need more data on what you're trying to achieve - scale, use cases etc.
I'm not sure that MongoDB is really the right choice for something like this, since MongoDB is really more about storing less well (or more complex) documents rather than hierarchical records like this one. However, if you are going the MongoDB route, then you can just use the account, campaign and keyword tags directly. There is no substantive benefit to abstracting these into meaningless keys in MongoDB. You can index these fields directly in MongoDB.
I don't know what your volumes are going to be and what other factors are affecting your technology choices. However, assuming that your accounts, campaigns and keywords don't change that frequently, you could do this with plain old RDBMS (SQL or Oracle etc.) using lookup tables for these determinants where the foreign keys are meaningless integers. If you're doing live analytics you could adopt a star schema and keep all of the numeric FKs on the base fact table (Conversion) so that you aren't joining a chain of four tables to get the whole picture, instead you'd be doing three one-hop joins. This would allow you to summarize at any level with only a single join.