I'm playing around with neo4j - seeing what I can and can't do with it before suggesting it for something serious. One of the things I'm looking at now is Data Partitioning. By this I mean having a single data store that contains data from many different customers, and knowing which customer the data belongs to.
In the SQL world, we've always done this by having a customer_id field on the tables that are customer specific, and then always including that in the queries and indices. This works perfectly well for us, but in the Graph DB world it feels like we can do better.
The options that I've come up with some far are:
The same as before - including a property on the nodes that is the Customer ID
Storing a Label on each Node that identifies the Customer. However, as far as I can tell you can't bind parameters to labels so this would mean that the queries are generated slightly awkwardly.
Storing a Customer Node, and linking all of the other nodes to it.
Number #3 seems to be the "correct" Graph DB way of managing this, but I'm concerned with the impact of this on the performance of the data. It's perfectly feasible that there will be hundreds of thousands of links from a single Customer Node to the other data nodes, and there will be hundreds of different Customer Nodes. (Based on the volume of data in the existing SQL database)
What's the recommended way of achieving this level of data partitioning whilst maintaining performance?
Related
Neo4j is a great tool for mapping relational data, but I am curious what under what conditions it would not be a good tool to use.
In which use cases would using neo4j be a bad idea?
You might want to check out this slide deck and in particular slides 18-22.
Your question could have a lot of details to it, but let me try to focus on the big pieces. Graph databases are naturally indexed by relationships. So graph databases will be good when you need to traverse a lot of relationships. Graphs themselves are very flexible, so they'll be good when the inter-connections between your data need to change from time to time, or when the data about your core objects that's important to store needs to change. Graphs are a very natural method of modeling some (but not all) data sources, things like peer to peer networks, road maps, organizational structures, etc.
Graphs tend to not be good at managing huge lists of things. For example, if you were going to build a customer transaction database with analytics (where you need 1 million customers, 50 million transactions, and all you do is post transactions all day long) then it's probably not a good fit. RDBMS is great at that, notice how that use case doesn't exploit relationships really.
Make sure to read those two links I provided, they have much more discussion.
For maintenance reasons, any service aggregating data feeds has until now been well advised to keep their sources independent.
If I want to explore relationships between different feeds, this can be done at application level, using data tracking (for example) user preferences amongst the other feeds.
Graph databases are about managing relationship complexity, but this complexity is in many cases a design choice. Putting all your kids in one bathtub is fine until you drop the soap..
I am required to develop a big application,required to know graph database concepts the link http://sparsity-technologies.com/UserManual/API.html#transactions.I am planning to use core data instead of above link frame work. I want answerers for the following questions.
1)What is Graph Database exactly?.Explain with simple general example.which we can not perform with sqlite.
2)Does core data come under relational data base or not ? Explain.
3)Does core data come under Graph Database? But in apple documentation they mentioned that core data is for object graph management.object graph management means Graph Database .If i want to make relation ships ,weighted edge between objects core data is suitable?.
1)What is Graph Database exactly?.Explain with simple general
example.which we can not perform with sqlite.
Well, since this is all Turing complete, you can do it any database operation with any other database, the real question is a matter of efficiency.
In conventional "relational" databases the "relationships" are nothing but pointers to entries in other tables. They don't inherently communicate any information other than, "A is connected to B" To capture and structure anything more complex than that, you have to build a lot of pseudo-structure.
A1-->B1 // e.g. first-name, last-name
Which is fine but the relationship doesn't necessarily have a reciprocal, nor does the data in each table cell have to be names. To make the relationship always make sense, you've got build a lot of logic to put the data into the tables directly. Ditto for getting it out.
In a GraphDB you have "nodes" and "relationships". Nodes are not entries in a table. They can be arbitrarily complex objects, persisted or not, and persisted in a variety of ways. Nodes general model some "real-world" object like a person.
"Relationships" GraphDBs, owing to the previous meaning in SQL et al, really need another term because instead of be simple pointers, they to can be arbitrarily complex objects. In a node of names (way to simple to actually justify it)
Node-Name-A--(comes before)-->Node-Name-B
Node-Name-B--(comes after)-->Node-Name-B
In a sqlite, to find first and last names you query both tables. In a Graph, you grab one of the nodes and follow its relationship to other node.
(Come to think of it, graph theory in math started out as a way to model bridges of Konigsberg connecting the islands that made up the city. So maybe a transportation map would be a better example)
If cities are nodes, the roads are relationships. The road objects/descriptors would just connect the two but would contain their own logic and data such as their direction, length, conditions, traffic, suseptiblity to weather, and so on.
When you wanted to fetch and optimum route between widely separated cities, nodes for any particular time, traffic weather etc between two different nodes, you'd start with the node representing the start city and the follow the relationship/road-descriptors. In a complex model, any two nearby city-nodes might have several roads connecting them each best in certain circumstances.
All you have to do computationally though is compare the relationships between any to nodes. This is called "walking the graph" The huge benefit is that no matter how big the overall DB is, you only have to process the relationships coming out of the first node, say 3, and ignore utterly the the millions of other nodes and relationships that might be in the DB.
Can't do that in sqlite. The more data, the more "relationships" the more you have to process
2)Does core data come under relational data base or not ? Explain.
No, but if you hum a few bars you can fake it. By default, Core Data is an Object graph, which means it does connect object/nodes, but the relationships are themselves not objects but are instead defined by information contained in the class for each Object. E.g. you could have a Core Data of the usual Company, manager and employee.
CompanyClass
set_of_manager_objects
min_managers==1, max_managers==undefined
delete_Company_Object_delete_all_manager_objects
reciprocal_relationship_from_manager_is_company
ManagerClass
one company object
min_companies==1, max_companies==1
delete_manager_object_nullify (remove from set in company class)
recipocal_relationship_from_company_is_manager
So, Core Data a kind of "missing link" in the evolution of GraphDBs. I has relationships but they're not objects of themselves. They're inside the object/node. The relationship properties are hard coded into the classes themselves and just a few, but not all values can be changed. Still, Core Data does have the advantage of walking the graph. To find the Employees of one manager at one company. You just start at the company object, go through a small set of managers to find the right one, then walk down to the employee set. Even if you had hundreds of companies, thousands of managers and tens of thousands of employees. You can find one employee out of tens of thousands with a couple of hops.
But you can fake a GraphDB by creating relationship objects and putting them between any two object/nodes. Because Core Data allows any subclasses of relationship definition to be in the same relationship set e.g. ManagerClass--> LowManager,MidManager,HighManager, you can define a simple relationship in any given class and then populate with objects of arbitrary complexity as long as they are subclasses. These are usually termed "linking classes" or "linking relationships"
The normal pattern is to have the linking class have a relationship to the two or more classes it might have to link (which can be generic as well, I've started class trees with a base class with nothing but relationship properties, although their is a performance penalty if you get huge.)
If you give each node/object several relationships all defined on separate base linking classes, you can link the same nodes together in multiple ways.
3)Does core data come under Graph Database?
No, because the fundamental task of a database is persistence, saving the data. The fundamental task of Core Data is modeling the logic of the data inside the app.
Two different things. For example, when I start building a Core Data model, I start with an in-memory store, usually with test. The model graph is built from scratch every run, in memory, never touches the disk. As it progresses, I will shift to an XML store on disk, so I can examine it if necessary. The XML and binary stores are written out once entire and read in the same way. Only, at the end do I change the store to MySQL or something custom.
In a GraphDB, the nodes, relationships and the general graph are tied to the persistence systems innately AFAK and can't be altered. When you walk the graph, you walk the persistence, every time (except for caching.)
The usual question people ask is when to use Core Data and when to use SQL in the Apple Ecosystem.
The answer is pretty simple:
Core Data handles complexity inside the running app. The more complex the data model interactions, the more you get free with Core Data.
SQL derived solutions handle volumes of simple data. If the data model inside the app has little or no logic and there's a lot of it.
If your app is displaying something that would fit on a bunch of index cards, library book records, baseball cards etc, the an SQL solution is best because of the logic is just getting particular cards in and out of persistence.
If your app is complex vector drawing app, where every document will be different and of arbitrary complexity, or you're modeling an V8 engine, then most of the logic in the active data model while the app is running while persistence is trivial, then Core Data is the better choice.
Graph Databases are catching on because our data is getting 1) really, really big and 2) increasing complex. We need to model the complexity in the node-relationship graph in persistence so we don't have chew through the entire DB to find the data and then have to add an additional layer of logic
Core data is nothing but Data Model Layer, core data is NOT a datatbase and far away from being a graph database.
Core data only helps you to
Create Tables (Entities)
Columns in a table (Attribute)
Relationship (such as primary key, foreign key, one to one, one to many)
Core Data uses sqlite to store data and make queries.
Core Data is used in iOS mobile apps, I believe what you want is a backend solution for database.
I've created a graph model for a social network and needed some concrete advice regarding the design in regards to scaling. Pardon the n00bness of these questions but I'm not finding very many clear examples out there...
NOTE: the status updates and activity nodes /relationships are linked lists - with the newest entries constantly being placed at the top of the list.
Linked lists allow for news feed generation, but there could be hundreds of records per user - I presume the limit clause isn't sufficient even though the data is in descending order by date. Do I have to have a separate linked list that would only hold the most recent 10 status/activity updates) and constantly replace the head on that list to get better activity feed generation, or will one list properly sorted and do the job (with a limit clause)
These nodes all have properties (json data with content, IDs, etc) - how do "global" indexes come into play here so that I can find, for example, users that like Depeche Mode without waiting a lifetime for results? I know how to add a node to an index, just wondering if I'm missing a part of the picture here..
Security - logins and passwords.. I would presume a graph database could store them, but I'd presume it's a security risk at this point - would it be better to keep this in postgres etc?
How would you improve this model to handle scalability? Imagine 20 million users banging away on this..
Imagine 40 million users - what's wrong with this model when it comes to scalability?
Part 1.
You can write cypher or gremlin queries that do what you want. Remember that you can traverse forwards and backwards on edges. Given a user, it should always be relatively constant time to pull up the last ten things they did.
Part 2.
If you are representing a band as an entity of a certain type, index on that attribute. Then you'll be able to pull out that node and traverse outwards to find all the users who like that band. If you don't have an independent entity, or it is somehow implicit, you'll want to enable full text search for your respective graph database.
Part 3.
Learn more about security. The only thing you would be storing would be a properly hashed string of the user's password. At that point you would be fine using any graph db and good security practices.
Part 4/5.
Once you have one user, worry about the next thousand.
When you have a thousand users, worry about the next hundred thousand.
When you have one hundred thousand, worry about the next million.
When you have a million users, you can start worrying about the questions you asked.
Until you have at least 0.1% of the users/volume you want to scale to, it's mental masturbation to try and ask questions about how to scale up to a certain size.
We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.
I am building an ad analytics tool which assumes a data structure like this:
Account
Campaign
Keyword
Conversion
I have a lot of information about individual conversion events, which can be tied back to the cost data of each campaign, keyword, ad group, etc. In SQL, you could consider each property a sort of foreign key (text-based) to the campaign, keyword or ad in a particular account, but that's inefficient and slow. It doesn't sound like a great idea to make campaign_id, keyword_id, etc. fields and populate them either, because I want the analytics to be available in near-real time.
What would be a good way to model this with MongoDB?
Assuming a very high volume of conversion events (millions per day or more), a storage engine alone (MongoDB or anything else) won't help you. What you need is the ability to run map-reduce jobs on the data in order to calculate the analytics. You can scale-out your cluster as necessary to achieve near-real time performance.
The free/open-source options that I can suggest are Hadoop (and probably HBase and Hive) or Riak.
There are other options - I'm only suggesting these two because I've personal experience with them in a high scale production environment. We're currently using Hadoop to power an analytics system processing billions of events per day.
If you're not into rolling your own and are able and willing to pay (a lot!) then look at GreenPlum and Vertica.
I'll be happy to share more information on potential solution designs - but I'll need more data on what you're trying to achieve - scale, use cases etc.
I'm not sure that MongoDB is really the right choice for something like this, since MongoDB is really more about storing less well (or more complex) documents rather than hierarchical records like this one. However, if you are going the MongoDB route, then you can just use the account, campaign and keyword tags directly. There is no substantive benefit to abstracting these into meaningless keys in MongoDB. You can index these fields directly in MongoDB.
I don't know what your volumes are going to be and what other factors are affecting your technology choices. However, assuming that your accounts, campaigns and keywords don't change that frequently, you could do this with plain old RDBMS (SQL or Oracle etc.) using lookup tables for these determinants where the foreign keys are meaningless integers. If you're doing live analytics you could adopt a star schema and keep all of the numeric FKs on the base fact table (Conversion) so that you aren't joining a chain of four tables to get the whole picture, instead you'd be doing three one-hop joins. This would allow you to summarize at any level with only a single join.