I've created a graph model for a social network and needed some concrete advice regarding the design in regards to scaling. Pardon the n00bness of these questions but I'm not finding very many clear examples out there...
NOTE: the status updates and activity nodes /relationships are linked lists - with the newest entries constantly being placed at the top of the list.
Linked lists allow for news feed generation, but there could be hundreds of records per user - I presume the limit clause isn't sufficient even though the data is in descending order by date. Do I have to have a separate linked list that would only hold the most recent 10 status/activity updates) and constantly replace the head on that list to get better activity feed generation, or will one list properly sorted and do the job (with a limit clause)
These nodes all have properties (json data with content, IDs, etc) - how do "global" indexes come into play here so that I can find, for example, users that like Depeche Mode without waiting a lifetime for results? I know how to add a node to an index, just wondering if I'm missing a part of the picture here..
Security - logins and passwords.. I would presume a graph database could store them, but I'd presume it's a security risk at this point - would it be better to keep this in postgres etc?
How would you improve this model to handle scalability? Imagine 20 million users banging away on this..
Imagine 40 million users - what's wrong with this model when it comes to scalability?
Part 1.
You can write cypher or gremlin queries that do what you want. Remember that you can traverse forwards and backwards on edges. Given a user, it should always be relatively constant time to pull up the last ten things they did.
Part 2.
If you are representing a band as an entity of a certain type, index on that attribute. Then you'll be able to pull out that node and traverse outwards to find all the users who like that band. If you don't have an independent entity, or it is somehow implicit, you'll want to enable full text search for your respective graph database.
Part 3.
Learn more about security. The only thing you would be storing would be a properly hashed string of the user's password. At that point you would be fine using any graph db and good security practices.
Part 4/5.
Once you have one user, worry about the next thousand.
When you have a thousand users, worry about the next hundred thousand.
When you have one hundred thousand, worry about the next million.
When you have a million users, you can start worrying about the questions you asked.
Until you have at least 0.1% of the users/volume you want to scale to, it's mental masturbation to try and ask questions about how to scale up to a certain size.
Related
Tag values and series cardinality
Influxdb creates new series for ever combination of (tag, value) pair that it sees. An example in the documentation shows this with a tag called email. Series cardinality is a limiting factor on performance. Independent tags have a multiplicative effect on series cardinality.
My data
I process data that naturally breaks down into something I call groups. Think of it like an advertising network that processes customers' ads, where a customer is a "group". I'd like to track how much time and resources different groups take to process. I currently have about 1000 groups and I'm working on growth planning, so let's suppose I might soon have 10's or 100's of thousands. There are other tags with 10's or 100's of values (e.g., hostname). These things are all important to being able to understand our data.
I currently have a half million series. I don't think I have a lot of data. I'm running influxdb 1.2.4, looks like our influx version isn't being updated too frequently.
My question
This seems like a relatively ordinary need, but it also seems to be one that is going to get me in trouble with influxdb.
Am I confused that I'm heading for pain?
Is there a better way to address this need?
Am I outright using the wrong tool?
My Rails app allows users to setup a data feed (typically a REST API), and pulls in results at specific intervals to allow the user to later filter/sort/chart/export the data. An example could be pulling a stock price every 15 minutes and saving its value and a timestamp as a row in a table.
Since there could be many users with many feeds setup, I'm trying to determine the best way to handle all of this data in Rails.
I feel like I should stay away from one large mega table with a feed_id on each row since there could be millions and millions of rows very quickly (50 users with 5 feeds running every 15 minutes would be 25,000 rows per day). Will this get unwieldy too quickly or am I underestimating the power of Rails/Postgres? What is the limit?
Another option I came up with was giving each feed its own table – create a table when the feed is added and save the data there. In discussions I've read it seems like dynamic table creation is frowned upon except in special circumstances and I'm wondering if this one fits the mold.
The last option would be adding a second database - potentially NoSQL like MongoDB. I'd rather keep everything in one DB if possible but if that really will yield the best performance and reliability I'd give it a go.
I would love to hear people's experience and opinions in tackling something to this with Rails.
25,000 rows per day makes about 10 million per year. In this case you're well within limits of PostgreSQL for many years. Stock prices are mostly numeric, so, if I were you, I'd have a simple SQL table for all this data. Just avoid extra-long rows (texts) and you should be fine.
In future you could further extend your solution with partitioning (i.e. monthly or yearly) or move older data to some archive.
I have the following architecture.
You will find a duplication in HAS relationship. The main one is between Badge and Skill as I want to be able to aggregate/count same Skill from different Badge of the same User.
So, the duplicate relationship is between User and Skill. That is because, for instance, if an Organization wants to know all the skills of single or multiple recipients I would follow the following path:
Org -OWNS-> Badges -IS_AWARDED_To-> User -HAS-> Skill
//Skill nodes for a specific or multiple user represent each skill contained in every Badge the user was awarded.
However, if I do not add the duplicated relationship HAS between User and Skill, I will follow the following path instead:
Org -OWNS-> Badges -IS_AWARDED_TO-> User -IS_AWARDED-> Badges -HAS-> Skill
//Now I have all skills for a specific or multiple User for every badge awarded
The difference between the two paths is obvious. The first one will result in less queries but the duplication of the relationship is a concern. The second one will remove the duplication problem (is it a problem?) but has more queries. I am still a newbie to neo4j and feel free to tell me that both of my approaches seem convoluted and there is a more optimized way to reach what I am trying to do.
Your two models are valid, and you can use both of them.
But like you said, on the first one you duplicate some data. Generally we do that when we have some performance issues. Is it your case for now ?
As a starting point, I recommend you to start with the model 2 (ie. without duplication), and if you have some issues with this model, you can easely change it to the model 1 (the flexibility of Neo4j is really great for graph refactoring !).
In IT, nothing is free : if you duplicate some data to have better performances in reads, you will have an impact on writes.
When you write a (badge)-[:HAS]->(skill) relationship, you also need to create a (user)-[:HAS]->(skill) rel (same for update or delete).
So you need to keep the consistency of this data when you update the graph. In fact it's like you are creating a SQL stored view.
I'm playing around with neo4j - seeing what I can and can't do with it before suggesting it for something serious. One of the things I'm looking at now is Data Partitioning. By this I mean having a single data store that contains data from many different customers, and knowing which customer the data belongs to.
In the SQL world, we've always done this by having a customer_id field on the tables that are customer specific, and then always including that in the queries and indices. This works perfectly well for us, but in the Graph DB world it feels like we can do better.
The options that I've come up with some far are:
The same as before - including a property on the nodes that is the Customer ID
Storing a Label on each Node that identifies the Customer. However, as far as I can tell you can't bind parameters to labels so this would mean that the queries are generated slightly awkwardly.
Storing a Customer Node, and linking all of the other nodes to it.
Number #3 seems to be the "correct" Graph DB way of managing this, but I'm concerned with the impact of this on the performance of the data. It's perfectly feasible that there will be hundreds of thousands of links from a single Customer Node to the other data nodes, and there will be hundreds of different Customer Nodes. (Based on the volume of data in the existing SQL database)
What's the recommended way of achieving this level of data partitioning whilst maintaining performance?
We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.