RavenDB - Planning for scalability - scalability

I have been learning RavenDB recently and would like to put it to use.
I was wondering what advice or suggestions people had around building the system in a way that is ready to scale, specifically sharding the data across servers, but that can start on a single server and only grow as needed.
Is it advisable, or even possible, to create multiple databases on a single instance and implement sharding across them. Then to scale it would simply be a matter of spreading these databases across the machines?
My first impression is that this approach would work, but I would be interested to hear the opinions and experiences of others.
Update 1:
I have been thinking more on this topic. I think my problem with the "sort it out later" approach is that it seems to me difficult to spread data evenly across servers in that situation. I will not have a string key which I can range on (A-E,F-M..) it will be done with numbers.
This leaves two options I can see. Either break it at boundaries, so 1-50000 is on shard 1, 50001-100000 is on shard 2, but then with a site that ages, say like this one, your original shards will be doing a lot less work. Alternatively a strategy that round robins the shards and put the shard id into the key will suffer if you need to move a document to a new shard, it would change the key and break urls that have used the key.
So my new idea, and again I am putting it out there for comment, would be to create from day one a bucketting system. Which works like stuffing the shard id into the key, but you start with a large number, say 1000 which you distribute evenly between. Then when it comes time to split the load into a shard, you can say move buckets 501-1000 to the new server and write your shard logic that 1-500 goes to shard 1 and 501-1000 goes to shard 2. Then when a third server comes online you pick another range of buckets and adjust.
To my eye this gives you the ability to split into as many shards as you originally created buckets, spreading the load evenly both in terms of quantity and age. Without having to change keys.
Thoughts?

It is possible, but really unnecessary. You can start using one instance, and then scale when necessary by setting up sharding later.
Also see:
http://ravendb.net/documentation/docs-sharding
http://ayende.com/blog/4830/ravendb-auto-sharding-bundle-design-early-thoughts
http://ravendb.net/documentation/replication/sharding

I think a good solution is to use virtual shards. You can start with one server and point all virtual shard to a single server. Use module on the incremental id to evenly distribute the rows across the virtual shards. With Amazon RDS you have the option to turn a slave into a master, so before you change the sharding configuration (point more virtual shards to the new server), you should make a slave a master, then update your configuration file, and then delete all the records on the new master using modulu that doesn't comply with the shard range that you use for the new instance.
You also need to delete rows from the original server, but by now all the new data with IDs that are modulu based on the new virtual shard ranges will point to the new server. So you actually don't need to move the data, but take advantage of Amazon RDS server promotion feature.
You can then make replica off the original server. You create a unique ID as: Shard ID + Table Type ID + Incremental number. So when you query the database, you know to which shard to go and fetch the data from.
I don't know how it's possible to do it with RavenDB, but it can work pretty well with Amazon RDS, because Amazon already provide you with replication and server promotion feature.
I agree that their should be a solution that right from the start offer seamless sociability and not telling the developer to sort the problems out when those occur. Furthermore, I've find out that many NoSQL solution that evenly distribute data across shards need to work within a cluster with low latency. So you have to take that into consideration. I've tried using Couchbase with two different EC2 machines (not in a dedicated Amazon cluster) and data balancing was very very slow. That adds to the overall cost too.
I also want to add that what pinterest had done to solve their scalability issues, using 4096 virtual shards.
You should also need to look into paging issues with many NoSQL databases. With that approach you can page data quite easily, but maybe not in the most efficient way, because you might need to query several databases. Another problem is changing schema. Pinterest solved this by putting all the data in a JSON Blob in MySQL. When you want to add a new column, you create a new table with the new column data + key, and can use Index on that column. If you need to query the data, for example, by email, you can create another table with the emails + ID and put an index on the email column. Counters are another problem , I mean atomic counters. So it's better taking those counters out from the JSON and put them in a column so you can increment the counter value.
There are great solutions out there, but at the end of the day you find out that they can be very expensive. I preferred spending time on building my own sharding solution and prevent myself the headache later on. If you choose the other path, there are plenty of companies waiting for you to get into trouble and ask for quite a lot of money to solve your problems. Because at the moment that you need them, they know that you will pay everything to make your project work again. That's from my own experience, that's why I am breaking my head to build my own sharding solution using your approach, which also be much cheaper.
Another option is to use middleware solutions for MySQL like ScaleBase or DBshards. So you can continue working with MySQL, but at the time you need to scale, they have well proven solution. And the costs might be much lower then the alternative.
Another tip: when you create your config for shards, put a write_lock attribute that accepts false or true. So when it false, data won't be written to that shard, so when you fetch the list of shards for specific table type (ie. users), it will be written only to the other shards for that same type. This is also good for backup, so you can show a friendly error for visitors when you want to lock all the shard when backing up all the data to get a point-in-time snapshots of all the shards. Although I think you can send a global request for snapshoting all the databases with Amazon RDS and using point-in-time backup.
The thing is that most companies won't spend time working with a DIY sharding solution , they will prefer paying for ScaleBase. Those solution comes from single developers that can afford paying for a scalable solution from the start, but want to rest assured that when they reach to the point they need it, they have a solution. Just look at the prices out there and you can figure out that it will cost you A LOT. I will gladly share my code with you once I'm done. You are going with the best path in my opinion, it's all depends on your application logic. I model my database to be simple, no joins, not complicated aggregation queries - this solves many of my problems. In the future you can use Map Reduce to solve those big data queries needs.

Related

How can I view only the context I'm working on?

In Neo4j, I created the database through the various exercises I'm doing.
When I run a query, for example MATCH (n) RETURN (n), until that database that was created in "Christmas of 1914" appears on the screen, making my interface ugly, polluted, loaded with unnecessary objects to work at that moment.
If I work with Northwind, I want to see only Northwind, if I work with Facebook, I just want to see Social, and so on. I do not want to see all the databases on the planet on my screen each time I run a query like MATCH (n) RETURN (n).
Neo4j doesn't really have a direct equivalent to multiple databases stored within the same server instance. There are three options for achieving this:
1) the closest match would be create run an additional instance of neo4j on the same server. You will need to edit the neo4j.conf file to give the new instance a new port number and a new data directory. This will give you isolation between the data and user accounts in the two databases. The downside is you will need to divide up the RAM on the box before running, effectively limiting both instances to half the RAM.
2) You can attach labels to your nodes to identify which bucket of data (database in the RDBMS world) each node belongs to. You can operate as if the two are isolated even though they really live in the same database instance. Neo4j won't do a lot to help you enforce this, you will need to do the work at the application level. There is a mechanism for you to restrict users to only being able interact with a subset of your graph but you have to write custom procedures and restrict the users to only using those. I haven't tried it but it sounds tedious.
https://neo4j.com/docs/operations-manual/current/security/authentication-authorization/subgraph-access-control/
3) If you are running on VMs or the cloud, you mind as well just create a new instance for your second database. It achieves the same effect as number one but with better isolation of resources.

Ruby on Rails performance on lots of requests and DB updates per second

I'm developing a polling application that will deal with an average of 1000-2000 votes per second coming from different users. In other words, it'll receive 1k to 2k requests per second with each request making a DB insert into the table that stores the voting data.
I'm using RoR 4 with MySQL and planning to push it to Heroku or AWS.
What performance issues related to database and the application itself should I be aware of?
How can I address this amount of inserts per second into the database?
EDIT
I was thinking in not inserting into the DB for each request, but instead writing to a memory stream the insert data. So I would have a scheduled job running every second that would read from this memory stream and generate a bulk insert, avoiding each insert to be made atomically. But i cannot think in a nice way to implement this.
While you can certainly do what you need to do in AWS, that high level of I/O will probably cost you. RDS can support up to 30,000 IOPS; you can also use multiple EBS volumes in different configurations to support high IO if you want to run the database yourself.
Depending on your planned usage patterns, I would probably look at pushing into an in-memory data store, something like memcached or redis, and then processing the requests from there. You could also look at DynamoDB, which might work depending on how your data is structured.
Are you going to have that level of sustained throughput consistently, or will it be in bursts? Do you absolutely have to preserve every single vote, or do you just need summary data? How much will you need to scale - i.e. will you ever get to 20,000 votes per second? 200,000?
These type of questions will help determine the proper architecture.

How can one rails app on heroku access many databases

I want to be able to have one app access multiple databases on the HEROKU "system".
Can the connection to the database be changed dynamically?
Why I ask...
I have an app that has a lot of very processor heavy background jobs. If a given user uploads a product feed of say 50,000 product that have to be compared to existing products and update only the deltas it can take a "few" minutes.
Now to mitigate the delay I spin up multiple workers, each taking small bites out of the lot until there's none. I can get to about 20 workers before the GUI starts to feel sluggish because the DB is being hammered.
I've tuned some of the code and indexed the DB to some extent, and I'm sure there's more I could do, but it will eventually suffer the law of diminished returns.
For one user, I don't much care... if you upload 50k products you need to wait a bit..
But user one's choice to upload impacts user two. (different company so no cross over of data)..
Currently I handle different users by separating their data with schemas in postgresql.
The different users however share the same db connection and even on the best plan I can see a time when 20 users try to upload 50,000 products at the same time.(first of month/quarter for example).
User 21 would see a huge slow down on their system because of this..
So the question: Can I assign different users to different databases? User logs in, validates their info against a central DB, and then a different DB takes over?
My current solution is different instances of heroku. It's easy to maintain the code because it's one base and I just script the git push(es). The only issue is the different login URL's; which I suppose I could confront if I can't find an easy DB switch solution.
It sounds like you're able to shard your data by user, or set of users without much concern since you already separate them by schema. If that's the case, and you're using Ruby and ActiveRecord, look at https://github.com/tchandy/octopus. I imagine you're not looking to spin up databases on the fly, rather, you'll have them already built and ready to be used, and can add more as you go.
Granted, it sounds like what you're doing could be done a lot more effectively by using the right tool for that type of intensive processing like one of the Heroku Hadoop add-ons; nonetheless, if that's not an option for whatever reason, check out the gem above. There are a couple other gems like it, and of course you could technically manage your own ActiveRecord connections without this gem, but I think you'll find that will be painful really fast.
Of course, if you aren't using Ruby or ActiveRecord, still shard the data, and look for something like the gem above in your app's language :).
the postgres databases on heroku are configured with environment variables. when you run heroku config you should see:
DATABASE_URL: postgres://xxx.compute.amazonaws.com:5432/xxx
you can use these variables to connect to databases on other heroku instances or share a single database on different heroku apps.
if you try to run this kind of stuff on free heroku instances, i think it is against their terms of services.
if it's about scalability, i think you will just have to pay for a more expensive database instance...

Data Warehouse: One Database or many?

At my new company, they keep all data associated with the data warehouse, including import, staging, audit, dimension and fact tables, together in the same physical database.
I've been a database developer for a number of years now and this consolidation of function and form seems counter to everything I know.
It seems to make security, backup/restore and performance management issues more manually intensive.
Is this something that is done in the industry? Are there substantial reasons for doing or not doing it?
The platform is Netezza. The size is in terabytes, hundreds of millions of rows.
What I'm looking to get from answers to this question is a solid understanding of how right or wrong this path is. From your experience, what are the issues I should be focused on arguing if this is a path that will cause trouble for us down the road. If it is no big deal, then I'd like to know that as well.
In general I would recommend using separate databases. This is the configuration I have always seen used in production and it really makes a lot of sense since - as you mentioned - both databases have fundamentally different purposes / usage patterns / etc.
Edit
If you're using one physical server, the fewer instances on that server the simpler the management and the more efficient the process.
If you put TWO instances on the same Physical Server you get:
Negatives:
Half the memory to use
Twice the count of database process
Positives:
You could take the entire staging db down without affecting the DW
So which is more precious to you, outage windows or CPU and Memory?
On the same the physical server multiple instances make performance management issues MUCH more manual to solve. If you look at the health of one of the instances, it might look fine but users are reporting poor performance, so you have to look at the next instance to see if the problem may be coming from there... and so on per instance.
Security is also harder with more than one instance. At best it's just as hard as a single instance but it's never easier. You'll have two admin accounts (SYS or something), Duplicate process accounts, etc.
Tell us why you think it's better to have more than one instance.
ORIGINAL POST
Can we be clear on terms. When you say "in the same Database" do you mean to say the same instance, or the same physical server. If you did move the staging to a new instance would it reside on the same physical hardware?
I think people get a little too hung up on instances. If you're going to put two instances on the same piece of hardware, you're only doubling the number of everything to very little advantage. All the server processes will be running twice... all the memory pools will be cut in half.
so let's say you really did mean two separate physical boxes...
Let's say you buy 2 12-way boxes (just say). When you're staging db server is done for the day, those 12 CPU's are wasting away. When your users pack up and go home, your prod DW CPUs are wasting away. CPU cycles are perishable, you can't get them back. BUT, if you had one 24 way box... then the staging DB COULD use 20 CPUs at night for some excellent Parallel Execution for building summary tables and your users will have double the capacity for processes during the day.
so let's say you meant the same hardware.
"It seems to make security, backup/restore and performance management issues more manually intensive."
Guaranteed that performance issues are harder to solve the more instances that share the same hardware. Guaranteed.
Security
What security do you do at the instance level?
Backup
What DW are you backing up at the instance level? You're not backing up tablespaces, but rather whole instances? Seems like that pattern will fail at a certain size.
PLATFORM: NETEZZA
Not familiar with the tool specifically. So if it's a single instance on a single box, then the division would seem more logical than physical and therefore the reasons they exist is for management, not performance. You don't increase your CPUs or memory by adding a database, right? So it doesn't seem like there's no performance upside to it. Each DB may be adding separate processes (performance hit), or it might be completely logical like schemas in Oracle. If each database is managed by new processes than data going between them will mean IPC.
Maybe the addition of the Netezza tag will get some traction.
We use databases for every segment (INVENTORY, CRM, BILLING...). There are no performance downsides and maintenance and overview is much better.
Better late than never, but for Netezza:
There are no performance hits while querying cross database. Netezza allows only SELECT operations cross database, no INSERT, UPDATE or DELETEstatements allowed.
This means you cannot do:
THISDB(ADMIN)=>INSERT INTO OTHERDB..TBL SELECT * FROM THISDBTABLE;
but you can do \c OTHERDB then
OTHERDB(ADMIN)=>INSERT INTO TBL SELECT * FROM THISDB..THISDBTABLE;
You are also not able to create a materialized view on a cross-database object, for example:
OTHERDB(ADMIN)=>CREATE MATERIALIZED VIEW BLAH AS SELECT * FROM THISDB..THISDBTABLE;
Administration might be where you will decide (though you probably already did long ago) on what kind of database(s) you'll create. Depending on your infrastructure, you might have a TEST/QA system and a PROD system on the same box, or on separate boxes.
You will gain speed in the load and the output if the tables are in the same schema (database). Obvious...but hey, I said it.
There is more overhead the more tables you put into one schema. Backups time, size of backups, ease of use.
Where I am, we have many multiple TB databases within one data-warehouse. Our rule of thumb is that a single loading process or a single report query should NOT have to span database. This keeps "like" tables together but gives some allowances for our backups and contingency processes. It also makes it a bit easier to "find" data.
For those processes that need to break this rule, we will either move data from one database to the other or allow the process to join across schemas.
I'm not as familiar with Netezza, so I'm not 100% sure what your options might be.
Few points for you to consider
a) If the data in one or more staging, audit, dimension and fact table has to be joined, you are better off keeping them in one database
b) Typically you will retain dimension tables and fact tables in the same database and distribute on most frequently joined columns to leverage "co-located join" functionality of Netezza
c) You should be able to use SQL grant permission to manage access to all objects (DB, tables, views etc)

multiple db connections vs. centralized/redundant db

I have a project to create a dashboard that will connect to existing systems as well as create new features based on combining data from the existing systems. For example, the dashboard will be able to generate "orders" containing data merged from "members" (MS Access DB), "employees" (MySQL DB) and "products" (flat file), and there will also be new attributes particular to "orders."
At first I thought it would be most efficient to have my application connect to each of the systems separately and perform cross-vendor joins between the different databases. But then I thought that creating a centralized/redundant db (built with scripts pushing and pulling data between the systems) might also be useful because it would empower some semi-technical staff to use products like OOBase, which can only make a single connection.
Are there any other advantages to creating a centralized/redundant DB like the one I'm talking about? Or are multiple direct connections the best approach?
Thanks in advance for any tips.
To give you are short answer: yes, you want a central data storage.
You don't want to run complex reports on your live database. As your live database will grow you will want to do some housekeeping and clean it up but keep the data for analysi.
You will also want the data to be aggregated so you could perform historical analysis.
For the data which comes from different sources some clean-up will be required. And you will probably need to know how to link your data together and there are quire a lot of things like that you will have to be aware of to do the job properly.
You might consider reading on data warehousing (wikipedia) and business intelligence (wikipedia).
If you want to have 'new features' added to this system you could also look up orchestration (wikipedia. It will allow you to link your heterogeneous business processes together.
All of these are quite specialized and complex disciplines on their own so you might want to have a specialist to consult you.
Be very, very careful to copy lots of data around. If you do, here are some important guidelines:
Make sure that one system is defined as the master and no other system may tamper with the data.
Always copy data from the master to the slaves.
When you copy the data, use a checksum of some kind to make sure all data has been copied. Make sure you can handle "yesterday, the copy failed".
If a slave must make a change, push the change to the master and then use the standard "update" path to merge it back to the slave. Avoid "save change on slave and update the master some time in the future".

Resources