Client DB - CoreData (iOS)
Server DB - MySQL
I am trying to achieve data synchronisation between client and server but the complicated part is that the schema is highly relational. I was going through couple of synchronisation patterns already in use and looks like most of them are based on a NOSQL or schemaless DB. Wondering if there are any patterns of synch for a highly relational data. I have already gone through couchbase, dropbox sync api, wasabi synch etc. Following are the concerns
1) By highly relational data it means, there are several tables which are related to each other and Create/Update happens on all the tables. Right now I am planning to do seperate CRUD requests for each table. Is that a good approach? But the problem is that there should be a strict ordering of the requests because the changes in table-3 cannot be processed before the table-2 data is received. This relationship is making it hard to synch.
2) Change tracking on the client. what would be the best way to identify the changes in a particular table(CoreData Entity). I am planning for a delta approach where only the changes in similar kind of objects will be uploaded at a time.Any insights /links to it?
3) Data Merging/ConflictResolution - I am stumbled upon this part. 1 way would be to have the modified timestamp in each object, but what if the devices dates are not in sync or manually changed.
I wanted to know the implications/challenges in such a synch pattern with RDBMS backed server or any alternative approaches.
Problem #1 Explained
Assume there are 10 tables and APIs expose CRUD requests for these 10 tables. 1 Request will do only C/R/U/D of any one table. So my question was is this a good approach to design APIs like this when it comes to offline syncing of data. For e.g. Consider a relational data
Organization->Employee->Department->Project
Assuming some objects of these 4 tables got created offline. Now we need to sync data to server when network is back. So it will be like Create/Update Organisations First, Once it is over Create/Update Employee so that it can be linked to Organisation. So basically everytime a C/U/D will be issued from the top->bottom level objects. So my question is whether this is good approach in a Sync Problem. Because if the data was not relational we could have uploaded the changes in all the tables in a single C/U/D API Call.
It seems that you might not be aware of typical relational DBMS facilities and protocols that
support simultaneous write access by multiple sessions, making them suitable for multi-user, highly concurrent, and OLTP applications.
1) Your API to access MySQL allows you to make your changes atomically (all or nothing) via a transaction. Within that transaction you should update as many tables as possible simultaneously but can sequence such changes as necessary. By locking tables as you use them then unlocking in the reverse order you avoid deadlocks. You can request that only parts of tables that a transaction knows it could possibly change are locked so that non-overlapping clients can proceed concurrently.
2) Your schema can explicitly record redundant delta information that you get the DBMS to calculate on updates, or it can record sufficient past changes to calculate deltas on request. Your client can give the DBMS its transaction data and the DBMS can return relevant info based on it and the past. You probably do not need to and should keep any persistent state on your client. That is what the server database is for. The client database is a buffer for it and user info.
3) You can use an explicit client serial transaction id so that client plus id indicates what order the client thinks its transactions were sent regardless of its clock.
I wonder how much you have googled.
Related
I am only interested in atomic transactions and strong consistency. Does firebase realtime db supports both?
I don't see any transaction locks in firebase db anywhere. And, it is required to take lock on item to support atomicity. That's my first thought that probably firebase db is not an atomic db.
I also don't know the back-end atchitecture of firebase db. I am not sure if it always read from master node or slave nodes too. So can't not ensure whether it is strongly consistent or eventual consistent.
Realtime Database supports transactions. Clients must all agree how to cooperate with respect to these transactions. The database doesn't support any sort of operation that locks the entire database in order to serialize access from all client. You need to understand how RTDB transactions work in order to make effective use of them. Not all writes will require a transaction, and you need to figure out for yourself when and how best to use them in your particular application.
Since Realtime Database is a cloud-hosted database, you don't need to know (or care) about any sort of master/slave configuration. In fact, you can just assume that it works as the documentation suggests. The documentation suggests that it's eventually consistent if the client is offline at the time of a write operation (which will be cached locally and synchronized when it becomes online). It's immediately consistent if the client is already online, and the client is willing to "wait around" for the latest update as it listens to changing data in the database, whenever it becomes available to the client. (There are actually no "replicas" to speak of with Realtime Database, except the local caches that each client may maintain for themselves for data previously read.)
I have an MVC web site, where users can search for large recordsets from SQL Server and Oracle databases. Some of these recordsets can be very large, with many thousands of records. Sadly, it is a user requirement that they do not make their searches more specific.
When a user posts their search request to the database, my web page is hanging before often timing out (due to the amount of time taken to query the database).
We are thinking about removing the expensive database calls from the MVC site, and sending the query to a separate process to run in the background. When the query is complete, we can notify the user.
My proposed solution is:
1) When the user completes the search form in the web page, to simply display a message that the results are being generated and will be sent when complete
2) Send the SQL query to a database which can contain a list of SQL queries that need to be processed
3) Create a Windows Service which checks this database every couple of minutes for new queries
4) This Windows Service then queries the database. When the query is completed, it will create a CSV of the results, and email this to the user
I am looking for some advice and comments on my above approach? What do folks think of this as an approach to process expensive database calls in the background?
Generally speaking the requests will be made infrequently, but as mentioned, will be for a great amount of data. There is a chance that two or more requests could be made at the same time, but this will be infrequent.
I will also look at optimising the databases.
Grateful for any tips.
Martin :)
Another option is to supplement the existing code to execute the query on a separate thread so that periodic keep-alive updates can be sent to the requesting page while you wait for the query results. Similar to the way the insurance quote agregator pages work.
A second option is to make the results available as a hyperlink when they are ready and then communicate that either through the website or by email to the user.
Option three if these queries are not completely ad-hoc type queries then you could profile for the most frequent combinations and pre-compute them periodically placing the results into new tables (sort of halfway to optimising the current database structure).
The caveat there is that the data won't be as up to date - but given the time the queries are currently taking it probably isn't that important to be up to the second?
Whichever solution you choose I think it's going to depend on the user expectation - Do they know what they want and just send one big query and get it and be happy? or do they try several queries to find the right combination of parameters? If the latter then waiting for an email delivery of results might not be acceptable to them. But if what they want is a downloadable results document and they know what they want first time then it may. The only problem I see here is emails going astray or taking longer than the user thinks it should causing the request to be resubmitted multiple times and increasing the server workload - caching queries and results is probably a very good idea.
I would suggest to introduce layer of abstraction like messaging broker. Request will go in queue and batch layer will consume request from queue and once heavy work is done, batch layer will notify web layer again via messaging broker, Request-Reply pattern.
In addition on database side it is allways good to optimize queries.
This is just a general question, not too technical. We have this use-case wherein we are to load hundreds of thousands of records to an existing Neo4j database. Now, we cannot afford to make the database offline because of users who are accessing it. I know that Neo4j requires exclusive lock on the database while it's performing batch updates. Is there a way around my problem? I don't want to lock my database while doing updates. I still want my users to access it - even for just read-only access. Thanks.
Neo4j never requires exclusive lock on the database. It selectively locks portions of the graph that are affected by mutating operations. So there are some things you can do to achieve your goal. Are you a Neo4j Enterprise customer?
Option 1: If so, you can run your batch insert on the master node and route users to slaves for reading.
Option 2: Alternatively, you could do a "blue-green" style deployment where you:
take a backup (B) of your existing database (A), then mark the A database read-only
apply your batch inserts onto B either by starting a separate instance, or even better, using BatchInserters. That way, you'll insert your hundreds of thousands in a few seconds
start the new database B
flip a switch on a load-balancer, so that users start to be routed to the B instead of A
take A down
(Please let me know if you need some tips how to make a read-only DB.)
Option 3: If you can only afford to run one instance at any one time, then there are techniques you can employ to let your users access the database as usual and still insert large volumes of data. One of them could be using a single-threaded "writer" with a queue that batches write operations. Because one thread only ever writes to the database, you never run into deadlock scenarios and people can happily read from the database. For option 3, I suggest using GraphAware Writer.
I've assumed you are not trying to insert hundreds of thousands of nodes to a running Neo4j database using Cypher. If you are, I would start there and change it to use Java APIs or the BatchInserter API.
I'm developing a polling application that will deal with an average of 1000-2000 votes per second coming from different users. In other words, it'll receive 1k to 2k requests per second with each request making a DB insert into the table that stores the voting data.
I'm using RoR 4 with MySQL and planning to push it to Heroku or AWS.
What performance issues related to database and the application itself should I be aware of?
How can I address this amount of inserts per second into the database?
EDIT
I was thinking in not inserting into the DB for each request, but instead writing to a memory stream the insert data. So I would have a scheduled job running every second that would read from this memory stream and generate a bulk insert, avoiding each insert to be made atomically. But i cannot think in a nice way to implement this.
While you can certainly do what you need to do in AWS, that high level of I/O will probably cost you. RDS can support up to 30,000 IOPS; you can also use multiple EBS volumes in different configurations to support high IO if you want to run the database yourself.
Depending on your planned usage patterns, I would probably look at pushing into an in-memory data store, something like memcached or redis, and then processing the requests from there. You could also look at DynamoDB, which might work depending on how your data is structured.
Are you going to have that level of sustained throughput consistently, or will it be in bursts? Do you absolutely have to preserve every single vote, or do you just need summary data? How much will you need to scale - i.e. will you ever get to 20,000 votes per second? 200,000?
These type of questions will help determine the proper architecture.
My team is currently building a new SaaS application for our company (Amilia.com). We are in "alpha" release and the application was built to be deployed on a web farm.
For our session provider, we are using Sql Server mode (in DEV and TEST) and it seems to be not "scalable", hence we are looking for the best solution for handling sessions in asp.net (mvc3 in our case). We are currently using Sql Server but we would like to switch to an other system due to license cost.
We target 20 000 [EDITED, was 100k before] concurrent users. In session, we store a GUID, a string and a Cart object (we try to keep it as little as possible, this object allows us to save 3 queries at each request).
Here are the different solutions I've found :
ASP.NET built-in solutions:
No session : impossible in our case (eliminated)
In-Proc Mode : can't be used in a webfarm. (eliminated)
StateServer Mode : can be used in a webfarm but if the server goes down, I lose all my sessions. (eliminated)
StateServer Mode with a PartitionResolver using multiple servers (http://msdn.microsoft.com/en-ca/magazine/cc163730.aspx#S8) If I undestand well, if one of these servers goes down, only a part of my users will lose their session.
SqlServer Mode : can be used in a webfarm, if the server goes down, I can recover my sessions but the process is quite slow. Moreover, that database becomes a bottleneck in case of heavy load.
SqlServer Mode with a PartitionResolver using multiple servers (http://www.bulletproofideas.net/2011/01/true-scale-out-model-for-aspnet-session.html) : If one of these servers goes down, only a part of my users will lose their session. If the user was doing nothing between the downtime, he will recover his previous session otherwise he will be redirected to the signin screen.
Custom solutions :
Use MongoDB as Session storage (http://www.adathedev.co.uk/2011/05/mongodb-aspnet-session-state-store.html) It seems to be a good tradeoff but my knowledge in nosql is quite rudimentary so I cannot see the cons.
Use Memcached : the problem will be the same as StateServer mode and if the memcached server goes down, all my sessions are lost. Furthermore, I think Memcached is not dedicated to store session state ?
Use distributed memcached like ScaleOut (http://highscalability.com/product-scaleout-stateserver-memcached-steroids) : seems to be the best solution but it costs money.
Use repcached and memcached (http://repcached.lab.klab.org/), I've never seen an implementation of that solution.
We could easily go to Ms Azure and use tools provided by it but we have only one application, so if Microsoft doubles the price, we immediately double our infrastructure cost (but that's another subject).
So, what's the best way or at least what's your opinion about this ?
SQL Server session is pretty good. Since you already have a SQL Server database to store your primary data, you can just create another database and store the ASP.NET Session there.
About the scalability, I would say if you have 100,000 concurrent users, then your userbase must be more than 10 millions or more. You should do some practical estimate to see really how long it will take to reach such a concurrent user load. In my previous startup, we had millions of users all around the world, 24x7, but we hardly ever reached 10K concurrent users even though people used our site continuously for hours every day.
If you really have 100,000 concurrent users, license cost would be the least of your worry. With right business model, having 100K concurrent users means you have at least $10M revenue/year.
I have built myoffice.bt.com that uses SQL Server session and all primary data on a single SQL Server instance, but in two databases. Between 8 AM to 10 AM, millions of users hit our site. We hardly have any performance issue. With a dual Core server, 8 GB RAM, you can happily run a SQL Server instance and support such a load as long as you code it right. It all depends on how you have coded. If you have followed performance best practices, you can easily scale to millions of users on a single database server.
Take a look at my performance suggestions from:
http://omaralzabir.com/tag/performance/
I have used memcached clusters only to cache frequently used data. Never used for session for good reasons. There's been several occasions where a memcached server had to be rebooted. If we had used memcached for session, we would have lost all the sessions stored in that instance. So, I would not recommend storing sessions in memcached. But then again, how important is it for your app to maintain data in session? If you have a shopping cart, then as users add products on the cart, it must get persisted in database, not in session. Session is usually for short term storage. For any transactional data, you should never keep it on session, instead store it on relational tables directly.
I am always in support of not using Session. Developers abuse session all the time. Whenever they want to pass data from one page to another, they just put it on the Session. It results in bad design. If you truly want to scale to 100K concurrent user base, design your app to not use session at all. Any transactional data must be stored in database. Cart is a transactional object and thus it's not suitable for holding on Session. At some point you would need to know how many carts get started but never gets placed. So, you will need to store them in database permanently.
Remember, database based session is nothing but databased based serialization. Think very carefully on what you are serializing into database. You will have to clean it up as well since Session_End won't fire for database based session or in fact most of the out of proc sessions. So, essentially you are giving devs ability to just serialize data into database and bypass relational model. It always results in bad coding.
With permanent relational storage, fronted by a high performance cache like memcached, you have much better design to support large user base.
Hope this helps your concerns.