Moving slow database calls from MVC to background application - advice please - asp.net-mvc

I have an MVC web site, where users can search for large recordsets from SQL Server and Oracle databases. Some of these recordsets can be very large, with many thousands of records. Sadly, it is a user requirement that they do not make their searches more specific.
When a user posts their search request to the database, my web page is hanging before often timing out (due to the amount of time taken to query the database).
We are thinking about removing the expensive database calls from the MVC site, and sending the query to a separate process to run in the background. When the query is complete, we can notify the user.
My proposed solution is:
1) When the user completes the search form in the web page, to simply display a message that the results are being generated and will be sent when complete
2) Send the SQL query to a database which can contain a list of SQL queries that need to be processed
3) Create a Windows Service which checks this database every couple of minutes for new queries
4) This Windows Service then queries the database. When the query is completed, it will create a CSV of the results, and email this to the user
I am looking for some advice and comments on my above approach? What do folks think of this as an approach to process expensive database calls in the background?
Generally speaking the requests will be made infrequently, but as mentioned, will be for a great amount of data. There is a chance that two or more requests could be made at the same time, but this will be infrequent.
I will also look at optimising the databases.
Grateful for any tips.
Martin :)

Another option is to supplement the existing code to execute the query on a separate thread so that periodic keep-alive updates can be sent to the requesting page while you wait for the query results. Similar to the way the insurance quote agregator pages work.
A second option is to make the results available as a hyperlink when they are ready and then communicate that either through the website or by email to the user.
Option three if these queries are not completely ad-hoc type queries then you could profile for the most frequent combinations and pre-compute them periodically placing the results into new tables (sort of halfway to optimising the current database structure).
The caveat there is that the data won't be as up to date - but given the time the queries are currently taking it probably isn't that important to be up to the second?
Whichever solution you choose I think it's going to depend on the user expectation - Do they know what they want and just send one big query and get it and be happy? or do they try several queries to find the right combination of parameters? If the latter then waiting for an email delivery of results might not be acceptable to them. But if what they want is a downloadable results document and they know what they want first time then it may. The only problem I see here is emails going astray or taking longer than the user thinks it should causing the request to be resubmitted multiple times and increasing the server workload - caching queries and results is probably a very good idea.

I would suggest to introduce layer of abstraction like messaging broker. Request will go in queue and batch layer will consume request from queue and once heavy work is done, batch layer will notify web layer again via messaging broker, Request-Reply pattern.
In addition on database side it is allways good to optimize queries.

Related

FreeRadius accounting altering/updating sessions start times after a day, weeks and in some cases months

This might be a very specific problem or just ignorance from my side, but I don't seem to figure it out.
Within our organization, we have a FreeRadius Accounting system logging sessions from Wi-Fi usage. Our team is responsible for the data analysis of this accounting data.
Recently, we had to dump the Radius Accounting Database and made a freeze frame of it. While doing so we found a weird behavior.
Running the same query before and after the dump (a query that retrieves the total amount of sessions for a single day) gave a different amount. Around a difference of 5-10%.
Looking a bit deeper we discovered that several updates were being issued that altered the start time of sessions after they had been first registered in the accounting database.
We then found that previous data we collected had disparity after weeks or months even (with the discrepancy being around 2-10%).
TLDR:
Does FreeRadius adjust the start times of sessions based on some maintenance? Are WiFi controllers allowed to do this? Is it a bug?
Overal we just want to understand the rationale so we can justify the data and adjust our processing correctly, as currently, we cannot trust the values we collect daily or even weekly on these stats!
Any help or insight would be great!!!
FreeRADIUS only updates the database as a result of data in an incoming RADIUS packet, using the SQL queries in the local configuration. The only real way to understand this is to look at your SQL queries, and incoming requests (via radiusd -X) and see what is making changes to the data. It is possible that the NAS is broken and sending invalid or changing data, or possibly re-using session IDs which overwrite existing records.
It is also possible to configure FreeRADIUS to create a "fake" accounting start entry in the database in post-auth, which will then be updated when the real Start packet arrives. If you are doing this then you should check the values that are being written, and also if the session never starts up (or the Start is lost) then bad things might happen.
But in all circumstances the only solution you really have is to look at the debug output and see what is happening and why data is being written in the way that it is. There is nothing in FreeRADIUS that randomly updates the database without being sent that data from the NAS.

Respond with large amount of objects through a Rails API

I currently have an API for one of my projects and a service that is responsible for generating export files as CSVs, archive and store them somewhere in the cloud.
Since my API is written in Rails and my service in plain Ruby, I use the Her gem in the service to interact with the API. But I find my current implementation less performant, since I do a Model.all in my service, which in turn triggers a request that may contain way too many objects in the response.
I am curious on how to improve this whole task. Here's what I've thought of:
implement pagination at API level and call Model.where(page: xxx) from my service;
generate the actual CSV at API level and send the CSV back to the service (this may be done sync or async).
If I were to use the first approach, how many objects should I retrieve per page? How big should a response be?
If I were to use the second approach, this would bring quite an overhead to the request (and I guess API requests shouldn't take that long) and I also wonder whether it's really the API's job to do this.
What approach should I follow? Or, is there something better that I'm missing?
You need to pass a lot of information through a ruby process, that's always not simple, I don't think you're missing anything here.
If you decide to generate CSVs at the API level then what do you get with maintaining the service? You could just ditch the service altogether because replacing your service with an nginx proxy would do the same thing better (if you're just streaming the response from API host)?
If you decide to paginate, there will be a performance reduction for sure, but nobody can tell you exactly how much you should paginate - bigger pages will be faster and consume more memory (reducing throughput by being able to run less workers), smaller pages will be slower and consume less memory but demand more workers because of IO wait times,
exact numbers will depend on the IO response times of your API app and the cloud and your infrastructure, I'm afraid no one can give you a simple answer you can follow without experimentation with a stress test, and once you set up a stress test, you will get a number of your own anyway - better than anybody's estimate.
A suggestion, write a bit more about your problem, constraints you are working under etc and maybe someone can help you with a bit more radical solution. For some reason I get the feeling that what you're really looking for is a background processor like sidekiq or delayed job, or maybe connect your service to the DB directly through a DB view if you are anxoius to decouple your apps, or an nginx proxy for API responses, or nothing at all... but I really can't tell without more information.
I think it really depends how you want do define 'performance' and what your goal for your API is. Do you want to make sure no request to your API takes longer than 20msec to respond, than adding pagination would be a reasonable approach. Especially if the CSV generation is just an edge case, and the API is really built for other services. The number of items per page would then be limited by the speed at which you can deliver them. Your service would not be particularly more performant (even less so), since it needs to call the service multiple times.
Creating an async call (maybe with a webhook as callback) would be worth adding to your API if you think it is a valid use case for services to dump the whole record set.
Having said that, I think strictly speaking it is the job of the API to be quick and responsive. So maybe try to figure out how caching can improve response times, so paging through all the records is reasonable. On the other hand it is the job of the service to be mindful of the amount of calls to the API, so maybe store old records locally and only poll for updates instead of dumping the whole set of records each time.

Client-Server Data Synchronisation Algorithm

Client DB - CoreData (iOS)
Server DB - MySQL
I am trying to achieve data synchronisation between client and server but the complicated part is that the schema is highly relational. I was going through couple of synchronisation patterns already in use and looks like most of them are based on a NOSQL or schemaless DB. Wondering if there are any patterns of synch for a highly relational data. I have already gone through couchbase, dropbox sync api, wasabi synch etc. Following are the concerns
1) By highly relational data it means, there are several tables which are related to each other and Create/Update happens on all the tables. Right now I am planning to do seperate CRUD requests for each table. Is that a good approach? But the problem is that there should be a strict ordering of the requests because the changes in table-3 cannot be processed before the table-2 data is received. This relationship is making it hard to synch.
2) Change tracking on the client. what would be the best way to identify the changes in a particular table(CoreData Entity). I am planning for a delta approach where only the changes in similar kind of objects will be uploaded at a time.Any insights /links to it?
3) Data Merging/ConflictResolution - I am stumbled upon this part. 1 way would be to have the modified timestamp in each object, but what if the devices dates are not in sync or manually changed.
I wanted to know the implications/challenges in such a synch pattern with RDBMS backed server or any alternative approaches.
Problem #1 Explained
Assume there are 10 tables and APIs expose CRUD requests for these 10 tables. 1 Request will do only C/R/U/D of any one table. So my question was is this a good approach to design APIs like this when it comes to offline syncing of data. For e.g. Consider a relational data
Organization->Employee->Department->Project
Assuming some objects of these 4 tables got created offline. Now we need to sync data to server when network is back. So it will be like Create/Update Organisations First, Once it is over Create/Update Employee so that it can be linked to Organisation. So basically everytime a C/U/D will be issued from the top->bottom level objects. So my question is whether this is good approach in a Sync Problem. Because if the data was not relational we could have uploaded the changes in all the tables in a single C/U/D API Call.
It seems that you might not be aware of typical relational DBMS facilities and protocols that
support simultaneous write access by multiple sessions, making them suitable for multi-user, highly concurrent, and OLTP applications.
1) Your API to access MySQL allows you to make your changes atomically (all or nothing) via a transaction. Within that transaction you should update as many tables as possible simultaneously but can sequence such changes as necessary. By locking tables as you use them then unlocking in the reverse order you avoid deadlocks. You can request that only parts of tables that a transaction knows it could possibly change are locked so that non-overlapping clients can proceed concurrently.
2) Your schema can explicitly record redundant delta information that you get the DBMS to calculate on updates, or it can record sufficient past changes to calculate deltas on request. Your client can give the DBMS its transaction data and the DBMS can return relevant info based on it and the past. You probably do not need to and should keep any persistent state on your client. That is what the server database is for. The client database is a buffer for it and user info.
3) You can use an explicit client serial transaction id so that client plus id indicates what order the client thinks its transactions were sent regardless of its clock.
I wonder how much you have googled.

Ruby on Rails performance on lots of requests and DB updates per second

I'm developing a polling application that will deal with an average of 1000-2000 votes per second coming from different users. In other words, it'll receive 1k to 2k requests per second with each request making a DB insert into the table that stores the voting data.
I'm using RoR 4 with MySQL and planning to push it to Heroku or AWS.
What performance issues related to database and the application itself should I be aware of?
How can I address this amount of inserts per second into the database?
EDIT
I was thinking in not inserting into the DB for each request, but instead writing to a memory stream the insert data. So I would have a scheduled job running every second that would read from this memory stream and generate a bulk insert, avoiding each insert to be made atomically. But i cannot think in a nice way to implement this.
While you can certainly do what you need to do in AWS, that high level of I/O will probably cost you. RDS can support up to 30,000 IOPS; you can also use multiple EBS volumes in different configurations to support high IO if you want to run the database yourself.
Depending on your planned usage patterns, I would probably look at pushing into an in-memory data store, something like memcached or redis, and then processing the requests from there. You could also look at DynamoDB, which might work depending on how your data is structured.
Are you going to have that level of sustained throughput consistently, or will it be in bursts? Do you absolutely have to preserve every single vote, or do you just need summary data? How much will you need to scale - i.e. will you ever get to 20,000 votes per second? 200,000?
These type of questions will help determine the proper architecture.

Is it possible to have a stateless timed function

I'm trying to set a reminder in a system to fire at a certain time.
This is a web based app, so it's not like it will be in memory all the time.
Ideally I'd like to avoid using a service or job on the server(mainly out of curiosity, to see if there is a more efficient way to do it)
For example, imagine how many Ebay bids are constantly ending all the times, and emails being sent out seemingly perfectly in time.
Do people recon there is just a big loop going over and over, moving items into a queue etc... Or is there something lower level helping out (stored procedures, triggers etc)
Thanks everyone.
What you have to realize about eBay - and most large database-backed websites - is that the interactions between humans and the database that come through the web server are only a part (sometimes a very small part) of the functionality of the system.
To use eBay as an example, the email that goes out when auctions expire is not handled by a web server. They are far more likely to have that scripted. In other words, there is another program running on a number of their systems that look at the database for ended auctions, do some processing on them, send emails, etc.
If I were doing something similar (albeit on a much smaller scale,) I'd have my web services built in the usual way, but have a job that is run automatically every few minutes to do the maintenance work. It would start up, look at the database for work, process anything that was required, then exit.

Resources