Download data option for customers - ruby-on-rails

I have a multi tenant Rails app where the data of a customer is separated with a global scope. Now I want to give the customers the option to download all their own data in a single download. What is the best way to achieve this? Is it best to output everything into a CSV file?

Putting it into 1 csv file is going to likely cause you headaches.
I agree with Alex to do it as a background job if you go with CSV.
I will walk you through 2 approaches (CSV and Feed) and then you can choose what works for you.
CSV
Normally there are many tables that you want to export. If you put it all into one CSV file, it will be a bit messy of a file.
Instead I would set a nightly process for each customer for each table.
These generate CSVs for each customer for each table and stage them.
Finally for each customer I would bring those files together into a compressed file, and prep for delivery (Web download, FTP, Email etc)
The downside really is the lack of real time.
If you need real time (or if the data set is large), then you have to think about the impact this will have to your production database. It could cause serious performance degredation over time.
One option to get by this is to have read only replicated databases and you can deploy/utilize as needed.
Change Management
Instead of creating these ever-growing files every night, or on each request you can process data as it changes.
For example, if your customers really need to get this data, it could be for dropping in their database. I would move away from downloading CSV's or excel and offer an API.
When data changes come into your system, you notify interested components of the change. This way they do not have to go to the DB to get the changes. The API can have a pickup location that serves up the changed data whenever it exists.
We have used this mechanism in large scale, high volume environments with great success.
Push Notifications
Finally, there are web hooks. Basically when changes you post the data to their web server.
I would suggest if you are going to go with the CSV route, you look at the long term read impacts. You may not need to make a change now, but you should have in your plan an item and solution ready.
Finally I would break the task into many small tasks over 1 long running.

CSV is a commonly used format for this. There is a good rails cast on how to achieve this: http://railscasts.com/episodes/362-exporting-csv-and-excel
From my experience I can advice you to implement it as a background scheduled process because export could be expensive in resources and take long time to finish. After the task is finished you can email a user with the download link for example.

Related

Suggestion for trigger that sends email if threshold is broken

This is quite a broad question but ill try and summarise it as best I can.
I have an MVC front end which displays/allows processing of records which are classed as outstanding. I also have a scheduled console app which runs nightly and attempts to resolve each of these records using some logic I wrote.
I have a new requirement, which is to have an email sent every time the total number of outstanding records exceeds a certain amount, this amount needs to be configurable.
The table will contain every record with a flag to say if they have been resolved or not, so I will need to count the outstanding's then fire an email to notify if the threshold is broken.
I initially thought about adding a SQL Server trigger on insert however I soon realised that if no more records were added for a few days but the total number stayed above the threshold because nobody resolved them, then no further email would be sent.
I need the email to send every day on a schedule independently of insert/update.
So now I'm thinking possibly a SQL Server job, or an SSIS package or even a service which runs, but I'm aware this threshold number needs to be configurable.
So what would be the quickest simplest solution to my requirements, I'm open to any suggestion as long as it ticks all the boxes.
Given that the OP already has a console app running on a schedule, the most logical choice would be to simply add this check to the console app along with the email sending logic. It will be much easier to send emails that way, anyways, especially if you employ something like Postal, which will let you use MVC-style views to create your emails.
An SQL Server scheduled job seems to me to be the simplest way to go.
you can add a table to your database that will hold the threshold number and read it's value from there.
In many cases a GeneralParams table is a good thing to have anyway.
The other option you mentioned (windows service) is also configurable in many ways: you can use a GeneralParams table, or the App.Config file of the service (but you will have to restart it every time you change the app.config), or even a simple text file. anything goes. the downside is that it's outside of your sql server, but the upside is that it is probably easier to send emails from.

Read huge file from disk

I am developing an iOS app and I have this text file with a city name per line. I have like 3 Million cities in that file. In order to be able to perform searches and operations on it I am using a B-Tree but this tree takes a long time to be created. It is not good for the user experience having him to wait for this every time he uses the time. All this without using Core data!
Any tips on how can I speed up this process?
Thank you
My recommendation is that you use SQLite with an index on the fields you want to query (or some other type of permanent, indexed storage) so that the user only has to wait the first time the app is opened, and then you can query the database, which will be much faster. I am also fairly certain that you can install a SQLite database from a pre-generated file, so you might be able to generate this index offline, bundle it with your application, and then the user has no wait time at all. I'm not 100% sure on this options though, so you should investigate.
Either way, there is no magic solution here. If the data you want is on line 2 million of the file, you will have to read 2 million lines of text in order to get to that line. I would recommend finding a way to make the UX of your app acceptable so that the user feels better about waiting for the data to load. If you display some sort of pretty screen with a progress bar while the data indexes, the user will be more forgiving of this wait.
The B-Tree will obviously take some time to be created. If you don't want to use a database but stick with your own B-Tree implementation you could dump the tree data to a separate file and load that when the program starts instead of recreating it every time. However, you will have to update the cached tree every time the source data is modified.
In Python the pickle module can help you, but most programming languages will have a serialisation module.
Does this file come with the Application? If it does then you could already process the file file into an SQLite database. Before you ship the app containing the database. You can then use "Select" statements to search the data using indexed fields (like cityname).
If the file changes. Then Still ship with a database and just send amendments as a file. Which would edit the database to bring it back up to date. You may need to add a command to the file for each line like, REPLACE, NEW, DELETE.

RnR: Long running process

I have a part of my application that creates an export file. The export file process is fairly quick for the vast majority of users however, there are users that generate 10,000 or more records. This complicates things. First, the tool that imports the files, blows up on files larger than about 4,000 records. Secondly, the process for 10,000 records takes about 20 minutes. There has a tendency for the users to start doing other things and then for what ever reason, the process seems to time out and they never get their file. However, if you click the process button, and just leave your machine alone, 20 minutes later you will get the file.
I need to make this more user-friendly and robust. Here's my ideas:
1) automatically create separate files of 4,000 a pop
2) provide a status bar for the file generation
3) background the process so a user can click the button and come back say an hour later and download their files
So I have been doing research on the background plugins and gems. Most seem to be fairly out of date, which make me nervous and may seem to be major overkill for what I need. So Spawn seemed to be simple and straight forward but I'm unclear on how to do a status bar for that type of product.
Then we have something like Delayed_job. This seems like it would work but also seems a little heavy but it does provide the hooks to generate some kind of status update. Anyone have an example of this? The README is a little light.
Another issue is the file generation, how do I get this multiple files to download? Anyway, I can store the generated file for the live of the user session?
Finally, most of the solutions are looking like a major change, this issue is painful but technically works. So the time that I am being allotted to solve it is minimal so I am trying to KISS. Thanks for any help and or direction you can provide.
If your looking for background processing job I guess you must look for resque it supereasy run on redis as against delayed_job which poll your databases changes
as per gathering progress info I guess there bunch of resque plugin here one that can help you in the quest
Lastly
Another issue is the file generation, how do I get this multiple files to download? Anyway, I can store the generated file for the live of the user session?
Not sure what you actually meant but if you wanted multiple file to download can zipping into one can help

ASP.NET MVC 3 - Web Application - Efficiently Aggregate Data

I am running an ASP.NET MVC 3 web application and would like to gather statistics such as:
How often is a specific product viewed
Which search phrases typically return specific products in their result list
How often (for specific products) does a search result convert to a view
I would like to aggregate this data and break it down:
By product
By product by week
etc.
I'm wondering what are the cleanest and most efficient strategies for aggregating the data. I can think of a couple but I'm sure there are many more:
Insert the data into a staging table, then run a job to aggregate the data and push it into permanent tables.
Use a queuing system (MSMQ/Rhino/etc.) and create a service to aggregate this data before it ever gets pushed to the database.
My concerns are:
I would like to limit the number of moving parts.
I would like to reduce impact on the database. The fewer round trips and less extraneous data stored the better
In certain scenarios (not listed) I would like the data to be somewhat close to real-time (accurate to the hour may be appropriate)
Does anyone have real world experience with this and if so which approach would you suggest and what are the positives and negatives? If there is a better solution that I am not thinking of I'd love ot hear it...
Thanks
JP
I needed to do something similar in a recent project. We've implemented a full audit system in a secondary database, it tracks changes on every record on the live db. Essentially every insert, update and delete actually updates 2 records, one in the live db and one in the audit db.
Since we have this data in realtime on the audit db, we use this second database to fill any reports we might need. One of the tricks I've found when working with a reporting DB is to forget about normalisation. Just create a table for each report you want, and have it carry just the data you want for that report. Its duplicating data, but the performance gains are worth it.
As to filling the actual data in the reports, we use a mixture. Daily reports are generated by a scheduled task at around 3am, ditto for the weekly and monthly reports, normally over weekends or late at night.
Other reports are generated on demand, using mostly the data since the last daily, so its not that many records, once again all from the secondary database.
I agree that you should create a separate database for your statistics, it will reduce the impact on your database.
You can go with your idea of having "Staging" tables and "Aggregate" tables; that way, if you want to access the near-real-time data you go o the staging table, when you want to historical data, you go to the aggregates.
Finally, I would recommend you use an asynchronous call to save your statistics; that way your pages will not have an impact in response time.
I suggest that you will create a separate database for this. The best way is to use BI technique. There is a separate services in
SQL server for Bi.

Storing Data In Memory: Session vs Cache vs Static

A bit of backstory: I am working on an web application that requires quite a bit of time to prep / crunch data before giving it to the user to edit / manipulate. The data request task ~ 15 / 20 secs to complete and a couple secs to process. Once there, the user can manipulate vaules on the fly. Any manipulation of values will require the data to be reprocessed completely.
Update: To avoid confusion, I am only making the data call 1 time (the 15 sec hit) and then wanting to keep the results in memory so that I will not have to call it again until the user is 100% done working with it. So, the first pull will take a while, but, using Ajax, I am going to hit the in-memory data to constantly update and keep the response time to around 2 secs or so (I hope).
In order to make this efficient, I am moving the intial data into memory and using Ajax calls back to the server so that I can reduce processing time to handle the recalculation that occurs w/ this user's updates.
Here is my question, with performance in mind, what would be the best way to storing this data, assuming that only 1 user will be working w/ this data at any given moment.
Also, the user could potentially be working in this process for a few hours. When the user is working w/ the data, I will need some kind of failsafe to save the user's current data (either in a db or in a serialized binary file) should their session be interrupted in some way. In other words, I will need a solution that has an appropriate hook to allow me to dump out the memory object's data in the case that the user gets disconnected / distracted for too long.
So far, here are my musings:
Session State - Pros: Locked to one user. Has the Session End event which will meet my failsafe requirements. Cons: Slowest perf of the my current options. The Session End event is sometimes tricky to ensure it fires properly.
Caching - Pros: Good Perf. Has access to dependencies which could be a bonus later down the line but not really useful in current scope. Cons: No easy failsafe step other than a write based on time intervals. Global in scope - will have to ensure that users do not collide w/ each other's work.
Static - Pros: Best Perf. Easies to maintain as I can directly leverage my current class structures. Cons: No easy failsafe step other than a write based on time intervals. Global in scope - will have to ensure that users do not collide w/ each other's work.
Does anyone have any suggestions / comments on what I option I should choose?
Thanks!
Update: Forgot to mention, I am using VB.Net, Asp.Net, and Sql Server 2005 to perform this task.
I'll vote for secret option #4: use the database for this. If you're talking about a 20+ second turnaround time on the data, you are not going to gain anything by trying to do this in-memory, given the limitations of the options you presented. You might as well set this up in the database (give it a table of its own, or even a separate database if the requirements are that large).
I'd go with the caching method of for storing the data across any page loads. You can name the cache you want to store the data in to avoid conflicts.
For tracking user-made changes, I'd go with a more old-school approach: append to a text file each time the user makes a change and then sweep that file at intervals to save changes back to DB. If you name the files based on the user/account or some other session-unique indicator then there's no issue with conflict and the app (or some other support app, which might be a better idea in general) can sweep through all such files and update the DB even if the session is over.
The first part of this can be adjusted to stagger the write out more: save changes to Session, then write that to file at intervals, then sweep the file at larger intervals. you can tune it to performance and choose what level of possible user-change loss will be possible.
Use the Session, but don't rely on it.
Simply, let the user "name" the dataset, and make a point of actively persisting it for the user, either automatically, or through something as simple as a "save" button.
You can not rely on the session simply because it is (typically) tied to the users browser instance. If they accidentally close the browser (click the X button, their PC crashes, etc.), then they lose all of their work. Which would be nasty.
Once the user has that kind of control over the "persistent" state of the data, you can rely on the Session to keep it in memory and leverage that as a cache.
I think you've pretty much just answered your question with the pros/cons. But if you are looking for some peer validation, my vote is for the Session. Although the performance is slower (do you know by how much slower?), your processing is going to take a long time regardless. Do you think the user will know the difference between 15 seconds and 17 seconds? Both are "forever" in web terms, so go with the one that seems easiest to implement.
perhaps a bit off topic. I'd recommend putting those long processing calls in asynchronous (not to be confused with AJAX's asynchronous) pages.
Take a look at this article and ping me back if it doesn't make sense.
http://msdn.microsoft.com/en-us/magazine/cc163725.aspx
I suggest to create a copy of the data in a new database table (let's call it EDIT) as you send the initial results to the user. If performance is an issue, do this in a background thread.
As the user edits the data, update the table (also in a background thread if performance becomes an issue). If you have to use threads, you must make sure that the first thread is finished before you start updating the rows.
This allows a user to walk away, come back, even restart the browser and commit whenever she feels satisfied with the result.
One possible alternative to what the others mentioned, is to store the data on the client.
Assuming the dataset is not too large, and the code that manipulates it can be handled client side. You could store the data as an XML data island or JSON object. This data could then be manipulated/processed and handled all client side with no round trips to the server. If you need to persist this data back to the server the end resulting data could be posted via an AJAX or standard postback.
If this does not work with your requirements I'd go with just storing it on the SQL server as the other comment suggested.

Resources