RnR: Long running process - ruby-on-rails

I have a part of my application that creates an export file. The export file process is fairly quick for the vast majority of users however, there are users that generate 10,000 or more records. This complicates things. First, the tool that imports the files, blows up on files larger than about 4,000 records. Secondly, the process for 10,000 records takes about 20 minutes. There has a tendency for the users to start doing other things and then for what ever reason, the process seems to time out and they never get their file. However, if you click the process button, and just leave your machine alone, 20 minutes later you will get the file.
I need to make this more user-friendly and robust. Here's my ideas:
1) automatically create separate files of 4,000 a pop
2) provide a status bar for the file generation
3) background the process so a user can click the button and come back say an hour later and download their files
So I have been doing research on the background plugins and gems. Most seem to be fairly out of date, which make me nervous and may seem to be major overkill for what I need. So Spawn seemed to be simple and straight forward but I'm unclear on how to do a status bar for that type of product.
Then we have something like Delayed_job. This seems like it would work but also seems a little heavy but it does provide the hooks to generate some kind of status update. Anyone have an example of this? The README is a little light.
Another issue is the file generation, how do I get this multiple files to download? Anyway, I can store the generated file for the live of the user session?
Finally, most of the solutions are looking like a major change, this issue is painful but technically works. So the time that I am being allotted to solve it is minimal so I am trying to KISS. Thanks for any help and or direction you can provide.

If your looking for background processing job I guess you must look for resque it supereasy run on redis as against delayed_job which poll your databases changes
as per gathering progress info I guess there bunch of resque plugin here one that can help you in the quest
Lastly
Another issue is the file generation, how do I get this multiple files to download? Anyway, I can store the generated file for the live of the user session?
Not sure what you actually meant but if you wanted multiple file to download can zipping into one can help

Related

Download data option for customers

I have a multi tenant Rails app where the data of a customer is separated with a global scope. Now I want to give the customers the option to download all their own data in a single download. What is the best way to achieve this? Is it best to output everything into a CSV file?
Putting it into 1 csv file is going to likely cause you headaches.
I agree with Alex to do it as a background job if you go with CSV.
I will walk you through 2 approaches (CSV and Feed) and then you can choose what works for you.
CSV
Normally there are many tables that you want to export. If you put it all into one CSV file, it will be a bit messy of a file.
Instead I would set a nightly process for each customer for each table.
These generate CSVs for each customer for each table and stage them.
Finally for each customer I would bring those files together into a compressed file, and prep for delivery (Web download, FTP, Email etc)
The downside really is the lack of real time.
If you need real time (or if the data set is large), then you have to think about the impact this will have to your production database. It could cause serious performance degredation over time.
One option to get by this is to have read only replicated databases and you can deploy/utilize as needed.
Change Management
Instead of creating these ever-growing files every night, or on each request you can process data as it changes.
For example, if your customers really need to get this data, it could be for dropping in their database. I would move away from downloading CSV's or excel and offer an API.
When data changes come into your system, you notify interested components of the change. This way they do not have to go to the DB to get the changes. The API can have a pickup location that serves up the changed data whenever it exists.
We have used this mechanism in large scale, high volume environments with great success.
Push Notifications
Finally, there are web hooks. Basically when changes you post the data to their web server.
I would suggest if you are going to go with the CSV route, you look at the long term read impacts. You may not need to make a change now, but you should have in your plan an item and solution ready.
Finally I would break the task into many small tasks over 1 long running.
CSV is a commonly used format for this. There is a good rails cast on how to achieve this: http://railscasts.com/episodes/362-exporting-csv-and-excel
From my experience I can advice you to implement it as a background scheduled process because export could be expensive in resources and take long time to finish. After the task is finished you can email a user with the download link for example.

opening and closing streaming clients for specific durations

I'd like to infrequently open a Twitter streaming connection with TweetStream and listen for new statuses for about an hour.
How should I go about opening the connection, keeping it open for an hour, and then closing it gracefully?
Normally for background processes I would use Resque or Sidekiq, but from my understanding those are for completing tasks as quickly as possible, not chilling and keeping a connection open.
I thought about using a global variable like $twitter_client but that wouldn't horizontally scale.
I also thought about building a second application that runs on one box to handle this functionality, but that seems excessive if it can be integrated into the main app somehow.
To clarify, I have no trouble starting a process, capturing tweets, and using them appropriately. I'm just not sure what I should be starting. A new app? A daemon of some sort?
I've never encountered a problem like this, and am completely lost. Any direction would be much appreciated!
Although not a direct fix, this is what I would look at:
Time
You're working with time, so I'd look at what time-centric processes could be used to induce the connection for an hour
Specifically, I'd look at running a some sort of job on the server, which you could fire at specific times (programmatically if required), to open & close the connection. I only have experience with resque, but as you say, it's probably not up to the job. If I find any better solutions, I'll certainly update the answer
Storage
Once you've connected to TweetStream, you'll want to look at how you can capture the tweets for that time period. It seems a waste to create a data table just for the job, so I'd be inclined to use something like Redis to store the tweets that you need
This can then be used to output the tweets you need, allowing you to simulate storing / capturing them, but then delete them after the hour-window has passed
Delivery
I don't know what context you're using this feature in, so I'll just give you as generic process idea as possible
To display the tweets, I'd personally create some sort of record in the DB to show the time you're pinging TweetStream that day (if it changes; if it's constant, just set a constant in an initializer), and then just include some logic to try and get the tweets from Redis. If you're able to collect them, show them as you wish, else don't print anything
Hope that gives you a broader spectrum of ideas?

parse an active log file

Looking for a little help getting started on a little project i've had in the back of my mind for a while.
I have log file(s) varying in size depending on how often they are cleaned from 50-500MB. I'd like to write a program that will monitor the log file while its actively being written to. when in use it's being changed pretty quickly easily several hundred lines a second or so. Most if not all of the examples i've seen for reading log/text files are simply open and read file contents into a variable which isn't really feasible to do every time the file changes in this situation. I've not settled on a language to write this in but its on a windows box and I can work in .net flavors / java / or php ( heh dont think php will fly to well for this), and can likely muddle through another language if someone has a suggestion for something well built for handling this.
Essentially I believe what I'm looking for would probably be better described to as a high speed way of monitoring a text file for changes and seeing what those changes are. Each line written is relatively small. (less than 300 characters, so its not big data on each line).
EDIT: to change the wording to hopefully better describe what i'm trying to do. Which is write a program to keep an eye on a log file for a trigger then match a following action to that trigger. So my question here is pertaining to file handling inside a programming language.
I greatly appreciate any thoughts/comments.
If it's incremental then you can just read the whole file the first time you start analyzing logs, then you keep the current size as n. Next time you check (maybe a timed action to check last modified date) just skip first n bytes, read all new bytes and update size.
Otherwise you could use tail -f by getting its stdout and using it for your purposes..
The 'keep an eye on a log file' part of what you are describing is what tail does.
If you plan to implement it in Java, you can check this question: Java IO implementation of unix/linux "tail -f" and add your trigger logic to lines read.
I suggest not reinventing the wheel.
Try using the elastic.co
All of these applications are open source and free and are capable of monitoring (together) and trigger actions based on input.
filebeats - will read the log file line by line (supports multiline log messages as well) and will send it across to logstash. There are loads of other shippers you can use.
logstash - will take the log messages, filter them, add tags and send the messages to elasticsearch
elasticsearch - will take the log messages and index them, the store them. It is also capable of running actions based on input
kibana - is a user friendly web interface to query and analyze the data. Or just simply put it up on a dashboard.
Hope this helps.

How to fail gracefully and get notified if screen scraping fails in ruby on rails

I am working on a Rails 3 project that relies heavily on screen scraping to collect data mainly using Nokogiri. I'm aggregating essentially all the same data but I'm grabbing it from many difference sources and as time goes on I will be adding more and more. However I am acutely aware that screen scraping can be notoriously unreliable.
As such I am interested in how other people have handled the problem of verifying the data and then also getting notified if it is failing.
My current plan is as follow.
I am going to have validation on my model for most of the fields. If they fail I won't get bad data into my system. Although logging this failure in a meaningful way is still a problem.
I was thinking of some kind of counter where after so many failures from a particular source I somehow turn it off. Not sure how to keep track of that. I guess the only way is to have a field on my Source model that counts it and can be reset.
Logging is 800 pound gorilla I'm not sure how to deal with. I could just do standard writing to logs but if something fails I'd like to store the entire html so I can figure it out. Also I need to notify myself somehow so I can address the issues. I thought of maybe just creating a model for all this and storing it in the database. If I did this I'd probably have to store the html on s3 or something. I'm running this on heroku so that influences what I can do.
Setup begin and rescue blocks around every field. I was trying to figure out a to code this in a nicer ruby way so I just don't have a page of them but although I do have some fields are just straight up doc.css_at("#whatever") there are quite a number that require various formatting or calculations so I think it makes sense to try to rescue that so I can then log what went wrong. The other option is to let the exception bubble up and catch it when I try to create the model.
Anyway I'm sure I'm not even thinking of everything but that is why I'm trying to figure out how other people have handled this problem.
Our team does something similar to this, so here's some ideas:
we use a really high level begin/rescue transaction to make sure we don't get into weird half loaded states:
begin
ActiveRecord::Base.transaction do
...try to load a data source...
end
rescue
...error handling...
end
Email/page yourself when certain errors occur. We use exception_notifier but if you're sitting on Heroku the Exceptional plugin also seems like a good option. I've also heard of people having success w/ hoptoad
Capturing state is VERY important for troubleshooting issues. Something that's worked quite well for us is GMail. Our loaders effectively have two phases:
capture data and send it to our gmail account
log into gmail, download latest data and parse it
The second phase is the complex one, and if it fails a developer can simply log into the gmail account and easily inspect the failed message. This process has some limitations (per email and per mailbox storage limits, two phase pipeline, etc.) and we started out doing it because we had no other option, but it's proven shockingly resilient and convenient. Keep email in mind as a cheap/easy way to store noncritical state. We didn't start out thinking of using it that way and are now really glad we do. Logging into GMail feels better than digging through log files.
Build a dashboard UI. We have a simple dashboard with a grid of sources by day that looks like this. Each box is colored either red or green based on whether the load for that source on that day succeeded. You can go one step further and set up a monitor on this UI (mon.itor.us or equivalent) that alarms if some error threshold is met.

Letting something happen at a certain time with Rails

Like with browser games. User constructs building, and a timer is set for a specific date/time to finish the construction and spawn the building.
I imagined having something like a deamon, but how would that work? To me it seems that spinning + polling is not the way to go. I looked at async_observer, but is that a good fit for something like this?
If you only need the event to be visible to the owning player, then the model can report its updated status on demand and we're done, move along, there's nothing to see here.
If, on the other hand, it needs to be visible to anyone from the time of its scheduled creation, then the problem is a little more interesting.
I'd say you need two things. A queue into which you can put timed events (a database table would do nicely) and a background process, either running continuously or restarted frequently, that pulls events scheduled to occur since the last execution (or those that are imminent, I suppose) and actions them.
Looking at the list of options on the Rails wiki, it appears that there is no One True Solution yet. Let's hope that one of them fits the bill.
I just did exactly this thing for a PBBG I'm working on (Big Villain, you can see the work in progress at MadGamesLab.com). Anyway, I went with a commands table where user commands each generated exactly one entry and an events table with one or more entries per command (linking back to the command). A secondary daemon run using script/runner to get it started polls the event table periodically and runs events whose time has passed.
So far it seems to work quite well, unless I see some problem when I throw large number of users at it, I'm not planning to change it.
To a certian extent it depends on how much logic is on your front end, and how much is in your model. If you know how much time will elapse before something happens you can keep most of the logic on the front end.
I would use your model to determin the state of things, and on a paticular request you can check to see if it is built or not. I don't see why you would need a background worker for this.
I would use AJAX to start a timer (see Periodical Executor) for updating your UI. On the model side, just keep track of the created_at column for your building and only allow it to be used if its construction time has elapsed. That way you don't have to take a trip to your db every few seconds to see if your building is done.

Resources