Streaming large gzipped data to browser with minimal resources

Streaming large gzipped data to browser with minimal resources - ruby-on-rails

Is there a way to stream compressed data to the browser from rails?
I'm implementing an export function that dumps large amounts of text data from a table and lets the user download it. I'm looking for a way to do the following (and only as quickly as the user's browser can handle it to prevent resource spiking):
Retrieve item from table
Format it into a CSV row format
Gzip the row data
Stream to user
If there are more items, go to 1.
The idea is to keep resource usage down (i.e. don't pull from the database if it's not required, don't keep a whole CSV file/gzip state in memory). If the user aborts the download midway, rails shouldn't keep wasting time fetching the whole data set.
I also considered having rails just write a temporary file to disk and stream it from there but this would probably cause the user's browser to time out if the data is large enough.
Are there any ideas?

Here's an older blog post that shows an example of streaming: http://patshaughnessy.net/2010/10/11/activerecord-with-large-result-sets-part-2-streaming-data
You might also have luck with the new Streaming API and Batches. If I'm reading the documentation correctly, you'd need to do your queries and output formatting in a view template rather than your controller in order to take advantage of the streaming.
As for gzipping, it looks like the most common way to do that in Rails is Rack::Deflator. In older versions of Rails, the Streaming API didn't play well Rack::Deflator. That might be fixed now, but if not that SO question has a monkey patch that might help.
Update
Here's some test code that's working for me with JRuby on Torquebox:
# /app/controllers/test_controller.rb
def index
respond_to do |format|
format.csv do
render stream: true, layout: false
end
end
end
# /app/views/test/index.csv.erb
<% 100.times do -%>
<%= (1..1000).to_a.shuffle.join(",") %>
<% end -%>
# /config/application.rb
module StreamTest
class Application < Rails::Application
config.middleware.use Rack::Deflater
end
end
Using that as an example, you should be able to replace your view code with something like this to render your CSV
Name,Created At
<% Model.scope.find_each do |model| -%>
"<%= model.name %>","<%= model.created_at %>"
<% end -%>
As far as I can tell, Rails will continue to generate the response if the user hits stop half-way through. I think this is a limitation with HTTP, but I could be wrong. This should meet the rest of your requirements, though.

Related

Using CombinePDF/Prawn/Paperclip all in one method

So I'm not sure if this is a proper question for StackOverflow but with no one else to turn to I figured I'd try here.
Now, the below code works for it's intended purpose. However, I would consider myself a novice with Rails without the experience to foresee any consequences that might arise in the future as my application scales.
So the idea is that when the user clicks on a 'Generate PDF' button, Prawn will generate the custom PDF, CombinePDF will combine this and PDF's from associated sources, then the final PDF will be saved to the Rails.root directory (only because I don't know how to pass a save location to CombinePDF and have searched everywhere), then Paperclip will attach it to its appropriate model, then the original PDF generated in Rails.root will be deleted to clean up the directory.
Orders show action in Orders Controller
def show
#orders = Order.find(params[:id])
#properties = Property.find(#orders.property.id)
#deeds = Deed.where(property_id: #properties, order_id: #orders).all
#mortgages = Mortgage.where(property_id: #properties, order_id: #orders).all
#attached_assets = AttachedAsset.where(property_id: #properties, order_id: #orders).all
#owners = Owner.where(property_id: #properties, order_id: #orders).all
respond_to do |format|
format.html
format.pdf do
order_pdf = OrderPdf.new(#orders, #properties, #deeds, #mortgages, #attached_assets).render
combine_order_pdf = CombinePDF.new
combine_order_pdf << CombinePDF.parse(order_pdf)
if #deeds.any?
#deeds.each do |deed|
combine_order_pdf << CombinePDF.load(deed.document.path)
end
end
if #mortgages.any?
#mortgages.each do |mtg|
combine_order_pdf << CombinePDF.load(mtg.document.path)
end
end
if #attached_assets.any?
#attached_assets.each do |assets|
combine_order_pdf << CombinePDF.load(assets.asset.path)
end
end
combine_order_pdf.save "order_#{#orders.order_number}.pdf"
paperclip_pdf = File.open("order_#{#orders.order_number}.pdf")
#orders.document = paperclip_pdf
#orders.save
File.delete("order_#{#orders.order_number}.pdf")
redirect_to property_order_path(#properties, #orders)
flash[:success] = "PDF Generated Successfully. Please scroll down."
end
end
end
Generate PDF button in Orders show view
<%= link_to fa_icon("files-o"), order_path(#orders, format: "pdf"), class: "m-xs btn btn-lg btn-primary btn-outline" %>
To be clear: This code does work, but what I'm asking is:
Is all this logic appropriate in a controller function or is there a better place for it?
Will this process bottleneck my application?
Is there a better way to save directly to Paperclip after CombinePDF/Prawn do their thing?
Can I pass a save location to CombinePDF so its not in Rails.root?
Is storing Paperclip attachments in Rails.root/public the only way to have them able to be accessed/displayed on an intranet Rails app?
Is there anything I'm not seeing about this method that may put my application at risk for either performance, security, or stability?
Again, I understand if this isn't an appropriate question(s) but I have no one else to ask so if it is then let me know and I'll take it down. I guess to satisfy the 'Answerable' criteria for this question would be anyone who could tell me the rails way of doing this and why and/or answering the above questions. Thanks in advance for any input!

Is all this logic appropriate in a controller function or is there a better place for it?
Yes, ideally your action should be of 7-8 lines and remaining piece can either become another action or go inside a module which you can place in concerns folder. ex: you can take out your pdf logic and write another method inside concerns folder with file name orders_related.rb
module OrdersRelated
def self.parsing_pdf(orders, properties, deeds, mortgages, attached_assets)
order_pdf = OrderPdf.new(orders, properties, deeds, mortgages, attached_assets).render
.
.
#orders.document = paperclip_pdf
#orders.save
File.delete("order_#{#orders.order_number}.pdf")
redirect_to property_order_path(#properties, #orders)
end
end
Will this process bottleneck my application?
Yes, this kind of processing should alway happen in the background. Won't go into the details of which you should be using as it depends upon your requirements.
Is storing Paperclip attachments in Rails.root/public the only way to have them able to be accessed/displayed on an intranet Rails app?
No, You should be using a storage service like s3 bucket for to keep your files. there are many advantages of doing it which again is out of scope of this question.
Is there anything I'm not seeing about this method that may put my application at risk for either performance, security, or stability?
Yes, your method clearly needs lot of refactoring I can suggest few
Remove .all from all query it is not required
add index to these columns (property, order) in all table
Never save important files to your public directory(use third party storage service)
p.s: I intentionally left two questions as I don't have much experience with CombinePdf.

Can I use http streaming with axlsx_rails to avoid timeout issue with large/time intensive query?

I'm using the axlsx_rails Ruby gem in Rails 4.2.5 to generate an Excel file to let users download their data.
I have this in my index.xlsx.axlsx template:
wb = xlsx_package.workbook
wb.add_worksheet(name: 'Transactions') do |sheet|
sheet.add_row ["Date", "Vendor Name", "Account",
"Transaction Category",
"Amount Spent", "Description"]
#transactions.find_each(batch_size: 100) do |transaction|
sheet.add_row [transaction.transaction_date,
transaction.vendor_name,
transaction.account.account_name,
transaction.transaction_category.name,
transaction.amount,
transaction.description]
end
end
The page times out before returning an Excel file if there's enough data. Is there a way to use HTTP streaming to send results back as it's processing, rather than waiting until the entire transactions.find_each loop has completed?
I saw code here using response.stream.write:
response.headers['Content-Type'] = 'text/event-stream'
10.times {
response.stream.write "This is a test message"
sleep 1
}
response.stream.close
That approach looks promising, but I couldn't figure out how to integrate response.stream.write into an axlsx_rails template. Is there a way?
This is my first Stack Overflow question- apologies for any faux pas and thank you for any ideas you can offer.

Welcome to SO, Joe.
I asked in comment, but perhaps it's better to answer and explain.
The short answer, is yes, you can always stream if you can render (though with sometimes mixed performance results).
It does not, however, work if your referencing a file directly. IE, http://someurl.com/reports/mycustomreport.xlsx
Streaming in rails just isn't built that way by default. But not to worry, you "should" still be able to tackle your issue, providing the time you wish to save is rendering only.
In your controller (* note for future, when you're asking about rendering actions, it helps to provide your controller action code *) you should be able to do something similar to:
def report
#transactions = current_user.transactions.all
respond_to do |format|
format.html { render xlsx: 'report', stream: true}
end
end
Might help to do a sanity check on your loading. In your log as part of the 200 response you should get something like:
Completed 200 OK in 506ms (Views: 494.6ms | ActiveRecord: 2.8ms)
If the active record number is too high, or higher than the view number, this solution might not work for your query, and as suggested, this might need to be threaded or sent to a job.

Even if you can stream, I don't think it will be any faster. The problem is Axlsx is not going to generate your spreadsheet until you are done building it. And axlsx_rails just wraps that process, so it won't help either. So there will be no partial spreadsheet to serve in bits, and the delay will be just as long.
You should bite the bullet and try Sidekiq (which is very fast) or some other job scheduler. Then you can return the request immediately and generate the spreadsheet in the background. You will have to do some kind of monitoring or notification to get the generated report, or a ping back to another url using javascript that forwards to a new page when a flag is set on render complete. Your call there.
Having a job scheduler is also very convenient when you need to fire off an email in response to a request; the response can return immediately and not wait for the email to complete. Once you have a scheduler you will find more uses for it.
If you choose a job scheduler, axlsx_rails will let you use your template to generate the attachment, or you can create your own view context to generate the file. Or for a really bare bones way of rendering the template, see this test.

Save a response from API call to use in a test so I don't have to continuously repeat requests to API

API requests take too long and are costing me money in my Rails integration tests and my application.
I would like to save API responses and then use that data for testing. Are there any good ways to make that happen?
Also, how can I make fewer api calls in production/development? What kind of caching can I use?

If I understand correctly, your rails app is using an external api, like a google/fb/twitter api, this kind of stuff
Caching the views won't work, because it only caches the template, so it doesn't waste time rendering the view again, and it validates that the cache is warm by hashing the data, which the code will still hit the api to verify that the hashes still match
For you the best way is to use a class that does all the api calls, and cache them in rails cache and give that cache a timeout period, because you don't want your cache to be too stale, but in the same time you will sacrifice some accuracy for some money ( like only do a single call every 5, 15, 30 mins, which ever you pick )
Here's a sample of what I have in mind, but you should modify it to match your needs
module ApiWrapper
class << self
def some_method(some_key) # if keys are needed, like an id or something
Rails.cache.fetch("some_method/#{some_key}", expires_in: 5.minutes) do
# assuming ApiLibrary is the external library handler
ApiLibrary.call_external_library(some_key)
end
end
end
end
Then in your code, call that wrapper, it will only contact the external api if the stored value in the cache is expired.
The call will be something like this
# assuming 5 is the id or value you want to fetch from the api
ApiWrapper.some_method(5)
You can read more about caching methods from the rails guide for caching
Update:
I just thought of another way, for your testing (like rspec tests) you could stub the api calls, and this way you'll save the whole api call, unless you are testing the api it self, using to the same api library I wrote above, we can stub the ApiLibrary it self
allow(ApiLibrary).to receive(:some_method).and_return({ data: 'some fake data' })
PS: the hash key data is part of the return, it's the whole hash not just the string.

There is a great gem for this called VCR. It allows you to make a single request and keep response cached, so every time you run the test you will use this saved response.

I would use http://redis.io/ in conjunction with something like jbuilder. So as an example your view would look like:
json.cache! ["cache", "plans_index"] do
json.array! #plans do |plan|
json.partial! plan
end
end
for this controller:
def index
#plans = Plan.all
end
If you have something that is a show page you can cache it like this:
json.cache! ["cache", "plan_#{params["id"]}"] do
json.extract! #plan, :short_description, :long_description,
end

Pass XML value from one page other page

I want to pass an XML value from one page to another in a better way.
I am getting this XML value from API:
<hotelist>
<hotel>
<hotelId>109</hotelId>
<hotelName>Hotel Sara</hotelName>
<location>UK,london</location>
</hotel>
<hotel>
<hotelId>110</hotelId>
<hotelName>Radha park</hotelName>
<location>UK,london</location>
</hotel>
<hotel>
<hotelId>111</hotelId>
<hotelName>Hotel Green park</hotelName>
<location>chennai,India</location>
</hotel>
<hotel>
<hotelId>112</hotelId>
<hotelName>Hotel Mauria</hotelName>
<location>paris,France</location>
</hotel>
</hotelist>
I want to pass one hotel:
<hotel>
<hotelId>112</hotelId>
<hotelName>Hotel Mauria</hotelName>
<location>paris,France</location>
</hotel>
to next page.
I am using the Nokogiri gem for parsing XML. For the API next call I have to pass the one hotel to the next page. Which is the best method?
Note: This is just a sample. There are a lot of information bound with the hotel including available room, discount and so on.

So as far as I'm getting this, you are searching for some hotels through a third party service, and then displaying a list. After the user clicks on an item you displaying the detail info
for the hotel.
The easiest way would be having another API endpoint, which can provide the detail information for a specific Hotel id. I guess you're dealing with some really bad API and that's not the case.
There are couple of other options (ordered by complexity level):
There is really not much data and it should fit an simple GET request, so you can just encode the respective hotel information into the URL parameter for the detail
page. Assuming you have set up resourcefull routing and have already parsed the XML into #hotels array of some Hotel models/structs or the like:
<% #hotels.each |hotel| do %>
<% # generates <a href=/hotels/112?hotelName=Hotel+Mauria&location=paris%2C+France'>Hotel Mauria</a>
<%= link_to hotel.hotelName hotel_path(hotel.hotelId, hotelName: hotel.hotelName, location: hotel.location) %>
<% end %>
Encode the info into the respective Hotel DOM elements as data-attributes:
<div class="hotel" data-id="112" data-hotel-name="Mauria" ... />
Then render the detail page on the client side without the server entirely by subscribing to a click event, reading the info stored in the respective data attributes and replace the list with the detail div.
If the third party API is public you could even move the search problem entirely to the client.
Introduce caching of search requests on the server. Then just pick a hotel from the cache
by its id. This would be saving you from doing to much third party requests from your Rails app, which is a weak spot of Rails if deployed on a typical multi-process server.
The simplest way of doing this, would be storing the last search result in a user session, but that
would be probably too memory heavy. If you can expect the hotel information not to change frequently, you could cache it by the query parameters. You could also use some smart caching store like redis and index the entire hotel information, than performing the search on the cache and only in case of the cache miss hit the third party API. But always remember, caching is easy, expiring is hard.
"Everyone should be using low level caching in Rails" could be interesting for implementing a cache.

If you don't mind passing all that information in query parameters:
links = doc.xpath('//hotel').map do |hotel|
hash = Hash.from_xml(hotel.to_xml)
url_for({controller: 'x', action: 'y'}.merge(hash))
# or if you have your link as a string
# "#{link_string}?#{hash.to_param}"
end
If you want to create a link for just one hotel, extract the relevant XML (e.g., using the process described in Uri's answer), and then generate the link as above.
Assuming you have the API XML ready before you render the current page, you could render the relevant hotel data into form fields so that you could post to the next page, something like:
<%= fields_for :hotel do |hf| %>
<% hf.hidden_field :hotelId, value: hash['hotel']['hotelId'] %>
# ...
<% end %>

One optimum way to achieve this is as suggested by Mark Thomas.
However if you still need to pass data between pages you can put all the xml information as a string in a session variable and use it on next page.

Bulk JSON data via POST to API

This post is more about a coding approach to a problem rather than a problem itself (for a change!).
I have a number of projects that i'm working on the moment that require me to gather sales data from a number of disperate sources.
The data from each vendor is almost always accessed and structured in different ways; best case a nice valid JSON reponse, worst case I screen scrape the data.
As each vendor's source data is so different i've decided that a dedicated Rails-api App that feeds json data to a master sales-data App via it's API is the way forward.
I did look at using Sinatra for each vendor app but my knowledge is with Rails so I can get things done much quicker. I feel a dedicated app for each vendor app is the right approach as these can be maintained independently and should a vendor decide to start feeding their data themselves, I (or another developer) can easily swap things over without having to delve into one massive monothlic sales data gathering app, however, do say if you think dedicated apps doesn't make much sense.
So, as a simple stripped down example, each vendor app is structured around a class like this. Currently i'm just calling the methods via console, but will automate via rake tasks and background workers eventually.
class VendorA < ActiveRecord::Base
def self.get_report
# Uses Mechanize Gem to fetch a report in CSV format
# returns report
end
def self.save_report(report)
# Takes the report from the get_report method and saves it, currently to the app root but eventually this will be S3
# returns the local_report
end
def self.convert_report_to_json(local_report)
# Reads the local report, iterates through it grabbing the fields required for the master sales-data app and constructs a JSON repsonse
# returns a valid JSON repsonse called json_data
end
def self.send_to_master_sales_api(json_data)
# As you can see here I take the repsonse from convert_report_to_json and post it to my sales data App
require 'rest-client'
RestClient.post('http://localhost:3000/api/v1/sales/batch', json_data)
end
end
This works fine with the send_to_master_sales_api method doing what is expected. I haven't yet tested this beyond approx 1000 data objects/lines though.
On the receiving end, at the master sales-data App, things look like this:
module Api
module V1
class SalesController < ApplicationController
before_filter :restrict_access
skip_before_filter :verify_authenticity_token
respond_to :json
def batch
i = 0
clientid = params[:clientid]
objArray = JSON.parse(params[:sales])
objArray.each do |sale|
sale = Sale.new(sale)
sale.clientid = clientid #Required / Not null
sale.save
i +=1
end
render json: {message: "#{i} Sales lines recevied"}, status: 202
end
private
def restrict_access
api_key = ApiKey.find_by_api_key_and_api_secret(params[:api_key],params[:api_secret])
render json: { message: 'Invalid Key - Access Restricted' }, status: :unauthorized unless api_key
end
end
end
end
So, my main question is really about the volume of JSON data you think this approach can handle. As I mentioned above, i've tested with around 1000 lines/objects and it worked fine. But should I start having to process data with one of the vendor apps where i'm approaching say, 10,000, 100,000 or even 1,000,000 lines/objects per day from each source what should I be thinking about in terms of the above two apps being able to cope. Is there something like find_in_batches I could be using to ease the load of receiving the data? Currently I plan to write the master sales-data records to a Postgres DB. While i've limited experience of NoSQL, would writing to MongoDB or similar speed things up at the receiving end?
I realise this isn't a particularly direct question, but I would really appreicate input and thoughts from those with experience of this kind of thing.
Thanks in advance!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart