Fast ruby http library for large XML downloads

Fast ruby http library for large XML downloads - ruby-on-rails

I am consuming various XML-over-HTTP web services returning large XML files (> 2MB). What would be the fastest ruby http library to reduce the 'downloading' time?
Required features:
both GET and POST requests
gzip/deflate downloads (Accept-Encoding: deflate, gzip) - very important
I am thinking between:
open-uri
Net::HTTP
curb
but you can also come with other suggestions.
P.S. To parse the response, I am using a pull parser from Nokogiri, so I don't need an integrated solution like rest-client or hpricot.

You can use EventMachine and em-http to stream the XML:
require 'rubygems'
require 'eventmachine'
require 'em-http'
require 'nokogiri'
# this is your SAX handler, I'm not very familiar with
# Nokogiri, so I just took an exaple from the RDoc
class SteamingDocument < Nokogiri::XML::SAX::Document
def start_element(name, attrs=[])
puts "starting: #{name}"
end
def end_element(name)
puts "ending: #{name}"
end
end
document = SteamingDocument.new
url = 'http://stackoverflow.com/feeds/question/2833829'
# run the EventMachine reactor, this call will block until
# EventMachine.stop is called
EventMachine.run do
# Nokogiri wants an IO to read from, so create a pipe that it
# can read from, and we can write to
io_read, io_write = IO.pipe
# run the parser in its own thread so that it can block while
# reading from the pipe
EventMachine.defer(proc {
parser = Nokogiri::XML::SAX::Parser.new(document)
parser.parse_io(io_read)
})
# use em-http to stream the XML document, feeding the pipe with
# each chunk as it becomes available
http = EventMachine::HttpRequest.new(url).get
http.stream { |chunk| io_write << chunk }
# when the HTTP request is done, stop EventMachine
http.callback { EventMachine.stop }
end
It's a bit low-level perhaps, but probably the most performant option for any document size. Feed it hundreds of megs and it will not fill up your memory, as any non-streaming solution would (as long as you don't keep to much of the document you're loading, but that's on your side of things).

I found Theo's answer whilst looking for a solution for a similar use case. I found that with a small tweak his given example will work, but as it stands didn't work for me; it prematurely cut-off the parse when the http.callback fired. But thanks for the inspiration Theo!
require 'rubygems'
require 'eventmachine'
require 'em-http'
require 'nokogiri'
# this is your SAX handler, I'm not very familiar with
# Nokogiri, so I just took an exaple from the RDoc
class SteamingDocument < Nokogiri::XML::SAX::Document
def start_element(name, attrs=[])
puts "starting: #{name}"
end
def end_element(name)
puts "ending: #{name}"
end
def end_document
puts "should now fire"
end
end
document = SteamingDocument.new
url = 'http://stackoverflow.com/feeds/question/2833829'
# run the EventMachine reactor, this call will block until
# EventMachine.stop is called
EventMachine.run do
# Nokogiri wants an IO to read from, so create a pipe that it
# can read from, and we can write to
io_read, io_write = IO.pipe
# run the parser in its own thread so that it can block while
# reading from the pipe
EventMachine.defer(proc {
parser = Nokogiri::XML::SAX::Parser.new(document)
parser.parse_io(io_read)
}, proc { EventMachine.stop })
# use em-http to stream the XML document, feeding the pipe with
# each chunk as it becomes available
http = EventMachine::HttpRequest.new(url).get
http.stream { |chunk| io_write << chunk }
# when the HTTP request is done, stop EventMachine
http.callback { io_write.close }
end

http://github.com/pauldix/typhoeus
might be worth checking out. It's designed for large and fast parallel downloads and is based on libcurl so it is pretty solid.
That said, test Net::HTTP and see if the performance is acceptable before doing something more complicated.

The fastest download is probably a #read on the IO object which slurps in the whole thing into a single String. After that you can apply your processing. Or do you require to have the file processed during download?

Related

Ruby memory (allocations) spikes when handling base64 strings

I have a rails instance which on average uses about 250MB of memory. Lately I'm having issues with some really heavy spikes of memory usage which results in a response time of about ~25s. I have an endpoint which takes some relative simple params and base64 strings which are being send over to AWS.
See the image below for the correlation between memory/response time.
Now, when I look at some extra logs what's specifically happening during that time, I found something interesting.
First of all, I find the net_http memory allocations extremely high. Secondly, the update operation took about 25 sec in total. When I closely look at the timeline, I noticed some "blank gaps", between ~5 and ~15 seconds. The specific operations that are being done during those HTTP calls is from my perspective nothing special. But I'm a bit confused why those gaps occur, maybe someone could tell me a bit about that?
The code that's handling the requests:
def store_documents
identity_documents.each do |side, content|
is_file = content.is_a?(ActionDispatch::Http::UploadedFile)
file_extension = is_file ? content : content[:name]
file_name = "#{SecureRandom.uuid}_#{side}#{File.extname(file_extension)}"
if is_file
write_to_storage_service(file_name, content.tempfile.path)
else
write_file(file_name, content[:uri])
write_to_storage_service(file_name, file_name)
delete_file(file_name)
end
store_object_key_on_profile(side, file_name)
end
end
# rubocop:enable Metrics/MethodLength
def write_file(file_name, base_64_string)
File.open(file_name, 'wb') do |f|
f.write(
Base64.decode64(
get_content(base_64_string)
)
)
end
end
def delete_file(file_name)
File.delete(file_name)
end
def write_to_storage_service(file_name, path)
S3_IDENTITY_BUCKET
.object(file_name)
.upload_file(path)
rescue Aws::Xml::Parser::ParsingError => e
log_error(e)
add_errors(base: e)
end
def get_content(base_64_string)
base_64_string.sub %r{data:((image|application)/.{3,}),}, ''
end
def store_object_key_on_profile(side, file_name)
profile.update("#{side}_identity_document_object_key": file_name)
end
def identity_documents
{
front: front_identity_document,
back: back_identity_document
}
end
def front_identity_document
#front_identity_document ||= identity_check_params[:front_identity_document]
end
def back_identity_document
#back_identity_document ||= identity_check_params[:back_identity_document]
end
I tend towards some issues with Ruby GC, or perhaps Ruby doesn't have enough pages available to directly store the base64 string in memory? I know that Ruby 2.6 and Ruby 2.7 had some large improvements regarding memory fragmentation, but that didn't change much either (currently running Ruby 2.7.1)
I have my Heroku resources configured to use Standard-2x dynos (1GB ram) x3. WEB_CONCURRENCY(workers) is set to 2, and amount of threads is set to 5.
I understand that my questions are rather broad, I'm more interested in some tooling, or ideas that could help to narrow my scope. Thanks!

How to send binary file over Web Sockets with Rails

I have a Rails application where users upload Audio files. I want to send them to a third party server, and I need to connect to the external server using Web sockets, so, I need my Rails application to be a websocket client.
I'm trying to figure out how to properly set that up. I'm not committed to any gem just yet, but the 'faye-websocket' gem looks promising. I even found a similar answer in "Sending large file in websocket before timeout", however, using that code doesn't work for me.
Here is an example of my code:
#message = Array.new
EM.run {
ws = Faye::WebSocket::Client.new("wss://example_url.com")
ws.on :open do |event|
File.open('path/to/audio_file.wav','rb') do |f|
ws.send(f.gets)
end
end
ws.on :message do |event|
#message << [event.data]
end
ws.on :close do |event|
ws = nil
EM.stop
end
}
When I use that, I get an error from the recipient server:
No JSON object could be decoded
This makes sense, because the I don't believe it's properly formatted for faye-websocket. Their documentation says:
send(message) accepts either a String or an Array of byte-sized
integers and sends a text or binary message over the connection to the
other peer; binary data must be encoded as an Array.
I'm not sure how to accomplish that. How do I load binary into an array of integers with Ruby?
I tried modifying the send command to use the bytes method:
File.open('path/to/audio_file.wav','rb') do |f|
ws.send(f.gets.bytes)
end
But now I receive this error:
Stream was 19 bytes but needs to be at least 100 bytes
I know my file is 286KB, so something is wrong here. I get confused as to when to use File.read vs File.open vs. File.new.
Also, maybe this gem isn't the best for sending binary data. Does anyone have success sending binary files in Rails with websockets?
Update: I did find a way to get this working, but it is terrible for memory. For other people that want to load small files, you can simply File.binread and the unpack method:
ws.on :open do |event|
f = File.binread 'path/to/audio_file.wav'
ws.send(f.unpack('C*'))
end
However, if I use that same code on a mere 100MB file, the server runs out of memory. It depletes the entire available 1.5GB on my test server! Does anyone know how to do this is a memory safe manner?

Here's my take on it:
# do only once when initializing Rails:
require 'iodine/client'
Iodine.force_start!
# this sets the callbacks.
# on_message is always required by Iodine.
options = {}
options[:on_message] = Proc.new do |data|
# this will never get called
puts "incoming data ignored? for:\n#{data}"
end
options[:on_open] = Proc.new do
# believe it or not - this variable belongs to the websocket connection.
#started_upload = true
# set a task to send the file,
# so the on_open initialization doesn't block incoming messages.
Iodine.run do
# read the file and write to the websocket.
File.open('filename','r') do |f|
buffer = String.new # recycle the String's allocated memory
write f.read(65_536, buffer) until f.eof?
#started_upload = :done
end
# close the connection
close
end
end
options[:on_close] = Proc.new do |data|
# can we notify the user that the file was uploaded?
if #started_upload == :done
# we did it :-)
else
# what happened?
end
end
# will not wait for a connection:
Iodine::Http.ws_connect "wss://example_url.com", options
# OR
# will wait for a connection, raising errors if failed.
Iodine::Http::WebsocketClient.connect "wss://example_url.com", options
It's only fair to mention that I'm Iodine's author, which I wrote for use in Plezi (a RESTful Websocket real time application framework you can use stand alone or within Rails)... I'm super biased ;-)
I would avoid the gets because it's size could include the whole file or a single byte, depending on the location of the next End Of Line (EOL) marker... read gives you better control over each chunk's size.

Running code asynchronously inside pollers

In my ruby script,I am using celluloid-zmq gem. where I am trying to run evaluate_response asynchronously inside pollers using,
async.evaluate_response(socket.read_multipart)
But if I remove sleep from loop, somehow thats not working out, It is not reaching to "evaluate_response" method. But if I put sleep inside loop it works perfectly.
require 'celluloid/zmq'
Celluloid::ZMQ.init
module Celluloid
module ZMQ
class Socket
def socket
#socket
end
end
end
end
class Indefinite
include Celluloid::ZMQ
## Readers
attr_reader :dealersock,:pullsock,:pollers
def initialize
prepare_dealersock and prepare_pullsock and prepare_pollers
end
## prepare DEALER SOCK
def prepare_dealersock
#dealersock = DealerSocket.new
#dealersock.identity = "IDENTITY"
#dealersock.connect("tcp://localhost:20482")
end
## prepare PULL SOCK
def prepare_pullsock
#pullsock = PullSocket.new
#pullsock.connect("tcp://localhost:20483")
end
## prepare the Pollers
def prepare_pollers
#pollers = ZMQ::Poller.new
#pollers.register_readable(dealersock.socket)
#pollers.register_readable(pullsock.socket)
end
def run!
loop do
pollers.poll ## this is blocking operation never mind though we need it
pollers.readables.each do |socket|
## we know socket.read_multipart is blocking call this would give celluloid the chance to run other process in mean time.
async.evaluate_response(socket.read_multipart)
end
## If you remove the sleep the async evaluate response would never be executed.
## sleep 0.2
end
end
def evaluate_response(message)
## Hmmm, the code just not reaches over here
puts "got message: #{message}"
...
...
...
...
end
end
## Code is invoked like this
Indefinite.new.run!
Any idea why this is happening?

The question was 100% changed, so my previous answer does not help.
Now, the issues are...
ZMQ::Poller is not part of Celluloid::ZMQ
You are directly using the ffi-rzmq bindings, and not using the Celluloid::ZMQ wrapping, which provides evented & threaded handling of the socket(s).
It would be best to make multiple actors -- one per socket -- or to just use Celluloid::ZMQ directly in one actor, rather than undermining it.
Your actor never gets time to work with the response
This part makes it a duplicate of:
Celluloid async inside ruby blocks does not work
The best answer is to use after or every and not loop ... which is dominating your actor.
You need to either:
Move evaluate_response to another actor.
Move each socket to their own actor.
This code needs to be broken up into several actors to work properly, with a main sleep at the end of the program. But before all that, try using after or every instead of loop.

Use sidekiq with a running dynamic counter in Rails

I build a website-crawler that (later on) uses these links to read out information.
The current rake-task goes through all the possible pages one by one and checks if the requests goes trough (valid response) or returns a 404/503 error (invalid page). If it's valid the pages url gets saved into my database.
Now as you can see the task requests 50,000 pages in total thus requires some time.
I have read about Sidekiq and how it can perform these tasks asynchronously thus making this a lot faster.
My question: As you can see my task builds the counter after each loop. This will not work with Sidekiq I guess as it will only perform this script independent of itself various times, am I right?
How would I go around the problem of each instance needing its own counter then?
Hopefully my question makes sense - Thank you very much!
desc "Validate Pages"
task validate_url: :environment do
require 'rubygems'
require 'open-uri'
require 'nokogiri'
counter = 1
base_url = "http://example.net/file"
until counter > 50000 do
begin
url = "#{base_url}_#{counter}/"
open(url)
page = Page.new
page.url = url
page.save!
puts "Saved #{url} !"
counter += 1
rescue OpenURI::HTTPError => ex
logger ||= Logger.new("validations.log")
if ex.io.status[0] == "503"
logger.info "#{ex} # #{counter}"
end
puts "#{ex} # #{counter}"
counter += 1
rescue SocketError => ex
logger ||= Logger.new("validations.log")
logger.info "#{ex} # #{counter}"
puts "#{ex} # #{counter}"
counter += 1
end
end
end

A simple Redis INCR operation will create and/or increment a global counter for your jobs to use. You can use Sidekiq's redis connection to implement a counter trivially:
Sidekiq.redis do |conn|
conn.incr("my-counter")
end

If you want to use it async - that means you will have many instances of same job. The fastest approach - to use something like redis. This will give you simple and fast way to check\update counter for your needs. But also make sure you took care about counter: If one of your jobs using it, lock it for other jobs, so there wont be wrong results, etc

read and execute ruby script

script.rb:
puts 'hello'
puts 'foo'
main.rb:
puts `jruby script.rb` # receive the expected result
The question:
How can the same be achieved with reading the "script" before execution?
main.rb:
code=File.open('script.rb', 'r').read.gsub('"', '\"')
# puts `jruby -e '#{code}'` # Does not work for relatively big files;
Windows and unicode are the reasons of this question;
Please note that `jruby script.rb' creates a new process which is essential.

Store the modified script in a Tempfile and run that instead of passing the whole contents as an eval argument:
require 'tempfile'
code = IO.read('script.rb').gsub('"', '\"')
begin
tempfile = Tempfile.new 'mytempfile'
f.write code
f.close
puts `jruby '#{f.path}'`
ensure
f.close
f.unlink
end
The reason you’re likely getting an error is either a lack of proper escaping or a limit on the maximum argument length in the shell.
Also, beware that in your original implementation you never close the original file. I’ve fixed that by instead using IO.read.

In the command line, using
$ getconf ARG_MAX
will give the upper limit on how many bytes can be used for the command line argument and environment variables.
#Andrew Marshall's answer is better, but suppose you don't want to use a temp file, and assuming we can use fork in JRuby,
require 'ffi'
module Exec
extend FFI::Library
ffi_lib FFI::Platform::LIBC
attach_function :fork, [], :int
end
code = IO.read('script.rb')
pid = Exec.fork
if 0 == pid
eval code
exit 0
else
Process.waitpid pid
end

use require
main.rb:
require "script.rb"

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Fast ruby http library for large XML downloads - ruby-on-rails

http://github.com/pauldix/typhoeus might be worth checking out. It's designed for large and fast parallel downloads and is based on libcurl so it is pretty solid. That said, test Net::HTTP and see if the performance is acceptable before doing something more complicated.

The fastest download is probably a #read on the IO object which slurps in the whole thing into a single String. After that you can apply your processing. Or do you require to have the file processed during download?

Related

Ruby memory (allocations) spikes when handling base64 strings

How to send binary file over Web Sockets with Rails

Running code asynchronously inside pollers

Use sidekiq with a running dynamic counter in Rails

read and execute ruby script

Categories

Resources