How to send binary file over Web Sockets with Rails - ruby-on-rails

I have a Rails application where users upload Audio files. I want to send them to a third party server, and I need to connect to the external server using Web sockets, so, I need my Rails application to be a websocket client.
I'm trying to figure out how to properly set that up. I'm not committed to any gem just yet, but the 'faye-websocket' gem looks promising. I even found a similar answer in "Sending large file in websocket before timeout", however, using that code doesn't work for me.
Here is an example of my code:
#message = Array.new
EM.run {
ws = Faye::WebSocket::Client.new("wss://example_url.com")
ws.on :open do |event|
File.open('path/to/audio_file.wav','rb') do |f|
ws.send(f.gets)
end
end
ws.on :message do |event|
#message << [event.data]
end
ws.on :close do |event|
ws = nil
EM.stop
end
}
When I use that, I get an error from the recipient server:
No JSON object could be decoded
This makes sense, because the I don't believe it's properly formatted for faye-websocket. Their documentation says:
send(message) accepts either a String or an Array of byte-sized
integers and sends a text or binary message over the connection to the
other peer; binary data must be encoded as an Array.
I'm not sure how to accomplish that. How do I load binary into an array of integers with Ruby?
I tried modifying the send command to use the bytes method:
File.open('path/to/audio_file.wav','rb') do |f|
ws.send(f.gets.bytes)
end
But now I receive this error:
Stream was 19 bytes but needs to be at least 100 bytes
I know my file is 286KB, so something is wrong here. I get confused as to when to use File.read vs File.open vs. File.new.
Also, maybe this gem isn't the best for sending binary data. Does anyone have success sending binary files in Rails with websockets?
Update: I did find a way to get this working, but it is terrible for memory. For other people that want to load small files, you can simply File.binread and the unpack method:
ws.on :open do |event|
f = File.binread 'path/to/audio_file.wav'
ws.send(f.unpack('C*'))
end
However, if I use that same code on a mere 100MB file, the server runs out of memory. It depletes the entire available 1.5GB on my test server! Does anyone know how to do this is a memory safe manner?

Here's my take on it:
# do only once when initializing Rails:
require 'iodine/client'
Iodine.force_start!
# this sets the callbacks.
# on_message is always required by Iodine.
options = {}
options[:on_message] = Proc.new do |data|
# this will never get called
puts "incoming data ignored? for:\n#{data}"
end
options[:on_open] = Proc.new do
# believe it or not - this variable belongs to the websocket connection.
#started_upload = true
# set a task to send the file,
# so the on_open initialization doesn't block incoming messages.
Iodine.run do
# read the file and write to the websocket.
File.open('filename','r') do |f|
buffer = String.new # recycle the String's allocated memory
write f.read(65_536, buffer) until f.eof?
#started_upload = :done
end
# close the connection
close
end
end
options[:on_close] = Proc.new do |data|
# can we notify the user that the file was uploaded?
if #started_upload == :done
# we did it :-)
else
# what happened?
end
end
# will not wait for a connection:
Iodine::Http.ws_connect "wss://example_url.com", options
# OR
# will wait for a connection, raising errors if failed.
Iodine::Http::WebsocketClient.connect "wss://example_url.com", options
It's only fair to mention that I'm Iodine's author, which I wrote for use in Plezi (a RESTful Websocket real time application framework you can use stand alone or within Rails)... I'm super biased ;-)
I would avoid the gets because it's size could include the whole file or a single byte, depending on the location of the next End Of Line (EOL) marker... read gives you better control over each chunk's size.

Related

Ruby memory (allocations) spikes when handling base64 strings

I have a rails instance which on average uses about 250MB of memory. Lately I'm having issues with some really heavy spikes of memory usage which results in a response time of about ~25s. I have an endpoint which takes some relative simple params and base64 strings which are being send over to AWS.
See the image below for the correlation between memory/response time.
Now, when I look at some extra logs what's specifically happening during that time, I found something interesting.
First of all, I find the net_http memory allocations extremely high. Secondly, the update operation took about 25 sec in total. When I closely look at the timeline, I noticed some "blank gaps", between ~5 and ~15 seconds. The specific operations that are being done during those HTTP calls is from my perspective nothing special. But I'm a bit confused why those gaps occur, maybe someone could tell me a bit about that?
The code that's handling the requests:
def store_documents
identity_documents.each do |side, content|
is_file = content.is_a?(ActionDispatch::Http::UploadedFile)
file_extension = is_file ? content : content[:name]
file_name = "#{SecureRandom.uuid}_#{side}#{File.extname(file_extension)}"
if is_file
write_to_storage_service(file_name, content.tempfile.path)
else
write_file(file_name, content[:uri])
write_to_storage_service(file_name, file_name)
delete_file(file_name)
end
store_object_key_on_profile(side, file_name)
end
end
# rubocop:enable Metrics/MethodLength
def write_file(file_name, base_64_string)
File.open(file_name, 'wb') do |f|
f.write(
Base64.decode64(
get_content(base_64_string)
)
)
end
end
def delete_file(file_name)
File.delete(file_name)
end
def write_to_storage_service(file_name, path)
S3_IDENTITY_BUCKET
.object(file_name)
.upload_file(path)
rescue Aws::Xml::Parser::ParsingError => e
log_error(e)
add_errors(base: e)
end
def get_content(base_64_string)
base_64_string.sub %r{data:((image|application)/.{3,}),}, ''
end
def store_object_key_on_profile(side, file_name)
profile.update("#{side}_identity_document_object_key": file_name)
end
def identity_documents
{
front: front_identity_document,
back: back_identity_document
}
end
def front_identity_document
#front_identity_document ||= identity_check_params[:front_identity_document]
end
def back_identity_document
#back_identity_document ||= identity_check_params[:back_identity_document]
end
I tend towards some issues with Ruby GC, or perhaps Ruby doesn't have enough pages available to directly store the base64 string in memory? I know that Ruby 2.6 and Ruby 2.7 had some large improvements regarding memory fragmentation, but that didn't change much either (currently running Ruby 2.7.1)
I have my Heroku resources configured to use Standard-2x dynos (1GB ram) x3. WEB_CONCURRENCY(workers) is set to 2, and amount of threads is set to 5.
I understand that my questions are rather broad, I'm more interested in some tooling, or ideas that could help to narrow my scope. Thanks!

Running code asynchronously inside pollers

In my ruby script,I am using celluloid-zmq gem. where I am trying to run evaluate_response asynchronously inside pollers using,
async.evaluate_response(socket.read_multipart)
But if I remove sleep from loop, somehow thats not working out, It is not reaching to "evaluate_response" method. But if I put sleep inside loop it works perfectly.
require 'celluloid/zmq'
Celluloid::ZMQ.init
module Celluloid
module ZMQ
class Socket
def socket
#socket
end
end
end
end
class Indefinite
include Celluloid::ZMQ
## Readers
attr_reader :dealersock,:pullsock,:pollers
def initialize
prepare_dealersock and prepare_pullsock and prepare_pollers
end
## prepare DEALER SOCK
def prepare_dealersock
#dealersock = DealerSocket.new
#dealersock.identity = "IDENTITY"
#dealersock.connect("tcp://localhost:20482")
end
## prepare PULL SOCK
def prepare_pullsock
#pullsock = PullSocket.new
#pullsock.connect("tcp://localhost:20483")
end
## prepare the Pollers
def prepare_pollers
#pollers = ZMQ::Poller.new
#pollers.register_readable(dealersock.socket)
#pollers.register_readable(pullsock.socket)
end
def run!
loop do
pollers.poll ## this is blocking operation never mind though we need it
pollers.readables.each do |socket|
## we know socket.read_multipart is blocking call this would give celluloid the chance to run other process in mean time.
async.evaluate_response(socket.read_multipart)
end
## If you remove the sleep the async evaluate response would never be executed.
## sleep 0.2
end
end
def evaluate_response(message)
## Hmmm, the code just not reaches over here
puts "got message: #{message}"
...
...
...
...
end
end
## Code is invoked like this
Indefinite.new.run!
Any idea why this is happening?
The question was 100% changed, so my previous answer does not help.
Now, the issues are...
ZMQ::Poller is not part of Celluloid::ZMQ
You are directly using the ffi-rzmq bindings, and not using the Celluloid::ZMQ wrapping, which provides evented & threaded handling of the socket(s).
It would be best to make multiple actors -- one per socket -- or to just use Celluloid::ZMQ directly in one actor, rather than undermining it.
Your actor never gets time to work with the response
This part makes it a duplicate of:
Celluloid async inside ruby blocks does not work
The best answer is to use after or every and not loop ... which is dominating your actor.
You need to either:
Move evaluate_response to another actor.
Move each socket to their own actor.
This code needs to be broken up into several actors to work properly, with a main sleep at the end of the program. But before all that, try using after or every instead of loop.

Does Ruby's 'open_uri' reliably close sockets after read or on fail?

I have been using open_uri to pull down an ftp path as a data source for some time, but suddenly found that I'm getting nearly continual "530 Sorry, the maximum number of allowed clients (95) are already connected."
I am not sure if my code is faulty or if it is someone else who's accessing the server and unfortunately there's no way for me to really seemingly know for sure who's at fault.
Essentially I am reading FTP URI's with:
def self.read_uri(uri)
begin
uri = open(uri).read
uri == "Error" ? nil : uri
rescue OpenURI::HTTPError
nil
end
end
I'm guessing that I need to add some additional error handling code in here...
I want to be sure that I take every precaution to close down all connections so that my connections are not the problem in question, however I thought that open_uri + read would take this precaution vs using the Net::FTP methods.
The bottom line is I've got to be 100% sure that these connections are being closed and I don't somehow have a bunch open connections laying around.
Can someone please advise as to correctly using read_uri to pull in ftp with a guarantee that it's closing the connection? Or should I shift the logic over to Net::FTP which could yield more control over the situation if open_uri is not robust enough?
If I do need to use the Net::FTP methods instead, is there a read method that I should be familiar with vs pulling it down to a tmp location and then reading it (as I'd much prefer to keep it in a buffer vs the fs if possible)?
I suspect you are not closing the handles. OpenURI's docs start with this comment:
It is possible to open http/https/ftp URL as usual like opening a file:
open("http://www.ruby-lang.org/") {|f|
f.each_line {|line| p line}
}
I looked at the source and the open_uri method does close the stream if you pass a block, so, tweaking the above example to fit your code:
uri = ''
open("http://www.ruby-lang.org/") {|f|
uri = f.read
}
Should get you close to what you want.
Here's one way to handle exceptions:
# The list of URLs to pass in to check if one times out or is refused.
urls = %w[
http://www.ruby-lang.org/
http://www2.ruby-lang.org/
]
# the method
def self.read_uri(urls)
content = ''
open(urls.shift) { |f| content = f.read }
content == "Error" ? nil : content
rescue OpenURI::HTTPError
retry if (urls.any?)
nil
end
Try using a block:
data = open(uri){|f| f.read}

Fast ruby http library for large XML downloads

I am consuming various XML-over-HTTP web services returning large XML files (> 2MB). What would be the fastest ruby http library to reduce the 'downloading' time?
Required features:
both GET and POST requests
gzip/deflate downloads (Accept-Encoding: deflate, gzip) - very important
I am thinking between:
open-uri
Net::HTTP
curb
but you can also come with other suggestions.
P.S. To parse the response, I am using a pull parser from Nokogiri, so I don't need an integrated solution like rest-client or hpricot.
You can use EventMachine and em-http to stream the XML:
require 'rubygems'
require 'eventmachine'
require 'em-http'
require 'nokogiri'
# this is your SAX handler, I'm not very familiar with
# Nokogiri, so I just took an exaple from the RDoc
class SteamingDocument < Nokogiri::XML::SAX::Document
def start_element(name, attrs=[])
puts "starting: #{name}"
end
def end_element(name)
puts "ending: #{name}"
end
end
document = SteamingDocument.new
url = 'http://stackoverflow.com/feeds/question/2833829'
# run the EventMachine reactor, this call will block until
# EventMachine.stop is called
EventMachine.run do
# Nokogiri wants an IO to read from, so create a pipe that it
# can read from, and we can write to
io_read, io_write = IO.pipe
# run the parser in its own thread so that it can block while
# reading from the pipe
EventMachine.defer(proc {
parser = Nokogiri::XML::SAX::Parser.new(document)
parser.parse_io(io_read)
})
# use em-http to stream the XML document, feeding the pipe with
# each chunk as it becomes available
http = EventMachine::HttpRequest.new(url).get
http.stream { |chunk| io_write << chunk }
# when the HTTP request is done, stop EventMachine
http.callback { EventMachine.stop }
end
It's a bit low-level perhaps, but probably the most performant option for any document size. Feed it hundreds of megs and it will not fill up your memory, as any non-streaming solution would (as long as you don't keep to much of the document you're loading, but that's on your side of things).
I found Theo's answer whilst looking for a solution for a similar use case. I found that with a small tweak his given example will work, but as it stands didn't work for me; it prematurely cut-off the parse when the http.callback fired. But thanks for the inspiration Theo!
require 'rubygems'
require 'eventmachine'
require 'em-http'
require 'nokogiri'
# this is your SAX handler, I'm not very familiar with
# Nokogiri, so I just took an exaple from the RDoc
class SteamingDocument < Nokogiri::XML::SAX::Document
def start_element(name, attrs=[])
puts "starting: #{name}"
end
def end_element(name)
puts "ending: #{name}"
end
def end_document
puts "should now fire"
end
end
document = SteamingDocument.new
url = 'http://stackoverflow.com/feeds/question/2833829'
# run the EventMachine reactor, this call will block until
# EventMachine.stop is called
EventMachine.run do
# Nokogiri wants an IO to read from, so create a pipe that it
# can read from, and we can write to
io_read, io_write = IO.pipe
# run the parser in its own thread so that it can block while
# reading from the pipe
EventMachine.defer(proc {
parser = Nokogiri::XML::SAX::Parser.new(document)
parser.parse_io(io_read)
}, proc { EventMachine.stop })
# use em-http to stream the XML document, feeding the pipe with
# each chunk as it becomes available
http = EventMachine::HttpRequest.new(url).get
http.stream { |chunk| io_write << chunk }
# when the HTTP request is done, stop EventMachine
http.callback { io_write.close }
end
http://github.com/pauldix/typhoeus
might be worth checking out. It's designed for large and fast parallel downloads and is based on libcurl so it is pretty solid.
That said, test Net::HTTP and see if the performance is acceptable before doing something more complicated.
The fastest download is probably a #read on the IO object which slurps in the whole thing into a single String. After that you can apply your processing. Or do you require to have the file processed during download?

Working with Starling and multiple instances of Mongrel through Mongrel Cluster

Situation:
In a typical cluster setup, I have a 5 instances of mongrel running behind Apache 2.
In one of my initializer files, I schedule a cron task using Rufus::Scheduler which basically sends out a couple of emails.
Problem:
The task runs 5 times, once for each mongrel instance and each recipient ends up getting 5 mails (despite the fact I store logs of each sent mail and check the log before sending). Is it possible that since all 5 instances run the task at exact same time, they end up reading the email logs before they are written?
I am looking for a solution that will make the tasks run only once. I also have a Starling daemon up and running which can be utilized.
The rooster rails plugin specifically addresses your issue. It uses rufus-scheduler and ensures the environment is loaded only once.
The way I am doing it right now:
Try to open a file in exclusive locked mode
When lock is acquired, check for messages in Starling
If message exists, other process has already scheduled the job
Set the message again to the queue and exit.
If message is not found, schedule the job, set the message and exit
Here is the code that does it:
starling = MemCache.new("#{Settings[:starling][:host]}:#{Settings[:starling][:port]}")
mutex_filename = "#{RAILS_ROOT}/config/file.lock"
scheduler = Rufus::Scheduler.start_new
# The filelock method, taken from Ruby Cookbook
# This will ensure unblocking of the files
def flock(file, mode)
success = file.flock(mode)
if success
begin
yield file
ensure
file.flock(File::LOCK_UN)
end
end
return success
end
# open_lock method, taken from Ruby Cookbook
# This will create and hold the locks
def open_lock(filename, openmode = "r", lockmode = nil)
if openmode == 'r' || openmode == 'rb'
lockmode ||= File::LOCK_SH
else
lockmode ||= File::LOCK_EX
end
value = nil
# Kernerl's open method, gives IO Object, in our case, a file
open(filename, openmode) do |f|
flock(f, lockmode) do
begin
value = yield f
ensure
f.flock(File::LOCK_UN) # Comment this line out on Windows.
end
end
return value
end
end
# The actual scheduler
open_lock(mutex_filename, 'r+') do |f|
puts f.read
digest_schedule_message = starling.get("digest_scheduler")
if digest_schedule_message
puts "Found digest message in Starling. Releasing lock. '#{Time.now}'"
puts "Message: #{digest_schedule_message.inspect}"
# Read the message and set it back, so that other processes can read it too
starling.set "digest_scheduler", digest_schedule_message
else
# Schedule job
puts "Scheduling digest emails now. '#{Time.now}'"
scheduler.cron("0 9 * * *") do
puts "Begin sending digests..."
WeeklyDigest.new.send_digest!
puts "Done sending digests."
end
# Add message in queue
puts "Done Scheduling. Sending the message to Starling. '#{Time.now}'"
starling.set "digest_scheduler", :date => Date.today
end
end
# Sleep will ensure all instances have gone thorugh their wait-acquire lock-schedule(or not) cycle
# This will ensure that on next reboot, starling won't have any stale messages
puts "Waiting to clear digest messages from Starling."
sleep(20)
puts "All digest messages cleared, proceeding with boot."
starling.get("digest_scheduler")
Why dont you use mod_passenger (phusion)? I moved from mongrel to phusion and it worked perfect (with a timeamount of < 5 minutes)!

Resources