Multiple curl leads to "Too many open files error" on Windows - ruby-on-rails

I have an external REST API which handles storing data in "Data Store".
On a file upload, their is a Ruby library which calls this API and passes it the data array which then gets stored in the database by the external API.
I try to pass small chunks of array to the API so as to limit the post body content length in any curl call.
The library call looks like this
def add_data(table_name, table_data)
url = "#{ExternalAPI::URL}/addData"
m_curl = Curl::Multi.new
begin
chunks = table_data.each_slice(ExternalAPI::BATCH_SIZE).to_a
chunks.each do |data_chunk|
data = {
"tableName" => table_name,
"data" => data_chunk
}.to_json
curl = Curl::Easy.new(url)
curl.headers = {}
curl.headers['Content-type'] = 'text/plain'
curl.timeout = 300
curl.post_body = data
m_curl.add(curl)
end
m_curl.perform
true
rescue Exception => e
puts "Curl Failed #{e.message}"
puts "#{e.backtrace}"
Rails.logger.error "Curl Failed #{e.message}"
return false
end
end
This causes too many open connections error in Webrick in development mode.
I assumed Multi::Curl either recycles the connections but I'm not sure whether that happens internally.
I also tried to create a new curl connection in the for loop and close it at the end of the loop (I know its inefficient) but it still led to the same error.
Can anyone please shed some light on this?

I think Multi::Curl is going to try to execute all of your connections simultaneously. You probably need to batch them into smaller groups.

Related

Sending Thousands of Request at the Same Time with Ruby on Rails?

I need to develop an endpoint in rails that will send (possibly) hundreds/thousands of request, process it then return/render the json to user/client.
I've tried using thread pool with the size of 5, but it took forever, but when I tried increasing the size to the number of request, it threw ThreadError: can't create Thread: Resource temporarily unavailable exception.
I don't think I can use background job/worker for this because I should return the result.
So what should I do?
I was thinking that I should wrap the process in 20sec timeout so it doesn't reach rails 30sec limit, and if it's still not finished in 20sec, it will return the unfinished result. It goes like this
result = Queue.new
begin
Timeout::timeout(20) do
elements.each do |element|
pool.process {
response = send_request(element)
result << response
}
end
pool.shutdown
end
rescue Timeout::Error
pool.shutdown
end
result = (Array.new(elements.size {result.pop})).flatten
render json: {
data: result
}
But it's still not working, the process still keep going even after it timeout.

Accessing response status after it is sent

I'm working on a feature to store requests made to specific endpoints in my app.
after_action :record_user_activity
def record_user_activity
return unless current_user
return if request.url =~ /assets|packs/
#session.navigation.create!(
uri: request.url,
request_method: request.method,
response_status: response.code.to_i,
access_time: Time.now
)
end
The problem is that, even if we get an error response, when getting the response.code at this point (after_action), the response code is still a 2xx. I imagine it's probably because the server hasn't yet faced whatever problem it may face during the data access process.
How can I properly store the status code that was actually sent to the user?
The rails logs already store the requests and responses. If you just need to track all user requests, you can simply add the user id to your logs. Unless there's a specific reason to keep this data in the DB, which would grow exponentially with the amount of user activity, add this to either config/application.rb or config/environments/production.rb
MyAppName::Application.configure do
#...whatever you already have in configs
# then add the following line
config.log_tags = [
-> request { "user-#{request.cookie_jar.signed[:user_id]}" }
]
end
Then you can tail the logs and use grep, or write some other processes to parse over the logs with other analytics. There are many tools available for this type of work. But here's a basic example
tail -f logs/production.log | grep user-101
# this would show all log requests from user with id 101
If you still need this however, you may wanna try prepend_after_action instead, see
http://api.rubyonrails.org/classes/AbstractController/Callbacks/ClassMethods.html#method-i-prepend_after_action

Mechanize - Receiving Errno::EMFILE: Too many open files - socket(2) after a day

I'm running an application that uses mechanize to fetch some data every so often from an RSS feed.
It runs as a heroku worker and after a day or so I'm receiving the following error:
Errno::EMFILE: Too many open files - socket(2)
I wasn't able to find a "close" method within mechanize, is there anything special I need to be doing in order to close out my browser sessions?
Here is how I create the browser + read information:
def mechanize_browser
#mechanize_browser ||= begin
agent = Mechanize.new
agent.redirect_ok = true
agent.request_headers = {
'Accept-Encoding' => "gzip,deflate,sdch",
'Accept-Language' => "en-US,en;q=0.8",
}
agent
end
end
And actually fetching information:
response = mechanize_browser.get(url)
And then closing after the response:
def close_mechanize_browser
#mechanize_browser = nil
end
Thanks in advance!
Since you manually can't close each instance of Mechanize, you can try invoking Mechanize as a block. According to the docs:
After the block executes, the instance is cleaned up. This includes closing all open connections.
So, rather than abstracting Mechanize.new into a custom function, try running Mechanize via the start class method, which should automatically close all your connections upon completion of the request:
Mechanize.start do |m|
m.get("http://example.com")
end
I ran into this same issue. The Mechanize start example by #zeantsoi is the answer that I ended up following, but there is also a Mechanize.shutdown method if you want to do this manually without their block.
There is also an option that you can add a lambda on post_connect_hooks
Mechanize.new.post_connect_looks << lambda {|agent, url, response, response_body| agent.shutdown }

First Google Drive API files.list request returning an array of Hashes, after that, subsequent requests returning an array of File Resources. Why?

I'm querying the Google API to list all files in the drive using the Google API official gem for ruby. I'm using the example given in the Google developers page - https://developers.google.com/drive/v2/reference/files/list
The first request I made returns in the "items" an array of ruby "Hashes". The next requests return in the "items" an array of either "Google::APIClient::Schema::Drive::V2::File" or "Google::APIClient::Schema::Drive::V2::ParentReference" (the reason behind each type also buggs me).
Does anyone know why this happens? At the reference page of "files.list" none is said about changing the type of the results.
def self.retrieve_all_files(client)
drive = client.discovered_api('drive', 'v2')
result = Array.new
page_token = nil
begin
parameters = {}
if page_token.to_s != ''
parameters['pageToken'] = page_token
end
api_result = client.execute(
:api_method => drive.files.list,
:parameters => parameters)
if api_result.status == 200
files = api_result.data
result.concat(files.items)
page_token = files.next_page_token
else
puts "An error occurred: #{result.data['error']['message']}"
page_token = nil
end
end while page_token.to_s != ''
result
end
EDIT:
I couldn't solve the problem yet, but I manage to understand it better:
When the first request to the API is made, after the authorization is granted by the user, the "file.list" returns an array of Hashes at "Items" attribute of the File resource. Each of this Hashes is like a File resource, with all the attributes of the File, the difference is just in the type of the access. For example: the title of the file can be accessed like this "File['title']".
After the first request is made, all the subsequent requests return an array of File resources, that can be accessed like this "File.title".
FYI, this was a bug in the client lib. Using the latest version should fix it.

How to test for asynchronous HTTP requests in ruby using EventMachine

I'm getting messages of a RabbitMQ queue and each message is a URL that I want to make a request to. Now I'm using the AMQP gem to subscribe to the queue and that uses EventMachine, so I'm using the the em-http-request library to make the http requests. According to the documentation here: https://github.com/igrigorik/em-http-request/wiki/Parallel-Requests
The following will issue asynchronous http-requests:
EventMachine.run {
http1 = EventMachine::HttpRequest.new('http://google.com/').get
http2 = EventMachine::HttpRequest.new('http://yahoo.com/').get
http1.callback { }
http2.callback { }
end
So when I subscribe to the RabbitMQ queue I have the following code:
x = 0
EventMachine.run do
connection = AMQP.connect(:host => '127.0.0.1')
channel = AMQP::Channel.new(connection)
channel.prefetch(50)
queue = channel.queue("http.requests")
exchange = channel.direct("")
queue.subscribe do |metadata, payload|
url = payload.inspect
eval "
#http#{x} = EventMachine::HttpRequest.new(url).get
#http#{x}.callback do
puts \"got a response\"
puts #http#{x}.response
end
x = x+1
"
end
end
This dynamically creates new variables and creates new http requests, similar to the way described in the em-http-request documentation. But is there a way to test whether the requests are actually being made asynchronously? Is it possible to write to the console every time a get request is fired off so I can see they are fired off one after the other without waiting for a response?
You can try running tcpdump and analysing the output. If you see the TCP three-way handshakes for the two connections being interleaved then the connections are happening in parallel.
This can't really be part of an automated test though, if that's what you're trying to aim for. I would be happy just to verify that the library does what it says it does once and not make it part of a test suite.
A very simple example, demonstrating exactly what you want:
require 'em-http-request'
EM.run do
# http://catnap.herokuapp.com/3 delays the HTTP response by 3 seconds.
http1 = EventMachine::HttpRequest.new('http://catnap.herokuapp.com/3').get
http1.callback { puts 'callback 1' }
http1
puts 'fired 1'
http2 = EventMachine::HttpRequest.new('https://www.google.com/').get
http2.callback { puts 'callback 2' }
puts 'fired 2'
end
Output (for me):
fired 1
fired 2
callback 2
callback 1
Depending on your internet connection, Heroku and Google, the response to the second HTTP request will likely come in first and you can be sure, the requests are indeed done in parallel.

Resources