I'm trying to setup a script written in ruby to open a port 2004 on a server. Called the server with http and the port http://<server>:2004/ will then result back a HTTP header + response.
The response is read from a file.
This is working for small content, but not for like 50MB.
Somehow it simply breaks.
By the way I'm testing this script with SoapUI.
Here is the source code, I think this one is pretty self-explanatory.
For better reading I marked the response part big.
#!/bin/ruby
require 'socket'
require 'timeout'
require 'date'
server = TCPServer.open 2004
puts "Listening on port 2004"
#file="dump.request"
loop {
Thread.start(server.accept) do |client|
date = Time.now.strftime("%d-%m-%Y_%H-%M-%S")
file = "#{date}_mt_dump.txt"
puts date
puts "Accepting connection"
#client = server.accept
#resp = "OKY|So long and thanks for all the fish!|OKY"
ticket_id = "1235"
partial_data = ""
i = 1024
firstrun = "yes"
fd = File.open(file,'w')
puts "Attempting receive loop"
puts "Ready to transfer contents to the client"
f = File.open("output.txt.gz","r")
puts "Opened file output.txt.gz; size: #{f.size}"
resp = f.read(f.size)
headers = ["HTTP/1.1 200 OK",
"Content-Encoding: gzip",
"Content-Type: text/xml;charset=UTF-8",
"Content-Length: #{f.size}\r\n\r\n"].join("\r\n")
client.puts headers
#puts all_data.join()
fd.close unless fd == nil
puts "Start data transfer"
client.puts resp
client.close
puts "Closed connection"
puts "\n"
end
}
There are a number of issues I see with your code, some which are conceptual and some of which are technical, but without more information about the error you receive might be impossible to offer a correct response.
It is my initial thought that the issue is caused by the fact that you are opening Gzipped files without using the binary mode flag, so that the file reading stops ate the first EOF character and new line markers might be converted.
A few technical things to consider:
Your loop is infinite. You should really set up signal traps to allow you to exit the script (catching ^C, for example).
Zip files are usually binary files. You should use a binary mode to open the file, or use the IO.binread method if your loading the whole file up to memory.
Your loading the whole file unto the memory before sending it. That's great for small files, but it isn't the best approach for larger files. Loading 50MB up to the RAM for each client, while serving a 100 clients, means 5GB of RAM...
Considering the first two technical points, I would tweek the code a bit like so:
keep_running = true
trap('INT'){ keep_running = false ; raise ::SystemExit}
begin
while(run) {
Thread.start(server.accept) do |client|
date = Time.now.strftime("%d-%m-%Y_%H-%M-%S")
file = "#{date}_mt_dump.txt"
puts date
puts "Accepting connection"
#client = server.accept
#resp = "OKY|So long and thanks for all the fish!|OKY"
ticket_id = "1235"
partial_data = ""
i = 1024
firstrun = "yes"
fd = File.open(file,'bw')
puts "Attempting receive loop"
puts "Ready to transfer contents to the client"
f = File.open("output.txt.gz","br")
puts "Opened file output.txt.gz; size: #{f.size}"
resp = f.read(f.size)
headers = ["HTTP/1.1 200 OK",
"Content-Encoding: gzip",
"Content-Type: text/xml;charset=UTF-8",
"Content-Length: #{f.size}\r\n\r\n"].join("\r\n")
client.puts headers
#puts all_data.join()
fd.close unless fd == nil
puts "Start data transfer"
client.puts resp
client.close
puts "Closed connection"
puts "\n"
end
}
rescue => e
puts e.message
puts e.backtrace
rescue SystemExit => e
puts "exiting... please notice that existing threads will be brutally stoped, as we will not wait for them..."
end
As to my more general pointers:
Your code is opening a new thread per connection. While this is okay for a small load of concurrent connections, your script might grind to a halt if you have a lot of concurrent connections. The context-switching alone (moving between threads) could potentially create a DoS situation.
I recommend that you use a Reactor pattern, where you have a pool of threads. Another option is to fork a few processes listening to the same TCPSocket.
You don't read the data from the socket and you don't parse the HTTP request - this means that someone could potentially fill up the system buffer, which you never empty, by continuously sendings data.
It would be better if you read the information from the socket, or emptied it's buffer, as well as disconnected from any malformed of malicious connections.
Also, most browsers aren't too happy when the response comes in before the request...
You don't catch any exceptions nor print any error messages. This means that your script might throw an exception that will break everything apart. For instance, if your 'server' will reach the 'open file limit' for it's process, the accept method will throw an exception which will shut down the whole script, including existing connections.
I'm not sure why you aren't using one of the many HTTP servers available for Ruby - be it the builtin WEBrick (don't use for production) or one of the native Ruby community gems, such as Iodine.
Here's a short example using Iodine, which has an easy to utilize Http server written in Ruby (no need to compile anything):
require 'iodine/http'
# cache the file, since it's the only response ever sent
file_data = IO.binread "output.txt.gz"
Iodine.on_http do |request, response|
begin
# set any headers
response['content-type'] = 'text/xml;charset=UTF-8'
response['content-encoding'] = 'gzip'
response << file_data
true
rescue => e
Iodine.error e
false
end
end
end
#if in irb:
exit
Or, if you insist on writing your own HTTP server, you can at least use a one of the available IO reactors, such as Iodine (I it wrote for Plezi), to help you handle the thread pool and IO management (you can also use EventMachine, but I don't it like so much - than again, I'm biased, as I wrote the Iodine Library):
require 'iodine'
require 'stringio'
class MiniServer < Iodine::Protocol
# cache the file, since it's the only data sent,
# and make it available to all the connections.
def self.data
#data ||= IO.binread 'output.txt.gz'
end
# The on_opne callback is called when a connection is established.
# We'll use it to initialize the HTTP request's headers Hash.
def on_open
#headers = {}
end
# the on_message callback is called when data is sent from the client to the socket.
def on_message input
input = StringIO.new input
l = nil
headers = #headers # easy access
# loop the lines and parse the HTTP request.
while (l = input.gets)
unless l.match /^[\r]?\n/
if l.include? ':'
l = l.strip.downcase.split(':', 2)
headers[l[0]] = l[1]
else
headers[:method], headers[:query], headers[:version] = l.strip.split(/[\s]+/, 3)
headers[:request_start] = Time.now
end
next
end
# keep the connection alive if the HTTP version is 1.1 or if the connection is requested to be kept alive
keep_alive = (headers['connection'].to_s.match(/keep/i) || headers[:version].match(/1\.1/)) && true
# refuse any file uploads or forms. make sure the request is a GET request
return close if headers['content-length'] || headers['content-type'] || headers[:method].to_s.match(/get/i).nil?
# all is well, send the file.
write ["HTTP/1.1 200 OK",
"Connection: #{keep_alive ? 'keep-alive' : 'close'}",
"Content-Encoding: gzip",
"Content-Type: text/xml;charset=UTF-8",
"Content-Length: #{self.class.data.bytesize}\r\n\r\n"].join("\r\n")
write self.class.data
return close unless keep_alive
# reset the headers, in case another request comes in
headers.clear
end
end
end
Iodine.protocol = MiniServer
# # if running within a larget application, consider:
# Iodine.force_start!
# # Server starts automatically when the script ends.
# # on irb, use `exit`:
exit
Good Luck!
Related
I have a working system to produce errors and send them to be used by Active Admin.
For example in Active admin, for a specific page of our CMS, the page might execute :
url_must_be_accessible("http://www.exmaple.com", field_url_partner, "URL for the partner")
And this uses the code below to send to the Active Admin Editor different type of errors notifications:
module UrlHttpResponseHelper
def url_must_be_accessible(url, target_field, field_name)
if url
url_response_code = get_url_http_response(url).code.to_i
case url_response_code
when -1
# DNS issue; website does not exist;
errors.add(target_field,
"#{field_name}: DNS Problem -> #{url} website does not exist")
when 200
return
when 304
return
else
errors.add(target_field,
"#{field_name}: #{url} sends #{url_response_code} response code")
end
end
end
def get_url_http_response(url)
uri = URI.parse(URI.encode(url))
request = Net::HTTP.get_response(uri)
return request
rescue Errno::ECONNREFUSED, SocketError => e
OpenStruct.new(code: -1)
end
end
In local mode, this worked great! But in production, we're on Heroku and when a request pn Heroku goes beyond 30 seconds like if you try on this link "http://www.exmaple.com", the app crashes with a "H12 error".
I'd like to add to the code above two things
- timeouts: i think i need both read_timeout and open_timeout and that the read_timeout + open_timeout should be < to 30 seconds, with let's take some security , better < 25seconds
if the request is still "going" after 25 seconds, then avoid reaching 30seconds by giving up/dropping the request
and catch this "we dropped the request intentionnally because risk of timeout" by sending a notification to the user. I'd like to use my current system with somehting along the lines of:
rescue Errno::ECONNREFUSED, SocketError => e
OpenStruct.new(code: -7) # = some random number
end
case url_response_code
when -7
errors.add(target_field,
"#{field_name}: We tried to reach #{url} but this takes too long and risks crashing the app. please check the url and try again.")
How can I create a code like -1 but another one to rescue this "timeout"/"drop the request attempt" that I myself enforce.
Tried but nothing works. I don't manage to create the code for catch and drop this request if reaches 25 seconds...
That's not very beautiful solution (see: https://medium.com/#adamhooper/in-ruby-dont-use-timeout-77d9d4e5a001), but I believe you still can use it here, because you only have one thing happening inside opposite to the example in the link, where multiple actions could cause non-obvious behavior:
def get_url_http_response(url)
uri = URI.parse(URI.encode(url))
request = Timeout.timeout(25) { Net::HTTP.get_response(uri) }
return request
rescue Errno::ECONNREFUSED, SocketError => e
OpenStruct.new(code: -1)
rescue Timeout::Error
# return here anything you want
end
I have a very large XML from a remote server which I have to parse and get the data.
I have tried to open the file using the open() function but it is taking more than 15 minutes and still no response.
Then I tried Nokogiri::XML(open(URL)) where URL is the link which contains the data to parse.
Also, I have tried using Net::HTTP::Get but again with no fruitful results.
Can anyone suggest which gem and function can be used to parse the data?
As mentioned before, Nokogiri::XML::Reader is your friend here. The example in the documentation works fine if you have the file locally.
It is also possible to parse the data as soon as it comes in, fully streaming. This involves getting the data in chunks (e.g. using Net::HTTP) and connecting it to the Nokogiri::XML::Reader by means of an IO.pipe.
Example (adapted from this gist):
require 'nokogiri'
require 'net/http'
# setup request
uri = URI("http://example.com/articles.xml")
req = Net::HTTP::Get.new(uri.request_uri)
# read response in a separate thread using a pipe to communicate
rd, wr = IO.pipe
reader_thread = Thread.new do
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
http.request(req) do |response|
response.read_body {|chunk| wr.write(chunk) }
end
wr.close
end
end
# parse the incoming data chunk by chunk
reader = Nokogiri::XML::Reader(rd)
reader.each do |node|
next if node.node_type != Nokogiri::XML::Reader::TYPE_ELEMENT
next if node.name != "article"
# now that we have the desired fragment, put it to use
doc = Nokogiri::XML(node.outer_xml)
puts("Got #{doc.text}")
end
rd.close
# let the reader thread finish cleanly
reader_thread.join
If you are working with large XML files then you can use Nokogiri::XML::Reader class. I have successfully opened 1 GB files without any problems. For optimal performance you could download the file first and then parse it using XML::Reader class localy on your server
The usage is something like this (replace XML_FILE with your path):
Nokogiri::XML::Reader(File.open(XML_FILE)).each do |node|
if node.name == 'Node' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
puts node.outer_xml # you can do something like this also Nokogiri::XML(node.outer_xml).at('./Node')
end
end
Heere is the documentation: http://www.rubydoc.info/github/sparklemotion/nokogiri/master/Nokogiri/XML/Reader
Hope it helps
#server = TCPServer.new('', 2001)
Thread.new do
loop do
puts 'loop iteration'
Thread.start(#server.accept) do |client|
puts 'in thread'
client.puts Time.now
puts client.read
client.close #This is the line I don't like
end
end
end
How to prevent connection closing and accept data from client without reconnects?
If I comment client.close out the server executes contents in thread only once, immediately after client connects and that's all.
Can u also post your client code?
I suppose this is happening because you just read only one message from the client till the end with client.read and than not doing any other reads on this connections.
If you close it, your client might be just reconnection and then new TCPSocket is created by the TCPServer.
If you are not accepting the client to send any more messages immidiatly, it's perfectly fine to close the connection.
This is what an AcceptSocket (TCPServer) is for. It allows you to save ressources, by not
having a running Thread to do polling for incoming messages.
If you do expect the client to send more messages in relativly short periods you should also have a loop in you Thread which is listening on a TCPSocket.
Thread.start(#server.accept) do |client|
puts 'in thread'
client.puts Time.now
loop do
puts client.gets #or client.read
end
end
See also ruby-tcp-chat Server#listen_user_messages
But then i guess you will need some timeout mechanism, which closes the connection if there have been no messages for some period of time, otherwise you end up with a lot of permanent running threads
Background: We've built a chat feature in to one of our existing Rails applications. We're using the new ActionController::Live module and running Puma (with Nginx in production), and subscribing to messages through Redis. We're using EventSource client side to establish the connection asynchronously.
Problem Summary: Threads are never dying when the connection is terminated.
For example, should the user navigate away, close the browser, or even go to a different page within the application, a new thread is spawned (as expected), but the old one continues to live.
The problem as I presently see it is that when any of these situations occur, the server has no way of knowing whether the connection on the browser's end is terminated, until something attempts to write to this broken stream, which would never happen once the browser has moved away from the original page.
This problem seems to be documented on github, and similar questions are asked on StackOverflow here (pretty well exact same question) and here (regarding getting number of active threads).
The only solution I've been able to come up with, based on these posts, is to implement a type of thread / connection poker. Attempting to write to a broken connection generates an IOError which I can catch and properly close the connection, allowing the thread to die. This is the controller code for that solution:
def events
response.headers["Content-Type"] = "text/event-stream"
stream_error = false; # used by flusher thread to determine when to stop
redis = Redis.new
# Subscribe to our events
redis.subscribe("message.create", "message.user_list_update") do |on|
on.message do |event, data| # when message is received, write to stream
response.stream.write("messageType: '#{event}', data: #{data}\n\n")
end
# This is the monitor / connection poker thread
# Periodically poke the connection by attempting to write to the stream
flusher_thread = Thread.new do
while !stream_error
$redis.publish "message.create", "flusher_test"
sleep 2.seconds
end
end
end
rescue IOError
logger.info "Stream closed"
stream_error = true;
ensure
logger.info "Events action is quitting redis and closing stream!"
redis.quit
response.stream.close
end
(Note: the events method seems to get blocked on the subscribe method invocation. Everything else (the streaming) works properly so I assume this is normal.)
(Other note: the flusher thread concept makes more sense as a single long-running background process, a bit like a garbage thread collector. The problem with my implementation above is that a new thread is spawned for each connection, which is pointless. Anyone attempting to implement this concept should do it more like a single process, not so much as I've outlined. I'll update this post when I successfully re-implement this as a single background process.)
The downside of this solution is that we've only delayed or lessened the problem, not completely solved it. We still have 2 threads per user, in addition to other requests such as ajax, which seems terrible from a scaling perspective; it seems completely unattainable and impractical for a larger system with many possible concurrent connections.
I feel like I am missing something vital; I find it somewhat difficult to believe that Rails has a feature that is so obviously broken without implementing a custom connection-checker like I have done.
Question: How do we allow the connections / threads to die without implementing something corny such as a 'connection poker', or garbage thread collector?
As always let me know if I've left anything out.
Update
Just to add a bit of extra info: Huetsch over at github posted this comment pointing out that SSE is based on TCP, which normally sends a FIN packet when the connection is closed, letting the other end (server in this case) know that its safe to close the connection. Huetsch points out that either the browser is not sending that packet (perhaps a bug in the EventSource library?), or Rails is not catching it or doing anything with it (definitely a bug in Rails, if that's the case). The search continues...
Another Update
Using Wireshark, I can indeed see FIN packets being sent. Admittedly, I am not very knowledgeable or experienced with protocol level stuff, however from what I can tell, I definitely detect a FIN packet being sent from the browser when I establish the SSE connection using EventSource from the browser, and NO packet sent if I remove that connection (meaning no SSE). Though I'm not terribly up on my TCP knowledge, this seems to indicate to me that the connection is indeed being properly terminated by the client; perhaps this indicates a bug in Puma or Rails.
Yet another update
#JamesBoutcher / boutcheratwest(github) pointed me to a discussion on the redis website regarding this issue, specifically in regards to the fact that the .(p)subscribe method never shuts down. The poster on that site pointed out the same thing that we've discovered here, that the Rails environment is never notified when the client-side connection is closed, and therefore is unable to execute the .(p)unsubscribe method. He inquires about a timeout for the .(p)subscribe method, which I think would work as well, though I'm not sure which method (the connection poker I've described above, or his timeout suggestion) would be a better solution. Ideally, for the connection poker solution, I'd like to find a way to determine whether the connection is closed on the other end without writing to the stream. As it is right now, as you can see, I have to implement client-side code to handle my "poking" message separately, which I believe is obtrusive and goofy as heck.
A solution I just did (borrowing a lot from #teeg) which seems to work okay (haven't failure tested it, tho)
config/initializers/redis.rb
$redis = Redis.new(:host => "xxxx.com", :port => 6379)
heartbeat_thread = Thread.new do
while true
$redis.publish("heartbeat","thump")
sleep 30.seconds
end
end
at_exit do
# not sure this is needed, but just in case
heartbeat_thread.kill
$redis.quit
end
And then in my controller:
def events
response.headers["Content-Type"] = "text/event-stream"
redis = Redis.new(:host => "xxxxxxx.com", :port => 6379)
logger.info "New stream starting, connecting to redis"
redis.subscribe(['parse.new','heartbeat']) do |on|
on.message do |event, data|
if event == 'parse.new'
response.stream.write("event: parse\ndata: #{data}\n\n")
elsif event == 'heartbeat'
response.stream.write("event: heartbeat\ndata: heartbeat\n\n")
end
end
end
rescue IOError
logger.info "Stream closed"
ensure
logger.info "Stopping stream thread"
redis.quit
response.stream.close
end
I'm currently making an app that revolves around ActionController:Live, EventSource and Puma and for those that are encountering problems closing streams and such, instead of rescuing an IOError, in Rails 4.2 you need to rescue ClientDisconnected. Example:
def stream
#Begin is not required
twitter_client = Twitter::Streaming::Client.new(config_params) do |obj|
# Do something
end
rescue ClientDisconnected
# Do something when disconnected
ensure
# Do something else to ensure the stream is closed
end
I found this handy tip from this forum post (all the way at the bottom): http://railscasts.com/episodes/401-actioncontroller-live?view=comments
Here's a potentially simpler solution which does not use a heartbeat. After much research and experimentation, here's the code I'm using with sinatra + sinatra sse gem (which should be easily adapted to Rails 4):
class EventServer < Sinatra::Base
include Sinatra::SSE
set :connections, []
.
.
.
get '/channel/:channel' do
.
.
.
sse_stream do |out|
settings.connections << out
out.callback {
puts 'Client disconnected from sse';
settings.connections.delete(out);
}
redis.subscribe(channel) do |on|
on.subscribe do |channel, subscriptions|
puts "Subscribed to redis ##{channel}\n"
end
on.message do |channel, message|
puts "Message from redis ##{channel}: #{message}\n"
message = JSON.parse(message)
.
.
.
if settings.connections.include?(out)
out.push(message)
else
puts 'closing orphaned redis connection'
redis.unsubscribe
end
end
end
end
end
The redis connection blocks on.message and only accepts (p)subscribe/(p)unsubscribe commands. Once you unsubscribe, the redis connection is no longer blocked and can be released by the web server object which was instantiated by the initial sse request. It automatically clears when you receive a message on redis and sse connection to the browser no longer exists in the collection array.
Building on #James Boutcher, I used the following in clustered Puma with 2 workers, so that I have only 1 thread created for the heartbeat in config/initializers/redis.rb:
config/puma.rb
on_worker_boot do |index|
puts "worker nb #{index.to_s} booting"
create_heartbeat if index.to_i==0
end
def create_heartbeat
puts "creating heartbeat"
$redis||=Redis.new
heartbeat = Thread.new do
ActiveRecord::Base.connection_pool.release_connection
begin
while true
hash={event: "heartbeat",data: "heartbeat"}
$redis.publish("heartbeat",hash.to_json)
sleep 20.seconds
end
ensure
#no db connection anyway
end
end
end
Here you are solution with timeout that will exit blocking Redis.(p)subscribe call and kill unused connection tread.
class Stream::FixedController < StreamController
def events
# Rails reserve a db connection from connection pool for
# each request, lets put it back into connection pool.
ActiveRecord::Base.clear_active_connections!
# Last time of any (except heartbeat) activity on stream
# it mean last time of any message was send from server to client
# or time of setting new connection
#last_active = Time.zone.now
# Redis (p)subscribe is blocking request so we need do some trick
# to prevent it freeze request forever.
redis.psubscribe("messages:*", 'heartbeat') do |on|
on.pmessage do |pattern, event, data|
# capture heartbeat from Redis pub/sub
if event == 'heartbeat'
# calculate idle time (in secounds) for this stream connection
idle_time = (Time.zone.now - #last_active).to_i
# Now we need to relase connection with Redis.(p)subscribe
# chanel to allow go of any Exception (like connection closed)
if idle_time > 4.minutes
# unsubscribe from Redis because of idle time was to long
# that's all - fix in (almost)one line :)
redis.punsubscribe
end
else
# save time of this (last) activity
#last_active = Time.zone.now
end
# write to stream - even heartbeat - it's sometimes chance to
# capture dissconection error before idle_time
response.stream.write("event: #{event}\ndata: #{data}\n\n")
end
end
# blicking end (no chance to get below this line without unsubscribe)
rescue IOError
Logs::Stream.info "Stream closed"
rescue ClientDisconnected
Logs::Stream.info "ClientDisconnected"
rescue ActionController::Live::ClientDisconnected
Logs::Stream.info "Live::ClientDisconnected"
ensure
Logs::Stream.info "Stream ensure close"
redis.quit
response.stream.close
end
end
You have to use reds.(p)unsubscribe to end this blocking call. No exception can break this.
My simple app with information about this fix: https://github.com/piotr-kedziak/redis-subscribe-stream-puma-fix
Instead of sending a heartbeat to all the clients, it might be easier to just set a watchdog for each connection. [Thanks to #NeilJewers]
class Stream::FixedController < StreamController
def events
# Rails reserve a db connection from connection pool for
# each request, lets put it back into connection pool.
ActiveRecord::Base.clear_active_connections!
redis = Redis.new
watchdog = Doberman::WatchDog.new(:timeout => 20.seconds)
watchdog.start
# Redis (p)subscribe is blocking request so we need do some trick
# to prevent it freeze request forever.
redis.psubscribe("messages:*") do |on|
on.pmessage do |pattern, event, data|
begin
# write to stream - even heartbeat - it's sometimes chance to
response.stream.write("event: #{event}\ndata: #{data}\n\n")
watchdog.ping
rescue Doberman::WatchDog::Timeout => e
raise ClientDisconnected if response.stream.closed?
watchdog.ping
end
end
end
rescue IOError
rescue ClientDisconnected
ensure
response.stream.close
redis.quit
watchdog.stop
end
end
If you can tolerate a small chance of missing a message you can use subscribe_with_timeout:
sse = SSE.new(response.stream)
sse.write("hi", event: "hello")
redis = Redis.new(reconnect_attempts: 0)
loop do
begin
redis.subscribe_with_timeout(5 * 60, 'mycoolchannel') do |on|
on.message do |channel, message|
sse.write(message, event: 'message_posted')
end
end
rescue Redis::TimeoutError
sse.write("ping", event: "ping")
end
end
This code subscribes to a Redis channel, waits for 5 minutes, then closes connection to Redis and subscribes again.
I'm creating an API service which allows people to provide a URL of an image to the API call, and the the service downloads the image to process.
How do I ensure somebody does NOT give me the URL of, like, a 5MB image? Is there a way to limit the request?
This is what I have so far, which basically grabs everything.
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) { |http|
http.request(req)
}
Thanks,
Conrad
cwninja unfortunately gave you an answer that will only work for accidental attacks. An intelligent attacker will have no trouble at all defeating that check. There are two main reasons his method should not be used. First, nothing guarantees that the information in a HEAD response will match the corresponding GET response. A properly behaving server certainly will do this, but a malicious actor does not have to follow the spec. The attacker could simply send a HEAD response that says it has a Content-Length that's less than your threshold, but then hand you a huge file in the GET response. But that doesn't even cover the potential for a server to send back a response with the Transfer-Encoding: chunked header set. A chunked response could quite possibly never end. A few people pointing your server at never-ending responses could carry out a trivial resource-exhaustion attack, even if your HTTP client enforces a timeout.
To do this correctly, you need to use an HTTP library that allows you to count the bytes as they're received, and abort if it crosses the threshold. I would probably recommend Curb for this rather than Net::HTTP. (Can you even do this at all with Net::HTTP?) If you use the on_body and/or on_progress callbacks, you can count the incoming bytes and abort mid-response if you receive a file that's too large. Obviously, as cwninja already pointed out, if you receive a Content-Length header larger than your threshold, you want to abort for that too. Curb is also notably faster than Net::HTTP.
Try running this first:
Net::HTTP.start(url.host, url.port) { |http|
response = http.request_head(url.path)
raise "File too big." if response['content-length'].to_i > 5*1024*1024
}
You still have a race condition (someone could swap out the file after you do the HEAD request), but in the simple case this asks the server for the headers it would send back from a GET request.
Another one way to limit downloading size (full code should check response status, exception handling etc. It's just an example)
Net::HTTP.start(uri.host, uri.port) do |http|
request = Net::HTTP::Get.new uri.request_uri
http.request request do |response|
# check response codes here
body=''
response.read_body do |chunk|
body += chunk
break if body.size > MY_SAFE_SIZE_LIMIT
end
break
end
end
Combining the other two answers, I'd like to 1) check the size header, 2) watch the size of chunks, while also 3) supporting https and 4) aggressively enforcing a timeout. Here's a helper I came up with:
require "net/http"
require 'uri'
module FetchUtil
# Fetch a URL, with a given max bytes, and a given timeout
def self.fetch_url url, timeout_sec=5, max_bytes=5*1024*1024
uri = URI.parse(url)
t0 = Time.now.to_f
body = ''
Net::HTTP.start(uri.host, uri.port,
:use_ssl => (uri.scheme == 'https'),
:open_timeout => timeout_sec,
:read_timeout => timeout_sec) { |http|
# First make a HEAD request and check the content-length
check_res = http.request_head(uri.path)
raise "File too big" if check_res['content-length'].to_i > max_bytes
# Then fetch in chunks and bail on either timeout or max_bytes
# (Note: timeout won't work unless bytes are streaming in...)
http.request_get(uri.path) do |res|
res.read_body do |chunk|
raise "Timeout error" if (Time.now().to_f-t0 > timeout_sec)
raise "Filesize exceeded" if (body.length+chunk.length > max_bytes)
body += chunk
end
end
}
return body
end
end