I'm using Net::HTTP with Ruby to crawl an URL.
I don't want to crawl streaming audio such as: http://listen2.openstream.co/334
in fact i only want to crawl Html content, so no pdfs, video, txt..
Right now, I have both open_timeout and read_timeout set to 10, so even if I do crawl these streaming audio pages they will timeout.
url = 'http://listen2.openstream.co/334'
path = uri.path
req= Net::HTTP::Get.new(path, {'Accept' => '*/*', 'Content-Type' => 'text/plain; charset=utf-8', 'Connection' => 'keep-alive','Accept-Encoding' => 'Identity'})
uri = Addressable::URI.parse(url)
resp = Net::HTTP.start(uri.host, uri.inferred_port) do |httpRequest|
httpRequest.open_timeout = 10
httpRequest.read_timeout = 10
#how can I read the headers here before it's streaming the body and then exit b/c the content type is audio?
httpRequest.request(req)
end
However, is there a way to check the header BEFORE I read the body of a http response to see if it's an audio? I want to do so without sending a separate HEAD request.
net/http supports streaming, you can use this to read the header before the body.
Code example,
url = URI('http://stackoverflow.com/questions/41306082/ruby-nethttp-read-the-header-before-the-body-without-head-request')
Net::HTTP.start(url.host, url.port) do |http|
request = Net::HTTP::Get.new(url)
http.request(request) do |response|
# check headers here, body has not yet been read
# then call read_body or just body to read the body
if true
response.read_body do |chunk|
# process body chunks here
end
end
end
end
I will add a ruby example later tonight. However, for a quick response. There is a simple trick to do this.
You can use HTTP Range header to indicate if which range of bytes you want to receive from the server. Here is an example:
curl -XGET http://www.sample-videos.com/audio/mp3/crowd-cheering.mp3 -v -H "Range: bytes=0-1"
The above example means the server will return data from 0 to 1 byte range.
See: https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests
Since I did not find a way to properly do this in Net::HTTP, and I saw that you're using the addressable gem as an external dependency already, here's a solution using the wonderful http gem:
require 'http'
response = HTTP.get('http://listen2.openstream.co/334')
# Here are the headers
puts response.headers
# Everything ok? Start streaming the response
body = response.body
body.stream!
# now just call `readpartial` on the body until it returns `nil`
# or some other break condition is met
Sorry if you're required to use Net::HTTP, hopefully someone else will find an answer. A separate HEAD request might indeed be the way to go in that case.
You can do a whole host of net related things without using a gem. Just use the net/http module.
require 'net/http'
url = URI 'http://listen2.openstream.co/334'
Net::HTTP.start(url.host, url.port){|conn|
conn.request_get(url){|resp|
resp.each{|k_header, v_header|
# process headers
puts "#{k_header}: #{v_header}"
}
#
# resp.read_body{|body_chunk|
# # process body
# }
}
}
Note: while processing headers, just make sure to check the content-type header. For audio related content it would normally contain audio/mpeg value.
Hope, it helped.
Related
I am using trello api to attach an image to a card. the documentation says
require 'uri'
require 'net/http'
url = URI("https://api.trello.com/1/cards/id/attachments")
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_NONE
request = Net::HTTP::Post.new(url)
response = http.request(request)
puts response.read_body
After putting my key and my token, I tried to upload a file and the binary data goes in the url itself, not only it seems too ugly but it also doesn't work because the request is really too long. I've tried using multipart and rest client gems from in my code to upload and attach a file to a trello card but everytime I get errors like bad request or SSL errors, can anyone please give me a piece of code that really works? thanks
actually I am sending the image data via AJAX (I'm generating it from a charjs view), so the data sent is binary, it would be better if the solution upload an image from binary data.
Their documentation does indeed encourage you to add the whole encoded file object into the URL, which I also find ugly. I wonder if it will work to add it into the POST body instead? Try this:
request = Net::HTTP::Post.new(url)
request.set_form_data({file: put_encoded_file_contents_here})
I have a very large XML from a remote server which I have to parse and get the data.
I have tried to open the file using the open() function but it is taking more than 15 minutes and still no response.
Then I tried Nokogiri::XML(open(URL)) where URL is the link which contains the data to parse.
Also, I have tried using Net::HTTP::Get but again with no fruitful results.
Can anyone suggest which gem and function can be used to parse the data?
As mentioned before, Nokogiri::XML::Reader is your friend here. The example in the documentation works fine if you have the file locally.
It is also possible to parse the data as soon as it comes in, fully streaming. This involves getting the data in chunks (e.g. using Net::HTTP) and connecting it to the Nokogiri::XML::Reader by means of an IO.pipe.
Example (adapted from this gist):
require 'nokogiri'
require 'net/http'
# setup request
uri = URI("http://example.com/articles.xml")
req = Net::HTTP::Get.new(uri.request_uri)
# read response in a separate thread using a pipe to communicate
rd, wr = IO.pipe
reader_thread = Thread.new do
Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https') do |http|
http.request(req) do |response|
response.read_body {|chunk| wr.write(chunk) }
end
wr.close
end
end
# parse the incoming data chunk by chunk
reader = Nokogiri::XML::Reader(rd)
reader.each do |node|
next if node.node_type != Nokogiri::XML::Reader::TYPE_ELEMENT
next if node.name != "article"
# now that we have the desired fragment, put it to use
doc = Nokogiri::XML(node.outer_xml)
puts("Got #{doc.text}")
end
rd.close
# let the reader thread finish cleanly
reader_thread.join
If you are working with large XML files then you can use Nokogiri::XML::Reader class. I have successfully opened 1 GB files without any problems. For optimal performance you could download the file first and then parse it using XML::Reader class localy on your server
The usage is something like this (replace XML_FILE with your path):
Nokogiri::XML::Reader(File.open(XML_FILE)).each do |node|
if node.name == 'Node' && node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
puts node.outer_xml # you can do something like this also Nokogiri::XML(node.outer_xml).at('./Node')
end
end
Heere is the documentation: http://www.rubydoc.info/github/sparklemotion/nokogiri/master/Nokogiri/XML/Reader
Hope it helps
If I load a section of my website with Net::HTTP in Rails, will this get loaded every time or will it get cached along with the rest of the footer?
EDIT: I mean the rest of the footer is currently cached. Would the Net:HTTP results, which get rendered inside the footer, also become cached? I would like it to reload the results every time.
No, Net::HTTP will not cache anything for you. You will have to implement caching, or use a gem that does it for you. But depending on what you do with Rails, Rails can do it - look into fragment caching.
Doesn't look like it does, at least not by default as of 2011. There's also a segment in the net/http.rb file in the ruby source that has the following code commented out:
# The following example performs a conditional GET using the
# If-Modified-Since header. If the files has not been modified since the
# time in the header a Not Modified response will be returned. See RFC 2616
# section 9.3 for further details.
#
uri = URI('http://example.com/cached_response')
file = File.stat 'cached_response'
req = Net::HTTP::Get.new(uri)
req['If-Modified-Since'] = file.mtime.rfc2822
res = Net::HTTP.start(uri.hostname, uri.port) {|http|
http.request(req)
}
open 'cached_response', 'w' do |io|
io.write res.body
end if res.is_a?(Net::HTTPSuccess)
The source file is dated July 9. Hope that helps.
I will replace my command line
`curl -XPUT 'host:port/url' -d '{"val": "some_json"}'̀
by a Rails command, and get the result...
Somewhere like this :
response = call('put', 'host:port/url', '{"val" : "some_json"}')
Is there any predefined method to do this in Rails, or some gem ?
I know the command get of HTTP, but I will do a 'PUT' method.
Net::HTTP.get(URI.parse('host:port/url'))
Thanks for your replies
You can use Net::HTTP to send any standard http requests.
Here is a way, you can connect to any url ( http / https ), with any valid http methods with or without parameters.
def universal_connector(api_url, api_parameters={}, method="Get")
# Do raise Error, if url is invalid and Method is invalid
uri = URI(api_url)
req = eval("Net::HTTP::#{method.capitalize}.new('#{uri}')")
req.set_form_data(api_parameters)
Net::HTTP.start(uri.host, uri.port,:use_ssl => uri.scheme == 'https') do |http|
response = http.request(req)
return response.body
end
end
There are many alternatives available as well. Specifically, Faraday. Also, read this before making a choice.
#get is just a simple shortcut for the whole code (Net::HTTP Ruby library tends to be very verbose). However, Net::HTTP perfectly supports PUT requests.
Another alternative is to use an HTTP client as a wrapper. The most common alternatives are HTTParty and Faraday.
HTTParty.put('host:port/url', { body: {"val" : "some_json"} })
As a side note, please keep in mind that Rails is a framework, not a programming language. Your question is about how to perform an HTTP PUT request in Ruby, not Rails. It's important to understand the difference.
I'm creating an API service which allows people to provide a URL of an image to the API call, and the the service downloads the image to process.
How do I ensure somebody does NOT give me the URL of, like, a 5MB image? Is there a way to limit the request?
This is what I have so far, which basically grabs everything.
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) { |http|
http.request(req)
}
Thanks,
Conrad
cwninja unfortunately gave you an answer that will only work for accidental attacks. An intelligent attacker will have no trouble at all defeating that check. There are two main reasons his method should not be used. First, nothing guarantees that the information in a HEAD response will match the corresponding GET response. A properly behaving server certainly will do this, but a malicious actor does not have to follow the spec. The attacker could simply send a HEAD response that says it has a Content-Length that's less than your threshold, but then hand you a huge file in the GET response. But that doesn't even cover the potential for a server to send back a response with the Transfer-Encoding: chunked header set. A chunked response could quite possibly never end. A few people pointing your server at never-ending responses could carry out a trivial resource-exhaustion attack, even if your HTTP client enforces a timeout.
To do this correctly, you need to use an HTTP library that allows you to count the bytes as they're received, and abort if it crosses the threshold. I would probably recommend Curb for this rather than Net::HTTP. (Can you even do this at all with Net::HTTP?) If you use the on_body and/or on_progress callbacks, you can count the incoming bytes and abort mid-response if you receive a file that's too large. Obviously, as cwninja already pointed out, if you receive a Content-Length header larger than your threshold, you want to abort for that too. Curb is also notably faster than Net::HTTP.
Try running this first:
Net::HTTP.start(url.host, url.port) { |http|
response = http.request_head(url.path)
raise "File too big." if response['content-length'].to_i > 5*1024*1024
}
You still have a race condition (someone could swap out the file after you do the HEAD request), but in the simple case this asks the server for the headers it would send back from a GET request.
Another one way to limit downloading size (full code should check response status, exception handling etc. It's just an example)
Net::HTTP.start(uri.host, uri.port) do |http|
request = Net::HTTP::Get.new uri.request_uri
http.request request do |response|
# check response codes here
body=''
response.read_body do |chunk|
body += chunk
break if body.size > MY_SAFE_SIZE_LIMIT
end
break
end
end
Combining the other two answers, I'd like to 1) check the size header, 2) watch the size of chunks, while also 3) supporting https and 4) aggressively enforcing a timeout. Here's a helper I came up with:
require "net/http"
require 'uri'
module FetchUtil
# Fetch a URL, with a given max bytes, and a given timeout
def self.fetch_url url, timeout_sec=5, max_bytes=5*1024*1024
uri = URI.parse(url)
t0 = Time.now.to_f
body = ''
Net::HTTP.start(uri.host, uri.port,
:use_ssl => (uri.scheme == 'https'),
:open_timeout => timeout_sec,
:read_timeout => timeout_sec) { |http|
# First make a HEAD request and check the content-length
check_res = http.request_head(uri.path)
raise "File too big" if check_res['content-length'].to_i > max_bytes
# Then fetch in chunks and bail on either timeout or max_bytes
# (Note: timeout won't work unless bytes are streaming in...)
http.request_get(uri.path) do |res|
res.read_body do |chunk|
raise "Timeout error" if (Time.now().to_f-t0 > timeout_sec)
raise "Filesize exceeded" if (body.length+chunk.length > max_bytes)
body += chunk
end
end
}
return body
end
end