Nokogiri parsing different on server versus localhost - ruby-on-rails
I'm getting some weird differences when running Nokogiri locally versus running it on my server. On my local machine the entire document seems to parse and be available but on the server I seem to get the doctype tab and some random comment tags.
To start off, to make sure it wasn't a problem with open-uri I checked it - the results are not exact but do contain the correct markup.
Local:
ruby-1.8.7-p352 :005 > s = open('http://www.pennstateind.com/store/PK2WAY.html')
=> #<File:/var/folders/G8/G8bsAGBk1o82Eyks3ZmFtq-+3Y6/-Tmp-/open-uri20120626-5891-10y2ncr-0>
ruby-1.8.7-p352 :006 > s.length
=> 88408
Server:
rb(main):008:0> s = open('http://www.pennstateind.com/store/PK2WAY.html')
=> #<File:/tmp/open-uri20120626-22167-1td2l72-0>
irb(main):009:0> s.length
=> 98184
When I run this on my local machine I get this:
ruby-1.8.7-p352 :003 > d = Nokogiri::HTML(open('http://www.pennstateind.com/store/PK2WAY.html'))
=> [ OUTPUT OMITTED FOR BREVITY - CAN SUPPLY ON REQUEST ]
ruby-1.8.7-p352 :004 > d.to_s.length
=> 85212
But when I run this on the server I get this:
rb(main):006:0> d = Nokogiri::HTML(open('http://www.pennstateind.com/store/PK2WAY.html'))
=> #<Nokogiri::HTML::Document:0x36620e14b580 name="document" children= [#<Nokogiri::XML::DTD:0x36620e14b1c0 name="html">, #<Nokogiri::XML::Comment:0x36620e14b170 " Open Graph Tags ">, #<Nokogiri::XML::Comment:0x36620e14a98c " Customer_Session_Verified: 0 ">]>
irb(main):007:0> d.to_s.length
=> 172
The only apparent gem difference is for the JS compiler - all other gems are the exact version between local and server:
Local => libv8 (3.3.10.4 x86-darwin-10)
Server => libv8 (3.3.10.4 x86_64-linux)
Any ideas how to figure out what is going on and/or fix this?
Update - to isolate where the problem actually was I pulled a file from the server and from localhost then ran them on each. The results below show that the problem definitely lies in Nokogiri - what the problem is I am still perplexed by...
Running locally:
# FILE ORIGINALLY PULLED FROM SERVER
ruby-1.8.7-p352 :015 > server_file = File.open("/Users/jmcdonald/Desktop/files/SERVER.txt", "r")
=> #<File:/Users/jmcdonald/Desktop/files/SERVER.txt>
ruby-1.8.7-p352 :016 > server_file.read.length
=> 93071
ruby-1.8.7-p352 :022 > Nokogiri::HTML(server_file).to_s.length
=> 98793
# FILE ORIGINALLY PULLED FROM LOCALHOST
=> #<File:/Users/jmcdonald/Desktop/files/LOCAL.txt>
ruby-1.8.7-p352 :018 > local_file.read.length
=> 89622
ruby-1.8.7-p352 :026 > Nokogiri::HTML(local_file).to_html.length
=> 94632
Running on server:
# FILE ORIGINALLY PULLED FROM SERVER
irb(main):001:0> sf = File.open('/home/charlest/public_html/files/nokogiri_issue/SERVER.txt', 'r')
=> #<File:/home/charlest/public_html/files/nokogiri_issue/SERVER.txt>
irb(main):002:0> sf.read.length
=> 93071
irb(main):004:0> Nokogiri::HTML(sf).to_s.length
=> 896 # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< WRONG
# FILE ORIGINALLY PULLED FROM LOCALHOST
irb(main):008:0> lf = File.open('/home/charlest/public_html/files/nokogiri_issue/LOCAL.txt', 'r')
=> #<File:/home/charlest/public_html/files/nokogiri_issue/LOCAL.txt>
irb(main):009:0> lf.read.length
=> 89622
irb(main):011:0> Nokogiri::HTML(lf).to_s.length
=> 896 # <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< WRONG
It looks like your server and local environment are using different versions of libxml2. Older versions are known to have strange parsing bugs, so updating your server to the latest version you possibly can (or at least to the same version you're using for development) should fix you up.
There was also a bug with a shipped version of Nokogiri (I believe it affected 1.5.1) which affected the parsing in a some limited situations. I would suggest making sure your gems are updated. (gem update)
Try using File#read rather than File#open or make sure you're running lf.rewind before you try to parse w/ Nokogiri. The behavior you're seeing is most likely the result of your lf file handle being at the end of the file, which means Nokogiri is parsing an empty document.
> remote = File.open('./PK2WAY.html')
# => #<File:./PK2WAY.html>
> remote.read.length
# => 92978
> remote.read.length
# => 0
> Nokogiri::HTML(remote).to_s.length
# => 108
> remote.rewind
# => 0
> Nokogiri::HTML(remote).to_s.length
# => 93847
Related
Putting ~ after id fetches record
If we have an Active Record database say Users User.find(id) works as expected: But so does User.find('id~') Also User.find('id~gibberish') Is this a vulnerability or flaw of ActiveRecord? How do I handle such requests appropriately?
This should help clear some things up, it is not ActiveRecord, it's Ruby's to_i method that you're seeing. 2.2.1 :001 > '11'.to_i => 11 2.2.1 :002 > '11~'.to_i => 11 2.2.1 :003 > '11~gibberish'.to_i => 11 This is not a vulnerability nor a flaw. If you're worried about input like this, I'd ask for an example where you think it could cause you harm. Additionally if you'd like to be super defensive, use Integer( 2.2.1 :004 > Integer('11~gibberish') ArgumentError: invalid value for Integer(): "11~gibberish" 2.2.1 :005 > Integer('11') => 11
Different result for Regexp match with Rails and Rubular?
I am using Rails 4.0.0 with Ruby 2.0.0 p247. I am writing an URL regexp matcher but I have no idea why it does not work: 2.0.0-p247 :033 > REGEXP = %r{\Ahttps:\/\/#{ Rails.configuration.aws[:bucket] }\.s3(-#{Rails.configuration.aws[:region]}|)\.amazonaws\.com\/(?<path>uploads\/.+\/(?<filename>.+))\?.+\z}.freeze => /\Ahttps:\/\/test-gem\.s3(-eu-west-1|)\.amazonaws\.com\/(?<path>uploads\/.+\/(?<filename>.+))\?.+\z/ 2.0.0-p247 :034 > url = "https://test-gem.s3.amazonaws.com/uploads/2alrg16mvx6r-29590d114fb3257846c1a03330418da9/3031674-poster-p-1-for-25.jpg" => "https://test-gem.s3.amazonaws.com/uploads/2alrg16mvx6r-29590d114fb3257846c1a03330418da9/3031674-poster-p-1-for-25.jpg" 2.0.0-p247 :035 > REGEXP.match(url) => nil But when I try to debug in things like Rubular, it does work. Any idea? Thanks!
Remove \?.+ in the end if your regexp
Might be a bug with Ruby 2.0.0. I'm using 2.1.3 and it works like you'd expect. > r = /\Ahttps:\/\/test-gem\.s3(\A-eu-west-1\z|)\.amazonaws\.com\/(?<path>uploads\/.+\/(?<filename>.+))\z/ => /\Ahttps:\/\/test-gem\.s3(\A-eu-west-1\z|)\.amazonaws\.com\/(?<path>uploads\/.+\/(?<filename>.+))\z/ > r.match("https://test-gem.s3.amazonaws.com/uploads/2alrg16mvx6r-29590d114fb3257846c1a03330418da9/3031674-poster-p-1-for-25.jpg") => #<MatchData "https://test-gem.s3.amazonaws.com/uploads/2alrg16mvx6r-29590d114fb3257846c1a03330418da9/3031674-poster-p-1-for-25.jpg" path:"uploads/2alrg16mvx6r-29590d114fb3257846c1a03330418da9/3031674-poster-p-1-for-25.jpg" filename:"3031674-poster-p-1-for-25.jpg">
Rails n elements before last
In Rails I often do this: Model.last(5).first This retrieves element last-5. Is there a built-in way of doing this?
The more common way is offset() Model.offset(5).last Edit (for lazy people): 1.8.7 :001 > User.first.id => 1 1.8.7 :002 > User.last.id => 143455 1.8.7 :003 > User.offset(5).last.id => 143450
Rails / Ruby not following Rublar on regular expression
I have the following expression that I have tested in Rubular and that successfully matches against a snippet of HTML: Official Website<\/h3>\s*<p><a href="([^"]*)" However, when I run the expression in Ruby, using the following code, it returns no matches. I've reduced it down to "Official\s*Website" and it matches that, but nothing further. Are there any additional options I need to set, or anything else that I need to do to configure Ruby/Rails to start tracking Rubular? matches = sidebar.match(/Official\s*Website<\/h3>\s*<p><a href="([^"]*)"/) if matches.nil? puts "no matches" else puts "matches" end This is the relevant part of the snippet I'm matching against: <h3>Official Website</h3><p>website.com</p>
your regular expression is correct. rubular should be working the same way your code does. i tested it against ruby 1.8.7 and 1.9.3 irb(main):006:0> sidebar = ' <h3>Official Website</h3><p>website.com</p>' => " <h3>Official Website</h3><p>website.com</p>" irb(main):007:0> sidebar.match(/Official\s*Website<\/h3>\s*<p><a href="([^"]*)"/) => #<MatchData "Official Website</h3><p><a href=\"http://website.com\"" 1:"http://website.com"> - 1.9.3p0 :005 > sidebar = ' <h3>Official Website</h3><p>website.com</p>' => " <h3>Official Website</h3><p>website.com</p>" 1.9.3p0 :006 > sidebar.match(/Official\s*Website<\/h3>\s*<p><a href="([^"]*)"/) => #<MatchData "Official Website</h3><p><a href=\"http://website.com\"" 1:"http://website.com"> if you want to quickly check why stuff is not working, you should try it in IRB or in your rails console. most of the times it's typo or bad encoding.
HTTP post request via Ruby
I'm very new to ruby and trying some basic stuff. When I send HTTP request to the server using: curl -v -H "Content-Type: application/json" -X GET -d "{"myrequest":"myTest","reqid":"44","data":{"name":"test"}}" localhost:8099 My server sees JSON data as "{myrequest:myTest,reqid:44,data:{name:test}}" But when I send the request using the following ruby code: require 'net/http' #host = 'localhost' #port = '8099' #path = "/posts" #body = ActiveSupport::JSON.encode({ :bbrequest => "BBTest", :reqid => "44", :data => { :name => "test" } }) request = Net::HTTP::Post.new(#path, initheader = {'Content-Type' =>'application/json'}) request.body = #body response = Net::HTTP.new(#host, #port).start {|http| http.request(request) } puts "Response #{response.code} #{response.message}: #{response.body}" It sees it as "{\"bbrequest\":\"BBTest\",\"reqid\":\"44\",\"data\":{\"name\":\" test\"}}" and server is unable to parse it. Perhaps there are some extra options I need to set to send request from Ruby to exclude those extra characters? Can you please help. Thanks in advance.
What you are doing on the shell produces invalid JSON. Your server should not accept it. $echo "{"myrequest":"myTest","reqid":"44","data":{"name":"test"}}" {myrequest:myTest,reqid:44,data:{name:test}} This is JSON with unescaped keys and values, will NEVER work. http://jsonlint.com/ If your server accept this "sort of kind of JSON" but does not accept the second one in your example your server is broken. My server sees JSON data as "{myrequest:myTest,reqid:44,data:{name:test}}" Your server sees a string. When you will try to parse it into JSON it will produce an error or garbage. It sees it as "{\"bbrequest\":\"BBTest\",\"reqid\":\"44\",\"data\":{\"name\":\" test\"}}" No this is how it's printed via Ruby's Object#inspect. You are printing the return value of inspect somewhere and then trying to judge whether it's valid JSON - it is not, since this string you've pasted in is made to be pasted into the interactive ruby console (irb) or into a ruby script, and it contains builtin escapes. You need to see your JSON string raw, just print the string instead of inspecting it. I think your server is either broken or not finished yet, your curl example is broken and your ruby script is correct and will work once the server is fixed (or finished). Simply because irb(main):002:0> JSON.parse("{\"bbrequest\":\"BBTest\",\"reqid\":\"44\",\"data\":{\"name\":\" test\"}}") # => {"bbrequest"=>"BBTest", "reqid"=>"44", "data"=>{"name"=>" test"}}
Your problem is something other than the existence of escape characters in the string. Those are not put in by the code you show, but by irb or .inspect. If you put in a simple puts #body in your code (or in a Rails context, logger.debug #body), you'll see this. Here's an irb session showing the difference: ruby-1.9.2-p180 :002 > require 'active_support' => true ruby-1.9.2-p180 :003 > json = ActiveSupport::JSON.encode({ ruby-1.9.2-p180 :004 > :bbrequest => "BBTest", ruby-1.9.2-p180 :005 > :reqid => "44", ruby-1.9.2-p180 :006 > :data => ruby-1.9.2-p180 :007 > { ruby-1.9.2-p180 :008 > :name => "test" ruby-1.9.2-p180 :009?> } ruby-1.9.2-p180 :010?> }) => "{\"bbrequest\":\"BBTest\",\"reqid\":\"44\",\"data\":{\"name\":\"test\"}}" ruby-1.9.2-p180 :013 > puts json {"bbrequest":"BBTest","reqid":"44","data":{"name":"test"}} => nil In any case, the best way to do json encoding in Rails is not to call ActiveSupport::JSON.encode directly, but rather override as_json in your model or use the serializable_hash feature. This will make your code cleaner as well. See the top answers to this stackoverflow question for details.