Different results between Nokogiri and real browser - ruby-on-rails

The target url is: http://courts.delaware.gov/opinions/List.aspx?ag=all+courts
It seems to only retrieve the first 10 links, while a real browser retrieves 50 links.
Here's some sample code to reproduce the error:
require 'open-uri'
require 'nokogiri'
doc=Nokogiri::HTML(open("http://courts.delaware.gov/opinions/List.aspx?ag=all+courts", 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0'))
p "there are missing links" if doc.css('strong a').size < 50
When opening the file produced by open("http://courts.delaware.gov/opinions/List.aspx?ag=all+courts", 'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0'), I see the full and expected HTML.
The resulting doc returned from Nokogiri is truncated, with closing HTML tags after the 10th link and no additional content.
This leads me to believe there's something that is misleading the Nokogiri HTML parser to terminate early.
EDIT: It looks like there's something malformed in the HTML. When I remove the last <tr>...</tr> element, Nokogiri grabs more links. I'm still not sure what the problem is exactly, and how to configure Nokogiri to grab everything.
EDIT2: The problem is that Nokogiri stops parsing after encountering a special character and invalid UTF-8, \x002. There's probably some way to sanitize or force encoding before it is parsed by Nokogiri to fix this bug.

Related

parse web page encoded text with scrapy

i can't extract the content preview of book from Online Bookstore
it banned copying previews of books by encoding the text if i'm not wrong? ,i look for preview of this book
from inspect page looks like this, every word is outside the span tag!,the inside span tag ten digit code corresponding to each word
<span style='color:red;display:none;'>pq8BMvE37g</span>ولا <span style='color:red;display:none;'>G9XGnpBjnY</span>قدرة
i failed after trying with scrapy python :
response.xpath("//*[#class='nabza']").extract()
the to filter text
response.xpath("//*[#class='nabza']/text()").extract()
The fastest way might be to use this XPath :
string(//div[#class='nabza'])
Then a regex ([a-zA-Z0-9]+) to replace the digit codes with blank spaces.
Alternatively you could use this XPath :
//div[#class='nabza']//*[not(self::span)]/text()
No more ten digit code. You probably have to make some cleanup (check if the 473 parts of text are correctly merged, check the \r\n,...) and you should obtain something like this :
https://paste2.org/mWhxzxpj
EDIT : R code :
library(RCurl)
library(XML)
page=getURL("https://www.neelwafurat.com/itempage.aspx?id=lbb179878-143056&search=books", httpheader = c('User-Agent' = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0"),.encoding = 'UTF-8')
parse=htmlParse(page,encoding = "UTF-8")
text=xpathSApply(parse,"//div[#class='nabza']//*[not(self::span)]/text()",xmlValue)
result=paste0(text,collapse = "")
writeLines(result,"result.txt",useBytes=T)

Ruby Parsing: Error when I trying to put Cyrillic symbols into URL request

I wrote a Bot for Telegram, where users can receive images for their requests. But there was one problem, which I could not solve.
Some example with parsing on Ruby:
json_object = JSON.parse(open("https://api.site.com/search/photos?query=" + message.text + "&per_page=10&client_id=42324d2lkedi234fs342dfse2c038fdfsdfs").read)
message.text - It's a field with request from users.
Everything works fine with latin literals, but when I send Cyrillic(API also supports Cyrillic alphabet) symbols I get the below error:
/Users/me/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/rfc3986_parser.rb:21:in
`split': URI must be ascii only
"https://api.site.com/search/photos?query=\u0432\u0430\u0432\u0430&per_page=10&client_id=42324d2lkedi234fs342dfse2c038fdfsdfs"
(URI::InvalidURIError)
I used Encoding with utf-8 and win-1252, but nothing helped. How should this be fixed?
You should encode your cyrillic string:
URI.encode('http://google.com?1=АБВ') # => "%D0%90%D0%91%D0%92"
So, use it like this (or encode whole url):
URI.encode(message.text)
Try with
"anything".parameterize.underscore.humanize.downcase

Turbolinks 3 and render a partial

I'm excited about turbolinks3(it allows you to render only a partial and not reload all the body)
You can read more about it from here: https://github.com/rails/turbolinks/blob/master/CHANGELOG.md
It's amazing but I've a problem:
In browsers that doesn't support pushState(example ie8/9), I don't know how manage the behavior.
It give me this error on IE8:
Could not set the innerHTML property. Invalid target element for this operation.
My Controller code is:
def create
#post = Post.find(params[:post_id])
if #post.comments.create(comment_params)
render '_comment', change: [:comments, :super_test], layout: false, :locals => { comment: #post.comments.last }
else
render json:'error'
end
end
A 'solution' could be that I do:
redirect_to #post, change: [:comments, :super_test]
But then the problem is that it reply with a lot of data that I don't need!(and the response time is bigger) So I reallt want find another solution.
How I can resolve this problem ?
I've thought about 2 solution:
1) Use history.js / modernizr for polyfill the pushState on old browsers
But I've tried but I always get the same error(like if I don't have
modernizr)
Webpage error details
User Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)
Timestamp: Sat, 25 Apr 2015 17:28:52 UTC
Message: Could not set the innerHTML property. Invalid target element for this operation.
Line: 26
Char: 30464
Code: 0
URI: https://stark-forest-5974.herokuapp.com/assets/application-83a3aa4fd4a1ee124da87760bfdca86febd4fc1cb8a13167c892a15ce3caa53d.js
2) Find a way for check if the request is done by turbolinks/pjax or not...and use conditional render or redirect_to
But I've not idea on how I can do it, because turbolinks doesn't send
a specific header like does jquery-pjax
Any suggestions ? I really appreciate it!
PS: Please don't suggest me backbone/angular/ember/react, I already know them(backbone), but I want try turbolinks.
Your first instinct is right, with IE8 you'll need modernizr. the problem is neither you code or turbolinks here, it's IE8.
PS: Turbolinks doesn't actually replace JS frameworks, you can perfectly use it with one of them if you want. I did use it with React and Angular. Turbolinks just avoids re-loading the same thing several times (wich feels already magic).

Rails 2.3.2/Ruby 1.8.6 Encoding Question - ActionController returning UTF-8?

I have a pretty simple Rails question regarding encoding that I can't find an answer to.
Environment:
Rails 2.3.2/Ruby1.8.6
I am not setting any encoding options within the Rails environment currently, have left everything to defaults.
If I read a String from disk from a text file - and send it via Rails render :text functionality using Apache/Phusion, what encoding should the client expect?
Thank you for any answers,
Since about Rails 1.2, Rails sets Ruby 1.8's $KCODE magic variable to "UTF8". It includes ActiveSupport::CoreExtensions::String::Multibyte to patch around issues with otherwise ambiguous per-character/per-byte operators. Your text file should be UTF-8, Ruby will pass it through and your application layout should specify a META tag declaring the document's charset to be UTF-8 too:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
Then it should all 'just work', but there are some gotchas described below.
If you're on a Mac, running "script/console" in Terminal.app and then pasting unusual character sequences directly into the terminal from e.g. the Character Viewer is a good way to play around and demonstrate this to your own satisfaction, since the whole OS works in UTF-8. I don't know what the equivalent would be for Windows or an arbitrary Linux distribution.
For example, "⇒" - RIGHTWARDS DOUBLE ARROW - is Unicode 21D2, UTF8 0xE2 (226), 0x87 (125), 0x92 (146). If I paste that into Terminal and ask for the byte values I get the expected result:
>> $KCODE
=> "UTF8"
>> "⇒"
=> "\342\207\222"
>> puts "⇒"
⇒
...but...
>> "⇒"[0]
=> 226
>> "⇒"[1]
=> 135
>> "⇒"[2]
=> 146
>> "⇒"[3]
=> nil
Note how you're still getting byte access with "[]". See the documentation on the Multibyte extensions in the Rails API (for Rails 2.2, e.g. at http://railsapi.com/) if you want to do string operations, otherwise things like "foo.reverse" will do the wrong thing; "foo.mb_chars.reverse" gets it right by using the "mb_chars" proxy.

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources