Nokogiri results different from brower inspect - ruby-on-rails

I am trying to scrape a site but the results returned for just the links is different from when I inspect it with the browser.
In my browser I get normal links but all the a HREF links all become javascript:void(0); from Nokogiri.
Here is the site:
https://www.ctgoodjobs.hk/jobs/part-time
Here is my code:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
response = open(url) rescue nil
next unless response
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').text

is not that easy, urls are "obscured" using a js function, that's why you're getting javascript: void(0) when asking for the hrefs... looking at the html, there are some hidden inputs for each link, and, there is a preview url that you can use to build the job preview url (if that's what you're looking for), so you have this:
<div class="result-list-job current-view">
<input type="hidden" name="job_id" value="04375145">
<input type="hidden" name="each_job_title_url" value="barista-senior-barista-咖啡調配員">
<h2 class="job-title">Barista/ Senior Barista 咖 啡 調 配 員</h2>
<h3 class="job-company">PACIFIC COFFEE CO. LTD.</h3>
<div class="job-description">
<ul class="job-desc-list clearfix">
<li class="job-desc-loc job-desc-small-icon">-</li>
<li class="job-desc-work-exp">0-1 yr(s)</li>
<li class="job-desc-salary job-desc-small-icon">-</li>
<li class="job-desc-post-date">09/11/16</li>
</ul>
</div>
<a class="job-save-btn" title="save this job" style="display: inline;"> </a>
<div class="job-batch-apply"><span class="checkbox" style="background-position: 0px 0px;"></span><input type="checkbox" class="styled" name="job_checkbox" value="04375145"></div>
<div class="job-cat job-cat-de"></div>
</div>
then, you can retrieve each job_id from those inputs, like:
inputs = doc.search('//input[#name="job_id"]')
and then build the urls (i found the base url at joblist_preview.js:
urls = inputs.map do |input|
"https://www.ctgoodjobs.hk/english/jobdetails/details.asp?m_jobid=#{input['value']}&joblistmode=previewlist&ga_channel=ct"
end

Take the output of a browser and that of a tool like wget, curl or nokogiri and you will find the HTML the browser presents can differ drastically from the raw HTML.
Browsers these days can process DHTML, Nokogiri doesn't. You can only retrieve the raw HTML using something that lets you see the content without the browser, like the above mentioned tools, then compare that with what you see in a text editor, or what nokogiri shows you. Don't trust the browser - they're known to lie because they want to make you happy.
Here's a quick glimpse into what the raw HTML contains, generated using:
$ nokogiri "https://www.ctgoodjobs.hk/jobs/part-time"
Nokogiri dropped me into IRB:
Your document is stored in #doc...
Welcome to NOKOGIRI. You are using ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin15]. Have fun ;)
Counting the hits found by the selector returns:
>> #doc.search('.job-title > a').size
30
Displaying the text found shows:
>> #doc.search('.job-title > a').map(&:text)
[
[ 0] "嬰 兒 奶 粉 沖 調 機 - 兼 職 產 品 推 廣 員 Part Time Promoter (時 薪 高 達 HK$90, 另 設 銷 售 佣 金 )",
...
[29] "Customer Services Representative (Part-time)"
]
Looking at the actual href:
>> #doc.search('.job-title > a').map{ |n| n['href'] }
[
[ 0] "javascript:void(0);",
...
[29] "javascript:void(0);"
]
You can tell the HTML doesn't contain anything but what Nokogiri is telling you, so the browser is post-processing the HTML, processing the DHTML and modifying the page you see if you use something to look at the HTML. So, the short fix is, don't trust the browser if you want to know what the server sends to you.
This is why scraping isn't very reliable and you should use an API if at all possible. If you can't, then you're going to have to roll up your sleeves and dig into the JavaScript and manually interpret what it's doing, then retrieve the data and parse it into something useful.
Your code can be cleaned up and simplified. I'd write it much more simply as:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').map(&:text)
The use of search(...).text is a big mistake. text, when applied to a NodeSet, will concatenate the text of each contained node, making it extremely difficult to retrieve the individual text. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
The first result foobar would require being split apart to be useful, and unless you have special knowledge of the content, trying to figure out how to do it will be a major pain.
Instead, use map to iterate through the elements and apply &:text to each one, returning an array of each element's text.
See "How to avoid joining all text from Nodes when scraping" and "Taking apart a DHTML page" also.

Related

Carrying style IDs/names from HTML to .docx?

Is it possible to somehow tell pandoc to carry the names of styles from original HTML to .docx?
I understand that in order to tune the actual styles, I should be using reference.docx file generated by pandoc. However, reference.docx is limited to what styles it has to: headings, body text, block text, etc.
I'd like to:
specify "myStyle" style in the input HTML (via a "class" attribute, via any other HTML attribute or even via a filter code written in Lua),
<html>
<body>
<p>Hello</p>
<p class="myStyle">World!</p>
</body>
</html>
add a custom "myStyle" to reference.docx using Word,
run a html->docx conversion an expect pandoc generate a paragraph element with "myStyle" (instead of BodyText, which I believe it sets by default), so the end result looks like this (contents of word/document.xml inside the resulting output.docx was cut for brevity):
<w:p>
<w:pPr>
<w:pStyle w:val="BodyText" />
</w:pPr>
<w:r>
<w:txml:space="preserve">Hello</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="myStyle" />
</w:pPr>
<w:r>
<w:txml:space="preserve">World!</w:t>
</w:r>
</w:p>
There's some evidence styleId can be passed around, but I don't really understand it and am unable to find any documentation about it.
Doc on filtering in Lua states you can access attrs when manipulating a pandoc.div, but it says nothing about whether any of the attrs will be interpreted by pandoc in any meaningful way.
Finally, found what I needed – Custom styles. It's limited, but better than what I arrived earlier, and of course much better than nothing at all :)
I'll leave a step-by-step guide here in case anyone stumbles upon a similar question.
First, generate a reference.docx file like this:
pandoc --print-default-data-file reference.docx > styles.docx
Then open the file in MS Word (I was using a macOS version) you'll see this:
Click the "New style..." button on the right, and create a style to your liking. In my case I made change the style of text to be bold, in blue color:
Since I am converting from HTML to DOCX, here's my input.html:
<html>
<body>
<div>Page 1</div>
<div custom-style="eugene-is-testing">Page 2</div>
<div>Page 3</div>
</body>
</html>
Run:
pandoc --standalone --reference-doc styles.docx --output output.docx input.html
Finally, enjoy the result:

Kramdown not showing in heroku (Rails app)

I am using gem kramdown, locally is working, but not in production. Here are the results locally:
content: "~~~ ruby\r\ndef what?\r\n 42\r\nend\r\n~~~",
kramdown_to_html(subthema.content.html_safe)
=> "<div class=\"language-ruby highlighter-coderay\"><div
class=\"CodeRay\">\n <div class=\"code\"><pre><span
style=\"color:#080;font-weight:bold\">def</span> <span
style=\"color:#06B;font-weight:bold\">what?</span>\n <span
style=\"color:#00D\">42</span>\n<span style=\"color:#080;font-
weight:bold\">end</span>\n</pre></div>\n</div>\n</div>\n"
As you see, the piece of code is translated into right HTML, every code-word is within a span and every span has its own class (then with CSS you give the styling).
In production though, here are the results, doing exactly the same:
irb(main):010:0> subthema.content.html_safe
=> "~~~ ruby\r\ndef what?\r\n 42\r\nend\r\n~~~"
irb(main):011:0> kramdown_to_html(subthema.content)
=> "<pre><code class=\"language-ruby\">def what?\n 42\nend\n</code>
</pre>\n"
As you see, the span tags are not there. In Heroku, Kramdown just achieves to create pre and code tagsm but not the spans.
Any help in this topic?
Thanks in advance.

MediaWiki Scribunto extension's "Listen" module not generating expected HTML

I have a MediaWiki installation (1.23) with the Scribunto extension and Module:Listen. I try to invoke this module from an article like so:
{{Listen
|filename = Noise.ogg
|title = Noise
|description = Some noise
}}
This generates the little infobox, but the embedded sound player itself does not appear. I looked at the generated HTML, and the module is just making a second ordinary href to the file:
<div class="haudio">
<div style="padding:2px 0;">Noise</div>
<div style="padding-right:4px;">File:Noise.ogg</div>
<div class="description" style="padding:2px 0 0 0;">Some noise</div></div>
Rather than the second href to the file, I'd expect to see a or similar. Am I missing some template or Lua module?
You need to have TimedMediaHandler installed for the sound player to show up!

Twitter share button not using custom url or text

I have a link like so
= link_to "https://twitter.com/share", class: "twitter-share-button", data: { url: "https://google.com", text: hack.body, via: "GhettoLifeHack_", hashtags: "ghettolifehack" } do
= image_tag "Tweet", alt: "Social Twitter tweet button"
and no matter how much I change the data-url value, the pre-tweet confirmation page always prepopulates the tweet form field with the url of the referring page, not the one I specified. It also ignores my custom data-text as well.
Why is this happening?
I also have this minified script
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
that I got from here https://about.twitter.com/resources/buttons#tweet
Removing that script doesn't seem to change anything.
edit: upon trying using :'data-url' attributes directly, the output html is the same.
I am testing hardcoded strings and dynamically generated urls at the same time. The first is the dynamic one.
<a class="twitter-share-button" href="https://twitter.com/share" data-via="GhettoLifeHack_" data-url="http://localhost:3000/hacks/1" data-text="asdf comment body" data-hashtags="ghettolifehack">
<img src="/images/Tweet" alt="Tweet" title=""></img>
</a>
The second is the hard coded strings
<a class="twitter-share-button" href="https://twitter.com/share" data-via="GhettoLifeHack_" data-url="httpL//google.com" data-text="custom text" data-hashtags="ghettolifehack">
<img src="/images/Tweet" alt="Tweet" title=""></img>
</a>
I've tested on development and in production. Both have the same behavior of pre-populating the tweet form with the referring url, rather than the specified url and text.
This works in Chrome for me but not in Firefox 32
The code provided by you is perfectly fine and should work as expected.
Many site issues can be caused by corrupt cookies or cache. Try to clear both cookies and the cache. I would suggest you to look into the following link to see why it is not working in firefox
The issue was specific to firefox browser. I'm not sure what addon or setting is causing the conflicts, but it is working perfectly find in chrome, including the popup window.

Ruby and Sinatra: where does this extra string come from and how to eliminate it?

Part of the code in "routes.rb",
...
post '/csr' do
text = PkiSupport::display_csr('/etc/pki/subordinate_ca.csr')
erb :download_csr, :locals => { :csr => text }
end
In "PkiSupport.rb"
...
def display_csr(csr_file)
text = `more #{csr_file}`
return text
end
In "download_csr.erb"
...
<form id="csr-form" action="<%= url "/subordinate_ca/csr" %>" method="post">
<h4>csr</h4>
<textarea cols="80" rows="36" name="csr">
<%= csr %>
</textarea>
</form>
The idea is very simple, when user chooses "/csr", shell command "more ..." will be executed and the output string shown in the textarea of a form.
It does show up correctly, but contains extra preceding string (below), which is anything ahead of "-----BEGIN...". So how to prevent it?
::::::::::::::
/etc/pki/subordinate_ca.csr
::::::::::::::
-----BEGIN CERTIFICATE REQUEST-----
MIIFATCCAukCAQAwaDETMBEGCgmSJomT8ixkARkWA2NvbTETMBEGCgmSJomT8ixk
ARkWA3h5ejEQMA4GA1UECgwHWFlaIEluYzESMBAGA1UECwwJTWFya2V0aW5nMRYw
FAYDVQQDDA0xMC4xMC4xMzAuMTU4MIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIIC
CgKCAgEAruWYRn7mjZkHeD+PPLpMSBRoYnLKNvYMte9XneFDh1TItLolYhM4bmWX
gewKOO9+kNY21CoVu1jYZ3q9WitgJlS3tMHPhc6IjuY9DfQ58aeJaZHO8+ISE3Op
l6xNcaxOeHXMlVgdeX4ouyzB2ykJVhu1KtE+XTKilUu6nIrH6ETHrxelBs36Hu1q
...
Thanks.
Most likely, your culprit line of code is this one:
text = `more #{csr_file}`
That code forks and runs the standard Unix more program. Some versions of more will detect if they're run from another program, and output things slightly differently. That's what you're seeing here.
The quickest fix would be to change that line to
text = `cat #{csr_file}`
cat doesn't try to be as smart as more and will give you just the contents of the file.
Now, that said, there's no reason why your Ruby program needs to run a separate program just to read the contents of a file - Ruby has support for reading files directly. So the best fix would be to change that line to:
text = File.read(csr_file)
That will be faster and more portable.

Resources