How do I get the browser to wait with Capybara & Kimurai? - ruby-on-rails

I'm scraping [this page][1] to look for details of schools that are contained within the CSS selectors .box .column which is contained within a div .schools which is loaded dynamically and takes some time to appear.
I've done this with the watir gem and had no problems. Here's the code as reference.
browser = Watir::Browser.new
browser.goto('https://educationdestinationmalaysia.com/schools/pre-university')
js_doc = browser.element(css: '.schools').wait_until(&:present?)
schools_list = Nokogiri::HTML(js_doc.inner_html)
school_cards = schools_list.css('.box .columns .column:nth-child(2)')
I'm now trying to achieve the same with the kimurai gem but I'm not really familiar with Capybara.
What I've Tried
Changing the default max wait time
def parse(response, url:, data: {})
Capybara.default_max_wait_time = 20
puts browser.has_css?('div.schools')
end
using_wait_time
browser.using_wait_time(20) do
puts browser.has_css?('.schools')
end
Passing in a wait argument to has_css?
browser.has_css?('.schools', wait: 20)
Thanks for reading!
[1]: https://educationdestinationmalaysia.com/schools/pre-university

Your Watir code
js_doc = browser.element(css: '.schools').wait_until(&:present?)
returns the element, but in your Capybara code you're calling predicate methods (has_css?, has_xpath?, has_selector?, etc) that just return true or false. Those predicate methods will only wait if Capybara.predicates_wait is true. Is there a specific reason you're using the predicates though? Instead you can just find the element you're interested in, which will wait up to Capybara.default_max_wait_time or you can specify a custom wait option. The "equivalent" to your Watir example of
js_doc = browser.element(css: '.schools').wait_until(&:present?)
schools_list = Nokogiri::HTML(js_doc.inner_html)
school_cards = schools_list.css('.box .columns .column:nth-child(2)'
assuming you had Capybara.default_max_wait_time set to a number high enough for your app and testing setup
school_cards = browser.find('.schools').all('.box .columns .column:nth-child(2)')
If you do need to extend the wait for one of the finds you could do
school_cards = browser.find('.schools', wait: 10).all('.box .columns .column:nth-child(2)')
to wait up to 10 seconds for the .schools element to appear. This could also just be collapsed into
school_cards = browser.all('.schools .box .columns .column:nth-child(2)')
which will also wait (up to Capybara.default_max_wait_time) for at least one matching element to exist before returning it although depending on your exact HTML
school_cards = browser.all('.schools .column:nth-child(2)')
may be just as good and less fragile
Note: you do have to be using a Kimurai engine that supports JS - https://github.com/vifreefly/kimuraframework#available-engines - otherwise you won't be able to interact with dynamic websites

Related

Jobs update with Dashing and Ruby

I use Dashing for monitor trends and website statistics.
I create a jobs to check GooglesNews trends and Twitter trends .
The data is displayed well, however, they appear at first load and does put more update then. There is the code for twitter_trends.rb :
require 'nokogiri'
require 'open-uri'
url = 'http://trends24.in/france/~cloud'
data = Nokogiri::HTML(open(url))
list = data.xpath('//ol/li')
tags = list.collect do |tag|
tag.xpath('a').text
end
tags = tags.take(10)
tag_counts = Hash.new({value: 0})
SCHEDULER.every '10s' do
tag = tags.sample
tag_counts[tag] = {label: tag}
send_event('twitter_trends', {items: tag_counts.values})
end
I think I used bad "rufus-scheduler" to schedule my job jobs https://gist.github.com/pushmatrix/3978821#file-sample_job-rb
How to make the data will update correctly on a regular basis ?
Your scheduler looks fine, but it looks like you're making one call to the website:
data = Nokogiri::HTML(open(url))
But never calling it again. Is your intent to only check that site once along with the initial processing of it?
I assume you'd really want to wrap more of your logic into the scheduler loop - only things in there will be rerun when the schedule job hits.
When you covered everything in a scheduler, you are only taking one sample every 10 seconds (http://ruby-doc.org/core-2.2.0/Array.html#method-i-sample) then adding it to tag_counts. This is clearing the tag each time. Thing to remember about schedulers is it's basically a clean slate every time it runs. I'd recommend looping through tags and adding them to tag_counts that way instead of sampling. sampling is kind of unnecessary seeing as you are reducing it to 10 each time you run the scheduler.
If I move the SCHEDULER like this (after url on top), it works but that only one item appears randomly every 10 seconds.
require 'nokogiri'
require 'open-uri'
url = 'http://trends24.in/france/~cloud'
SCHEDULER.every '10s' do
data = Nokogiri::HTML(open(url))
list = data.xpath('//ol/li')
tags = list.collect do |tag|
tag.xpath('a').text
end
tags = tags.take(10)
tag_counts = Hash.new({value: 0})
tag = tags.sample
tag_counts[tag] = {label: tag}
send_event('twitter_trends', {items: tag_counts.values})
end
How to display a list of 10 items, which is updated regularly ?

How to access elements from finite scroll with Capybara / Poltergeist and Rails

For a pedagogical project I am trying to count the number of lesson elements on the following page: https://www.edx.org/course/subject/computer-science
I am using Poltergeist as a web driver to access the page, but since the page is using a javascript function to add more entries after page load when the user is scrolling down, I then need to replicate that with Poltergeist.
I have tried to scroll down using:
evaluate_script("page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };")
or
execute_script("page.scrollPosition = { top: page.scrollPosition + 1000, left: 0 };")
It does not seem to work.
Is there any way for Poltergeist to get to the bottom of the page so that the javascript loads all the elements in the (in)finite loop?
Once they are loaded, they are easy to count.
execute_script is called to execute javascript in the "browser" -- I'm not sure what the 'page' object you're trying to set values on is, but you probably want something more like
execute_script('window.scroll(0,1000);')
As a more complete example
#session.visit 'https://www.edx.org/course/subject/computer-science'
count = #session.all(:css, '.discovery-card', minimum: 1).length()
puts "there are #{count} discovery cards"
#session.execute_script('window.scroll(0,1000);')
new_count = #session.all(:css, '.discovery-card', minimum: count+1, wait: 30).length()
puts "there are now #{new_count} discovery cards"

Nokogiri Timeout::Error when scraping own site

Nokogiri works fine for me in the console, but if I put it anywhere... Model, View, or Controller, it times out.
I'd like to use it 1 of 2 ways...
Controller
def show
#design = Design.find(params[:id])
doc = Nokogiri::HTML(open(design_url(#design)))
images = doc.css('.well img') ? doc.css('.well img').map{ |i| i['src'] } : []
end
or...
Model
def first_image
doc = Nokogiri::HTML(open("http://localhost:3000/blog/#{self.id}"))
image = doc.css('.well img')[0] ? doc.css('.well img')[0]['src'] : nil
self.update_attribute(:photo_url, image)
end
Both result in a timeout, though they work perfectly in the console.
When you run your Nokogiri code from the console, you're referencing your development server at localhost:3000. Thus, there are two instances running: one making the call (your console) and one answering the call (your server)
When you run it from within your app, you are referencing the app itself, which is causing an infinite loop since there is no available resource to respond to your call (that resource is the one making the call!). So you would need to be running multiple instances with something like Unicorn (or simply another localhost instance at a different port), and you would need at least one of those instances to be free to answer the Nokogiri request.
If you plan to run this in production, just know that this setup will require an available resource to answer the Nokogiri request, so you're essentially tying up 2 instances with each call. So if you have 4 instances and all 4 happen to make the call at the same time, your whole application is screwed. You'll probably experience pretty severe degradation with only 1 or 2 calls at a time as well...
Im not sure what default value of timeout.
But you can specify some timeout value like below.
require 'net/http'
http = Net::HTTP.new('localhost')
http.open_timeout = 100
http.read_timeout = 100
Nokogiri.parse(http.get("/blog/#{self.id}").body)
Finally you can find what is the problem as you can control timeout value.
So, with tyler's advice I dug into what I was doing a bit more. Because of the disconnect that ckeditor has with the images, due to carrierwave and S3, I can't get any info direct from the uploader (at least it seems that way to me).
Instead, I'm sticking with nokogiri, and it's working wonderfully. I realized what I was actually doing with the open() command, and it was completely unnecessary. Nokogiri parses HTML. I can give it HTML in for form of #design.content! Duh, on my part.
So, this is how I'm scraping my own site, to get the images associated with a blog entry:
designs_controller.rb
def create
params[:design][:photo_url] = Nokogiri::HTML(params[:design][:content]).css('img').map{ |i| i['src']}[0]
#design = Design.new(params[:design])
if #design.save
flash[:success] = "Design created"
redirect_to designs_url
else
render 'designs/new'
end
end
def show
#design = Design.find(params[:id])
#categories = #design.categories
#tags = #categories.map {|c| c.name}
#related = Design.joins(:categories).where('categories.name' => #tags).reject {|d| d.id == #design.id}.uniq
set_meta_tags og: {
title: #design.name,
type: 'article',
url: design_url(#design),
image: Nokogiri::HTML(#design.content).css('img').map{ |i| i['src']},
article: {
published_time: #design.published_at.to_datetime,
modified_time: #design.updated_at.to_datetime,
author: 'Alphabetic Design',
section: 'Designs',
tag: #tags
}
}
end
The Update action has the same code for Nokogiri as the Create action.
Seems kind of obvious now that I'm looking at it, lol. I dwelled on this for longer than I'd like to admit...

Is there a way to send key presses to Webkit using Capybara?

I need to send some key-presses to a web app in an integration test that uses Capybara and WebKit. Using Selenium (WebDriver and Firefox) I can achieve it like this:
find("#element_id").native.send_keys :tab
but WebKit's native element node doesn't have a send_keys method. Actually native in WebKit returned a string containing a number. Is there another way to send keystrokes to WebKit? Maybe even some workaround using JavaScript/jQuery?
I've been trying to implement Marc's answer without any success, but I found some help from a similar question: capybara: fill in form field value with terminating enter key. And apparently there was a pull request from capybara that seems to address this issue.
What worked for me was:
before { fill_in "some_field_id", with: "\t" }
My example erases the text in the field and then presses Tab. To fill in a field with 'foobar', replace "\t" with "foobar\t". You can also use "\n" for the Enter key.
For your example, you could use:
find("#element_id").set("\t")
This worked for me with Poltergeist, to trigger the asterisk key:
find("body").native.send_key("*")
I had no luck with the other solutions; not even Syn.
This was to trigger an angular-hotkeys event.
You can do it like that:
keypress_script = "var e = $.Event('keydown', { keyCode: #{keycode} }); $('body').trigger(e);"
page.driver.browser.execute_script(keypress_script)
Now since Capybara-webkit 1.9.0 you can send key presses like enter and others using send_keys:
find("textarea#comment").send_keys(:enter)
Source: https://github.com/thoughtbot/capybara-webkit/issues/191#issuecomment-228758761
Capybara API Docs: http://www.rubydoc.info/github/jnicklas/capybara/Capybara%2FNode%2FElement%3Asend_keys
I ended up doing the following:
Capybara.current_driver = Capybara.javascript_driver
keypress_script = "$('input#my_field').val('some string').keydown();"
page.driver.browser.execute_script(keypress_script)
I discovered in Chrome, testing my JavaScript, that actually creating an $.Event with keyCode or charCode and then triggering that on my input field didn't put the characters in the input. I was testing autocompletion which required a few characters be in the input field, and it would start the autocompletion on keydown. So I set the input value manually with val, then trigger keydown to cause the autocompletion script to start.
For simple cases, triggering a keypress event in JS will work:
def press(code)
page.execute_script("$('#my-input').trigger($.Event('keypress', {keyCode: #{code}}))")
end
For a more general and robust answer, use this great library that goes through the trouble of triggering the right events (i.e. keydown, then keypress and finally keyup).
def type(string)
page.execute_script("Syn.click({}, 'my-input').wait().type(#{string.to_json})")
end
A more complex example can be found here
Here is my solution, which works with capybara 2.1.0:
fill_in('token-input-machine_tag_list', :with => 'new tag name')
page.evaluate_script("var e = $.Event('keydown', { keyCode: 13 }); $('#token-input-machine_tag_list').trigger(e);") # Press enter
Please, note, that in new capybara you have to use page.evaluate_script.
For Capybara Webkit, this is the solution I used:
def press_enter(input)
script = "var e = jQuery.Event('keypress');"
script += "e.which = 13;"
script += "$('#{input}').trigger(e);"
page.execute_script(script);
end
Then I use it cleanly in my test like:
press_enter("textarea#comment")

Delaying a method based on page results

I am retrieving results from NCBI's online Blast tool with 'net/http' and 'uri'. To do this I have to search through an html page to check if one of the lines is "Status=WAITING" or "Status=READY". When the Blast tool has finished the status will change to ready and results will be posted on the html page.
I have a working version to check the status and then retrieve the information that I need, but it is inefficient and is broken into two methods when I believe that there could be some way to put them into one.
def waitForBlast(rid)
get = Net::HTTP.post_form(URI.parse('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'), {:RID => "#{rid}", :CMD => 'Get'})
get.body.each{|line| (waitForBlast(rid) if line.strip == "Status=WAITING") if line[/Status=/]}
end
def returnBlast(rid)
blast_array = Array.new
get = Net::HTTP.post_form(URI.parse('http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?'), {:RID => "#{rid}", :CMD => 'Get'})
get.body.each{|line| blast_array.push(line[/<a href=#\d+>/][/\d+/]) if line[/<a href=#\d+>/]}
return blast_array
end
The first method checks the status and is my main concern because it is recursive. I believe(and correct me if I'm wrong) that designed as is takes too much computing power when all that I need is some way to recheck the results within the same method(adding in a time delay is a bonus). The second method is fine, but I would prefer if it was combined with the first somehow. Any help appreciated.
Take a look at this implementation. This is what he does:
res='http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get&FORMAT_OBJECT=SearchInfo&RID=' + #rid
while status = open(res).read.scan(/Status=(.*?)$/).to_s=='WAITING'
#logger.debug("Status=WAITING")
sleep(3)
end
I think using the string scanner might be a bit more efficient than iterating over every line in the page, but I haven't looked at it's implementation so I may be wrong.

Resources