Nokogiri and Mechanize help (navigating to pages via div class and scraping) - ruby-on-rails

I need help clicking on some elements via div class, not by text of link, to get to a page to scrape some data.
Starting with the page http://www.salatomatic.com/b/United-States+125, how do I click on each state's name without using the text of the link but by the div class?
After clicking on a state, for example http://www.salatomatic.com/b/Alabama+7, I need to click on a region in the state, again by div class, not text of the link.
Inside a region, www [dot] salatomatic [dot] com/c/Birmingham+12, I want to loop through, clicking on each of the items (11 mosques in this example).
Inside the item/mosque, I need to scrape the address (at the top under the title of the mosque) and store/create it in my database.
UPDATES:
I have this now:
require 'nokogiri'
require 'open-uri'
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://www.salatomatic.com/b/United-States+125")
#loops through all state links
page.search('.subtitleLink a').map{|a| page.uri.merge a[:href]}.each do |uri|
page2 = agent.get uri
#loops through all regions in each state
page2.search('.subtitleLink a').map{|a| page2.uri.merge a[:href]}.each do |uri|
page3 = agent.get uri
#loops through all places in each region
page3.search('.subtitleLink a').map{|a| page3.uri.merge a[:href]}.each do |uri|
page4 = agent.get uri
#I'm able to grab the title of the place but not sure how to get the address b/c there is no div around it.
puts page4.at('.titleBM')
#I'm guessing I would use some regex/xpath here to get the address, but how would that work?
#This is the structure of the title/address in HTML:
<td width="100%"><div class="titleBM">BIS Hoover Crescent Islamic Center </div>2524 Hackberry Lane, Hoover, AL 35226</td> This is the listing page: http://www.salatomatic.com/d/Hoover+12446+BIS-Hoover-Crescent-Islamic-Center
end
end
end

It's important to make sure the a[:href]'s are converted to absolute urls first though.
Therefore, maybe:
page.search('.subtitleLink a').map{|a| page.uri.merge a[:href]}.each do |uri|
page2 = agent.get uri
end

For the pages of US and regions you can do:
agent = Mechanize.new
page = agent.get('http://www.salatomatic.com/b/United-States+125')
page.search("#header a").each { |a| ... }
Here inside the block you can find corresponding link and click:
page.link_with(text: a.text).click
or ask mechanize to load the page by href:
region_page = agent.get a[:href]
Inside the region you can do the same, just search like
page.search(".tabTitle a").each ...
for Tabs (Restaurants, Markets, Schools etc.) And like
page.search(".subtitleLink a").each ...
How to find these things? Try some bookmarklets like SelectorGadget or similar, dig in HTML source code and find common parents/classes for links you are interested in.
UPDATED getting page by href as #pguardiario suggested

Related

Why is Nokogiri not finding this img src?

I want get image from this Url :
doc_autobip = Nokogiri::HTML(URI.open('https://www.autobip.com/fr/actualite/sappl_mercedes_benz_livraison_de_282_camions_mercedes_benz/16757'))
The img tag is :
<img src="https://www.autobip.com/storage/photos/articles/16757/sappl_mercedes_benz_livraison_de_282_camions_mercedes_benz_2020-08-12-09-1087474.jpg" class="fotorama__img">
Logically this can be useful
src_img = article.css('img.fotorama__img').map { |link| link['src'] }
But i have alwayse src_img = [] !!
any ideas, please
The html class fotorama__img is being added to the image dynamically. Although you can see it when you inspect the page, you cannot find the fotorama__img class on it when you View Source of the page.
Nokogiri, gets the source of the website & doesn't wait for the javascript on the page to execute.
You can try something like this, which should work
doc_autobip = Nokogiri::HTML(URI.open('https://www.autobip.com/fr/actualite/sappl_mercedes_benz_livraison_de_282_camions_mercedes_benz/16757'))
# the div wrapping the image has the classes "fotorama mnmd-gallery-slider mnmd-post-media-wide"
doc_autobip.css('.fotorama.mnmd-gallery-slider.mnmd-post-media-wide img').map { |link| link['src'] }
This is just to show it works. You can choose wisely which element & classes to use to make it work.
Update:
Or if you want the content of the page to load you can use watir
require 'nokogiri'
require 'watir'
browser = Watir::Browser.new
browser.goto 'https://www.autobip.com/fr/actualite/sappl_mercedes_benz_livraison_de_282_camions_mercedes_benz/16757'
doc = Nokogiri::HTML.parse(browser.html)
doc.css('img.fotorama__img').map { |link| link['src'] }
But you'll need to install additional drivers to use watir fyi.

How to link to my root_path plus an anchor using HAML

I have this snippet of code using haml
%li.tab.col.s3.l2
%a{:href => "#main_kpis"} KPIs
#main_kpis.col.s12.no-padding
.col.s12.light-sky-grid
#kpis_wrapper
When I click on the li>tab element, I go directly to the div main_kpis, but I need to send the full URL to the browser.... How can I do that?
%li.tab.col.s3.l2
= link_to("KPIs", root_url(anchor: '#main_kpis'))
#main_kpis.col.s12.no-padding
.col.s12.light-sky-grid
#kpis_wrapper

click on img capybara

Git this page https://ru.aliexpress.com/store/product/Original-xiaomi-Redmi-note3-snapdragon-650-4000mAh-13ML-1080P-3G-32G-5-5-screen-octa-core/1986585_32622877163.html?detailNewVersion=&categoryId=5090301
and i need to click on each color modification as i think. All configs and driver works fine cuz` other elements on page i can interact.
Code
page.all(:css, '.item-sku-image img').each_with_index do |mod,i|
find(:xpath,"//img[#title='#{mod['title']}']").find(:xpath, "..").click
puts find(:xpath,"//img[#title='#{mod['title']}']").find(:xpath, "..").find(:xpath, "..")['class'] # with this line i'm checkin` if i clicked on img block cuz his parend node changes it's class to active
end
Have no idea why can i click on every single item or link on this page except this block of img. ( using poltergeist )
There is no need to refind the element, and in fact it should be perfectly fine to click the img element itself.
page.all(:css, '.item-sku-image img').each_with_index do |mod,i|
mod.click # mod.find(:xpath, "./..").click if you do need to click the parent
puts mod.find(:xpath, "./ancestor::li").matches_selector?(:css, '.active')
end

Rendering dynamic scss-files with ajax, rails

As the title suggests, my main objective is to render a dynamic scss(.erb) file after an ajax call.
assets/javascripts/header.js
// onChange of a checkbox, a database boolean field should be toggled via AJAX
$( document ).ready(function() {
$('input[class=collection_cb]').change(function() {
// get the id of the item
var collection_id = $(this).parent().attr("data-collection-id");
// show a loading animation
$("#coll-loading").removeClass("vhidden");
// AJAX call
$.ajax({
type : 'PUT',
url : "/collections/" + collection_id + "/toggle",
success : function() {
// removal of loading animation, a bit delayed, as it would be too fast otherwise
setTimeout(function() {
$("#coll_loading").addClass("vhidden");
}, 300);
},
});
});
});
controller/collections_controller.rb
def toggle
# safety measure to check if the user changes his collection
if current_user.id == Collection.find(params[:id]).user_id
collection = Collection.find(params[:id])
# toggle the collection
collection.toggle! :auto_add_item
else
# redirect the user to error page, alert page
end
render :nothing => true
end
All worked very smooth when I solely toggled the database object.
Now I wanted to add some extra spices and change the CSS of my 50+ li's accordingly to the currently selected collections of the user.
My desired CSS looks like this, it checks li elements if they belong to the collections and give them a border color if so.
ul#list > li[data-collections~='8'][data-collections~='2']
{
border-color: #ff2900;
}
I added this to my controller to generate the []-conditions:
def toggle
# .
# .
# toggle function
# return the currently selected collection ids in the [data-collections]-format
#active_collections = ""
c_ids = current_user.collections.where(:auto_add_item => true).pluck('collections.id')
if c_ids.size != 0
c_ids.each { |id| #active_collections += "[data-collections~='#{id}']" }
end
# this is what gets retrieved
# #active_collections => [data-collections~='8'][data-collections~='2']
end
now I need a way to put those brackets in a scss file that gets generated dynamically.
I tried adding:
respond_to do |format|
format.css
end
to my controller, having the file views/collections/toggle.css.erb
ul#list<%= raw active_collections %> > li<%= raw active_collections %> {
border-color: #ff2900;
}
It didn't work, another way was rendering the css file from my controller, and then passing it to a view as described by Manuel Meurer
Did I mess up with the file names? Like using css instead of scss? Do you have any ideas how I should proceed?
Thanks for your help!
Why dynamic CSS? - reasoning
I know that this should normally happen by adding classes via JavaScript. My reasoning to why I need a dynamic css is that when the user decides to change the selected collections, he does this very concentrated. Something like 4 calls in 3 seconds, then a 5 minutes pause, then 5 calls in 4 seconds. The JavaScript would simply take too long to loop through the 50+ li's after every call.
UPDATE
As it turns out, JavaScript was very fast at handling my "long" list... Thanks y'all for pointing out the errors in my thinking!
In my opinion, the problem you've got isn't to do with CSS; it's to do with how your system works
CSS is loaded static (from the http request), which means when the page is rendered, it will not update if you change the CSS files on the server
JS is client side and is designed to interact with rendered HTML elements (through the DOM). This means that JS by its nature is dynamic, and is why we can use it with technologies like Ajax to change parts of the page
Here's where I think your problem comes in...
Your JS call is not reloading the page, which means the CSS stays static. There is currently no way to reload the CSS and have them render without refreshing (sending an HTTP request). This means that any updating you do with JS will have to include per-loaded CSS
As per the comments to your OP, you should really look at updating the classes of your list elements. If you use something like this it should work instantaneously:
$('li').addClass('new');
Hope this helps?
If I understood your feature correctly, actually all you need can be realized by JavaScript simply, no need for any hack.
Let me organize your feature at first
Given an user visiting the page
When he checks a checkbox
He will see a loading sign which implies this is an interaction with server
When the loading sign stopped
He will see the row(or 'li") he checked has a border which implies his action has been accepted by server
Then comes the solution. For readability I will simplify your loading sign code into named functions instead of real code.
$(document).ready(function() {
$('input[class=collection_cb]').change(function() {
// Use a variable to store parent of current scope for using later
var $parent = $(this).parent();
// get the id of the item
var collection_id = $parent.attr("data-collection-id");
show_loading_sign();
// AJAX call
$.ajax({
type : 'PUT',
url : "/collections/" + collection_id + "/toggle",
success : function() {
// This is the effect you need.
$parent.addClass('green_color_border');
},
error: function() {
$parent.addClass('red_color_border');
},
complete: function() {
close_loading_sign(); /*Close the sign no matter success or error*/
}
});
});
});
Let me know if my understanding of feature is correct and if this could solve the problem.
What if, when the user toggles a collection selection, you use jquery change one class on the ul and then define static styles based on that?
For example, your original markup might be:
ul#list.no_selection
li.collection8.collection2
li.collection1
And your css would have, statically:
ul.collection1 li.collection1,
ul.collection2 li.collection2,
...
ul.collection8 li.collection8 {
border-color: #ff2900;
}
So by default, there wouldn't be a border. But if the user selects collection 8, your jquery would do:
$('ul#list').addClass('collection8')
and voila, border around the li that's in collection8-- without looping over all the lis in javascript and without loading a stylesheet dynamically.
What do you think, would this work in your case?

Mechanize not recognizing anchor tags via CSS selector methods

(Hope this isn't a breach of etiquette: I posted this on RailsForum, but I haven't been getting much response from there recently.)
Has anyone else had problems with Mechanize not recognizing anchor tags via CSS selectors?
The HTML looks like this (snippet with white space removed for clarity):
<td class='calendarCell' align='left'>
10
<p style="margin-bottom:15px; line-height:14px; text-align:left;">
<span class="sidenavHeadType">
Current Events</span><br />
<b><a href="http://www.mysite.org/index.php/site/
Clubs/banks_and_the_fed" class="a2">Banks and the Fed</a></b>
<br />
10:30am- 11:45am
</p>
I'm trying to collect the data from these events. Everything is working except getting the anchor within the <p>. There's clearly an <a> tag inside the <b>, and I'm going to need to follow that link to get further details on this event.
In my rake task, I have:
agent.page.search(".calendarCell,.calendarToday").each do |item|
day = item.at("a").text
item.search("p").each do |e|
anchor = e.at("a")
puts anchor
puts e.inner_html
end
end
What's interesting is that the item.at("a") always returns the anchor. But the e.at("a") returns nil. And when I do inner_html on the p element, it ignores the anchor entirely. Example output:
nil
<span class="sidenavHeadType">
Photo Club</span><br><b>Indexing Slide Collections</b>
<br>
2:00pm- 3:00pm
However, when I run the same scrape directly with Nokogiri:
doc.css(".calendarCell,.calendarToday").each do |item|
day = item.at_css("a").text
item.css("p").each do |e|
link = e.at_css("a")[:href]
puts e.inner_html
end
end
It recognizes the inside the , and it will return the href, etc.
<span class="sidenavHeadType">
Bridge Party</span><br><b>Party Bridge</b>
<br>
7:00pm- 9:00pm
Mechanize is supposed to use Nokogiri, so I'm wondering if I have a bad version or if this affects others as well.
Thanks for any leads.
Never mind. False alarm. In my Nokogiri task, I was pointing to a local copy of the page that included the anchors. The live page required a login, so when I browsed to it, I could see the a tags. Adding the login to the rake task solved it.

Resources