How to scrape a dynamic website with Ruby

How to scrape a dynamic website with Ruby - ruby-on-rails

I want to scrape a React website that has products with names and descriptions. The HTML structure looks like this:
<h6 class="menu-index-page__item-title" data-reactid=".5c2v.$menuItemContent.0">
<span data-reactid=".5c2v.$menuItemContent.0.1">Product name</span>
</h6>
<p class="menu-index-page__item-desc" data-reactid=".5c2v.$menuItemContent.1">
<span data-reactid=".5c2v.$menuItemContent.1.0">
<span data-reactid=".5c2v.$menuItemContent.1.0.0">
<span data-reactid=".5c2v.$menuItemContent.1.0.0.0:$0">Description line 1</span>
<br data-reactid=".5c2v.$menuItemContent.1.0.0.0:$0br">
<span data-reactid=".5c2v.$menuItemContent.1.0.0.$1">
<span data-reactid=".5c2v.$menuItemContent.1.0.0.$1.0">
<span data-reactid=".5c2v.$menuItemContent.1.0.0.$1.0.0">Description line 2</span>
<span data-reactid=".5c2v.$menuItemContent.1.0.0.$1.0.1">…</span>
</span>
</span>
</span>
</p>
If the description has more or fewer lines, the number of span tags will change, therefore making a XPath search invalid.
The only thing that comes back for each product on each page is:
.$menuItemContent.1.0.0.0:$0
for the first line of the description and
.$menuItemContent.1.0.0.$1.0.0
for the second line of the description.
Could I use a regular expression to grab just this part from the data-reactid attribute?
I am using Nokogiri at the moment.

The prices are more than likely dynamically loaded by a javascript once the webpage has finished displaying.
To be able to scrape dynamically loaded data, you will need to use a library like Watir which is supported by Rails 5.
With Watir, you are able to wait until all scripts are executed and all data is loaded before attempting to scrape the site.

Related

Capybara: click on text within <span>

I had a bot that would apply on indeed.com jobs. It would collect jobs then apply to them one by one. However, indeed recently made things a lot harder. Used to be able to just locate the button's id and use that but now the id is dynamic: changes from different job positions.
Does anyone know how it is possible to link to the "Apply Now" button (not really a botton) if the code below is:
<a class="indeed-apply-button" href="javascript:void(0);" id="indeed-ia-1532137767182-0">
<span class="indeed-apply-button-inner" id="indeed-ia-1532137767182-0inner">
<span class="indeed-apply-button-label" id="indeed-ia-1532137767182-0label">Apply Now</span>
<span class="indeed-apply-button-cm">
<img src="https://d3fw5vlhllyvee.cloudfront.net/indeedapply/s/14096d1/check.png" style="border: 0px;">
</span>
</span>
</a>

Many ways to click that element, the 3 simplest would probably be
click_link('Apply Now') # find link by partial text and click it
click_link(class: 'indeed-apply-button') # find and click link by class
find('span', text: 'Apply Now').click # find span by text and click in it

in my case I had to select an option shown using js-chosen library inside an iframe. So, its not a regular select tag. So had to do following to select:-
within_frame(find("#cci_form")) do
find('#showing_listing_id_chosen').click
find('[data-option-array-index="2"]').click
end
But, no luck using
find('.active-result[data-option-array-index="2"]').click
or
find('li[data-option-array-index="2"]').click

Multiple html pages using Intel App Framework

So I have an app that I am trying to strip out all of the JQuery Mobile and now use Intel's App Framework. I am having trouble figuring out how to integrate multiple html pages into the app so that I don't have to have all my code in a single file. I tried this:
$.ui.loadContent("page2.html");
but that doesn't seem to work. I get a 'loading content' spinner but nothing seems to happen.
How do I link pages together from different files?

Ok so I have figured it out. The documentation can sometimes be hard to search and there is no search box available on their website right now. But if you go to the quickstart and then then AFUI(on the left) and then panel properties they say:
data-defer="filename.html" - This will load content into the panel
from a remote page/url. This is useful for separating out content into
different files. af.ui.ready is not available until all files are
loaded asynchronously.
So in my index.html file I have something like this:
<div id="afui">
<nav>
<ul class="list">
<li>Post a Lunch</li>
<li>Personal Profile</li>
<li>Select University</li>
</ul>
</nav>
<!--Main View Pages-->
<div class="panel" title="Events" id="event-list_panel" data-defer="event-list.html" data-load="loadMainEventsList"> </div>
<div class="panel" title="Description" id="description_panel" data-defer="description.html" data-load="loadEventDetails"> </div>
<div class="panel" title="Select University" id="select-university_panel" data-defer="select-university.html"> </div>
</div> <!--id="afui"-->
and then I have the details of each page in seperate files. In my mind this does a literal copy/paste, and I haven't found any evidences yet that it isn't just a copy/paste.
Update:
in AF3 data-defer is now data-include

jquery Mobile - Auto Divider

I'm using the jquery Mobile AutoDivider for a web project of mine and it works great in IE8, but for some reason in Chrome it's not generating the headers for me.
My question is: How exactly does the AutoDivider determine what to make a 'divider'? Is is just the first item within your <li></li>?
Here's my basic HTML structure (it's ultimately placed in a ASP.Net Repeater:
<ul data-role="listview" data-autodividers="true">
<li>
<img src="mySource.jpg" alt="" />
<h3>John Doe</h3>
<p><strong>Company Name Here</strong></p>
<p>User Address</p>
<p class="ui-li-aside">
<strong style="display: none;"><!-- This is what seems to make the headers in IE, placing this right here: -->
Last Name of Employee</strong>
</p>
</li>
</ul>

see the docu http://jquerymobile.com/demos/1.2.0/docs/lists/docs-lists.html
Autodividers
A listview can be configured to automatically generate dividers for its items. This is
done by adding a data-autodividers="true" attribute to any listview.
By default, the text used to create dividers is the uppercased first letter of the
item's text. Alternatively you can specify divider text by setting the > autodividersSelector option on the listview programmatically.

Using plugins in Jekyll

I have been using Jekyll for a month or two now, I am also new to Ruby so learning every day.
In my site I want to add the reading time of blog posts, I have found this ruby gem which will let me do it
https://github.com/garethrees/readingtime
I install it in the normal way into my sites root and add the code needed and nothing happens. This isn't a shock because I have no actually link to it in my sites root?
So it my site looks like this html wise
---
layout: default
---
<div class="twelve columns">
<h3>{{ page.title }}</h3>
<span class="date">Wrote by Josh Hornby</span>
<span class="date">Estimated reading time – <%= #article.body.reading_time %> </span>
<br /> <br />
<%= #article.body %>
{{ content }}
</article>
<div class="twitter_button"> <img src="/images/twitter.png" alt="twitter-logo" width="50" height="50" /> </div>
</div>
<div class="four columns">
<h3>Recent Posts</h3>
<p>Find out what else I've been talking about:</p>
{% for post in site.related_posts limit: 10 %}
<ul class="square">
<li><a class="title" style="text-decoration:none;" href="{{post.url}}"><strong>{{ post.title }}</strong></a>
{% endfor %}
</ul>
</div>
Now I'm not shocked that its not working but my question is how to a install the gem so I can access it in my Jekyll file? Do I need to create a _plugin directory and call it from there? Or won't it work as its not a jekyll plugin? In that case I may have a little project writing my own Ruby Jekyll plugin.

As you have surmised, you cannot call arbitrary ruby commands in your html using <%. Instead, you want to write a plugin which defines a Liquid filter. In your html above, you would use the liquid tag to grab the content of the page. You might want to brush up on liquid-for-designers and liquid-for-programmers as well as the Jekyll notes on writing liquid extensions before we dive in, but I'll try to explain.
First, we need to use a Jekyll filter to grab the content of the page, which we will pass to our plugin for analysis. Above you have an #article.body which probably means something to a ruby on rails site, but doesn't mean anything to Jekyll. As you see in the center of your layout file, the content for the page is simply called content. It is pulled in through a Liquid output, indicated by the {{ }}.
In place of the line
<span class="date">Estimated reading time – <%= #article.body.reading_time %> </span>
We want a line that grabs the content and passes it to our plugin:
<span class="date">Estimated reading time – {{ content | readingtime }} </span>
The vertical bar is a filter, meaning pipe content to the function readingtime and include the output. Now we need to write the plugin itself. In the _plugins directory, we create a ruby script following the standard template for a Liquid filter:
require 'readingtime'
module TextFilter
def readingtime(input)
input.reading_time
end
end
Liquid::Template.register_filter(TextFilter)
And save the above as something like readingtime.rb in _plugins. It's kinda self explanatory, but you see this tells ruby to load the gem, and define a filter that takes its input and applies the reading_time function to that string.
A little note: content will pull in the HTML version of the content, not a plain text string. I don't know if the readingtime gem needs a plain text string, but you can of course convert between them using a little extra ruby code. If necessary, that's left as an exercise to the reader (though this might help).

many little partials take up lot's of time to render, why, and how can I speed this up?

I have some 'boxes' that use a javascript scrolling library to display content. The box contains 4 visible content nuggets like this:
<div class="item nugget lesson">
<h3>
<a href="/en/dance_genres/22-authentic-jazz" title="Details and Information for 'Authentic Jazz'">
Authentic Jazz
</a>
</h3>
<div class="thumb">
<a href="/en/dance_genres/22-authentic-jazz" title="Details and Information for 'Authentic Jazz'">
<img alt="22" src="http://common-resources.idance.net.s3.amazonaws.com/images/model_resources/dance_genres/thumb/22.jpg">
</a>
</div>
History: Grounded in vintage videos, the modern revival of ...
<br>
<a href="/en/dance_genres/22-authentic-jazz" title="Details and Information for 'Authentic Jazz'">
<img alt="Lesson_view" src="/images/objects/lesson_view.png?1276105734">
</a>
</div>
When I render more than 50 of these partials, rails rendering load time is noticeable slow (over 2 seconds). I've optimized the sh*% out of my db queries and even added counter_cache fields, so that is not the slowdown.
I'm not talking about load in the browser, but rails processing time.
Please see load times here: http://pastebin.com/pSrNSSsF
Is this normal?

This is normal. You can try rendering a collection, for a bit of a performance gain. (Or cache.)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to scrape a dynamic website with Ruby - ruby-on-rails

Related

Capybara: click on text within <span>

Multiple html pages using Intel App Framework

jquery Mobile - Auto Divider

Using plugins in Jekyll

many little partials take up lot's of time to render, why, and how can I speed this up?

Categories

Resources