Can't Identify Proper CSS Selector to Scrape with Mechanize

Can't Identify Proper CSS Selector to Scrape with Mechanize - ruby-on-rails

I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.
The rake task I have defined to complete the scraping is as follows:
mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click
bean = Bean.new
bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end
Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.
<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">
The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama

You need to change
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
to
bean.image_url = coffee_page.css('#mobile-only>img').attr('src')
If you can, always use nearby identifiers to locate the element you want to access.

Related

Jobs update with Dashing and Ruby

I use Dashing for monitor trends and website statistics.
I create a jobs to check GooglesNews trends and Twitter trends .
The data is displayed well, however, they appear at first load and does put more update then. There is the code for twitter_trends.rb :
require 'nokogiri'
require 'open-uri'
url = 'http://trends24.in/france/~cloud'
data = Nokogiri::HTML(open(url))
list = data.xpath('//ol/li')
tags = list.collect do |tag|
tag.xpath('a').text
end
tags = tags.take(10)
tag_counts = Hash.new({value: 0})
SCHEDULER.every '10s' do
tag = tags.sample
tag_counts[tag] = {label: tag}
send_event('twitter_trends', {items: tag_counts.values})
end
I think I used bad "rufus-scheduler" to schedule my job jobs https://gist.github.com/pushmatrix/3978821#file-sample_job-rb
How to make the data will update correctly on a regular basis ?

Your scheduler looks fine, but it looks like you're making one call to the website:
data = Nokogiri::HTML(open(url))
But never calling it again. Is your intent to only check that site once along with the initial processing of it?
I assume you'd really want to wrap more of your logic into the scheduler loop - only things in there will be rerun when the schedule job hits.

When you covered everything in a scheduler, you are only taking one sample every 10 seconds (http://ruby-doc.org/core-2.2.0/Array.html#method-i-sample) then adding it to tag_counts. This is clearing the tag each time. Thing to remember about schedulers is it's basically a clean slate every time it runs. I'd recommend looping through tags and adding them to tag_counts that way instead of sampling. sampling is kind of unnecessary seeing as you are reducing it to 10 each time you run the scheduler.

If I move the SCHEDULER like this (after url on top), it works but that only one item appears randomly every 10 seconds.
require 'nokogiri'
require 'open-uri'
url = 'http://trends24.in/france/~cloud'
SCHEDULER.every '10s' do
data = Nokogiri::HTML(open(url))
list = data.xpath('//ol/li')
tags = list.collect do |tag|
tag.xpath('a').text
end
tags = tags.take(10)
tag_counts = Hash.new({value: 0})
tag = tags.sample
tag_counts[tag] = {label: tag}
send_event('twitter_trends', {items: tag_counts.values})
end
How to display a list of 10 items, which is updated regularly ?

Filter tweet keywords

I'm trying to filer certain words with Twython before retweeting. I can't figure out a way to get it to work and instead of filtering out certain words, it's adding those words to the ones to retweet. Here is my code:
naughty_words = ["",'"Sign up"', "kindle", "read", "book", "amzn", "amazon"]
good_words = ["Giveaway", ""]
filter = "OR".join(good_words)
blacklist = "-".join(naughty_words)
keywords = filter + blacklist
search_results = twitter.search(q="keywords", count= 5)
try:
for tweet in search_results["statuses"]:
twitter.retweet(id = tweet["id_str"])
time.sleep(15)
except TwythonError as e:
print e

Two issues that I see, fix those and see if it fixes your problem.
1) keywords isn't functioning as expected. From your code now I get GiveawaySign up -kindle -read -book -amzn -amazon. This is because good_words is a 1 element list, so the .join isn't working as expected.
2) The way "Sign Up" is done will show up as "Sign" AND "up" that's more likely the problem.
Try the following:
naughty_words = ["",'"Sign up"', "kindle", "read", "book", "amzn", "amazon"]
good_words = ["Giveaway", ""]
Also, remove the space after OR and keep the one before.
Edit
Change your filter and blacklist to:
filter = "".join(good_words)
blacklist = " -".join(naughty_words)
Since you only have one word in good_words there's not need for the OR. You should get:
Giveaway -"Sign up" -kindle -read -book -amzn -amazon

scrapy plus selenium to process dynamic multipage --can't continue clicking

I am using Scrapy plus selenium to scrapy data from dynamic pages.here is my spider code:
class asbaiduSpider(CrawlSpider):
name = 'apps_v3'
start_urls = ["http://as.baidu.com/a/software?f=software_1012_1"]
rules = (Rule(SgmlLinkExtractor(allow=("cid=(50[0-9]|510)&s=1&f=software_1012_1", )), callback='parse_item',follow=True),)
def __init__(self):
CrawlSpider.__init__(self)
chromedriver = "/usr/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
self.driver = webdriver.Chrome(chromedriver)
def __del__(self):
self.driver.stop()
CrawlSpider.__del__(self)
def parse_item(self,response):
hxs = Selector(response)
#links= hxs.xpath('//span[#class="tit"]/text()').extract()
links= hxs.xpath('//a[#class="hover-link"]/#href').extract()
for link in links:
#print 'link:\t%s'%link
time.sleep(2)
return Request(link,callback=self.parse_page)
def parse_page(self,response):
self.driver.get(response.url)
time.sleep(2.5)
app_comments = ''
num = len(self.driver.find_elements_by_xpath("//section[#class='s-index-page devidepage']/a"))
print 'num:\t%s'%num
if num == 8:
print 'num====8 ohohoh'
while True:
link = self.driver.find_element_by_link_text('下一页')
try:
link.click()
except:
break
The problem is, everytime after clicking page2, it just quit the current page. But I need to crawl page3, page4 and so on.
the pages need to parse are like :
http://as.baidu.com/a/item?docid=5302381&pre=web_am_software&pos=software_1012_0&f=software_1012_0 (it's in Chinese, sorry for the inconvenience)
And I need to turn the bottom pages and scrape the comment data.
I have been stuck with the problem for 2 days. I really appreciate for any help.
Thank you...

If I have understood it correct here is your case
Open a page
Find some links from the page and visit them one by one
While visiting each link extract data.
If my understanding is correct. I think you can proceed with below logic.
Open the page
Get all the links and save them to an array.
Now open each page separately using the webdriver and do your job.

How do I disable Transformations in TYPO3 RTE Editor?

I created a custom extension for TYPO3 CMS.
It basically does some database queries to get text from database.
As I have seen, TYPO3 editor, transforms data before storing it in database so for example a link <a href="....." >Link</a> is stored as <link href>My Link Text</link> and so on for many tags like this.
when I query data from DB, I get it as it is stored in DB (<link href>My Link Text</link>)
so links are not displayed as they shoud. They display as normal text..
As far as I know there are two ways to go:
disable RTE transformations (how to do that?)
use lib.parseFunc_RTE (which i have no Idea how to configure it properly)
any idea?
thanks.

I guess you're not using Extbase and Fluid? Just as a reference, if you are using Extbase and Fluid for your extension you can render text from the RTE using Fluid:
<f:format.html>{bodytext}</f:format.html>
This uses lib.parseFunc_RTE to render the RTE text as HTML. You can also tell it to use a different TypoScript object for the rendering:
<f:format.html parseFuncTSPath="lib.my_parseFunc">{bodytext}</f:format.html>
Useful documentation:
parseFunc
Fluid format.html

I came across the same problem, but using EXTBASE the function "pi_RTEcssText" ist not available anymore. Well maybe it is, but I didn't know how to include it.
Anyway, here's my solution using EXTBASE:
$this->cObj = $this->configurationManager->getContentObject();
$bodytext = $this->cObj->parseFunc($bodyTextFromDb, $GLOBALS['TSFE']->tmpl->setup['lib.']['parseFunc_RTE.']);
This way I get the RTE formatted text.

I have managed to do it by configuring the included typoscript:
# Creates persistent ParseFunc setup for non-HTML content. This is recommended to use (as a reference!)
lib.parseFunc {
makelinks = 1
makelinks.http.keep = {$styles.content.links.keep}
makelinks.http.extTarget < lib.parseTarget
makelinks.http.extTarget =
makelinks.http.extTarget.override = {$styles.content.links.extTarget}
makelinks.mailto.keep = path
tags {
link = TEXT
link {
current = 1
typolink.parameter.data = parameters : allParams
typolink.extTarget < lib.parseTarget
typolink.extTarget =
typolink.extTarget.override = {$styles.content.links.extTarget}
typolink.target < lib.parseTarget
typolink.target =
typolink.target.override = {$styles.content.links.target}
parseFunc.constants =1
}
}
allowTags = {$styles.content.links.allowTags}
And denied tag link:
denyTags = link
sword = <span class="csc-sword">|</span>
constants = 1
nonTypoTagStdWrap.HTMLparser = 1
nonTypoTagStdWrap.HTMLparser {
keepNonMatchedTags = 1
htmlSpecialChars = 2
}
}

Well, just so if anyone else runs into this problem,
I found one way to resolve it by using pi_RTEcssText() function inside my extension file:
$outputText=$this->pi_RTEcssText( $value['bodytext'] );
where $value['bodytext'] is the string I get from the database-query in my extension.
This function seems to process data and return the full HTML (links, paragraphs and other tags inculded).
Note:
If you haven't already, it requires to include this file:
require_once(PATH_tslib.'class.tslib_pibase.php');
on the top of your extension file.
That's it basically.

Seeking resources to help do external image preview (scraping) like FB and subsequent full-size image capture

seeking resources or guidance to help generate an image preview from a link, similar to the one used in Facebook's UI, and then subsequently also allow a user to display / grab the full size image of the preview as well. Not looking to necessarily create a bookmarklet a la svpply.com or the like, but interested in figuring out a way whereby a user can enter a link, select the image they want on the page they've linked to, and then have that image (in full, or near to full-size) added to a post or submission to a web page.
Any help, guidance, or anything would be greatly appreciated!!
Thank you!

This might not be a perfect solution, but I am using it at the moment to populate a database entry with some info fetched from an external URL. This is just to fetch the contents, without taking the image (still working on it). I use Nokogiri gem to HTML
class Link < ActiveRecord::Base
require 'open-uri'
....
def fill_from_url(url_input)
url_input = url_input.strip
self.url = url_input
self.valid?
existing_link = Link.find_by_url(url_input)
if self.errors.messages.has_key?(:url)
return self.errors.messages
else
page = open(url_input)
target_url = page.base_uri.to_s
input = Nokogiri::HTML.parse(page)
desc = input.at('head/meta[#name="description"]/#content')
kws = input.at('head/meta[#name="keywords"]/#content')
lang = input.at('html/#lang')
if input.at('head/title')
self.title = input.at('head/title').content.gsub("\n"||"\r"||"\t",'').squeeze(" ").strip
else
self.title = input.at('title').content.gsub("\n"||"\r"||"\t",'').squeeze(" ").strip
end
self.url = target_url.to_s
self.website = target_url.split('/')[2]
self.description = desc.content[0..2000] if desc
self.keywords = kws.content[0..2000] if kws
self.language_code = lang.content if lang
end
end
...
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Can't Identify Proper CSS Selector to Scrape with Mechanize - ruby-on-rails

You need to change bean.image_url = coffee_page.css('img data-featured-product-image').attr('src') to bean.image_url = coffee_page.css('#mobile-only>img').attr('src') If you can, always use nearby identifiers to locate the element you want to access.

Related

Jobs update with Dashing and Ruby

Filter tweet keywords

scrapy plus selenium to process dynamic multipage --can't continue clicking

How do I disable Transformations in TYPO3 RTE Editor?

Seeking resources to help do external image preview (scraping) like FB and subsequent full-size image capture

Categories

Resources