scrapy plus selenium to process dynamic multipage --can't continue clicking

scrapy plus selenium to process dynamic multipage --can't continue clicking - parsing

I am using Scrapy plus selenium to scrapy data from dynamic pages.here is my spider code:
class asbaiduSpider(CrawlSpider):
name = 'apps_v3'
start_urls = ["http://as.baidu.com/a/software?f=software_1012_1"]
rules = (Rule(SgmlLinkExtractor(allow=("cid=(50[0-9]|510)&s=1&f=software_1012_1", )), callback='parse_item',follow=True),)
def __init__(self):
CrawlSpider.__init__(self)
chromedriver = "/usr/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
self.driver = webdriver.Chrome(chromedriver)
def __del__(self):
self.driver.stop()
CrawlSpider.__del__(self)
def parse_item(self,response):
hxs = Selector(response)
#links= hxs.xpath('//span[#class="tit"]/text()').extract()
links= hxs.xpath('//a[#class="hover-link"]/#href').extract()
for link in links:
#print 'link:\t%s'%link
time.sleep(2)
return Request(link,callback=self.parse_page)
def parse_page(self,response):
self.driver.get(response.url)
time.sleep(2.5)
app_comments = ''
num = len(self.driver.find_elements_by_xpath("//section[#class='s-index-page devidepage']/a"))
print 'num:\t%s'%num
if num == 8:
print 'num====8 ohohoh'
while True:
link = self.driver.find_element_by_link_text('下一页')
try:
link.click()
except:
break
The problem is, everytime after clicking page2, it just quit the current page. But I need to crawl page3, page4 and so on.
the pages need to parse are like :
http://as.baidu.com/a/item?docid=5302381&pre=web_am_software&pos=software_1012_0&f=software_1012_0 (it's in Chinese, sorry for the inconvenience)
And I need to turn the bottom pages and scrape the comment data.
I have been stuck with the problem for 2 days. I really appreciate for any help.
Thank you...

If I have understood it correct here is your case
Open a page
Find some links from the page and visit them one by one
While visiting each link extract data.
If my understanding is correct. I think you can proceed with below logic.
Open the page
Get all the links and save them to an array.
Now open each page separately using the webdriver and do your job.

Related

Scrapy Spider not returning any results

I am trying to build a scraper with Scrapy. My overall goal is to scrape the webpages of a website and return a list of links for all downloadable documents of the different pages.
Somehow my code does return only None. I am not sure what the cause for this could be. Thank you for your help in advance. Please note that the robots.txt does not cause this issue.
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.url import url_query_cleaner
def processlinks(links):
for link in links:
link.url = url_query_cleaner(link.url)
yield link
class ExampleCrawler(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com']
start_urls = ["https://example.com/"]
rules = (
Rule(
LinkExtractor(
deny=[
re.escape('https://www.example.com/offsite'),
re.escape('https://www.example.com/whitelist-offsite'),
],
),
process_links=processlinks,
callback='parse_links',
follow=False
),)
def parse_links(self, response):
html = response.body
links = scrapy.Selector(text=html).xpath('//#href').extract()
documents = []
for link in links:
absolute_url = urljoin(response.url, link)
documents.append(absolute_url)
return documents
I expected to receive a list containing all document download links for all webpages of the website. I only got a None value returned. It seems like that parse_links method does not get called.

There were a few logical and technical issues in the code. I have made changes to the code. Below are the details.
Your site was redirecting to another site so you need to update the allowed domains and added www.iana.org to it.
allowed_domains = ['www.example.com', 'www.iana.org']
Secondly, in scrapy, you can't return a list or string it should be a request or team in the form or dictionary. see the last time code.
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.url import url_query_cleaner
from urllib.parse import urljoin
import scrapy
def processlinks(links):
for link in links:
link.url = url_query_cleaner(link.url)
yield link
class ExampleCrawler(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com', 'www.iana.org']
start_urls = ["https://example.com/"]
rules = (
Rule(
LinkExtractor(
deny=[
re.escape('https://www.example.com/offsite'),
re.escape('https://www.example.com/whitelist-offsite'),
],
),
process_links=processlinks,
callback='parse_links',
follow=False
),)
def parse_links(self, response):
html = response.body
links = scrapy.Selector(text=html).xpath('//#href').extract()
documents = []
for link in links:
absolute_url = urljoin(response.url, link)
documents.append(absolute_url)
return {"document": documents}

Can't Identify Proper CSS Selector to Scrape with Mechanize

I have built a web scraper that is successfully pulling almost everything I need out of the web page I'm looking at. The goal is to pull the URL for a particular image associated with all the coffees found at a particular URL.
The rake task I have defined to complete the scraping is as follows:
mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click
bean = Bean.new
bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end
Now the information I need is all on the page, and I'm looking for the image URL that is found like the below, but for all the individual coffee_pages at the source page. It needs to be generic enough to pull this picture source but nothing else. I've tried a number of different css selectors but everything pulls either nil or blank.
<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">
The coffee_page I'm on is here: https://shop.ceremonycoffee.com/products/burundi-kiryama

You need to change
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
to
bean.image_url = coffee_page.css('#mobile-only>img').attr('src')
If you can, always use nearby identifiers to locate the element you want to access.

Link encryption?

I have been stuck on a problem for a few hours. Nothing online has helped and I'm losing the will to live right now.
The site loads up a question with no hints and asks you to find a secret code.
Here's the brief explanation of it:
'Well done on making it to the secret bonus challenge! Our agents have been struggling to deal with a hacker obsessed with clocks and timing. He set up an elaborate collection of pages with content that changes based on a timer. We've replicated it below, can you figure out how to get the secret code?'
There are many links inside this challenge and when they are clicked it opens to a new website and has pseudo strings in there, I don't see much pattern. Links below:
https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?
verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?
verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=BY%2F8lhw%2BtbBgvOMDiHeB5A%3D%3D
(If it doesn't allow you to go on) then what it has is just a tag and no element with what it seems a three character code which always ends in 'a' for example 'Aja' and makes a new one every 10 seconds (which is not re-generated client side.)
Anyone have any suggestions to whether or not the link is a hint of encryption or not? I've decrypted it once and it came up with:
'https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=BY/8lhw tbBgvOMDiHeB5A==' which isn't much help.
Anyways, anyone have any suggestions?
Thanks :)

Its not impossible. I have the answer here:
import requests
page1 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt1?verify=wMHfxKSix2qSPJtLe6U98w%3D%3D"
page1_content = requests.get(page1)
page1txt = page1_content.text
page2 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt2?verify=wMHfxKSix2qSPJtLe6U98w%3D%3D"
page2_content = requests.get(page2)
page2txt = page2_content.text
page3 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt3?verify=wMHfxKSix2qSPJtLe6U98w%3D%3D"
page3_content = requests.get(page3)
page3txt = page3_content.text
page4 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt4?verify=wMHfxKSix2qSPJtLe6U98w%3D%3D"
page4_content = requests.get(page4)
page4txt = page4_content.text
page5 = "https://assess.joincyberdiscovery.com/challenge-files/clock-pt5?verify=wMHfxKSix2qSPJtLe6U98w%3D%3D"
page5_content = requests.get(page5)
page5txt = page5_content.text
code = (page1txt + page2txt + page3txt + page4txt + page5txt)
page6 = "https://assess.joincyberdiscovery.com/challenge-files/get-flag?verify=wMHfxKSix2qSPJtLe6U98w%3D%3D&string="+code
page6txt = requests.get(page6)
print (page6txt.text)
Replace all of the links with the links you are given

undefined method `click' for "2":String, Rails error when using Mechanize

class ScraperController < ApplicationController
def show
mechanize = Mechanize.new
website = mechanize.get('https://website.com/')
$max = 2
$counter = 0
$link_to_click = 2
#names = []
while $counter <= $max do
#names.push(website.css('.memName').text.strip)
website.link_with(:text => '2').text.strip.click
$link_to_click += 1
$counter += 1
end
end
end
I am trying to scrape 20 items off of each page and then click on the link at the bottom (1, 2, 3, 4, 5, etc.). However, I get the error as seen in the title which tells me that I cannot click the string. So it recognizes that the button '2' exists but will tell me if cannot click it. Ideally, once this is sorted out, I wanted to the use the $link_to_click variable as a way to replace the '2' so that it will increment each time but it always comes back as nil. I have also changed it to .to_s with the same result.
If I remove the click all together, it will scrape the same page 3 times instead of moving onto the next page. I have also removed the text.strip part before the .click and it will do the same thing. I have tried many variations but have had no luck.
I would really appreciate any advice you could offer.

I ended up reviewing the articles I was referencing to solve this and came to this conclusion.
I changed the website_link to website = website.link_with(:text => $link_to_click.to_s).click (because it only worked as a string) and it printed out the first page, second and each one thereafter.
These are the articles that I was referencing to learn how to do this.
http://docs.seattlerb.org/mechanize/GUIDE_rdoc.html
and
https://readysteadycode.com/howto-scrape-websites-with-ruby-and-mechanize

Seeking resources to help do external image preview (scraping) like FB and subsequent full-size image capture

seeking resources or guidance to help generate an image preview from a link, similar to the one used in Facebook's UI, and then subsequently also allow a user to display / grab the full size image of the preview as well. Not looking to necessarily create a bookmarklet a la svpply.com or the like, but interested in figuring out a way whereby a user can enter a link, select the image they want on the page they've linked to, and then have that image (in full, or near to full-size) added to a post or submission to a web page.
Any help, guidance, or anything would be greatly appreciated!!
Thank you!

This might not be a perfect solution, but I am using it at the moment to populate a database entry with some info fetched from an external URL. This is just to fetch the contents, without taking the image (still working on it). I use Nokogiri gem to HTML
class Link < ActiveRecord::Base
require 'open-uri'
....
def fill_from_url(url_input)
url_input = url_input.strip
self.url = url_input
self.valid?
existing_link = Link.find_by_url(url_input)
if self.errors.messages.has_key?(:url)
return self.errors.messages
else
page = open(url_input)
target_url = page.base_uri.to_s
input = Nokogiri::HTML.parse(page)
desc = input.at('head/meta[#name="description"]/#content')
kws = input.at('head/meta[#name="keywords"]/#content')
lang = input.at('html/#lang')
if input.at('head/title')
self.title = input.at('head/title').content.gsub("\n"||"\r"||"\t",'').squeeze(" ").strip
else
self.title = input.at('title').content.gsub("\n"||"\r"||"\t",'').squeeze(" ").strip
end
self.url = target_url.to_s
self.website = target_url.split('/')[2]
self.description = desc.content[0..2000] if desc
self.keywords = kws.content[0..2000] if kws
self.language_code = lang.content if lang
end
end
...
end

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart