Scrapy - parsing product info && product reviews - parsing

I am creating a crawler to get product info and product reviews and export to csv files from a specific category. For example, I need to get all the information from a pants category, so my crawling starts from there.
I can easily extract an each product link from there. But then I need the crawler to open up that link, fetch all the required information for each product. I also need it to fetch all the reviews for the product, but the problem is that the reviews have pagination too.
I start from here:
class SheinSpider(scrapy.Spider):
name = "shein_spider"
start_urls = [
"https://www.shein.com/Men-Pants-c-1978.html?icn=men-pants&ici=www_tab02navbar02menu01dir06&scici=navbar_3~~tab02navbar02menu01dir06~~2_1_6~~real_1978~~~~0~~0"
]
def parse(self, response):
for item in response.css('.js-good'):
yield {"product_url": item.css('.category-good-name a::attr(href)').get()}
I do know how to parse the info from the catalog list, but don't know how to make the crawler follow each link from the list.

The way to follow links in scrapy is to just yield a scrapy.Request object with the URL and the parse you want to use to process that link. From the scrapy documentation tutorial "Scrapy’s mechanism of following links: when you yield a Request in a callback method, Scrapy will schedule that request to be sent and register a callback method to be executed when that request finishes."
I would recommend checking the tutorial in Scrapy documentation here, especially the section called "Following links".
In your specific example, this is the code that will make it work. Be mindful that your product url needs to be complete and it could be that the href you are getting it from only has a relative url.
name = "shein_spider"
start_urls = [
"https://www.shein.com/Men-Pants-c-1978.html?icn=men-pants&ici=www_tab02navbar02menu01dir06&scici=navbar_3~~tab02navbar02menu01dir06~~2_1_6~~real_1978~~~~0~~0"
]
def parse(self, response):
for item in response.css('.js-good'):
product_url = item.css('.category-good-name a::attr(href)').get()
yield scrapy.Request(product_url, callback=self.parse_item)
def parse_item(self, response):
# Do what you want to do to process the product details page #

Related

Gibbon/Mailchimp API request to create interests inside interest-groupings

I'm using Gibbon, version 2.2.1 for Mailchimp, and I'd like to be able to create an interest inside an interest group. For instance, I have users who are subscribed to a class. My interest group is "Lessons", and an interest inside that interest group would be "Foo Lesson".
I'd like to be able to add the ability to add a new class in my site's CMS, which would make an API request on after_create.
class Lesson < ActiveRecord::Base
after_create :create_class_on_mailchimp
def create_class_on_mailchimp
require 'mailchimp_service'
mailchimp = MailchimpService.new(self)
response = mailchimp.create_class
self.interest_id = response.id
self.save
end
end
class MailchimpService
def initialize(lesson)
#lesson = lesson
#list_id = ENV['MAILCHIMP_LIST_ID']
end
def create_class
GB.lists(#list_id).interest_categories(ENV['MAILCHIMP_CLASSES_CATEGORY_ID']).interests.create(
body: {
name: 'foobar'
}
)
end
end
I keep getting this error:
Gibbon::MailChimpError:the server responded with status 404 #title="Resource Not Found",
#detail="The requested resource could not be found.",
#body={
"type" =>"http://developer.mailchimp.com/documentation/mailchimp/guides/error-glossary/",
"title" =>"Resource Not Found",
"status" =>404,
"detail" =>"The requested resource could not be found.",
"instance" =>""
},
#raw_body="{
\"type\": \"http://developer.mailchimp.com/documentation/mailchimp/guides/error-glossary/\",
\"title\":\"Resource Not Found\",
\"status\":404,
\"detail\":\"The requested resource could not be found.\",
\"instance\":\"\"
}",
#status_code=404
What this tells me is that I'm not using the correct resource name? There doesn't seem to be any documentation for this kind of request in Gibbon's limited docs, nor does it seem to be something that Mailchimp goes over. Here is a link to Mailchimp's docs that goes over the requests for interests inside interest-groupings, however, there doesn't seem to be a create option... Just read, edit, and delete. This seems silly to me, as I can imagine people would want to create interests from somewhere other than Mailchimp's dashboard.
I've tried using name, title, and interest_name for the resource name, but none work. I've also tried using REST API calls, but I receive the same response.
Am I doing something wrong, or is this really something that Mailchimp doesn't offer? It'd be a huge bummer if so, since I'll be creating many classes that I want people to be able to subscribe to, and it would be a major pain to have to do this all manually.
I'm pretty sure POST works to create interests, although it does appear to be missing from the documentation. What is probably happening is that either your list ID or interest category ID is incorrect. You might want to try using the API Playground to track down the exact IDs for both of those entities.

Spree Products API for a particular store - spree-multi-domain

I have an Spree application with spree-multi-domain extension support. Here their are different store with different products assigned to it.(in Admin panel)
like for STORE 1 domain is store1.example.com and for STORE 2 -> store2.example.com
Here I have set the wildcard subdomains for multiple store
*.example.com
Okay now, when I call example.com/api/products.json?token=MY_TOKEN_ID,
I get complete list of products in JSON format. But
Here, I have a issue while retrieving products for Store 1 and Store 2 through api call.
When I call products.json for the
Store 1 store1.example.com/api/products.json?token=MY_TOKEN_ID and for
Store 2 store2.example.com/api/products.json?token=MY_TOKEN_ID
then also I get the complete list of products when as usual like example.com/api/products.json?token=MY_TOKEN_ID
What I'm expecting here is when I call the GET request for products of a particular store then I should get the products of that particular store which was assigned in the admin panel.
So What should I do, couldn't understand.
Please help??
the spree-multi-gem is not 100% stable and still under development.
you need to override the API and use current_store for each request.
A new ControllerHelpers::Store concern provides a current_store helper to fetch a helper based on the request’s domain.
just an example, non related to api
create a /app/controllers/spree/taxons_controller_decorator.rb and extend the TaxonsController. you need to class_eval it, otherwise you override the complete class!!
Spree::TaxonsController.class_eval do
def show
#taxon = Spree::Taxon.find_by_store_id_and_permalink!(current_store.id, params[:id])
return unless #taxon
#searcher = build_searcher(params.merge(:taxon => #taxon.id))
#products = #searcher.retrieve_products
#taxonomies = get_taxonomies
end
end
so by that, every other function from the Spree::TaxonsController stays as it was, and just the show method was overridden
so for your case: this is the orignal file
https://github.com/spree/spree/blob/master/api/app/controllers/spree/api/v1/products_controller.rb
so you need to go into your rails app and have a /app/controllers/spree/api/v1/products_controller_decorator.rb where you go (i think that works)
Spree::Api::V1::ProductsController.class_eval do
end
but after reading that i think the best idea is to override
https://github.com/spree/spree/blob/715d4439f4f02a1d75b8adac74b77dd445b61908/api/app/controllers/spree/api/base_controller.rb#L132
Line 132 the product_scope :-)
this should help you - if not you better go magento :P
cheers

Accessing URI params in the controller

(Learning RoR on my own, so pls forgive me if this is an obvious question)
I have an app to track books stored on shelves in libraries. A request should come in like:
GET books/library_id => shows all books on every shelf in a library
GET books/library_id/shelf_id => shows all books on one shelf in a library
GET books/library_id/shelf_id/book_id => shows a particular book
POST would use similar formats, except I will be using JSON in the POST body to supply the information (author, pub date, length, etc) for the book, shelf, or library
My question is, [:params] passed in to my controller seems to hold query (anything after a ?) and POST body parameters, but not the URL, which I need to parse and use to determine what to show. Is there a way to get this out of the parameters? I'm trying to avoid something like GET /books/?library_id/shelf_id
You can set up a route so that params will contain specific URL fragments in addition to the query string and/or post data.
In config/routes.rb:
get 'books/:library_id(/:shelf_id(/:book_id))', to: 'books#show'
In app/controllers/books_controller.rb:
class BooksController < ApplicationController
def show
library_id = params[:library_id]
shelf_id = params[:shelf_id] # may be nil
book_id = params[:book_id] # may be nil
# TODO: Do something with library_id, shelf_id (if present),
# book_id (if present).
end
end
Alternatively, if you wanted to process the URL with some very custom logic, you could have a wildcard route like get 'books/*sometext', to: 'books#show'. Then, in your controller action you could manually parse params[:sometext]. This would be considered "not the Rails way" but it's there if you need complete flexibility.
Finally, maybe it is worth mentioning that in your controller action you can get information about the request such as request.path, request.fullpath, request.url. But it doesn't sound like you need this in your case.

Analytics UTM Tags - Data not recording

I created a helper module in Rails that creates a hash of parameters that are passed to link_to rails helpers. These utm parameters are being used in emails sent out by the Rails app. The problem is that Google Analytics is not picking up any data when I've been testing. I realize there is a delay in processing and I'm also using the GA debugger to look at the beacons it's actually sending out and I have a staging server and staging Google Analytics that I am testing all of this on. Does anyone see anything with the approach below that would cause GA to not pick up any visits under the "campaigns" report? Does the order of the utm tags actually matter? I realize utm_campaign, utm_source and utm_medium are all required and ensured they are in every link.
For example, this is what one of the links looks like. However, GA is not picking up any data.
http://example.com/?utm_campaign=welcome&utm_medium=email&utm_source=abc
I compare this link that that the link_to method has created with to what the Google UTM Link Builder outputs and the only difference is the order of the parameters. Does the order matter?
http://example.com/?utm_source=abc&utm_medium=email&utm_campaign=welcome
Here is the helper module and an example of it's usage with a link_to.
# mailer_helper.rb
require 'uri'
module MailerHelper
def self.tracking_param_hash(utm_source, utm_medium, utm_campaign, params = {})
{ utm_source: utm_source,
utm_medium: utm_medium,
utm_campaign: utm_campaign
}.merge(params)
end
def self.email_tracking_params(utm_campaign, options = {})
tracking_param_hash("abc", "email", utm_campaign, options)
end
...
end
# example usage in email view
email_params = MailerHelper::email_tracking_params("set-password", reset_password_token: #token)
link_to 'Set my password', edit_password_url(#resource, email_params)
# rendered link in email
http://example.com/users/password/edit?reset_password_token=YEt9PJgQsxb3WpJQ7OEXH3YDT8JZMD&utm_campaign=reset-password&utm_medium=email&utm_source=abc
Data wasn't recording for several reasons. The first one was because one of the link users were clicking on was doing a POST to a user_confirmation_path. Because the urchin was never loads on the POST, no data will be recorded and users are re-directed to the sign in page. The parameters need to persist with the re-direct from the POST, otherwise traffic will be treated as direct. Near the end of this Kissmetrics blog post they outline this problem. To get around this, you can pass parameters into the redirect url. So something like redirect_to user_signin_path(tracking_parameters), where tracking_parameters are the utm tags from the POST URL.
The second was because the utmz cookie was persisting across and wasn't counting the visit as a new session. You need to test everything in a fresh incognito window because of how Google Analytics and chrome treat individual sessions. See this SO post for more details.
Finally, for anyone who reads this, you can confirm that UTM parameters are being set by looking fora cookie called utmz. This cookie is used by Google Analytics and will look something like this 12340657.1412743014.15.1.utmcsr=abc|utmccn=confirm|utmcmd=email and should have the utm parameters in it.
[Edited for clarity and and the various scenarios I was testing]
According to google, the order doesn't matter. AFAICT everything is in-order in your code, so the problem is somewhere else. Cookies?

How to create nested loops under unpredictable results

I'm working on a web crawler application. It will list all links of a given domain as a part of categorized site map. I'm using Nokogiri gem for parsing and searching the HTML. This code works for a single page:
doc = Nokogiri::HTML(open("url"))
links = doc.css("a")
unless links.blank?
links.each do |t|
if t["href"].first == "/"
// link stuff
end
end
end
At the commented line, I can do another doc = Nokogiri::HTML(open(t_URL)) and receive the second set of links so on and so forth. But what about 3rd, 4th or 5th steps?
How will I crawl all other pages of the entire site and other pages having link at the previous pages? The number of links for per page is not predictable, so I can't use each or times. How can I keep visiting all pages and other nested pages and track the links of all of them?
All you need to do is keep track of the absolute URLs in a hash. The value of the hash could be a count or you may want to keep track of when you last scraped each page with a timestamp. Note when you scrape, you should get just the hrefs:
to_visit = {"url" => Time.now}
while !to_visit.empty? do
doc = Nokogiri::HTML(open(to_visit.shift.first))
doc.css("a[href]").each do |link|
url = make_absolute(link)
to_visit[url] = Time.now #add this page to the to_visit 'list'
end
end
Where you'll need to define make_absolute which should create a full URL complete with protocol, host, port, and path.
As you mentioned, each or times are to be used when the iterator is fixed in advance. When you do not have a fixed iterator, you need to use loops like loop, while, until, and break from it when all links have been found.

Resources