Robots.txt flexibility with top level domains - parsing

so the only problem I have left for this web crawler is making it to where when the top level domain changes, say from imdb to youtube, that it will then switch the robots.txt from following the disallow rules of imdb to youtube. I believe that it can all be fixed just with how the variables are declared in the beginning.
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re
re.IGNORECASE = True
#SourceUrl
url = "http://www.imdb.com"
urls = [url]
visited =[url]
robotsUrl = url +'/robots.txt'
while len(urls) < 250000:
try:
htmltext = urllib.request.urlopen(urls[0]).read()
robots = urllib.request.urlopen(robotsUrl).read()
disallowList = re.findall(b'Disallow\:\s*([a-zA-Z0-9\*\-\/\_\?\.\%\:\&]+)', robots)
except:
print (urls[0])
sourceCode = BeautifulSoup(htmltext, "html.parser")
urls.pop(0)
print(len(urls))
for link in sourceCode.findAll('a', href=True):
if "http://" not in link['href']:
link['href'] = urllib.parse.urljoin(url,link['href'])
in_disallow = False
for i in range(len(disallowList)):
if (disallowList[i]).upper().decode() in link['href'].upper():
in_disallow = True
break
if not in_disallow:
if link['href'] not in visited:
urls.append(link['href'])
visited.append(link['href'])
print (visited)

As long as the domain names used inside your robots.txt matches the one corresponding to the url to your robots.txt, it is all fine. In other words, you can replace yoursite.imdb to yoursite.youtube in all urls. That's fine.
Update
Say you have a sitemap declared in your robots.txt, then it should have the same tld.
http://www.yoursite.imbd/robots.txt
should contain:
sitemap: http://www.yoursite.imbd/sitemap1.xml (not .youtube)
Otherwise, for directives such as allow or disallow, there is not impact, since the TDL does not appear in the paths.

Related

Scrapy Spider not returning any results

I am trying to build a scraper with Scrapy. My overall goal is to scrape the webpages of a website and return a list of links for all downloadable documents of the different pages.
Somehow my code does return only None. I am not sure what the cause for this could be. Thank you for your help in advance. Please note that the robots.txt does not cause this issue.
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.url import url_query_cleaner
def processlinks(links):
for link in links:
link.url = url_query_cleaner(link.url)
yield link
class ExampleCrawler(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com']
start_urls = ["https://example.com/"]
rules = (
Rule(
LinkExtractor(
deny=[
re.escape('https://www.example.com/offsite'),
re.escape('https://www.example.com/whitelist-offsite'),
],
),
process_links=processlinks,
callback='parse_links',
follow=False
),)
def parse_links(self, response):
html = response.body
links = scrapy.Selector(text=html).xpath('//#href').extract()
documents = []
for link in links:
absolute_url = urljoin(response.url, link)
documents.append(absolute_url)
return documents
I expected to receive a list containing all document download links for all webpages of the website. I only got a None value returned. It seems like that parse_links method does not get called.
There were a few logical and technical issues in the code. I have made changes to the code. Below are the details.
Your site was redirecting to another site so you need to update the allowed domains and added www.iana.org to it.
allowed_domains = ['www.example.com', 'www.iana.org']
Secondly, in scrapy, you can't return a list or string it should be a request or team in the form or dictionary. see the last time code.
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.url import url_query_cleaner
from urllib.parse import urljoin
import scrapy
def processlinks(links):
for link in links:
link.url = url_query_cleaner(link.url)
yield link
class ExampleCrawler(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com', 'www.iana.org']
start_urls = ["https://example.com/"]
rules = (
Rule(
LinkExtractor(
deny=[
re.escape('https://www.example.com/offsite'),
re.escape('https://www.example.com/whitelist-offsite'),
],
),
process_links=processlinks,
callback='parse_links',
follow=False
),)
def parse_links(self, response):
html = response.body
links = scrapy.Selector(text=html).xpath('//#href').extract()
documents = []
for link in links:
absolute_url = urljoin(response.url, link)
documents.append(absolute_url)
return {"document": documents}

Web Scraping - BeautifulSoup parsers do not seem to work

I am trying to extract the name of a few items from the url below. The node and class_, point to the right content but when I use find_all , I do not get back any results. From previous posts it seems that this problem might be connected to using the wrong parser. I have used xml, lxml and others but nothing seems to work.
Is anyone able to extract the content successfully?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import html5lib
import urllib3
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('div', class_ = 'name namorio')
UPDATE
I have managed to find the info I needed, hidden in another section of the same code available to inspection. I have extracted them using this code:
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('li', class_ = False)
names_pb = [c.select("a > p")[0].text for c in con_pb]
prices_pb = [c.select('a > p')[1].text for c in con_pb]
picts_pb = [c.find('img').get('src') for c in con_pb]
df_pb = pd.DataFrame({'(Pull&Bear) Item_Name': names_pb,
'Item_Price_EUR': prices_pb,
'Link_to_Pict': picts_pb })
It seems that the website is using javascript in order to display its content. Meaning that you can't directly visit the homepage and scrape the content (as the requests doesn't support javascript rendered websites). That being said all of the data displayed on the website is sent in the form of a JSON string, so in order to get all the names of the items you could use the following code:
import requests
url = "https://www.pullandbear.com:443/itxrest/2/catalog/store/24009405/20309428/category/1030207088/product?languageId=-4&appId=1"
all_products = requests.get(url).json()["products"]
product_names = [item["bundleProductSummaries"][0]["name"] for item in all_products]
print(product_names)
hope this helps

Scraping webpages for links with a specific class

First post here and I have had a look but can't find the answer I need.
I'm trying to go through a website and find all the links that have a certain class, in this case 'annmt'.
I want the result to only show the link though and am having trouble trying to get the format right. Once right I want to append it to an empty list that I can call on later.
My code is:
import requests
from bs4 import BeautifulSoup
import datetime as dt
l = []
def getlinks():
page = requests.get("http://www.investegate.co.uk/Index.aspx?
ftse=1&date=20170609")
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('a', attrs={'class': 'annmt'}):
for link in links.find_all('a', href=True):
link = link['href']
l.append(link)
print l
Here is the working code for your reference:
import requests
from bs4 import BeautifulSoup
import datetime as dt
l = []
def getlinks():
page = requests.get("http://www.investegate.co.uk/Index.aspx?ftse=1&date=20170609")
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('a', attrs={'class': 'annmt'}):
link = links.get('href')
l.append(link)
print l

How to compile custom format ini file with redirects?

I'm working with an application that has 3 ini files in a somewhat irritating custom format. I'm trying to compile these into a 'standard' ini file.
I'm hoping for some inspiration in the form of pseudocode to help me code some sort of 'compiler'.
Here's an example of one of these ini files. The less than/greater than indicates a redirect to another section in the file. These redirects could be recursive.. i.e. one redirect then redirects to another. It could also mean a redirect to an external file (3 values are present in that case). Comments start with a # symbol
[PrimaryServer]
name = DEMO1
baseUrl = http://demo1.awesome.com
[SecondaryServer]
name = DEMO2
baseUrl = http://demo2.awesome.com
[LoginUrl]
# This is a standard redirect
baseLoginUrl = <PrimaryServer:baseUrl>
# This is a redirect appended with extra information
fullLoginUrl = <PrimaryServer:baseUrl>/login.php
# Here's a redirect that points to another redirect
enableSSL = <SSLConfiguration:enableSSL>
# This is a key that has mutliple comma-separated values, some of which are redirects.
serverNames = <PrimaryServer:name>,<SecondaryServer:name>,AdditionalRandomServerName
# This one is particularly nasty. It's a redirect to another file...
authenticationMechanism = <Authenication.ini:Mechanisms:PrimaryMechanism>
[SSLConfiguration]
enableSSL = <SSLCertificates:isCertificateInstalled>
[SSLCertificates]
isCertificateInstalled = true
Here's an example of what I'm trying to achieve. I've removed the comments for readability.
[PrimaryServer]
name = DEMO1
baseUrl = http://demo1.awesome.com
[SecondaryServer]
name = DEMO2
baseUrl = http://demo2.awesome.com
[LoginUrl]
baseLoginUrl = http://demo1.awesome.com
fullLoginUrl = http://demo1.awesome.com/login.php
enableSSL = true
serverNames = DEMO1,DEMO2,AdditionalRandomServerName
authenticationMechanism = valueFromExternalFile
[SSLConfiguration]
enableSSL = <SSLCertificates:isCertificateInstalled>
[SSLCertificates]
isCertificateInstalled = true
I'm looking at using ini4j (Java) to achieve this, but am by no means fixed on using that language.
My main questions are:
1) How can I handle the recursive redirects
2) How am I best to handle the redirects that have an additional string, e.g. serverNames
3) Bonus points for any suggestions about how to handle the external redirects. No big deal if that part isn't working just yet.
So far, I'm able to parse and tidy up the file, but I'm struggling with these redirects.
Once again, I'm only hoping for pseudocode. Perhaps I need more coffee, but I'm really puzzled by this one.
Thanks in advance for any suggestions.

Parser returns wrong url

I'm parsing dialect words from http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna.
from urllib import request
from bs4 import BeautifulSoup
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist
url = 'http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna'
dialettando_tokens = []
while url:
html = request.urlopen(url).read().decode('utf8')
page = BeautifulSoup(html, 'html.parser')
a_list = page.find_all('a')
for a in a_list:
try:
a_str = str(a.contents[0])
if a_str[:3] == '<b>' and a.contents[0].string:
dialettando_tokens.append(a.contents[0].string.strip())
except:
pass
if a.string == 'Simonelli Editore Srl':
break
elif a.string == 'PROSSIMI':
link = a['href']
url = 'http://www.dialettando.com/dizionario/' + link
break
else:
url = ''
In the end of each iteration I need to parse url to the next page.
HTML:
PROSSIMI
And I need to get this link:
'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto&regione=Sardegna'
BUT the parser returns:
'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialettoRione=Sardegna'
This link doesn't work correctly and I can't understand what's wrong.
An href needs to have the ampersand character escaped, see this question. It is possible the site you visited is not escaping the ampersand inside the href correctly, and hoping they never accidentally reference an HTML entity, except in your case they did. It seems like you have to parse buggy HTML, plus a parser that didn't notice the semicolon was missing and did the HTML entity conversion anyway.

Resources