Scraping webpages for links with a specific class - hyperlink

First post here and I have had a look but can't find the answer I need.
I'm trying to go through a website and find all the links that have a certain class, in this case 'annmt'.
I want the result to only show the link though and am having trouble trying to get the format right. Once right I want to append it to an empty list that I can call on later.
My code is:
import requests
from bs4 import BeautifulSoup
import datetime as dt
l = []
def getlinks():
page = requests.get("http://www.investegate.co.uk/Index.aspx?
ftse=1&date=20170609")
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('a', attrs={'class': 'annmt'}):
for link in links.find_all('a', href=True):
link = link['href']
l.append(link)
print l

Here is the working code for your reference:
import requests
from bs4 import BeautifulSoup
import datetime as dt
l = []
def getlinks():
page = requests.get("http://www.investegate.co.uk/Index.aspx?ftse=1&date=20170609")
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('a', attrs={'class': 'annmt'}):
link = links.get('href')
l.append(link)
print l

Related

Beautiful soup findAll returns empty list on this website?

I'm trying to extract property value history from this website:https://www.properly.ca/buy/home/view/ma-tEpHcSzeES-OlhE-V6A/bc/vancouver/1268-w-broadway-%23720/
But my code returns an empty list instead of the property cost history.
I used the following code:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url= "https://www.properly.ca/buy/home/view/ma-tEpHcSzeES-OlhE-V6A/bc/vancouver/1268-w-broadway-%23720/"
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"property-history"})
for entry in officials:
print(str(entry))
Which returns an empty list, although this URL does have a property history table. Any help would be appreciated.
Thanks!
officials = soup.findAll("table",{"id":"property-history"})
On browser, I don't see a table with id="property-history" - but there is a div with that id, so maybe you can instead get the data you want through
officials = soup.find_all("div", {"id":"property-history"})
Btw, the only table I could find while inspecting the page was inside the map, and I don't think it holds any useful information for you.

Scrapy Spider not returning any results

I am trying to build a scraper with Scrapy. My overall goal is to scrape the webpages of a website and return a list of links for all downloadable documents of the different pages.
Somehow my code does return only None. I am not sure what the cause for this could be. Thank you for your help in advance. Please note that the robots.txt does not cause this issue.
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.url import url_query_cleaner
def processlinks(links):
for link in links:
link.url = url_query_cleaner(link.url)
yield link
class ExampleCrawler(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com']
start_urls = ["https://example.com/"]
rules = (
Rule(
LinkExtractor(
deny=[
re.escape('https://www.example.com/offsite'),
re.escape('https://www.example.com/whitelist-offsite'),
],
),
process_links=processlinks,
callback='parse_links',
follow=False
),)
def parse_links(self, response):
html = response.body
links = scrapy.Selector(text=html).xpath('//#href').extract()
documents = []
for link in links:
absolute_url = urljoin(response.url, link)
documents.append(absolute_url)
return documents
I expected to receive a list containing all document download links for all webpages of the website. I only got a None value returned. It seems like that parse_links method does not get called.
There were a few logical and technical issues in the code. I have made changes to the code. Below are the details.
Your site was redirecting to another site so you need to update the allowed domains and added www.iana.org to it.
allowed_domains = ['www.example.com', 'www.iana.org']
Secondly, in scrapy, you can't return a list or string it should be a request or team in the form or dictionary. see the last time code.
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.url import url_query_cleaner
from urllib.parse import urljoin
import scrapy
def processlinks(links):
for link in links:
link.url = url_query_cleaner(link.url)
yield link
class ExampleCrawler(CrawlSpider):
name = 'example'
allowed_domains = ['www.example.com', 'www.iana.org']
start_urls = ["https://example.com/"]
rules = (
Rule(
LinkExtractor(
deny=[
re.escape('https://www.example.com/offsite'),
re.escape('https://www.example.com/whitelist-offsite'),
],
),
process_links=processlinks,
callback='parse_links',
follow=False
),)
def parse_links(self, response):
html = response.body
links = scrapy.Selector(text=html).xpath('//#href').extract()
documents = []
for link in links:
absolute_url = urljoin(response.url, link)
documents.append(absolute_url)
return {"document": documents}

beautifulsoup returns only partial urls for some websites

from bs4 import BeautifulSoup, SoupStrainer
import requests
def get_url(url):
page = requests.get(url.format())
data = page.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
so that's the base code and when i request,
# get_url("https://www.marie-claire.es/moda")
get_url("http://spanish.xinhuanet.com/")
xinhua returns
full URLs,
but the other website
does not return the full hyperlinks,
I am not sure why I have this issue and how to solve it.
Has anyone had a similar issue? or has an idea how to solve this?
I suspect that you're looking for urljoin here:
from bs4 import BeautifulSoup, SoupStrainer
import requests
from urllib.parse import urljoin
def get_url(url):
page = requests.get(url.format())
data = page.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(urljoin(url, link.get('href')))
You might also consider
for link in set(soup.find_all('a')):
to avoid duplicates in your result.

Web Scraping - BeautifulSoup parsers do not seem to work

I am trying to extract the name of a few items from the url below. The node and class_, point to the right content but when I use find_all , I do not get back any results. From previous posts it seems that this problem might be connected to using the wrong parser. I have used xml, lxml and others but nothing seems to work.
Is anyone able to extract the content successfully?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import html5lib
import urllib3
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('div', class_ = 'name namorio')
UPDATE
I have managed to find the info I needed, hidden in another section of the same code available to inspection. I have extracted them using this code:
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('li', class_ = False)
names_pb = [c.select("a > p")[0].text for c in con_pb]
prices_pb = [c.select('a > p')[1].text for c in con_pb]
picts_pb = [c.find('img').get('src') for c in con_pb]
df_pb = pd.DataFrame({'(Pull&Bear) Item_Name': names_pb,
'Item_Price_EUR': prices_pb,
'Link_to_Pict': picts_pb })
It seems that the website is using javascript in order to display its content. Meaning that you can't directly visit the homepage and scrape the content (as the requests doesn't support javascript rendered websites). That being said all of the data displayed on the website is sent in the form of a JSON string, so in order to get all the names of the items you could use the following code:
import requests
url = "https://www.pullandbear.com:443/itxrest/2/catalog/store/24009405/20309428/category/1030207088/product?languageId=-4&appId=1"
all_products = requests.get(url).json()["products"]
product_names = [item["bundleProductSummaries"][0]["name"] for item in all_products]
print(product_names)
hope this helps

autoplay first video in results of youtube using python

I want to make an app where I search by typing a certain keyword and the program automatically plays the first video in the search results on youtube. How do I get the link for the first video of the search result?
this code is to prints the link of the first video on a search result you provide to the app.. example :-
run the app .. type hello its me .. then it does it magic.
import urllib.request
import urllib.parse
import re
import webbrowser as wb
query_string = urllib.parse.urlencode({"search_query" : input()})
html_content = urllib.request.urlopen("http://www.youtube.com/results?"+query_string)
search_results = re.findall(r'href=\"\/watch\?v=(.{11})', html_content.read().decode())
print("http://www.youtube.com/watch?v=" + search_results[0])
wb.open_new("http://www.youtube.com/watch?v={}".format(search_results[0]))
I have updated the answer from Ali.M.Kamel.
import urllib.request
import urllib.parse
import re
import webbrowser as wb
query_string = urllib.parse.urlencode({"search_query" : input()})
html_content = urllib.request.urlopen("https://www.youtube.com.hk/results?"+query_string)
search_results = re.findall(r'url\"\:\"\/watch\?v\=(.*?(?=\"))', html_content.read().decode())
if search_results:
print("http://www.youtube.com/watch?v=" + search_results[0])
wb.open_new("http://www.youtube.com/watch?v={}".format(search_results[0]))

Resources