BeautifulSoup - All href links don't appear to be extracting

BeautifulSoup - All href links don't appear to be extracting - parsing

I am trying to extract all href links that are within class ['address']. Each time I run the code, I only get the first 5 and that's it, even though I know there should be 9.
Web-Page:
https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch
I have read through a variety of threads below, altered my code countless times, including switching through all parsers (html.parser, html5lib, lxml, xml, lxml-xml) but nothing seems to be working. Any idea of what's causing it stop after the 5th iteration? I am still fairly new into python so I apologize if this is a rookie mistake that I'm overlooking. Any help would be appreciated, even the sarcastic answers :)
Beautiful Soup findAll doesn't find them all
Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
BeautifulSoup fails to parse long view state
Beautifulsoup lost nodes
Missing parts on Beautiful Soup results
Python 64 bit not storing as long of string as 32 bit python
I used pretty similar code on the following web-pages below and did not experience any issues scraping the hrefs:
https://www.walgreens.com/storelistings/storesbystate.jsp?requestType=locator
https://www.walgreens.com/storelistings/storesbycity.jsp?requestType=locator&state=AK
My code below:
import requests
from bs4 import BeautifulSoup
local_rg = requests.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = local_rg.content
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
for link in local_rg_content_src.find_all('div'):
local_class = str(link.get('class'))
if str("['address']") in str(local_class):
local_a = link.find_all('a')
for a_link in local_a:
local_href = str(a_link.get('href'))
print(local_href)
My results (first 5):
/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
But should be 9:
/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681

Try using selenium instead of requests to get the source code of the page. Here is how you do it:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = driver.page_source
driver.close()
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
The rest of the code is the same. Here is the full code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch')
local_rg_content = driver.page_source
driver.close()
local_rg_content_src = BeautifulSoup(local_rg_content, 'lxml')
for link in local_rg_content_src.find_all('div'):
local_class = str(link.get('class'))
if str("['address']") in str(local_class):
local_a = link.find_all('a')
for a_link in local_a:
local_href = str(a_link.get('href'))
print(local_href)
Output:
/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681

the page uses Ajax to load store information from external URL. You can use requests/json module to load it:
import re
import json
import requests
url = 'https://www.walgreens.com/storelocator/find.jsp?requestType=locator&state=AK&city=ANCHORAGE&from=localSearch'
ajax_url = 'https://www.walgreens.com/locator/v1/stores/search?requestor=search'
m = re.search(r'"lat":([\d.-]+),"lng":([\d.-]+)', requests.get(url).text)
params = {
'lat': m.group(1),
'lng': m.group(2)
}
data = requests.post(ajax_url, json=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for result in data['results']:
print(result['store']['address']['street'])
print('https://www.walgreens.com' + result['storeSeoUrl'])
print('-' * 80)
Prints:
1470 W NORTHERN LIGHTS BLVD
https://www.walgreens.com/locator/walgreens-1470+w+northern+lights+blvd-anchorage-ak-99503/id=15092
--------------------------------------------------------------------------------
725 E NORTHERN LIGHTS BLVD
https://www.walgreens.com/locator/walgreens-725+e+northern+lights+blvd-anchorage-ak-99503/id=13656
--------------------------------------------------------------------------------
4353 LAKE OTIS PARKWAY
https://www.walgreens.com/locator/walgreens-4353+lake+otis+parkway-anchorage-ak-99508/id=15653
--------------------------------------------------------------------------------
7600 DEBARR RD
https://www.walgreens.com/locator/walgreens-7600+debarr+rd-anchorage-ak-99504/id=12679
--------------------------------------------------------------------------------
2197 W DIMOND BLVD
https://www.walgreens.com/locator/walgreens-2197+w+dimond+blvd-anchorage-ak-99515/id=12680
--------------------------------------------------------------------------------
2550 E 88TH AVE
https://www.walgreens.com/locator/walgreens-2550+e+88th+ave-anchorage-ak-99507/id=15654
--------------------------------------------------------------------------------
12405 BRANDON ST
https://www.walgreens.com/locator/walgreens-12405+brandon+st-anchorage-ak-99515/id=13449
--------------------------------------------------------------------------------
12051 OLD GLENN HWY
https://www.walgreens.com/locator/walgreens-12051+old+glenn+hwy-eagle+river-ak-99577/id=15362
--------------------------------------------------------------------------------
1721 E PARKS HWY
https://www.walgreens.com/locator/walgreens-1721+e+parks+hwy-wasilla-ak-99654/id=12681
--------------------------------------------------------------------------------

Related

Wierd URL Encoding/Decoding for non English Characters

How and why a non-English word is converted to weird characters like پاکستان to Ù¾Ø§Ú©Ø³ØªØ§Ù†, is there any way back to get پاکستان from Ù¾Ø§Ú©Ø³ØªØ§Ù†. It happens in browser shown code and received requests at server
Background:
I get lot of requests at my Non-English content (urdu) website with urls like
Ù¾Ø§Ú©Ø³ØªØ§Ù†
I tried to know what that means but search engines don't help. I tried things like
Decode this 'mystring'
What ecoding is this 'mystring'
I thought it might be corrupted/spam url, from this link
Weird characters in URL
Problem explanation/example
But when I viewed one my js file in browser (while having look on working js file). It is showing me same wired characters in browser, even at localhost
'pakistan': {'eng': 'Pakistan', 'ur': 'Ù¾Ø§Ú©Ø³ØªØ§Ù†'},
//But actually source code for above line is following
'pakistan': {'eng': 'Pakistan', 'ur': 'پاکستان'},
But in browser its showing me following for same line,
My knowledge
I only know about Encoding/Decoding, which seems unrelated here with best of my knowledge as?
encodeURI and decodeURI in JS or quote and unquote in python and same for other languages. But what they do for me is only
`پاکستان` to `%D9%BE%D8%A7%DA%A9%D8%B3%D8%AA%D8%A7%D9%86` and vise versa
Why needed?
I don't want to miss the requests received with those malformed urls, there must be some things to undo as all browsers chrome/firefox/edge showing those characters same, If their translation/conversion method and result is same then there should be some technique available to reverse it as well

Thanks to Giacomo Catenazzi and then I be greatful to the following answer
How to decode cp1252 string?
A very custom and still imperfect solution to my problem.
This algo needs to be improved Only by experiment I came to know, this algo works as its not working for me when string is long or including - (hyphens)
So I made changes according to my requirement and its working fair enough, so that I could guess what the actual string was.
import re, itertools
from lxml.builder import unicode
def specific_my_required_processing(received_string):
starting_characters_in_encoded_string_in_my_case = ['Ø', 'Ã', 'Ù', 'Ù', 'Ú']
arr = received_string.split('-')
res = []
missed = []
for string_item in arr:
decoded_string = guess_decode_string_without_hyphens(string_item)
if decoded_string and decoded_string[:1] not in starting_characters_in_encoded_string_in_my_case:
res.append(decoded_string)
else:
missed.append({string_item: decoded_string})
resulting_urdu_string = '-'.join(res)
print('\n\nResult', resulting_urdu_string)
print('\nCould not be decoded', missed)
def guess_decode_string_without_hyphens(s):
encodings = ['cp1251', 'cp1252', 'utf8']
for steps in range(2, 10, 2):
for encs in itertools.product(encodings, repeat=steps):
r = s
try:
for enc in encs:
r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
except (UnicodeEncodeError, UnicodeDecodeError) as e:
continue
if re.match(u'^[\w\sа-яА-Я]+$', r):
res = str(r)
print('Encoding => ', encs, ' Conversion = ' + s + ' => ' + res)
return res
sample_encoded_string = 'Ø§Ø³Ù„Ø§Ù…-Ø¢Ø¨Ø§Ø¯-ÛØ§Ø¦ÛŒÚ©ÙˆØ±Ù¹-Ø§ÛŒ-ÙˆÛŒ-Ø§ÛŒÙ…-Ù‚Ø§Ù†ÙˆÙ†-Ø³Ø§Ø²ÛŒ-Ú©Ø§Ù„Ø¹Ø¯Ù…-Ù‚Ø±Ø§Ø±-Ø¯ÛŒÙ†Û’-Ú©ÛŒ-Ø¯Ø±Ø®ÙˆØ§Ø³Øª-Ù†Ø§Ù…Ú©Ù…Ù„-Ù‚Ø±Ø§Ø±'
specific_my_required_processing(sample_encoded_string)

Web Scraping - BeautifulSoup parsers do not seem to work

I am trying to extract the name of a few items from the url below. The node and class_, point to the right content but when I use find_all , I do not get back any results. From previous posts it seems that this problem might be connected to using the wrong parser. I have used xml, lxml and others but nothing seems to work.
Is anyone able to extract the content successfully?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import html5lib
import urllib3
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('div', class_ = 'name namorio')

UPDATE
I have managed to find the info I needed, hidden in another section of the same code available to inspection. I have extracted them using this code:
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('li', class_ = False)
names_pb = [c.select("a > p")[0].text for c in con_pb]
prices_pb = [c.select('a > p')[1].text for c in con_pb]
picts_pb = [c.find('img').get('src') for c in con_pb]
df_pb = pd.DataFrame({'(Pull&Bear) Item_Name': names_pb,
'Item_Price_EUR': prices_pb,
'Link_to_Pict': picts_pb })

It seems that the website is using javascript in order to display its content. Meaning that you can't directly visit the homepage and scrape the content (as the requests doesn't support javascript rendered websites). That being said all of the data displayed on the website is sent in the form of a JSON string, so in order to get all the names of the items you could use the following code:
import requests
url = "https://www.pullandbear.com:443/itxrest/2/catalog/store/24009405/20309428/category/1030207088/product?languageId=-4&appId=1"
all_products = requests.get(url).json()["products"]
product_names = [item["bundleProductSummaries"][0]["name"] for item in all_products]
print(product_names)
hope this helps

Scraping webpages for links with a specific class

First post here and I have had a look but can't find the answer I need.
I'm trying to go through a website and find all the links that have a certain class, in this case 'annmt'.
I want the result to only show the link though and am having trouble trying to get the format right. Once right I want to append it to an empty list that I can call on later.
My code is:
import requests
from bs4 import BeautifulSoup
import datetime as dt
l = []
def getlinks():
page = requests.get("http://www.investegate.co.uk/Index.aspx?
ftse=1&date=20170609")
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('a', attrs={'class': 'annmt'}):
for link in links.find_all('a', href=True):
link = link['href']
l.append(link)
print l

Here is the working code for your reference:
import requests
from bs4 import BeautifulSoup
import datetime as dt
l = []
def getlinks():
page = requests.get("http://www.investegate.co.uk/Index.aspx?ftse=1&date=20170609")
soup = BeautifulSoup(page.content, 'html.parser')
for links in soup.find_all('a', attrs={'class': 'annmt'}):
link = links.get('href')
l.append(link)
print l

urllib.request.urlopen(url) how to use this function with ip address?

I'm working on Python3 with testing page load times so I created a local apache server for compare but the problem is I use urllib.request.urlopen(url) function which doesn't allow me to use my own ip address. Is there anything that helps me to get page with only ip address. Here's the code I working on;
start_loadf = time.time()
nf = urllib.request.urlopen(url) ##// I want here to be something like 192.168.1.2
page = nf.read()
end_loadf = time.time()
nf.close()
reading_time = format(end_loadf-start_loadf,'.3f')
print("Kaynaktan alinan ilk okuma suresi : ", reading_time , "sn.")

Solved the problem when I look in to urllib literally. Actually what I need is urllib2 but because of I'm using python3.4 I souldn't import urllib it causes python use urllib part not urllib2. After importing urlib.request only and writing the url part as http://192.168.1.2 instead of 192.168.1.2 it works fine.
import urllib.request
import time
import socket
nf = urllib.request.urlopen("http://192.168.1.2")
start_loadf = time.time()
page = nf.read()
end_loadf = time.time()
nf.close()
reading_time = format(end_loadf-start_loadf,'.3f')
print("Kaynaktan alinan ilk okuma suresi : ", reading_time , "sn.")

How to programatically get holdings of an ETF

I am looking for a way to get the holding list of an ETF via a web service such as yahoo finance. So far, YQL has not yielded the desired results.
As an example ZUB.TO is an ETF that has holdings. here is a list of the holdings by querying the yahoo.finance.quotes we do not get the proper information.
The result.
Is there another table somewhere that would contain the holdings?

Perhaps downloading from Yahoo Finance is not working and/or may not work.
Instead how about using the various APIs the ETF providers already have for downloading the Excel or CSV files of the holdings?
Use the "append_df_to_excel" file as file to import, and then use the code below to make Excel file for all the 11 Sector SPDRs provided by SSgA (State Street global Advisors).
Personally I use this for doing breadth analysis.
import pandas as pd
import append_to_excel
# https://stackoverflow.com/questions/20219254/how-to-write-to-an-existing-excel-file-without-overwriting-data-using-pandas
##############################################################################
# Author: Salil Gangal
# Posted on: 08-JUL-2018
# Forum: Stack Overflow
##############################################################################
output_file = 'C:\my_python\SPDR_Holdings.xlsx'
base_url = "http://www.sectorspdr.com/sectorspdr/IDCO.Client.Spdrs.Holdings/Export/ExportExcel?symbol="
data = {
'Ticker' : [ 'XLC','XLY','XLP','XLE','XLF','XLV','XLI','XLB','XLRE','XLK','XLU' ]
, 'Name' : [ 'Communication Services','Consumer Discretionary','Consumer Staples','Energy','Financials','Health Care','Industrials','Materials','Real Estate','Technology','Utilities' ]
}
spdr_df = pd.DataFrame(data)
print(spdr_df)
for i, row in spdr_df.iterrows():
url = base_url + row['Ticker']
df_url = pd.read_excel(url)
header = df_url.iloc[0]
holdings_df = df_url[1:]
holdings_df.set_axis(header, axis='columns', inplace=True)
print("\n\n", row['Ticker'] , "\n")
print(holdings_df)
append_df_to_excel(output_file, holdings_df, sheet_name= row['Ticker'], index=False)
Image of Excel file generated for SPDRs

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

BeautifulSoup - All href links don't appear to be extracting - parsing

Related

Wierd URL Encoding/Decoding for non English Characters

Web Scraping - BeautifulSoup parsers do not seem to work

Scraping webpages for links with a specific class

urllib.request.urlopen(url) how to use this function with ip address?

How to programatically get holdings of an ETF

Categories

Resources