Beautiful soup findAll returns empty list on this website? - parsing

I'm trying to extract property value history from this website:https://www.properly.ca/buy/home/view/ma-tEpHcSzeES-OlhE-V6A/bc/vancouver/1268-w-broadway-%23720/
But my code returns an empty list instead of the property cost history.
I used the following code:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url= "https://www.properly.ca/buy/home/view/ma-tEpHcSzeES-OlhE-V6A/bc/vancouver/1268-w-broadway-%23720/"
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"property-history"})
for entry in officials:
print(str(entry))
Which returns an empty list, although this URL does have a property history table. Any help would be appreciated.
Thanks!

officials = soup.findAll("table",{"id":"property-history"})
On browser, I don't see a table with id="property-history" - but there is a div with that id, so maybe you can instead get the data you want through
officials = soup.find_all("div", {"id":"property-history"})
Btw, the only table I could find while inspecting the page was inside the map, and I don't think it holds any useful information for you.

Related

Web Scraping - BeautifulSoup parsers do not seem to work

I am trying to extract the name of a few items from the url below. The node and class_, point to the right content but when I use find_all , I do not get back any results. From previous posts it seems that this problem might be connected to using the wrong parser. I have used xml, lxml and others but nothing seems to work.
Is anyone able to extract the content successfully?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import html5lib
import urllib3
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('div', class_ = 'name namorio')
UPDATE
I have managed to find the info I needed, hidden in another section of the same code available to inspection. I have extracted them using this code:
url_pb = 'https://www.pullandbear.com/it/uomo/accessori/zaini-c1030207088.html'
req_pb = requests.get(url_pb)
pars_pb = BeautifulSoup(req_pb.content, 'html.parser')
con_pb = pars_pb.find_all('li', class_ = False)
names_pb = [c.select("a > p")[0].text for c in con_pb]
prices_pb = [c.select('a > p')[1].text for c in con_pb]
picts_pb = [c.find('img').get('src') for c in con_pb]
df_pb = pd.DataFrame({'(Pull&Bear) Item_Name': names_pb,
'Item_Price_EUR': prices_pb,
'Link_to_Pict': picts_pb })
It seems that the website is using javascript in order to display its content. Meaning that you can't directly visit the homepage and scrape the content (as the requests doesn't support javascript rendered websites). That being said all of the data displayed on the website is sent in the form of a JSON string, so in order to get all the names of the items you could use the following code:
import requests
url = "https://www.pullandbear.com:443/itxrest/2/catalog/store/24009405/20309428/category/1030207088/product?languageId=-4&appId=1"
all_products = requests.get(url).json()["products"]
product_names = [item["bundleProductSummaries"][0]["name"] for item in all_products]
print(product_names)
hope this helps

How to retrieve an XMLTYPE data that contains special characters?

I want to retrieve XMLTYPE data from an Oracle table using cx_oracle.
the data looks like this:
<infos>
<Comment/>
<Observation>àéèç</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
Here's my code:
#!/usr/bin/python
from __future__ import print_function
import cx_Oracle
# Connection to RTDIAG
try:
dsn_test = cx_Oracle.makedsn(host='xxxxx',port='1521',service_name='xxxxx')
con_test = cx_Oracle.connect(user='xxxx', password='xxxxx',dsn=xxxx)
except cx_Oracle.InterfaceError:
print ("Impossible to connect to the DB!")
print ("***exit script***")
quit()
ID_record = 1729
cursor = con_test.cursor()
query = """select a.content.getClobVal() from emb_log a where ID = :id and uncompleted_record=1
"""
cursor.execute(query,id=1729)
xml_retrieved = cursor.fetchone()[0].read() #string
print (xml_retrieved)
Here's what I get
<infos>
<Comment/>
<Observation>aeec</Observation>
<Level>L3</Level>
<Duration/>
<Cause/>
<Depot> Haren </Depot>
<Resolution/>
</infos>
The special characters contained within the XML child is not being retrieved proprely. They are converted in 'ascii like' characters.
Why and how to fetch the XML exactly the way it appears in the DB ?
Thank you.
Set your NLS environment. You will probably find it easiest to use
the
encoding
option when you connect.
for performance, you will want to fetch the CLOB via an OutputTypeHandler

Dart: getElementsByClassName returns a 0 element list but the data is there

I'm writing a function that will parse certain websites and fetch data from there, which will be used to create instances of a class. I'm able to successfully extract the data when it is retrieved using the getElementById() function, but for some reason, the getElementsByClassName() always returns a node list with 0 elements.
The site I'm currently parsing is here.
If you search for 'datas-nev', you will find exactly one match:
<p class="datas-nev"><b>Kutya neve: </b>Jhonny</p>
And here is the code use for parsing:
import 'package:html/parser.dart' show parse;
...
final response = await http.get(URL);
var document = parse(response.body);
var detailsContainer = document.getElementById('husky_details_container_right');
var dogName = new List<Node>();
dogName = document.getElementsByClassName('datas-nev');
The contents of the detailsContainer can be extracted successfully, for example this gives me back a string of relevant data I will use later:
var humanBehaviourValue;
try { humanBehaviourValue = detailsContainer.nodes[1].nodes[19].nodes[1].nodes[7].nodes[1].toString(); }
catch (e) { humanBehaviourValue = 'N/A'; }
But when I check the value of dogName in the debug window, I get the following:
dogName = {_growableList} size = 0
I already tried initializing the dogName 'properly' by List<Node> dogName = new List<Node>(); but it didn't help. I also tried other datas-* values, but it seems the parser can't find them. I even tried using just datas (because that is a div, while others are paragraphs), but that didn't help either.
Basically I could just hardwire the name and some data (breed, color, etc) as those never really change, but the location of the shelter can change, and keeping it up-to-date by scraping the data seems better than pushing updates out manually. That means I mostly need the value of datas-helyszin but that isn't parsed either.
As #Günter Zöchbauer pointed out, the code actually works. I was just looking for the value too soon, before it was actually fetched...

TypeError when attempting to parse pubmed EFetch

I'm new to this python/biopyhton stuff, so am struggling to work out why the following code, pretty much lifted straight out of the Biopython Cookbook, isn't doing what I'd expect.
I'd have thought it'd end up with the interpreter display two list containing the same number, but all i get is one list and then a message saying TypeError: 'generator' object is not subscriptable.
I'm guessing something is going wrong with the Medline.parse step and the result of the efetch isn't being processed in a way that allows subsequent interation to extract the PMID values. Or, the efetch isn't returning anything.
Any pointers at to what I'm doing wrong?
Thanks
from Bio import Medline
from Bio import Entrez
Entrez.email = 'A.N.Other#example.com'
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
print(record['IdList'])
items = record['IdList']
handle2 = Entrez.efetch(db="pubmed", id=items, rettype="medline", retmode="text")
records = Medline.parse(handle2)
for r in records:
print(records['PMID'])
You're trying to print records['PMID'] which is a generator. I think you meant to do print(r['PMID']) which will print the 'PMID' entry in the current record dictionary object for each iteration. This is confirmed by the example given in the Bio.Medline.parse() documentation.

Retrieving data via POST request

I am having trouble obtaining data programmatically from a particular webpage.
http://www.uschess.org/msa/thin2.php allows one to search for US Chess ratings by name and state.
Submitting a POST request, I can get to the equivalent of http://www.uschess.org/msa/thin2.php?memln=nakamura&memfn=hikaru but still requires one to clicking the "Search" button to get useful data. What is the best way to get to that results page?
import urllib.request
import urllib.parse
data = {'memfn':'hikaru', 'memln':'nakamura'}
url = r'http://www.uschess.org/msa/thin2.php'
s = urllib.parse.urlopen(url, bytes(urllib.parse.urlencode(data),'UTF-8'))
s.read()
Thanks!
This one works:
#!/usr/bin/env python
import urllib
data = {'memfn':'hikaru', 'memln':'nakamura', 'mode':'Search'}
url = r'http://www.uschess.org/msa/thin2.php'
s = urllib.urlopen(url, bytes(urllib.urlencode(data)))
print s.read()
Basically you need to submit hidden parameter mode with value Search to imitate the button press.
Note: I rewrote it for python 2.x, sorry, but I didn't have python3 handy.

Resources