Extracting an element from XML with Python3? - xml-parsing

I am trying to write a Python 3 script where I am querying a web api and receiving an XML response. The response looks like this –
<?xml version="1.0" encoding="UTF-8"?>
<ipinfo>
<ip_address>4.2.2.2</ip_address>
<ip_type>Mapped</ip_type>
<anonymizer_status/>
<Network>
<organization>level 3 communications inc.</organization>
<OrganizationData>
<home>false</home>
<organization_type>Telecommunications</organization_type>
<naics_code>518219</naics_code>
<isic_code>J6311</isic_code>
</OrganizationData>
<carrier>level 3 communications</carrier>
<asn>3356</asn>
<connection_type>tx</connection_type>
<line_speed>high</line_speed>
<ip_routing_type>fixed</ip_routing_type>
<Domain>
<tld>net</tld>
<sld>bbnplanet</sld>
</Domain>
</Network>
<Location>
<continent>north america</continent>
<CountryData>
<country>united states</country>
<country_code>us</country_code>
<country_cf>99</country_cf>
</CountryData>
<region>southwest</region>
<StateData>
<state>california</state>
<state_code>ca</state_code>
<state_cf>88</state_cf>
</StateData>
<dma>803</dma>
<msa>31100</msa>
<CityData>
<city>san juan capistrano</city>
<postal_code>92675</postal_code>
<time_zone>-8</time_zone>
<area_code>949</area_code>
<city_cf>77</city_cf>
</CityData>
<latitude>33.499</latitude>
<longitude>-117.662</longitude>
</Location>
</ipinfo>
This is the code I have so far –
import urllib.request
import urllib.error
import sys
import xml.etree.ElementTree as etree
…
try:
xml = urllib.request.urlopen(targetURL, data=None)
except urllib.error.HTTPError as e:
print("HTTP error: " + str(e) + " URL: " + targetURL)
sys.exit()
tree = etree.parse(xml)
root = tree.getroot()
The API query works and through the debugger I can see all of the information inside the ‘root’ variable. My issue is that I have not been able to figure out how to extract something like the ASN (<asn></asn>) from the returned XML. I’ve been beating my head against this for a day with a whole wide variety of finds, findalls and all other sorts of methods but not been able to crack this. I think I have reached the point where I cannot see the wood for all the trees and every example I have found on the internet doesn’t seem to help. Can someone show me a code snippet which can extract the contents of a XML element from inside the tree structure?
Many thanks
Tim

I would recommend using Beautiful Soup.
It's a very powerful when it comes to extracting data from xml-code.
Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(targetURL)
soup.find_all('asn') #Would return all the <asn></asn> tags found!

Related

Trouble parsing CNN search results using Python 3 lxml

I am trying to parse the response from a search on the CNN site like so:
import requests
from lxml import html
from lxml import etree
r = requests.get('https://www.cnn.com/search?q=climate+change')
doc = etree.HTML(r.content)
for url in doc.xpath('//a[#href]'):
u = url.get('href')
print(u)
This gives a bunch of links, primarily to different sections on the site, but it gives no links at all to the actual stories returned by the search. What am I doing wrong?

Using python to parse twitter url

I am using the following code but I am not able to extract any information from the url.
from urllib.parse import urlparse
if __name__ == "__main__":
z = 5
url = 'https://twitter.com/isro/status/1170331318132957184'
df = urlparse(url)
print(df)
ParseResult(scheme='https', netloc='twitter.com', path='/isro/status/1170331318132957184', params='', query='', fragment='')
I was hoping to extract the tweet message, time of tweet and other information available from the link but the code above clearly doesn't achieve that. How do I go about it from here ?
print(df)
ParseResult(scheme='https', netloc='twitter.com', path='/isro/status/1170331318132957184', params='', query='', fragment='')
I think you may be misunderstanding the purpose of the urllib parseurl function. From the Python documentation:
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment
From the result you are seeing in ParseResult, your code is working perfectly - it is breaking your URL up into the component parts.
It sounds as though you actually want to fetch the web content at that URL. In that case, I might take a look at urllib.request.urlopen instead.

LXML does not parse broken HTML : XMLSyntaxError error finding by XPath

I'm trying to extract a csrf token from a login page.
I'm using as a parser the lxml library.
s = requests.Session()
login_html = etree.fromstring(
s.get('https://www.uwkotinleuven.be/fr/login').text)
find = etree.XPath('//*[#id="login-form-2"]/input[3]')
print(find(login_html).value )
Here is the error:
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 46, column 24
I'm unsure wether the error is coming from the XPath finder, or any broken HTML that is sent.
Should I change parse, or give parameters? Is there a conventional way to parse broken HTML ?
It looks like a broken html problem. See if this works for you:
import requests
import lxml.etree as etree
from io import StringIO
s = requests.Session()
dat = s.get('https://www.uwkotinleuven.be/fr/login')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(dat.text), parser)
find = tree.xpath('//*[#id="login-form-2"]/input[3]')
print(find[0].attrib.values()[2])
Output:
3pKL_AsLLBE07T6S-VY8eXJ4ooK_QH5kMgajPEwKSso

Could anyone scrape an element with Jsoup?

I'm trying to scrape this link using Jsoup with Kotlin/Java. And I have problem in scrapping players part (under Current Squad). Could anyone parse it?
You can not access the information directly using only the response from that link.
You can make a JSON object with the http response from https://stats.fn.sportradar.com/betsgi/en/America:Argentina:Buenos_Aires/gismo/stats_team_squad/2817 and https://stats.fn.sportradar.com/betsgi/en/America:Argentina:Buenos_Aires/gismo/stats_teamplayer_facts/2817/42556.
As an example in python you can get the minutes played by each player as follows:
import urllib
import json
f=urllib.urlopen('https://stats.fn.sportradar.com/betsgi/en/America:Argentina:Buenos_Aires/gismo/stats_team_squad/2817')
f2=urllib.urlopen('https://stats.fn.sportradar.com/betsgi/en/America:Argentina:Buenos_Aires/gismo/stats_teamplayer_facts/2817/42556')
j=json.loads(f.read())
j2=json.loads(f2.read())
plrs=j['doc'][0]['data']['players']
for plr in plrs:
print '========================='
print plr['name']
try:
print 'minutes played:' +str(j2['doc'][0]['data'][str(plr['_id'])]['stats']['total']['minutes_played'])
except KeyError, e:
pass

How to read a xml file with python with xml.dom

i am having one xml file and i need to fetch value of tags postnumer and regarding every postnumber all the valid number values...i am using xml.dom for reading xml file in python what is the code to fetch the value...how can i do this? please help
xml code:
-<post>
<PostNumber>180</postNumber>
-<validList>
<validNumber>208</validNumber>
<validNumber>209</validNumber>
<validNumber>210</validNumber>
</validlist>
</post>
-<post>
<postNumber>1023832</postNumber>
-<validlist>
<validNumber>264</validNumber>
<validNumber>401</validNumber>
</validlist>
</post>

Resources