LXML does not parse broken HTML : XMLSyntaxError error finding by XPath - parsing

I'm trying to extract a csrf token from a login page.
I'm using as a parser the lxml library.
s = requests.Session()
login_html = etree.fromstring(
s.get('https://www.uwkotinleuven.be/fr/login').text)
find = etree.XPath('//*[#id="login-form-2"]/input[3]')
print(find(login_html).value )
Here is the error:
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 46, column 24
I'm unsure wether the error is coming from the XPath finder, or any broken HTML that is sent.
Should I change parse, or give parameters? Is there a conventional way to parse broken HTML ?

It looks like a broken html problem. See if this works for you:
import requests
import lxml.etree as etree
from io import StringIO
s = requests.Session()
dat = s.get('https://www.uwkotinleuven.be/fr/login')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(dat.text), parser)
find = tree.xpath('//*[#id="login-form-2"]/input[3]')
print(find[0].attrib.values()[2])
Output:
3pKL_AsLLBE07T6S-VY8eXJ4ooK_QH5kMgajPEwKSso

Related

Trouble parsing CNN search results using Python 3 lxml

I am trying to parse the response from a search on the CNN site like so:
import requests
from lxml import html
from lxml import etree
r = requests.get('https://www.cnn.com/search?q=climate+change')
doc = etree.HTML(r.content)
for url in doc.xpath('//a[#href]'):
u = url.get('href')
print(u)
This gives a bunch of links, primarily to different sections on the site, but it gives no links at all to the actual stories returned by the search. What am I doing wrong?

How to parse XML data with some non-xml formatted elements at Python

I have following answer from CUCM api:
<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5"><return><row><pkid>63d1f8a1-0964-caa0-d496-ff91340c236c</pkid><userid>Semenova.LA</userid><firstname/><lastname>Семенова</lastname><snrenabled>t</snrenabled><devicecount>1</devicecount><licensetype>Enhanced </licensetype><licenses>1</licenses></row></return></ns:executeSQLQueryResponse></soapenv:Body></soapenv:Envelope>
I tried to parse this answer. I used lxml library.
from lxml import etree
root = etree.fromstring(response)
But I received following error
File "src\lxml\etree.pyx", line 3237, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1891, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
It looks as some element at answer in unsupported
If i cut response to
response='''<return><row><pkid>63d1f8a1-0964-caa0-d496-ff91340c236c</pkid><userid>Semenova.LA</userid><firstname/><lastname>Семенова</lastname><snrenabled>t</snrenabled><devicecount>1</devicecount><licensetype>Enhanced </licensetype><licenses>1</licenses></row></return>'''
All works as expected
What should I do to fix it?
Should I delete unwanted element such as:
<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5">
How i can do this?
Thanx a lot!
Try changing
root = etree.fromstring(response)
to
root = etree.fromstring(resp.encode())
I applied non-scalable solution - just cut begin and end string by template
cut_string ='''<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5">'''
cut_string2 = '''</ns:executeSQLQueryResponse></soapenv:Body></soapenv:Envelope>'''
s = response.replace(cut_string, "")
ss = s.replace(cut_string2, "")
root = etree.fromstring(ss.encode())
Are any more clever solution?
For example - get string between <return>String</return>

lines=True parameter for the Json Type Provider and Json.Net library?

I am working on this Kaggle competition. The Jupyter notebooks on Kaggle only support R and Python and I wanted to use F# locally. The problem is that the datasets are .json files and both the F# Json Type Provider and Newtonsoft libraries fail when trying to parse the files.
Here are examples of the code failing in F#:
open FSharp.Data
type Context = JsonProvider<"train.json">
let context = Context.
and
open System
open System.IO
open Newtonsoft.Json
open Newtonsoft.Json.Linq
let object = JObject.Parse(File.ReadAllText("train.json"));
object
This Python example uses these line of code to parse them correctly:
train = pd.read_json('../input/stanford-covid-vaccine/train.json', lines=True)
test = pd.read_json('../input/stanford-covid-vaccine/test.json', lines=True)
In the notebook, the author says that without the "lines=True" parameter, the read_json method fails with this trailing error.
My question: assuming tis is the same error, is there a way to apply that same kind of "lines=true" to the .NET libraries to parse the json?
I've seen a few datasets where the format was one valid JSON record per line:
{"event":"nothing 1"}
{"event":"nothing 2"}
{"event":"nothing 3"}
This is not valid JSON overall. I think you can either parse it line-by-line or you can turn it into valid JSON. For line-by-line parsing (which may be more efficient as you can do this in a streaming fashion), I would use:
open FSharp.Data
type Log = JsonProvider<"""{"event":"nothing 1"}""">
for line in File.ReadAllLines("some.json") do
let l = Log.Parse(line)
printfn "%s" l.Event

Using python to parse twitter url

I am using the following code but I am not able to extract any information from the url.
from urllib.parse import urlparse
if __name__ == "__main__":
z = 5
url = 'https://twitter.com/isro/status/1170331318132957184'
df = urlparse(url)
print(df)
ParseResult(scheme='https', netloc='twitter.com', path='/isro/status/1170331318132957184', params='', query='', fragment='')
I was hoping to extract the tweet message, time of tweet and other information available from the link but the code above clearly doesn't achieve that. How do I go about it from here ?
print(df)
ParseResult(scheme='https', netloc='twitter.com', path='/isro/status/1170331318132957184', params='', query='', fragment='')
I think you may be misunderstanding the purpose of the urllib parseurl function. From the Python documentation:
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment
From the result you are seeing in ParseResult, your code is working perfectly - it is breaking your URL up into the component parts.
It sounds as though you actually want to fetch the web content at that URL. In that case, I might take a look at urllib.request.urlopen instead.

Extracting an element from XML with Python3?

I am trying to write a Python 3 script where I am querying a web api and receiving an XML response. The response looks like this –
<?xml version="1.0" encoding="UTF-8"?>
<ipinfo>
<ip_address>4.2.2.2</ip_address>
<ip_type>Mapped</ip_type>
<anonymizer_status/>
<Network>
<organization>level 3 communications inc.</organization>
<OrganizationData>
<home>false</home>
<organization_type>Telecommunications</organization_type>
<naics_code>518219</naics_code>
<isic_code>J6311</isic_code>
</OrganizationData>
<carrier>level 3 communications</carrier>
<asn>3356</asn>
<connection_type>tx</connection_type>
<line_speed>high</line_speed>
<ip_routing_type>fixed</ip_routing_type>
<Domain>
<tld>net</tld>
<sld>bbnplanet</sld>
</Domain>
</Network>
<Location>
<continent>north america</continent>
<CountryData>
<country>united states</country>
<country_code>us</country_code>
<country_cf>99</country_cf>
</CountryData>
<region>southwest</region>
<StateData>
<state>california</state>
<state_code>ca</state_code>
<state_cf>88</state_cf>
</StateData>
<dma>803</dma>
<msa>31100</msa>
<CityData>
<city>san juan capistrano</city>
<postal_code>92675</postal_code>
<time_zone>-8</time_zone>
<area_code>949</area_code>
<city_cf>77</city_cf>
</CityData>
<latitude>33.499</latitude>
<longitude>-117.662</longitude>
</Location>
</ipinfo>
This is the code I have so far –
import urllib.request
import urllib.error
import sys
import xml.etree.ElementTree as etree
…
try:
xml = urllib.request.urlopen(targetURL, data=None)
except urllib.error.HTTPError as e:
print("HTTP error: " + str(e) + " URL: " + targetURL)
sys.exit()
tree = etree.parse(xml)
root = tree.getroot()
The API query works and through the debugger I can see all of the information inside the ‘root’ variable. My issue is that I have not been able to figure out how to extract something like the ASN (<asn></asn>) from the returned XML. I’ve been beating my head against this for a day with a whole wide variety of finds, findalls and all other sorts of methods but not been able to crack this. I think I have reached the point where I cannot see the wood for all the trees and every example I have found on the internet doesn’t seem to help. Can someone show me a code snippet which can extract the contents of a XML element from inside the tree structure?
Many thanks
Tim
I would recommend using Beautiful Soup.
It's a very powerful when it comes to extracting data from xml-code.
Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(targetURL)
soup.find_all('asn') #Would return all the <asn></asn> tags found!

Resources