How to parse XML data with some non-xml formatted elements at Python

How to parse XML data with some non-xml formatted elements at Python - xml-parsing

I have following answer from CUCM api:
<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5"><return><row><pkid>63d1f8a1-0964-caa0-d496-ff91340c236c</pkid><userid>Semenova.LA</userid><firstname/><lastname>Семенова</lastname><snrenabled>t</snrenabled><devicecount>1</devicecount><licensetype>Enhanced </licensetype><licenses>1</licenses></row></return></ns:executeSQLQueryResponse></soapenv:Body></soapenv:Envelope>
I tried to parse this answer. I used lxml library.
from lxml import etree
root = etree.fromstring(response)
But I received following error
File "src\lxml\etree.pyx", line 3237, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1891, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
It looks as some element at answer in unsupported
If i cut response to
response='''<return><row><pkid>63d1f8a1-0964-caa0-d496-ff91340c236c</pkid><userid>Semenova.LA</userid><firstname/><lastname>Семенова</lastname><snrenabled>t</snrenabled><devicecount>1</devicecount><licensetype>Enhanced </licensetype><licenses>1</licenses></row></return>'''
All works as expected
What should I do to fix it?
Should I delete unwanted element such as:
<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5">
How i can do this?
Thanx a lot!

Try changing
root = etree.fromstring(response)
to
root = etree.fromstring(resp.encode())

I applied non-scalable solution - just cut begin and end string by template
cut_string ='''<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5">'''
cut_string2 = '''</ns:executeSQLQueryResponse></soapenv:Body></soapenv:Envelope>'''
s = response.replace(cut_string, "")
ss = s.replace(cut_string2, "")
root = etree.fromstring(ss.encode())
Are any more clever solution?
For example - get string between <return>String</return>

Related

LXML does not parse broken HTML : XMLSyntaxError error finding by XPath

I'm trying to extract a csrf token from a login page.
I'm using as a parser the lxml library.
s = requests.Session()
login_html = etree.fromstring(
s.get('https://www.uwkotinleuven.be/fr/login').text)
find = etree.XPath('//*[#id="login-form-2"]/input[3]')
print(find(login_html).value )
Here is the error:
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 46, column 24
I'm unsure wether the error is coming from the XPath finder, or any broken HTML that is sent.
Should I change parse, or give parameters? Is there a conventional way to parse broken HTML ?

It looks like a broken html problem. See if this works for you:
import requests
import lxml.etree as etree
from io import StringIO
s = requests.Session()
dat = s.get('https://www.uwkotinleuven.be/fr/login')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(dat.text), parser)
find = tree.xpath('//*[#id="login-form-2"]/input[3]')
print(find[0].attrib.values()[2])
Output:
3pKL_AsLLBE07T6S-VY8eXJ4ooK_QH5kMgajPEwKSso

Deserialize XML with UTF-16 encoding in ServiceStack.Text

I am trying to use ServiceStack.Text to deserialize some XML.
Code:
var buildEvent = dto.EventXml.FromXml<TfsEventBuildComplete>();
The opening xml line is:
<?xml version="1.0" encoding="UTF-16"?>
ServiceStack fails with the following error:
The encoding in the declaration 'utf-16' does not match the encoding of the document 'utf-8'.
I can see from the source of the Xml Serializer that ServiceStack uses UTF-8.
I am wondering whether ServiceStack.Text can deserialize UTF-16 and if so how? And if not, why not?

I have managed to hack my way around the issue. I'm not proud of it but....
var buildEvent = dto.EventXml.Replace("utf-16", "utf-8").FromXml<TfsEventBuildComplete>();

Extracting an element from XML with Python3?

I am trying to write a Python 3 script where I am querying a web api and receiving an XML response. The response looks like this –
<?xml version="1.0" encoding="UTF-8"?>
<ipinfo>
<ip_address>4.2.2.2</ip_address>
<ip_type>Mapped</ip_type>
<anonymizer_status/>
<Network>
<organization>level 3 communications inc.</organization>
<OrganizationData>
<home>false</home>
<organization_type>Telecommunications</organization_type>
<naics_code>518219</naics_code>
<isic_code>J6311</isic_code>
</OrganizationData>
<carrier>level 3 communications</carrier>
<asn>3356</asn>
<connection_type>tx</connection_type>
<line_speed>high</line_speed>
<ip_routing_type>fixed</ip_routing_type>
<Domain>
<tld>net</tld>
<sld>bbnplanet</sld>
</Domain>
</Network>
<Location>
<continent>north america</continent>
<CountryData>
<country>united states</country>
<country_code>us</country_code>
<country_cf>99</country_cf>
</CountryData>
<region>southwest</region>
<StateData>
<state>california</state>
<state_code>ca</state_code>
<state_cf>88</state_cf>
</StateData>
<dma>803</dma>
<msa>31100</msa>
<CityData>
<city>san juan capistrano</city>
<postal_code>92675</postal_code>
<time_zone>-8</time_zone>
<area_code>949</area_code>
<city_cf>77</city_cf>
</CityData>
<latitude>33.499</latitude>
<longitude>-117.662</longitude>
</Location>
</ipinfo>
This is the code I have so far –
import urllib.request
import urllib.error
import sys
import xml.etree.ElementTree as etree
…
try:
xml = urllib.request.urlopen(targetURL, data=None)
except urllib.error.HTTPError as e:
print("HTTP error: " + str(e) + " URL: " + targetURL)
sys.exit()
tree = etree.parse(xml)
root = tree.getroot()
The API query works and through the debugger I can see all of the information inside the ‘root’ variable. My issue is that I have not been able to figure out how to extract something like the ASN (<asn></asn>) from the returned XML. I’ve been beating my head against this for a day with a whole wide variety of finds, findalls and all other sorts of methods but not been able to crack this. I think I have reached the point where I cannot see the wood for all the trees and every example I have found on the internet doesn’t seem to help. Can someone show me a code snippet which can extract the contents of a XML element from inside the tree structure?
Many thanks
Tim

I would recommend using Beautiful Soup.
It's a very powerful when it comes to extracting data from xml-code.
Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(targetURL)
soup.find_all('asn') #Would return all the <asn></asn> tags found!

What does "Error parsing XML: not well-formed" mean?

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:orientation=”vertical”
android:layout_width=”fill_parent”
android:layout_height=”fill_parent” >
I get these two errors
error: Error parsing XML: not well-formed (invalid token)
&
Open quote is expected for attribute "android:orientation" associated with an element type "LinearLayout".

Did you copy and paste that from word? Your quotes look a little funky. Sometimes word will use a different character than the expected " for double quotes. Make sure those are all consistent. Otherwise, the syntax is invalid.

Looks like you have "smart quotes" ( not simple " double quotes) around some attributes in your LinearLayout element.

There are many references that explain the differences between valid and well formed XML documents. A good starting point can be found here. There is also an online XML Validator that you can use to test XML documents.
The validator shows that you have two issues:
Some of your attribute values use an invalid quote character: ” vs. ", and
you need to close the LinearLayout tag with /> instead of just >.

SimpleXML Cyrillic Encoding

This is the type of XML file, which I am using:
<?xml version="1.0" encoding="UTF-8"?>
<ProductCatalog>
<ProductType>Дънни платки</ProductType>
<ProductType>Дънни платки 2</ProductType>
</ProductCatalog>
And when I run the PHP file with the following code:
$pFile = new SimpleXMLElement('test.xml', null, true);
foreach ($pFile->ProductType as $pChild)
{
var_dump($pChild);
}
I get the following results:
object(SimpleXMLElement)#5 (1) { [0]=> string(40) "Ð”ÑŠÐ½Ð½Ð° Ð¿Ð»Ð°Ñ‚ÐºÐ° Ð½Ð°ÑÑ‚Ð¾Ð»Ð½Ð°"
I have tried different encodings in the XML file but it's not working well with Cyrillic symbols.

What happens if you switch Character encoding (to utf-8) in browser?
I mean, looks like output issue.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to parse XML data with some non-xml formatted elements at Python - xml-parsing

Try changing root = etree.fromstring(response) to root = etree.fromstring(resp.encode())

Related

LXML does not parse broken HTML : XMLSyntaxError error finding by XPath

Deserialize XML with UTF-16 encoding in ServiceStack.Text

Extracting an element from XML with Python3?

What does "Error parsing XML: not well-formed" mean?

SimpleXML Cyrillic Encoding

Categories

Resources