SimpleXML Cyrillic Encoding - character-encoding

This is the type of XML file, which I am using:
<?xml version="1.0" encoding="UTF-8"?>
<ProductCatalog>
<ProductType>Дънни платки</ProductType>
<ProductType>Дънни платки 2</ProductType>
</ProductCatalog>
And when I run the PHP file with the following code:
$pFile = new SimpleXMLElement('test.xml', null, true);
foreach ($pFile->ProductType as $pChild)
{
var_dump($pChild);
}
I get the following results:
object(SimpleXMLElement)#5 (1) { [0]=> string(40) "Дънна платка наÑтолна"
I have tried different encodings in the XML file but it's not working well with Cyrillic symbols.

What happens if you switch Character encoding (to utf-8) in browser?
I mean, looks like output issue.

Related

How to parse XML data with some non-xml formatted elements at Python

I have following answer from CUCM api:
<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5"><return><row><pkid>63d1f8a1-0964-caa0-d496-ff91340c236c</pkid><userid>Semenova.LA</userid><firstname/><lastname>Семенова</lastname><snrenabled>t</snrenabled><devicecount>1</devicecount><licensetype>Enhanced </licensetype><licenses>1</licenses></row></return></ns:executeSQLQueryResponse></soapenv:Body></soapenv:Envelope>
I tried to parse this answer. I used lxml library.
from lxml import etree
root = etree.fromstring(response)
But I received following error
File "src\lxml\etree.pyx", line 3237, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1891, in lxml.etree._parseMemoryDocument
ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.
It looks as some element at answer in unsupported
If i cut response to
response='''<return><row><pkid>63d1f8a1-0964-caa0-d496-ff91340c236c</pkid><userid>Semenova.LA</userid><firstname/><lastname>Семенова</lastname><snrenabled>t</snrenabled><devicecount>1</devicecount><licensetype>Enhanced </licensetype><licenses>1</licenses></row></return>'''
All works as expected
What should I do to fix it?
Should I delete unwanted element such as:
<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5">
How i can do this?
Thanx a lot!
Try changing
root = etree.fromstring(response)
to
root = etree.fromstring(resp.encode())
I applied non-scalable solution - just cut begin and end string by template
cut_string ='''<?xml version='1.0' encoding='UTF-8'?><soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/"><soapenv:Body><ns:executeSQLQueryResponse xmlns:ns="http://www.cisco.com/AXL/API/12.5">'''
cut_string2 = '''</ns:executeSQLQueryResponse></soapenv:Body></soapenv:Envelope>'''
s = response.replace(cut_string, "")
ss = s.replace(cut_string2, "")
root = etree.fromstring(ss.encode())
Are any more clever solution?
For example - get string between <return>String</return>

What encoding is this and how do I turn it into something I can see properly?

I'm writing a script that will operate on the subtitle files of a popular streaming service (Netfl*x).
The subtitle files have strange characters in them and I can't get them to render in a way that my text editors or web browser will display in a readable way. The xml encoding says UTF-8, but some characters are not readable.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<tt xmlns:tt="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xmlns:tts="http://www.w3.org/ns/ttml#styling" ttp:tickRate="10000000" ttp:timeBase="media" xmlns="http://www.w3.org/ns/ttml">
<p>de 15 % la nuit dernière.</span></p>
<p>if youâve got things to doâ¦</span></p>
And in Vim:
This is what it looks like in the browser:
How can I convert this into something I can use?
I'll go out on a limb and say that file is UTF-8 encoded just fine, and you're merely looking at it using the wrong encoding. The character À encoded in UTF-8 is C3 80. C3 in ISO-8859-1 is Ã, which in your screenshot is followed by an 80. So looks like you're looking at a UTF-8 file using the (wrong) ISO-8859 encoding.
Use the correct encoding when opening the file.
My terminal is set to en_US.UTF-8, but was also rendering this supposedly UTF-8 encoded file incorrectly (sonné -> sonné). I was able to solve this by using iconv to encode the file in ISO8859-1.
iconv original.xml -t ISO8859-1 -o converted.xml
In the new file, the characters were properly rendered, although I don't quite understand why.

Deserialize XML with UTF-16 encoding in ServiceStack.Text

I am trying to use ServiceStack.Text to deserialize some XML.
Code:
var buildEvent = dto.EventXml.FromXml<TfsEventBuildComplete>();
The opening xml line is:
<?xml version="1.0" encoding="UTF-16"?>
ServiceStack fails with the following error:
The encoding in the declaration 'utf-16' does not match the encoding of the document 'utf-8'.
I can see from the source of the Xml Serializer that ServiceStack uses UTF-8.
I am wondering whether ServiceStack.Text can deserialize UTF-16 and if so how? And if not, why not?
I have managed to hack my way around the issue. I'm not proud of it but....
var buildEvent = dto.EventXml.Replace("utf-16", "utf-8").FromXml<TfsEventBuildComplete>();

"Error attempting to parse XML file" when parsing using XInclude

I am trying to create a combined xml document using XInclude to be unmarshalled via JAXB.
Here is my unmarshalling code:
#Override
public T readFromReader(final Reader reader) throws Exception {
final Unmarshaller unmarshaller = createUnmarshaller();
final SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setXIncludeAware(true);
spf.setNamespaceAware(true);
//spf.setValidating(true);
final XMLReader xr = spf.newSAXParser().getXMLReader();
final SAXSource source = new SAXSource( xr, new InputSource(reader) );
try {
final T object = (T) unmarshaller.unmarshal(source);
postReadSetup(object);
return object;
} catch (final Exception e) {
throw new RuntimeException("Cannot parse XML: Additional information is attached. Please ensure your XML is valid.", e);
}
}
Here is my main xml file:
<?xml version="1.0" encoding="UTF-8" ?>
<tag1 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xi="http://www.w3.org/2001/XInclude"
xsi:schemaLocation="path-to-schema/schema.xsd">
<xi:include href="path-to-xml-files/included.xml"></xi:include>
</tag1>
And included.xml:
<?xml version="1.0" encoding="UTF-8"?>
<tag2> Some text </tag2>
In order to actually unmarshal it, I create a new FileReader with the path to my xml file (path-to-xml-files/main.xml - the path is correct because it can clearly find the main file). When I run it, however, there is something wrong with the included file. I am getting an UnmarshalException with a linked SAXParseException with this error message: Error attempting to parse XML file (href='path-to-xml-files/included.xml').
When I manually merge the content of included.xml into main.xml, it runs with no problems.
I can't tell if it's a JAXB issue or an XInclude issue, though I strongly suspect the latter.
What am I missing?
I fought with this exact same problem for three hours and finally I found this:
xerces.apache.org/xerces2-j/features.html
In short, you need to add the following line:
spf.setFeature("http://apache.org/xml/features/xinclude/fixup-base-uris", false);
I had the exact same issue.
Actually, the href attribute expects an URI, which can be:
Either an HTTP address (which means your included XML must be hosted somewhere)
Or a file on your local machine. But in that case, you need to prefix it with "file:..." and provide the absolute path.
With your example:
<?xml version="1.0" encoding="UTF-8" ?>
<tag1 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xi="http://www.w3.org/2001/XInclude"
xsi:schemaLocation="path-to-schema/schema.xsd">
<xi:include href="file:absolute-path-to-xml-files/included.xml"/>
</tag1>

Can we change XML encoding from utf-8 to utf -16?

I have written a code for generating XML with UTF-8 encoding.I always validate the XML with XSD file. In the same code i need UTF-16 encoding. Because one of my XSD file is of UTF-16 encoding.
But in my existing code it is not accepted. it gives following error.
FAILED: Fatal error: Document labelled UTF-16 but has UTF-8 content at filepath/standard.xsd:1.
and at line 1 of XSD this tag is defined <?xml version="1.0" encoding="utf-16"?>
How can i validate it with utf-8 encoding?
Is there any way to change UTF-16 to UTF-8 encoding.
Thanks in advance.
You can change the encoding from utf16 to utf-8 with Iconv
Call iconv from Ruby 1.8.7 through system to convert a file from utf-16 to utf-8
When you write the new file you can replace the first line with a new header like
<?xml version="1.0" encoding="utf-8" ?>
Ruby - Open file, find and replace multiple lines
If you need it in the other way then change the endoding in the function.

Resources