xml file parsing with xmlint and carriage return - xml-parsing

I am using xmlint to parse a xml document.
sample document is this
<?xml version="1.0"?>
<top>
<c>
<a n="zzz"><i>0</i></a>
</c>
<c>
<a n="zzz"><i>1</i></a>
</c>
</top>
I am doing something like this
xmllint -xpath '//top/c/a[#n="zzz"]/i/text()' a.xml
and I get
<i>0</i><i>165</i>
If I do
xmllint -xpath '/ad/c/a[#n="sid"]/i/text()' f.xml
I get
01
My desired output is (there is a carriage return)
0
1

Related

XML Parsing Error - Extra content at end of document

I have been trying to parse this xml response from a service and i keep getting a parsing error. I am working with BODS and its very sensitive to xml structure.
XML response as follows:
<?xml version="1.0" encoding="utf-8" ?>
<GetMapDataResponsexmlns="http://insbridge.net/wsi/Connector/SoftData">
<GetMapDataResult>
<ibdoc gen_date="1/27/2016 11:31 AM" timespan="0.000000" site_location="whlibqa" xmlns="">
<dataresults lob="20" env_def="sr_int">
<program parent_id="675" id="0" ver="1">
<m i="5" r="4" n="Get EL Increased Limits" l="false">
<d p="28">
<v>0</v>
<v>0</v>
<v>100000</v>
<v>100000</v>
<v>500000</v>
<v>1</v>
<v>1</v>
<v>1</v>
<q>1</q>
<q>03/01/2003</q>
<q>12/31/2012</q>
<q>FL</q>
</d>
<d p="29">
<v>0.008</v>
<v>50</v>
<v>500000</v>
<v>500000</v>
<v>500000</v>
<v>1</v>
<v>0</v>
<v>0</v>
<q>2</q>
<q>03/01/2003</q>
<q>12/31/2012</q>
<q>FL</q>
</d>
</m>
</program>
</dataresults>
</ibdoc>
</GetMapDataResult>
</GetMapDataResponse>
Any help would be appreciated

Nokogiri: create xml from string with `?` in field name

Controller response includes "spec?" field:
r = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<hash type=\"array\">\n <item><spec? type=\"boolean\">false</spec?>\n </item>\n <hash>\n"
When trying to create xml from it with Nokogiri.xml(r) receive literally:
<?xml version="1.0" encoding="UTF-8"?>
<hash type="array">
<item><spec type=" type="boolean">false/spec">
</spec>item>
<hash>
</hash></item></hash>
which is something strange;
My question is:
is it possible to create xml from string using Nokogiri, parsing or removing ? and other non-xml-standart chars, at stage of Nokogiri.XML()?
Desirible result:
Nokogiri.xml(r) do |config|
config.maybe_some_configs?
end #=>
<?xml version="1.0" encoding="UTF-8"?>
<hash type="array">
<item><spec type="boolean">false</spec></item>
</hash>
The proper way to parse a string into an XML DOM is Nokogiri::XML or Nokogiri.XML or Nokogiri::XML.parse, but not using xml.
Also, XML tags can't contain ?. See the spec for more information. You'll have to dig through the "Names and Tokens" section and decode hexadecimal character descriptions to figure out the ranges of characters allowed, but a hint is that ? is character code 0x3f.
Which leads to the fact that the XML in r is invalid:
<?xml version="1.0" encoding="UTF-8"?>
<hash type="array">
<item><spec? type="boolean">false</spec?>
</item>
<hash>
Which, when parsed results in:
irb(main):012:0> doc = Nokogiri::XML(r)
#<Nokogiri::XML::Document:0x80c8014c name="document" children=[#<Nokogiri::XML::Element:0x80c7399c name="hash" attributes=[#<Nokogiri::XML::Attr:0x80c733e8 name="type" value="array">] children=[#<Nokogiri::XML::Text:0x80c6e26c "\n ">, #<Nokogiri::XML::Element:0x80c6df60 name="item" children=[#<Nokogiri::XML::Element:0x80c6d970 name="spec">, #<Nokogiri::XML::Text:0x80c6d09c "? type=\"boolean\">false">]>, #<Nokogiri::XML::Text:0x80c6ca34 "?>\n ">]>]>
irb(main):013:0> doc.errors
[
[0] #<Nokogiri::XML::SyntaxError: error parsing attribute name>,
[1] #<Nokogiri::XML::SyntaxError: attributes construct error>,
[2] #<Nokogiri::XML::SyntaxError: Couldn't find end of Start Tag spec line 3>,
[3] #<Nokogiri::XML::SyntaxError: expected '>'>,
[4] #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: item line 3 and spec>,
[5] #<Nokogiri::XML::SyntaxError: Opening and ending tag mismatch: hash line 2 and item>,
[6] #<Nokogiri::XML::SyntaxError: Extra content at the end of the document>
]
As a result, Nokogiri is having to do some fixup in the DOM to try to make sense of it. The resulting XML looks like:
irb(main):014:0> puts doc.to_xml
<?xml version="1.0" encoding="UTF-8"?>
<hash type="array">
<item><spec/>? type="boolean">false</item>?>
</hash>
The way to fix it is to give Nokogiri valid XML. Either fix the source of the XML, if you control it, or fix the problems in the string before passing it to Nokogiri.
By its definition, XML is a strict format, and Nokogiri honors that and, trying to be friendly, makes it possible for you to check errors to see if its empty?. If it's not, odds are good you shouldn't continue using the source until you've determined the problems and fixed whatever causes the parsing problems. Sometimes the problem is fairly benign, and you can ignore it, but in either case you should at least be aware of it.
Pre-massaging the data before Nokogiri sees it isn't hard:
doc = Nokogiri::XML(r.gsub('spec?', 'spec'))
irb(main):024:0> puts doc.to_xml
<?xml version="1.0" encoding="UTF-8"?>
<hash type="array">
<item><spec type="boolean">false</spec>
</item>
<hash>
</hash></hash>
nil
irb(main):025:0> doc.errors
[
[0] #<Nokogiri::XML::SyntaxError: Premature end of data in tag hash line 5>,
[1] #<Nokogiri::XML::SyntaxError: Premature end of data in tag hash line 2>
]
That's a start, but not an attempt to fix it for you completely. I'm teaching you to fish, not handing out fish.

Extracting data from xml file using xmllint

I have a small xml document from which I need to extract some values using xmllint. I am able to navigate through the xml hierarchy using xmllint --shell xmlfilename command.
But I am unable to extract the values. I don't want to use a grep / any pattern matching command, as that is already done and is a success.
I would appreciate any help regarding the xmlliint.
Here is my document in png format. I want to extract the 300$ and 500$ (the value).
<?xml version="1`.`0" encoding="ISO-8859-1"?>
<adi>
<asset>
<electronics item="Mobile" name="Nokia" value="300$" />
<electronics item="Mobile" name="Sony" value="500$" />
</asset>
</adi>
Another doubt is, are the two sets, the different representation of same xml ?
<?xml version="1.0 encoding="ISO-8859-1"?>
<adi>
<asset>
<electronics>
<item> Mobile </item>
<name>Nokia</name>
<value>300$</value>
</electronics>
<electronics>
<item> Mobile </item>
<name>Sony</name>
<value>500$</value>
</electronics>
</asset>
</adi>
With regards to your second question, those two snippets do not represent the same XML content. Attributes and child elements are not equivalent. A child element can be the root element of some arbitrary XML tree, but attributes are atomic.
E.g., I could modify the second snippet like this:
<?xml version="1.0 encoding="ISO-8859-1"?>
<adi>
<asset>
<electronics>
<item>
Mobile
<sub-item>Phone</sub-item>
</item>
<name>Nokia</name>
<value>300$</value>
</electronics>
<electronics>
<item> Mobile </item>
<name>Sony</name>
<value>500$</value>
</electronics>
</asset>
</adi>
where I have added <sub-item>Phone</sub-item> to the first <item> element.
However, there's no equivalent if item is an attribute instead, as in the first snippet.
Late but while searches for the tag xmllint match the first page, I answer you now ;)
use --xpath instead of --xpath like below
xmllint --xpath '//electronics/value/text()' second-xml_file.xml

CDATA not working on rails

I have the below xml's in my code
XML Parsing Error: not well-formed
Location: http://localhost:3000/api/client?client=test1
Line Number 1, Column 1111:
<?xml version="1.0" encoding="UTF-8"?>
<application>
<name><![CDATA[TESTapp2]]></name>
<application-identifier>wac-8c28afa4-0f6e-11e1-8885-7071bc62c7bc</application-identifier>
<clients>
<pricepoint id="1" name=<![CDATA[TEST-price]]> currency="dollar" locale="la" country="india" price="50" text="this is a TEST" receipt="oi120934" operator-reference="1213w" operator-id="1"></pricepoint></pricepoints><product-image></product-image>
</clients>
</application>
<name><![CDATA[TESTapp2]]></name> this is working
<name=\"[CDATA[TESTapp2]]\"> this is not working,throws encoding error
AFAIK, Using CDATA as an attribute value is forbidden. CDATA can only be used for text nodes.

Parsing with SAX and handling character entities

I am parsing a MathML expression with SAX (although the fact that it's MathML may not be completely relevant). An example input string is
<math xmlns='http://www.w3.org/1998/Math/MathML'>
<mrow>
<mo>λ</mo>
</mrow>
</math>
In order for the SAX parser to accept this string, I expand it a bit:
<?xml version="1.0"?>
<!DOCTYPE doc_type [
<!ENTITY nbsp " ">
<!ENTITY amp "&">
]>
<body>
<math xmlns='http://www.w3.org/1998/Math/MathML'>
<mrow>
<mo>λ</mo>
<mrow>
</math>
</body>
Now, when I run the SAX parser on this, I get an exception:
[Fatal Error] :5:86: The entity "lambda" was referenced, but not declared.
org.xml.sax.SAXParseException: The entity "lambda" was referenced, but not
declared.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
However, I know how to fix that. I simply add this line to the string being parsed:
<!ENTITY lambda "Λ">
This gives me
<?xml version="1.0"?>
<!DOCTYPE doc_type [
<!ENTITY nbsp " ">
<!ENTITY amp "&">
<!ENTITY lambda "Λ">
]>
<body>
<math xmlns='http://www.w3.org/1998/Math/MathML'>
<mrow>
<mo>λ</mo>
<mrow>
</math>
</body>
Now, it parses just fine, thank you.
However, the problem is that I can't add an ENTITY declaration for every possible character entity that might be used in MathML (for example, "part", "notin", and "sum").
How do I rewrite this string so that it can be parsed for any possible character entity that might be included?
Use a DOCTYPE declaration that refers to the MathML DTD:
<!DOCTYPE math
PUBLIC "-//W3C//DTD MathML 3.0//EN"
"http://www.w3.org/Math/DTD/mathml3/mathml3.dtd">
or a local copy of the same.

Resources