XML parsing as pagination - ios

I have an XML file which i get downloaded from server which is having 50k elements. I need to display those 50k elements in a tableView.
But It consumes more memory.
So i thought is there any XML parser available in swift which allows me kind of pagination like parse 1 to 10 next 10-20 and so on.

All u need is a SAX xml parser like libxml2. DOM parser will not be able to parse the data with 50K elements because DOM parsers loads the entire Document Object Model into memory to construct the tree and then parses the nodes. Where as SAX parsers parses the xml in chunk.
Unfortunately most of the SAX parsers I am aware of are in C. So u have to write the wrapper around them to use it swift project. Good news there are tutorials explaining how to use them.
here are few of the useful links to integrate libxml2 to swift project.
http://redqueencoder.com/wrapping-libxml2-for-swift/
https://www.cocoawithlove.com/2008/10/using-libxml2-for-parsing-and-xpath.html
EDIT:
You can make use of NSXMLParser as well which is a SAX parser written in Objective-C. You can find loads of tutorials on how to use it with Swift
https://medium.com/#lucascerro/understanding-nsxmlparser-in-swift-xcode-6-3-1-7c96ff6c65bc

Related

Should I use one XML Parser for every XML feed I have, or should I write a parser for each XML feed I have?

So I help write an app for my university and I'm wondering what's the best way to handle multiple XML feeds (like scores for sports, class information, etc.).
Should I have one XML parser that can handle all feeds? Or should I write a parser for each feed? We're having trouble deciding the best way to implement it.
This is iOS and we use a mix of Swift 3 and Objective-C
I think the right strategy is to write a base class that handles common data types like integers, booleans, strings, etc., and then write derived classes for each type of feed. This is the strategy I use in my XML parser, which is based on the data structures and Apple's XML parser as described here:
https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/NSXML_Concepts/NSXML.html
Personally I prefer to use the XPath data models where you can query the XML tree for a specific node using a path-like string.

How to parse dynamic XML with a SAX parser

Scenario:
Large (dynamic) xml files being uploaded by users.
We need to map the xml to our own database structure.
We need to use a SAX parser (or something like it) because of memory issues when parsing large XML files.
We currently use https://github.com/craigambrose/sax_stream for parsing XML's that all have the same structure.
For a new feature, we need to parse XML with unknown contents.
How would one use a SAX parser when the xml nodes are different each time ?
I've tried using https://github.com/soulcutter/saxerator, especially the at_depth() function could come in handy to collect the elements at a certain depth, after that we could get the elements inside a node by using the for_tag() function. Based on this info we maybe could create a mapping on the fly
If a SAX parser isn't an option, are there any alternatives for parsing very large (dynamic) XML files?

Parsing XML chunks in a non-XML file

Can anyone share experience with parsing XML chunks embedded in a non-XML file?
I am implementing an Edge-Side-Includes[1] processor. Edge-Side-Includes elements are not necessarily embedded in XML- or well-formed XML files and this poses the question, how to go about finding and then parsing such elements.
Has anyone done something similar?
[1] http://www.w3.org/TR/esi-lang
Seems like the best option is to either embed the XML tokenizing into the overall tokenizer or identify the chunks and hand them to an XML parser individually.

What is the difference between SAX and DOM?

I read some articles about the XML parsers and came across SAX and DOM.
SAX is event-based and DOM is tree model -- I don't understand the differences between these concepts.
From what I have understood, event-based means some kind of event happens to the node. Like when one clicks a particular node it will give all the sub nodes rather than loading all the nodes at the same time. But in the case of DOM parsing it will load all the nodes and make the tree model.
Is my understanding correct?
Please correct me If I am wrong or explain to me event-based and tree model in a simpler manner.
Well, you are close.
In SAX, events are triggered when the XML is being parsed. When the parser is parsing the XML, and encounters a tag starting (e.g. <something>), then it triggers the tagStarted event (actual name of event might differ). Similarly when the end of the tag is met while parsing (</something>), it triggers tagEnded. Using a SAX parser implies you need to handle these events and make sense of the data returned with each event.
In DOM, there are no events triggered while parsing. The entire XML is parsed and a DOM tree (of the nodes in the XML) is generated and returned. Once parsed, the user can navigate the tree to access the various data previously embedded in the various nodes in the XML.
In general, DOM is easier to use but has an overhead of parsing the entire XML before you can start using it.
In just a few words...
SAX (Simple API for XML): Is a stream-based processor. You only have a tiny part in memory at any time and you "sniff" the XML stream by implementing callback code for events like tagStarted() etc. It uses almost no memory, but you can't do "DOM" stuff, like use xpath or traverse trees.
DOM (Document Object Model): You load the whole thing into memory - it's a massive memory hog. You can blow memory with even medium sized documents. But you can use xpath and traverse the tree etc.
Here in simpler words:
DOM
Tree model parser (Object based) (Tree of nodes).
DOM loads the file into the memory and then parse- the file.
Has memory constraints since it loads the whole XML file before parsing.
DOM is read and write (can insert or delete nodes).
If the XML content is small, then prefer DOM parser.
Backward and forward search is possible for searching the tags and evaluation of the
information inside the tags. So this gives the ease of navigation.
Slower at run time.
SAX
Event based parser (Sequence of events).
SAX parses the file as it reads it, i.e. parses node by node.
No memory constraints as it does not store the XML content in the memory.
SAX is read only i.e. can’t insert or delete the node.
Use SAX parser when memory content is large.
SAX reads the XML file from top to bottom and backward navigation is not possible.
Faster at run time.
You are correct in your understanding of the DOM based model. The XML file will be loaded as a whole and all its contents will be built as an in-memory representation of the tree the document represents. This can be time- and memory-consuming, depending on how large the input file is. The benefit of this approach is that you can easily query any part of the document, and freely manipulate all the nodes in the tree.
The DOM approach is typically used for small XML structures (where small depends on how much horsepower and memory your platform has) that may need to be modified and queried in different ways once they have been loaded.
SAX on the other hand is designed to handle XML input of virtually any size. Instead of the XML framework doing the hard work for you in figuring out the structure of the document and preparing potentially lots of objects for all the nodes, attributes etc., SAX completely leaves that to you.
What it basically does is read the input from the top and invoke callback methods you provide when certain "events" occur. An event might be hitting an opening tag, an attribute in the tag, finding text inside an element or coming across an end-tag.
SAX stubbornly reads the input and tells you what it sees in this fashion. It is up to you to maintain all state-information you require. Usually this means you will build up some sort of state-machine.
While this approach to XML processing is a lot more tedious, it can be very powerful, too. Imagine you want to just extract the titles of news articles from a blog feed. If you read this XML using DOM it would load all the article contents, all the images etc. that are contained in the XML into memory, even though you are not even interested in it.
With SAX you can just check if the element name is (e. g.) "title" whenever your "startTag" event method is called. If so, you know that you needs to add whatever the next "elementText" event offers you. When you receive the "endTag" event call, you check again if this is the closing element of the "title". After that, you just ignore all further elements, until either the input ends, or another "startTag" with a name of "title" comes along. And so on...
You could read through megabytes and megabytes of XML this way, just extracting the tiny amount of data you need.
The negative side of this approach is of course, that you need to do a lot more book-keeping yourself, depending on what data you need to extract and how complicated the XML structure is. Furthermore, you naturally cannot modify the structure of the XML tree, because you never have it in hand as a whole.
So in general, SAX is suitable for combing through potentially large amounts of data you receive with a specific "query" in mind, but need not modify, while DOM is more aimed at giving you full flexibility in changing structure and contents, at the expense of higher resource demand.
You're comparing apples and pears. SAX is a parser that parses serialized DOM structures. There are many different parsers, and "event-based" refers to the parsing method.
Maybe a small recap is in order:
The document object model (DOM) is an abstract data model that describes a hierarchical, tree-based document structure; a document tree consists of nodes, namely element, attribute and text nodes (and some others). Nodes have parents, siblings and children and can be traversed, etc., all the stuff you're used to from doing JavaScript (which incidentally has nothing to do with the DOM).
A DOM structure may be serialized, i.e. written to a file, using a markup language like HTML or XML. An HTML or XML file thus contains a "written out" or "flattened out" version of an abstract document tree.
For a computer to manipulate, or even display, a DOM tree from a file, it has to deserialize, or parse, the file and reconstruct the abstract tree in memory. This is where parsing comes in.
Now we come to the nature of parsers. One way to parse would be to read in the entire document and recursively build up a tree structure in memory, and finally expose the entire result to the user. (I suppose you could call these parsers "DOM parsers".) That would be very handy for the user (I think that's what PHP's XML parser does), but it suffers from scalability problems and becomes very expensive for large documents.
On the other hand, event-based parsing, as done by SAX, looks at the file linearly and simply makes call-backs to the user whenever it encounters a structural piece of data, like "this element started", "that element ended", "some text here", etc. This has the benefit that it can go on forever without concern for the input file size, but it's a lot more low-level because it requires the user to do all the actual processing work (by providing call-backs). To return to your original question, the term "event-based" refers to those parsing events that the parser raises as it traverses the XML file.
The Wikipedia article has many details on the stages of SAX parsing.
In practical: book.xml
<bookstore>
<book category="cooking">
<title lang="en">Everyday Italian</title>
<author>Giada De Laurentiis</author>
<year>2005</year>
<price>30.00</price>
</book>
</bookstore>
DOM presents the xml document as a the following tree-structure in memory.
DOM is W3C standard.
DOM parser works on Document Object Model.
DOM occupies more memory, preferred for small XML documents
DOM is Easy to navigate either forward or backward.
SAX presents the xml document as event based like start element:abc, end element:abc.
SAX is not W3C standard, it was developed by group of developers.
SAX does not use memory, preferred for large XML documents.
Backward navigation is not possible as it sequentially process the documents.
Event happens to a node/element and it gives all sub nodes(Latin nodus, ‘knot’).
This XML document, when passed through a SAX parser, will generate a sequence of events like the following:
start element: bookstore
start element: book with an attribute category equal to cooking
start element: title with an attribute lang equal to en
Text node, with data equal to Everyday Italian
....
end element: title
.....
end element: book
end element: bookstore
Both SAX and DOM are used to parse the XML document. Both has advantages and disadvantages and can be used in our programming depending on the situation
SAX:
Parses node by node
Does not store the XML in memory
We cant insert or delete a node
Top to bottom traversing
DOM
Stores the entire XML document into memory before processing
Occupies more memory
We can insert or delete nodes
Traverse in any direction.
If we need to find a node and does not need to insert or delete we can go with SAX itself otherwise DOM provided we have more memory.

Adobe Alchemy returning C data structures

I have lexer/parser (Generated from an ANTLR grammar file) which (for performance reasons) I have compiled to C code which will be included into my actionscript project using Adobe Alchemly.
The parser will generate an abstract syntax tree (In C) from an input string (passed from Actionscript) - I wish to return the C AST back into actionscript for further processing. How can I convert the tree structure of the AST to a format which I can return to actionscript?
Thanks,
Unfortunately you can't just send a C data structure across. You've got three options, in increasing order of madness:
Serialize the data on the C side and reconstitute it on the AS3 side.
Pack up the data into Objects and return those.
Pass a pointer and size back to AS3 and pull out the data from Alchemy's ram ByteArray.
I only include #3 for completeness-- I think it would be crazy to try it for any kind of complex data structure. The code would be fragile. Following pointers would be clunky. Bleah.
For #2 you could use dynamic Objects (via AS3_Object) or concrete ones (via AS3_Get, AS3_New). This is fairly complex code also and not so fast. Can be hard to maintain.
For #1, the type of serialization is what matters. You could have your C code render the structures to a binary 'file', return that, and have your AS3 parse the file format via ByteArray. Or you could render it to XML and have AS3's XML class parse it. This has the benefit of being fairly fast (since XML is implemented natively), at least on the de-serialization end. If you have a fast XML renderer on the C side (or, ahem, sprintfs), its not so bad.

Resources