Parsing XML chunks in a non-XML file - xml-parsing

Can anyone share experience with parsing XML chunks embedded in a non-XML file?
I am implementing an Edge-Side-Includes[1] processor. Edge-Side-Includes elements are not necessarily embedded in XML- or well-formed XML files and this poses the question, how to go about finding and then parsing such elements.
Has anyone done something similar?
[1] http://www.w3.org/TR/esi-lang

Seems like the best option is to either embed the XML tokenizing into the overall tokenizer or identify the chunks and hand them to an XML parser individually.

Related

XML parsing as pagination

I have an XML file which i get downloaded from server which is having 50k elements. I need to display those 50k elements in a tableView.
But It consumes more memory.
So i thought is there any XML parser available in swift which allows me kind of pagination like parse 1 to 10 next 10-20 and so on.
All u need is a SAX xml parser like libxml2. DOM parser will not be able to parse the data with 50K elements because DOM parsers loads the entire Document Object Model into memory to construct the tree and then parses the nodes. Where as SAX parsers parses the xml in chunk.
Unfortunately most of the SAX parsers I am aware of are in C. So u have to write the wrapper around them to use it swift project. Good news there are tutorials explaining how to use them.
here are few of the useful links to integrate libxml2 to swift project.
http://redqueencoder.com/wrapping-libxml2-for-swift/
https://www.cocoawithlove.com/2008/10/using-libxml2-for-parsing-and-xpath.html
EDIT:
You can make use of NSXMLParser as well which is a SAX parser written in Objective-C. You can find loads of tutorials on how to use it with Swift
https://medium.com/#lucascerro/understanding-nsxmlparser-in-swift-xcode-6-3-1-7c96ff6c65bc

Parsing and pretty printing the same file format in Haskell

I was wondering, if there is a standard, canonical way in Haskell to write not only a parser for a specific file format, but also a writer.
In my case, I need to parse a data file for analysis. However, I also simulate data to be analyzed and save it in the same file format. I could now write a parser using Parsec or something equivalent and also write functions that perform the text output in the way that it is needed, but whenever I change my file format, I would have to change two functions in my code. Is there a better way to achieve this goal?
Thank you,
Dominik
The BNFC-meta package https://hackage.haskell.org/package/BNFC-meta-0.4.0.3
might be what you looking for
"Specifically, given a quasi-quoted LBNF grammar (as used by the BNF Converter) it generates (using Template Haskell) a LALR parser and pretty pretty printer for the language."
update: found this package that also seems to fulfill the objective (not tested yet) http://hackage.haskell.org/package/syntax

How to parse dynamic XML with a SAX parser

Scenario:
Large (dynamic) xml files being uploaded by users.
We need to map the xml to our own database structure.
We need to use a SAX parser (or something like it) because of memory issues when parsing large XML files.
We currently use https://github.com/craigambrose/sax_stream for parsing XML's that all have the same structure.
For a new feature, we need to parse XML with unknown contents.
How would one use a SAX parser when the xml nodes are different each time ?
I've tried using https://github.com/soulcutter/saxerator, especially the at_depth() function could come in handy to collect the elements at a certain depth, after that we could get the elements inside a node by using the for_tag() function. Based on this info we maybe could create a mapping on the fly
If a SAX parser isn't an option, are there any alternatives for parsing very large (dynamic) XML files?

XML Serialization VS XML Parsing

What is the difference between XML Serialization and XML Parsing? When should we use each one?
Parsing is, generally speaking, the processing of an input stream into meaningful data structures; in the XML context, parsing is the process of reading a sequence of characters conforming to the grammar and other constraints of the XML spec into whatever internal representation of XML your program uses.
Serialization is the opposite process: processing the internal data structures of a program (in this context, your internal representation of an XML document) and creating a character sequence (typically written to an output stream) that conforms to the angle-bracket syntax of the spec.
Use a parser to read XML from a character stream into data structures; use a serializer to write data structures out into a character stream.
I don't know much about XML, but here's what I know about serialization and parsing.
parsing - reading data (parse-in) from storage, and writing data (parse-out) to storage… "such as a text file"
serializing - (serialize) translating data into a readable format, and (de-serialize) translate that format back to data… "i.e. you want to translate a struct into readable content, stream that content across a network, and translate it back into code."
here's a new one…
marshalling - (marshall and unmarshall) similar to serialize, except marshalling is used to translate data into a different format… "i.e. you want to translate a stream of bytes into an 32 bit structure (one byte to four bytes)"
in easy terms (for beginners)
TL;DR
XML parsing (or XML deserialization) ==> input: valid XML, output: data structures
XML serialization ==> input: data structures, output: valid XML
XML parsing (a.k.a XML de-serialization)
You take a .xml file (example.xml) as input to process it with your programming language of choise, so that your programm can do something usefull with the data in that file. Your programm will transform the information from the file into data structures that your programming language can deal with (i.e. lists, arrays, objects, etc.).
XML serialization
Your programm (in any programming language), transforms information represented as data structures (lists, arrays, objects, etc.) into a valid XML output which can be saved into a file or tranmitted to another programm.
NOTE: Technically the input (when we are takling about parsing) and the output (when we are talking about serialization) does not have to be a file. As said in the more professional answer above it can be any input/output stream, too. And files don't have to have .xml extension, they can have any file extension which represents a valid XML format (i.e. .svg is also a XML based format). The key to understanding is, that when we do XML parsing we have valid XML on the input side and data structures on the output side, and when we do XML serialization we have data structures on the input side and valid XML on the output side.
To give an example from the Python world: you can use buildin packages (like xml.etree.ElementTree) or third party libraries (like lxml (recommended) or xmltodict) to do both - parse (deserialize) or create (serialize) XML data.

Parsing binary data

I got interested in parser generators. But I don't have the theoretical background. I just read a few things on the internet.
Currently I'm trying to do something with ANTLR
So my questions:
I have a special format of my dataframes:
The first byte of a frame is a tag that describes the nature of the data
The second byte contains the length (number of bytes) of the data itself
Then follows the data itself
The data can contain dataframes itself, and dataframes can be listed one after the other
I hope my description is clear. My questions:
Can I create such a parser with ANTLR that reads the lengs of the frame and then knows when the frame ends?
In ANTLR can I load the different tags I use from a generated file?
Thank you!
I'm not 100% sure about this, but:
Parser generators like antlr require a grammar that is at least context-free
using length-fields in your data makes your grammar not context free (context-sensitive i think)
It is the latter point i'm not sure about - maybe you want to research some more on that.
You probably have to write a packet "parser" yourself (which then has to be a parser for your context-sensitive packet grammar)
Alternatively, you could drop the length field, and use something like s-expressions, JSON or xml; these would be parseable by something generated with antlr.
I think you will be better off to create a hand written binary parser instead of using ANTLR because ANTLR is primarily intended to read and make sense of a text file and not binary data. The lexer part is focused on tokenizing text so trying to make it read binary data instead would be an uphill battle.
It sounds as if your structure would need some kind of recursive way of reading the data although it could be done easier just having a tree structure and then fill it as you read your file.

Resources