Sax Stax Chunks Processing in Java - xml-parsing

Stax and Sax parser character method mentioning it returns chunk of data in the documentation but i am receiving Out of memory, Can any one please explain this ? if possible an example, i am trying to read a Xml that will have large content between one xml tag.
<x>......too long data...</x>
looking at if i could override this or some properties i could set ? i found we can set CDATA_CHUNK_SIZE = "jdk.xml.cdataChunkSize" in Latest versions but that's not an option for me.
In Stax parser : "a processor may return all contiguous character data in a single chunk, or it may split it into several chunks."
In SAX parser: characters method "
characters(char[] ch, int start, int length)
Receive notification of character data.
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks; however, all of the characters in any single event must come from the same external entity so that the Locator provides useful information.
"
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xerces.internal.util.XMLStringBuffer.append(XMLStringBuffer.java:208)
at com.sun.org.apache.xerces.internal.impl.XMLEntityScanner.scanData(XMLEntityScanner.java:1370)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanCDATASection(XMLDocumentFragmentScannerImpl.java:1654)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:3020)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606)
at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:117)
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:558)
at com.javacodegeeks.StAXParserDemo.main(StAXParserDemo.java:65)

Related

How to parse a binary PDF stream of unknown length?

From the PDF docs: "The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes."
As the contents may be binary, an occurrence of endstream does not necessarily indicate the end of the stream. Now when considering this stream:
%PDF-1.4
%307쏢
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
x234+T03203T0^#A(235234˥^_d256220^314^U310^E^#[364^F!endstream
endobj
6 0 obj
30
endobj
The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.
I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect. While it may help PDF writers to output PDFs sequentially, it makes parsing for PDF readers quite difficult. Considering that a PDF file is read more frequently than being written, I don't understand this.
So how can such a stream be parsed correctly?
The Length is an indirect object that follows the stream. Obviously that length can only be read after the stream has been parsed.
This is an understandable conclusion if one assumes that the file is to be read sequentially beginning to end.
This assumption is incorrect, though, because parsing a PDF from the front and determining the PDF objects on the run is not the recommended way of parsing a PDF.
While ISO 32000-1 is a bit vague here and merely says
Conforming readers should read a PDF file from its end.
(ISO 32000-1, section 7.5.5 File Trailer)
ISO 32000-2 clearly specifies:
With the exception of linearized PDF files, all PDF files should be read using the trailer and cross-reference table as described in the following subclauses. Reading a non-linearized file in a serial manner is not reliable because of the way objects are to be processed after an incremental update. (See 6.3.2, "Conformance of PDF processors".)
(ISO 32000-2, section 7.5 File structure)
Thus, in case of your PDF excerpt, a PDF processor trying to read object 5 0
looks up object 5 0 in the cross references and gets its offset in the file,
goes to that offset and starts reading the object, first parsing the stream dictionary,
at the stream keyword recognizes that the object is a stream and retrieves its Length value which happens to be an indirect reference to 6 0,
looks up object 6 0 in the cross references and gets its offset in the file,
goes to that offset and reads the object, the number 30,
reads the stream content of the stream object 5 0 knowing its length is 30.
An approach as yours is explicitly considered "not reliable".
I think allowing Length to be an indirect object that can only be resolved after the stream is a design defect.
If there were no cross references, you'd be correct. That also is why the FDF format (which does not have mandatory cross references) specifies:
FDF is based on PDF; it uses the same syntax and has essentially the same file structure (7.5, "File structure"). However, it differs from PDF in the following ways:
[...]
The length of a stream shall not be specified by an indirect object.
(ISO 32000-2, section 12.7.8 Forms data format)
Concerning the comments:
So I'm correct that PDF cannot be parsed sequentially,
While the very original design of PDF probably was meant for sequential parsing, it has been further developed with only access via cross references in mind. PDF simply is not meant to be parsed sequentially anymore. And that was already the case when I started dealing with PDFs in the late 90s.
and the only reason is that the required length of binary streams may be defined after the stream.
That's by far not the only reason, there are more situations requiring a cross reference lookup to parse correctly.
As #mkl indicated, a parser has to read somewhere before the end of the PDF file to get startxref, hoping that it does not start parsing in the middle of a binary stream.
That's not correct. The PDF must end with "%%EOF" plus optionally an end-of-line. Before that there must be an end-of-line, before that a number, before that an end-of-line, before that startxref.
This is already expressed clearly in ISO 32000-1:
The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section.
(ISO 32000-1, section 7.5.5 File Trailer)
Thus, no danger of being "in the middle of a binary stream" if the PDF is valid.
The other thing I dislike about the format of PDF is this: When developing a parser, you usually create test files with some elements you are working on. This approach seems to work with everything but streams. The absolute file positions of syntax elements and the requirement for multiple random accesses makes this task harder.
You seem to be subject to the misconception that the PDF format is a tagged text format like HTML. This is not the case. Even though numerous syntactical elements are defined using some ASCII keyword and there are "lines", PDF is a binary format, the cross reference tables are not a gimmick but the central access hub to the objects, and optimization for random access is done by design.

How to check whether input is a string in Erlang?

I would like to write a function to check if the input is a string or not like this:
is_string(Input) ->
case check_if_string(Input) of
true -> {ok, Input};
false -> error
end.
But I found it is tricky to check whether the input is a string in Erlang.
The string definition in Erlang is here: http://erlang.org/doc/man/string.html.
Any suggestions?
Thanks in advance.
In Erlang a string can be actually quite a few things, so there are a few ways to do this depending on exactly what you mean by "a string". It is worth bearing in mind that every sort of string in Erlang is a list of character or lexeme values of some sort.
Encodings are not simple things, particularly when Unicode is involved. Characters can be almost arbitrarily high values, lexemes are globbed together in deep lists of integers, and Erlang iolist()s (which are super useful) are deep lists of mixed integer and binary values that get automatically flattened and converted during certain operations. If you are dealing with anything other than flat lists of printable ASCII values then I strongly recommend you read these:
Unicode module docs
String module docs
IO Library module docs
So... this is not a very simple question.
What to do about all the confusion?
Quick answer that always works: Consider the origin of the data.
You should know what kind of data you are dealing with, whether it is coming over a socket or from a file, or especially if you are generating it yourself. On the edges of your system you may need some help purifying data, though, because network clients send all sorts of random trash from time to time.
Some helper functions for the most common cases live in the io_lib module:
io_lib:char_list/1: Returns true if the input is a list of characters in the unicode range.
io_lib:deep_char_list/1: Returns true if the input is a deep list of legal chars.
io_lib:deep_latin1_char_list/1: Returns true if the input is a deep list of Latin-1 (your basic printable ASCII values from 32 to 126).
io_lib:latin1_char_list/1: Returns true if the input is a flat list of Latin-1 characters (90% of the time this is what you're looking for)
io_lib:printable_latin1_list/1: Returns true if the input is a list of printable Latin-1 (If the above isn't what you wanted, 9% of the time this is the one you want)
io_lib:printable_list/1: Returns true if the input is a flat list of printable chars.
io_lib:printable_unicode_list/1: Returns true if the input is a flat list of printable unicode chars (for that 1% of the time that this is your problem -- except that for some of us, myself included here in Japan, this covers 99% of my input checking cases).
For more particular cases you can either use a regex from the re module or write your own recursive function that zips through a string for those special cases where a regex either doesn't fit, is impossible, or could make you vulnerable to regex attacks.
In erlang, string can be represented by list or binary.
If string is used as list then you can use following function to check:
is_string([C|T]) when (C >= 0) and (C =< 255) ->
is_string(T);
is_string([]) ->
true;
is_string(_) ->
false.
If string is used as binary in code then is_binary(Term) in build function can be used.

Is it possible to parse big file with ANTLR?

Is it possible to instruct ANTLR not to load entire file into memory? Can it apply rules one by one and generate topmost list of nodes sequentially, along with reading file? Also may be it is possible to drop analyzed nodes somehow?
Yes, you can use:
UnbufferedCharStream for your character stream (passed to lexer)
UnbufferedTokenStream for your token stream (passed to parser)
This token stream implementation doesn't differentiate on token channels, so make sure to use ->skip instead of ->channel(HIDDEN) as the command in your lexer rules that shouldn't be sent to the parser.
Make sure to call setBuildParseTree(false) on your parser or a giant parse tree will be created for the entire file.
Edit with some additional commentary:
I put quite a bit of work into making sure UnbufferedCharStream and UnbufferedTokenStream operate in the most "sane" manner possible, especially in relation to the mark, release, seek, and getText methods. My goal was to preserve as much of the functionality of those methods as possible without compromising the ability of the stream to release unused memory.
ANTLR 4 allows for true unlimited lookahead. If your grammar requires lookahead to EOF to make a decision, then you would not be able to avoid loading the entire input into memory. You'll have to take great care to avoid this situation when writing your grammar.
There is a Wiki page buried somewhere on Antlr.org that speaks to your question; cannot seem to find in just now.
In substance, the lexer reads data using a standard InputStream interface, specifically ANTLRInputStream.java. The typical implementation is ANTLRFileStream.java that preemptively reads the entire input data file into memory. What you need to do is to write your own buffered version -"ANTLRBufferedFileStream.java"- that reads from the source file as needed. Or, just set a standard BufferedInputStream/FileInputStream as the data source to the AntlrInputStream.
One caveat is that Antlr4 has the potential for doing an unbounded lookahead. Not likely a problem for a reasonably sized buffer in normal operation. More likely when the parser attempts error recovery. Antlr4 allows for tailoring of the error recovery strategy, so the problem is manageable.
Additional detail:
In effect, Antlr implements a pull-parser. When you call the first parser rule, the parser requests tokens from the lexer, which requests character data from the input stream. The parser/lexer interface is implemented by a buffered token stream, nominally BufferedTokenStream.
The parse tree is little more than a tree data structure of tokens. Well, a lot more, but not in terms of data size. Each token is an INT value backed typically by a fragment of the input data stream that matched the token definition. The lexer itself does not require a full copy of the lex'd input character stream to be kept in memory. And, the token text fragments could be zero'd out. The critical memory requirement for the lexer is the input character stream lookahead scan, given a buffered file input stream.
Depending on your needs, the in-memory parse tree can be small even given a 100GB+ input file.
To help further, you need to explain more what it is you are trying to do in Antlr and what defines your minimum critical memory requirement. That will guide which additional strategies can be recommended. For example, if the source data is amenable, you can use multiple lexer/parser runs, each time subselecting in the lexer different portions of the source data to process. Compared to file reads and DB writes, even with fast disks, Antlr execution will likely be barely noticeable.

XML Serialization VS XML Parsing

What is the difference between XML Serialization and XML Parsing? When should we use each one?
Parsing is, generally speaking, the processing of an input stream into meaningful data structures; in the XML context, parsing is the process of reading a sequence of characters conforming to the grammar and other constraints of the XML spec into whatever internal representation of XML your program uses.
Serialization is the opposite process: processing the internal data structures of a program (in this context, your internal representation of an XML document) and creating a character sequence (typically written to an output stream) that conforms to the angle-bracket syntax of the spec.
Use a parser to read XML from a character stream into data structures; use a serializer to write data structures out into a character stream.
I don't know much about XML, but here's what I know about serialization and parsing.
parsing - reading data (parse-in) from storage, and writing data (parse-out) to storage… "such as a text file"
serializing - (serialize) translating data into a readable format, and (de-serialize) translate that format back to data… "i.e. you want to translate a struct into readable content, stream that content across a network, and translate it back into code."
here's a new one…
marshalling - (marshall and unmarshall) similar to serialize, except marshalling is used to translate data into a different format… "i.e. you want to translate a stream of bytes into an 32 bit structure (one byte to four bytes)"
in easy terms (for beginners)
TL;DR
XML parsing (or XML deserialization) ==> input: valid XML, output: data structures
XML serialization ==> input: data structures, output: valid XML
XML parsing (a.k.a XML de-serialization)
You take a .xml file (example.xml) as input to process it with your programming language of choise, so that your programm can do something usefull with the data in that file. Your programm will transform the information from the file into data structures that your programming language can deal with (i.e. lists, arrays, objects, etc.).
XML serialization
Your programm (in any programming language), transforms information represented as data structures (lists, arrays, objects, etc.) into a valid XML output which can be saved into a file or tranmitted to another programm.
NOTE: Technically the input (when we are takling about parsing) and the output (when we are talking about serialization) does not have to be a file. As said in the more professional answer above it can be any input/output stream, too. And files don't have to have .xml extension, they can have any file extension which represents a valid XML format (i.e. .svg is also a XML based format). The key to understanding is, that when we do XML parsing we have valid XML on the input side and data structures on the output side, and when we do XML serialization we have data structures on the input side and valid XML on the output side.
To give an example from the Python world: you can use buildin packages (like xml.etree.ElementTree) or third party libraries (like lxml (recommended) or xmltodict) to do both - parse (deserialize) or create (serialize) XML data.

Erlang: unmarshalling variable length data fields in binary stream

I'm creating an Erlang application that needs to parse a binary TCP stream from a 3rd party program.
One of the types of packets I can receive has data formatted like this:
N_terms *[Flags ( 8 bits ), Type ( 8 bits ), [ optional data ] ].
The problem I have is the optional data is determined by a permutation of all possible
combinations of flags and types. Additionally, depending on the type there is additional optional data associated with it.
If I were to write a parser in an imperative language, I'd simply read in the 2 fields and then have a series of if( ... ) statements where I would read a value and increment my position in the stream. In Erlang, my initial naive assumption is that I would have 2^N function clauses to match byte syntax on the stream, where N is total number of flags + all types with additional optional data.
As it stands, at a minimum I have 3 flags and 1 type that has optional data that I must implement, which would mean I'd have 16 different function clauses to match on the stream.
There must be a better, idiomatic way to do this in Erlang - what am I missing?
Edit:
I should clarify I do not know the number of terms in advance.
One solution is to take
<<Flag:8/integer, Type:8/integer, Rest/binary>>
and then write a function decode(Flag, Type) which returns a description of what Rest will contain. Now, that description can then be passed to a decoder for Rest which can then use the description given to parse it correctly. A simple solution is to make the description into a list and whenever you decode something off of the stream, you use that description list to check that it is valid. That is, the description acts like your if.. construction.
As for the pointer move, it is easy. If you have a Binary and decode it
<<Take:N/binary, Next/binary>> = Binary,
Next is the moved pointer under the hood you are searching for. So you can just break your binary into pieces like that to obtain the next part to work on.
I would parse it something like:
parse_term(Acc,0,<<>>) -> {ok,Acc};
parse_term(_,0,_) -> {error,garbage};
parse_term(Acc,N,<<Flag:8/integer,Type:8/integer,Rest/binary>>) ->
{Optional,Rest1} = extract_optional(Flag,Type,Rest),
parse_term([{Flag,Type,Optional}|Acc],N-1,Rest1>>).
parse_stream(<<NTerms/integer,Rest/binary>>)->
parse_term([],NTerms,Rest).

Resources