Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
So, I'm getting this data. From the network socket, or out of a file. I'm cobbling together code that will interpret the data. Read some bytes, check some flags, and some bytes indicate how much data follows. Read in that much data, rinse, repeat.
This task reminds me much to parsing source code. I'm comfy with lex/yacc and antlr, but they're not up to this task. You can't specify bits and raw bytes as tokens (well, maybe you could, but I wouldn't know how), and you can't coax them into "read two bytes, make them into an unsigned 16bit integer, call it n, and then read n bytes.".
Then again, when the spec of the protocol/data format is defined in a systematic manner (not all of them are), there should be a systematic way to read in data that is formatted according to the protocol. Right?
There's gotta be a tool that does that.
Kaitai Struct initiative have emerged recently to solve exactly that task: to generate binary parsers from a spec. You can provide a scheme for serialization of arbitrary data structure in a YAML/JSON-based format like that:
meta:
id: my_struct
endian: le
seq:
- id: some_int
type: u4
- id: some_string
type: str
encoding: UTF-8
size: some_int + 4
- id: another_int
type: u4
compile it using ksc (they provide a reference compiler implementation), and, voila, you've got a parser in any supported programming language, for example, in C++:
my_struct_t::my_struct_t(kaitai::kstream *p_io, kaitai::kstruct *p_parent, my_struct_t *p_root) : kaitai::kstruct(p_io) {
m__parent = p_parent;
m__root = this;
m_some_int = m__io->read_u4le();
m_some_string = m__io->read_str_byte_limit((some_int() + 4), "UTF-8");
m_another_int = m__io->read_u4le();
}
or in Java:
private void _parse() throws IOException {
this.someInt = this._io.readU4le();
this.someString = this._io.readStrByteLimit((someInt() + 4), "UTF-8");
this.anotherInt = this._io.readU4le();
}
After adding that to your project, it provides a very intuitive API like that (an example in Java, but they support more languages):
// given file.dat contains 01 00 00 00|41 42 43 44|07 01 00 00
MyStruct s = MyStruct.fromFile("path/to/file.dat");
s.someString() // => "ABCD"
s.anotherInt() // => 263 = 0x107
It supports different endianness, conditional structures, substructures, etc, and lots more. Pretty complex data structures, such as PNG image file format or PE executable can be parsed.
You may try to employ Boost.Spirit (v2) which has recently got binary parsing tools, endianness-aware native and mixed parsers
// This is not a complete and useful example, but just illustration that parsing
// of raw binary to real data components is possible
typedef boost::uint8_t byte_t;
byte_t raw[16] = { 0 };
char const* hex = "01010000005839B4C876BEF33F83C0CA";
my_custom_hex_to_bytes(hex, raw, 16);
// parse raw binary stream bytes to 4 separate words
boost::uint32_t word(0);
byte_t* beg = raw;
boost::spirit::qi::parse(beg, beg + 16, boost::spirit::qi::dword, word))
UPDATE: I found similar question, where Joel de Guzman confirms in his answer availability of binary parsers: Can Boost Spirit be used to parse byte stream data?
The Construct parser, written in Python, has done some interesting work in this field.
The project has had a number of authors, and periods of stagnation, but as of 2017 it seems to be more active again.
Read up on ASN.1. If you can describe the binary data in its terms, you can then use various available kits. Not for the faint of heart.
https://github.com/lonelybug/ddprotocol/wiki/Getting-Start
hope this is helpful to you.
There is certainly nothing stopping you from writing a recursive decent parser, say, for binary data the same way you would hand-tool a text parser. If the format you need to read is not too complicated this is a reasonable way to proceed.
Of course, if you format is very simple you could take a look at Reading binary file defined by a struct and similar question.
I don't know of any parser generators for non-text input, though those are also possible.
In the event that you are not familiar with coding parsers by hand, the canonical SO question is Learning to write a compiler. The Crenshaw tutorial (and in PDF) is a fast read.
See also google protocol buffers.
There is a tool called binpac that does exactly this.
http://www.icsi.berkeley.edu/pubs/networking/binpacIMC06.pdf
Related
I've an app in which I need to send a packet to an external device. This packet has a CRC before the end message. The CRC has to be separated in CRCH and CRCL.
For example: CRC = 0x5B so CRCH should be 0x35 (ASCII representation of 5) and CRCL should be 0x42 (ASCII representation of B).
I searched on internet and I found several functions in C or in other language to create CRC32, but my device need to use a CRC8. How I can create a CRC8 in Objective-C? Can you help me to find a way to do this?
Surprising how this rather simple question is still not answered.
First, you need to separate problems in your question. CRH and CRL are just hex conversion and that's easy to do (and has lots of examples on internet too). In most cases, you just need to compare crc you received to one you calculated. So, you just need to convert them to the same form. E.g. convert the crc you calculated to text using sprintf and %2X format and compare with CRC you received (as text).
The second part is actually CRC. This is a little bit trickier.Your options are as follows:
1) the easiest is to rename your .m file to .mm and use CRC library from boost C++. It's just a header include, so it won't affect rest of your code in any way and you can even make it in a separate file, so you'll have a C function which will use boost under the hood.
You might need to find parameters for your CRC though. For that, see this excellent resource http://reveng.sourceforge.net/crc-catalogue/
2) You can write your own implementation. Surprisingly there is plenty of examples for particular algorithms in the internet, but they often optimized for particular crc and are hard to adopt for other algorithms.
So, your best bet is probably starting with "A Painless Guide to CRC Error Detection Algorithms" article by Ross Williams. It also includes examples in C.
Though it could be complicated to get your head around all the technical stuff and explanations there.
So, as as short cut I'd like to suggest to look at my own implementation in java here. It's obviously not Objective-C. But I looked through it and you should be able to just copy and paste to your .m file and just compile it. Possibly adjusting few types.
You'll need public static long calculateCRC(Parameters crcParams, byte[] data) and private static long reflect(long in, int count) functions there. And the Parameters class which looks scarier, but should just become a struct in your case:
struct Parameters
{
int width; // Width of the CRC expressed in bits
long polynomial; // Polynomial used in this CRC calculation
bool reflectIn; // Refin indicates whether input bytes should be reflected
bool reflectOut; // Refout indicates whether input bytes should be reflected
long init; // Init is initial value for CRC calculation
long finalXor; // Xor is a value for final xor to be applied before returning result
}
You might also want to also adjust types there to a shorter unsigned type (java has no unsigned). But it should work perfectly well as is.
If I this understood correctly, common lisp was standardized in a time when there were many different architectures with different opinions on the size of a byte. To that end common lisp allows us to define the size of a byte.
For example I can create an array of 8bit bytes like this:
(make-array 10 :element-type '(unsigned-byte 8))
This works great and so far this knowledge has been enough for whatever I've been doing.
Today though I've been getting into using binary streams and the read-byte function confuses me.
The CLHS says that read-byte reads and returns one byte from stream.
but what kind of byte is this? The default platform byte? Can I specify this in any way?
Thanks folks
For example OPEN has an :element-type argument, which is implementation-defined. Your Common Lisp implementation has more informations about it. As said in comments, (unsigned-byte 8) describes a stream octets which happens to be the size of bytes in most (all?) implementations. Thanks #Xach.
See also flexi-streams which has make-external-format and binary-types for custom binary encodings.
It is whatever the element type of the stream you read from indicates.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to write a program for a school java project to parse some CSV I do not know. I do know the datatype of each column - although I do not know the delimiter.
The problem I do not even marginally know how to fix is to parse Date or even DateTime Columns. They can be in one of many formats.
I found many libraries but have no clue which is the best for my needs:
http://opencsv.sourceforge.net/
http://www.csvreader.com/java_csv.php
http://supercsv.sourceforge.net/
http://flatpack.sourceforge.net/
The problem is I am a total java beginner. I am afraid non of those libraries can do what I need or I can't convince them to do it.
I bet there are a lot of people here who have code sample that could get me started in no time for what I need:
automatically split in Columns (delimiter unknown, Columntypes are known)
cast to Columntype (should cope with $, %, etc.)
convert dates to Java Date or Calendar Objects
It would be nice to get as many code samples as possible by email.
Thanks a lot!
AS
You also have the Apache Commons CSV library, maybe it does what you need. See the guide. Updated to Release 1.1 in 2014-11.
Also, for the foolproof edition, I think you'll need to code it yourself...through SimpleDateFormat you can choose your formats, and specify various types, if the Date isn't like any of your pre-thought types, it isn't a Date.
There is a serious problem with using
String[] strArr=line.split(",");
in order to parse CSV files, and that is because there can be commas within the data values, and in that case you must quote them, and ignore commas between quotes.
There is a very very simple way to parse this:
/**
* returns a row of values as a list
* returns null if you are past the end of the input stream
*/
public static List<String> parseLine(Reader r) throws Exception {
int ch = r.read();
while (ch == '\r') {
//ignore linefeed chars wherever, particularly just before end of file
ch = r.read();
}
if (ch<0) {
return null;
}
Vector<String> store = new Vector<String>();
StringBuffer curVal = new StringBuffer();
boolean inquotes = false;
boolean started = false;
while (ch>=0) {
if (inquotes) {
started=true;
if (ch == '\"') {
inquotes = false;
}
else {
curVal.append((char)ch);
}
}
else {
if (ch == '\"') {
inquotes = true;
if (started) {
// if this is the second quote in a value, add a quote
// this is for the double quote in the middle of a value
curVal.append('\"');
}
}
else if (ch == ',') {
store.add(curVal.toString());
curVal = new StringBuffer();
started = false;
}
else if (ch == '\r') {
//ignore LF characters
}
else if (ch == '\n') {
//end of a line, break out
break;
}
else {
curVal.append((char)ch);
}
}
ch = r.read();
}
store.add(curVal.toString());
return store;
}
There are many advantages to this approach. Note that each character is touched EXACTLY once. There is no reading ahead, pushing back in the buffer, etc. No searching ahead to the end of the line, and then copying the line before parsing. This parser works purely from the stream, and creates each string value once. It works on header lines, and data lines, you just deal with the returned list appropriate to that. You give it a reader, so the underlying stream has been converted to characters using any encoding you choose. The stream can come from any source: a file, a HTTP post, an HTTP get, and you parse the stream directly. This is a static method, so there is no object to create and configure, and when this returns, there is no memory being held.
You can find a full discussion of this code, and why this approach is preferred in my blog post on the subject: The Only Class You Need for CSV Files.
My approach would not be to start by writing your own API. Life's too short, and there are more pressing problems to solve. In this situation, I typically:
Find a library that appears to do what I want. If one doesn't exist, then implement it.
If a library does exist, but I'm not sure it'll be suitable for my needs, write a thin adapter API around it, so I can control how it's called. The adapter API expresses the API I need, and it maps those calls to the underlying API.
If the library doesn't turn out to be suitable, I can swap another one in underneath the adapter API (whether it's another open source one or something I write myself) with a minimum of effort, without affecting the callers.
Start with something someone has already written. Odds are, it'll do what you want. You can always write your own later, if necessary. OpenCSV is as good a starting point as any.
i had to use a csv parser about 5 years ago. seems there are at least two csv standards: http://en.wikipedia.org/wiki/Comma-separated_values and what microsoft does in excel.
i found this libaray which eats both: http://ostermiller.org/utils/CSV.html, but afaik, it has no way of inferring what data type the columns were.
You might want to have a look at this specification for CSV. Bear in mind that there is no official recognized specification.
If you do not now the delimiter it will not be possible to do this so you have to find out somehow. If you can do a manual inspection of the file you should quickly be able to see what it is and hard code it in your program. If the delimiter can vary your only hope is to be able to deduce if from the formatting of the known data. When Excel imports CSV files it lets the user choose the delimiter and this is a solution you could use as well.
I agree with #Brian Clapper. I have used SuperCSV as a parser though I've had mixed results. I enjoy the versatility of it, but there are some situations within my own csv files for which I have not been able to reconcile "yet". I have faith in this product and would recommend it overall--I'm just missing something simple, no doubt, that I'm doing in my own implementation.
SuperCSV can parse the columns into various formats, do edits on the columns, etc. It's worth taking a look-see. It has examples as well, and easy to follow.
The one/only limitation I'm having is catching an 'empty' column and parsing it into an Integer or maybe a blank, etc. I'm getting null-pointer errors, but javadocs suggest each cellProcessor checks for nulls first. So, I'm blaming myself first, for now. :-)
Anyway, take a look at SuperCSV. http://supercsv.sourceforge.net/
At a minimum you are going to need to know the column delimiter.
Basically you will need to read the file line by line.
Then you will need to split each line by the delimiter, say a comma (CSV stands for comma-separated values), with
String[] strArr=line.split(",");
This will turn it into an array of strings which you can then manipulate, for example with
String name=strArr[0];
int yearOfBirth = Integer.valueOf(strArr[1]);
int monthOfBirth = Integer.valueOf(strArr[2]);
int dayOfBirth = Integer.valueOf(strArr[3]);
GregorianCalendar dob=new GregorianCalendar(yearOfBirth, monthOfBirth, dayOfBirth);
Student student=new Student(name, dob); //lets pretend you are creating instances of Student
You will need to do this for every line so wrap this code into a while loop. (If you don't know the delimiter just open the file in a text editor.)
I would recommend that you start by pulling your task apart into it's component parts.
Read string data from a CSV
Convert string data to appropriate format
Once you do that, it should be fairly trivial to use one of the libraries you link to (which most certainly will handle task #1). Then iterate through the returned values, and cast/convert each String value to the value you want.
If the question is how to convert strings to different objects, it's going to depend on what format you are starting with, and what format you want to wind up with.
DateFormat.parse(), for example, will parse dates from strings. See SimpleDateFormat for quickly constructing a DateFormat for a certain string representation.
Integer.parseInt() will prase integers from strings.
Currency, you'll have to decide how you want to capture it. If you want to just capture as a float, then Float.parseFloat() will do the trick (just use String.replace() to remove all $ and commas before you parse it). Or you can parse into a BigDecimal (so you don't have rounding problems). There may be a better class for currency handling (I don't do much of that, so am not familiar with that area of the JDK).
Writing your own parser is fun, but likely you should have a look at
Open CSV. It provides numerous ways of accessing the CSV and also allows to generate CSV. And it does handle escapes properly. As mentioned in another post, there is also a CSV-parsing lib in the Apache Commons, but that one isn't released yet.
There are many open sourced parser implementations available to us in Haskell. Parsec seems to be the standard for text parsing and attoparsec seems to be a popular choice for binary parsing but I don't know much beyond that. Is there a particular decision tree that you follow for choosing a parser implementation? Have you learned anything interesting about the strengths or weaknesses of the libraries?
You have several good options.
For lightweight parsing of String types:
parsec
polyparse
For packed bytestring parsing, e.g. of HTTP headers.
attoparsec
For actual binary data most people use either:
binary -- for lazy binary parsing
cereal -- for strict binary parsing
The main question to ask yourself is what is the underlying string type?
String?
bytestring (strict)?
bytestring (lazy)?
unicode text
That decision largely determines which parser toolset you'll use.
The second question to ask is: do I already have a grammar for the data type? If so, I can just use happy
The Happy parser generator
And obviously for custom data types there are a variety of good existing parsers:
XML
haxml
xml-light
hxt
hexpat
CSV
bytestring-csv
csv
JSON
json
rss/atom
feed
Just to add to Don's post: Personally, I quite like Text.ParserCombinators.ReadP (part of base) for no-nonsense quick and easy stuff. Particularly when Parsec seems like overkill.
There is a bytestringreadp library for the bytestring version, but it doesn't cover Char8 bytestrings, and I suspect attoparsec would be a better choice at this point.
I recently converted some code from Parsec to Attoparsec. Both are quite capable.
Attoparsec wins on performance and memory footprint, but Parsec provides better error reporting and has more complete documentation.
Bryan O’Sullivan’s blog post What’s in a parser? Attoparsec rewired (2/2) includes a nice performance benchmark comparing several implementations along with some comments comparing memory usage.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am a college student getting my Computer Science degree. A lot of my fellow students really haven't done a lot of programming. They've done their class assignments, but let's be honest here those questions don't really teach you how to program.
I have had several other students ask me questions about how to parse things, and I'm never quite sure how to explain it to them. Is it best to start just going line by line looking for substrings, or just give them the more complicated lecture about using proper lexical analysis, etc. to create tokens, use BNF, and all of that other stuff? They never quite understand it when I try to explain it.
What's the best approach to explain this without confusing them or discouraging them from actually trying.
I'd explain parsing as the process of turning some kind of data into another kind of data.
In practice, for me this is almost always turning a string, or binary data, into a data structure inside my Program.
For example, turning
":Nick!User#Host PRIVMSG #channel :Hello!"
into (C)
struct irc_line {
char *nick;
char *user;
char *host;
char *command;
char **arguments;
char *message;
} sample = { "Nick", "User", "Host", "PRIVMSG", { "#channel" }, "Hello!" }
Parsing is the process of analyzing text made of a sequence of tokens to determine its grammatical structure with respect to a given (more or less) formal grammar.
The parser then builds a data structure based on the tokens. This data structure can then be used by a compiler, interpreter or translator to create an executable program or library.
(source: wikimedia.org)
If I gave you an english sentence, and asked you to break down the sentence into its parts of speech (nouns, verbs, etc.), you would be parsing the sentence.
That's the simplest explanation of parsing I can think of.
That said, parsing is a non-trivial computational problem. You have to start with simple examples, and work your way up to the more complex.
What is parsing?
In computer science, parsing is the process of analysing text to determine if it belongs to a specific language or not (i.e. is syntactically valid for that language's grammar). It is an informal name for the syntactic analysis process.
For example, suppose the language a^n b^n (which means same number of characters A followed by the same number of characters B). A parser for that language would accept AABB input and reject the AAAB input. That is what a parser does.
In addition, during this process a data structure could be created for further processing. In my previous example, it could, for instance, to store the AA and BB in two separate stacks.
Anything that happens after it, like giving meaning to AA or BB, or transform it in something else, is not parsing. Giving meaning to parts of an input sequence of tokens is called semantic analysis.
What isn't parsing?
Parsing is not transform one thing into another. Transforming A into B, is, in essence, what a compiler does. Compiling takes several steps, parsing is only one of them.
Parsing is not extracting meaning from a text. That is semantic analysis, a step of the compiling process.
What is the simplest way to understand it?
I think the best way for understanding the parsing concept is to begin with the simpler concepts. The simplest one in language processing subject is the finite automaton. It is a formalism to parsing regular languages, such as regular expressions.
It is very simple, you have an input, a set of states and a set of transitions. Consider the following language built over the alphabet { A, B }, L = { w | w starts with 'AA' or 'BB' as substring }. The automaton below represents a possible parser for that language whose all valid words starts with 'AA' or 'BB'.
A-->(q1)--A-->(qf)
/
(q0)
\
B-->(q2)--B-->(qf)
It is a very simple parser for that language. You start at (q0), the initial state, then you read a symbol from the input, if it is A then you move to (q1) state, otherwise (it is a B, remember the remember the alphabet is only A and B) you move to (q2) state and so on. If you reach (qf) state, then the input was accepted.
As it is visual, you only need a pencil and a piece of paper to explain what a parser is to anyone, including a child. I think the simplicity is what makes the automata the most suitable way to teaching language processing concepts, such as parsing.
Finally, being a Computer Science student, you will study such concepts in-deep at theoretical computer science classes such as Formal Languages and Theory of Computation.
Have them try to write a program that can evaluate arbitrary simple arithmetic expressions. This is a simple problem to understand but as you start getting deeper into it a lot of basic parsing starts to make sense.
Parsing is about READING data in one format, so that you can use it to your needs.
I think you need to teach them to think like this. So, this is the simplest way I can think of to explain parsing for someone new to this concept.
Generally, we try to parse data one line at a time because generally it is easier for humans to think this way, dividing and conquering, and also easier to code.
We call field to every minimum undivisible data. Name is field, Age is another field, and Surname is another field. For example.
In a line, we can have various fields. In order to distinguish them, we can delimit fields by separators or by the maximum length assign to each field.
For example:
By separating fields by comma
Paul,20,Jones
Or by space (Name can have 20 letters max, age up to 3 digits, Jones up to 20 letters)
Paul 020Jones
Any of the before set of fields is called a record.
To separate between a delimited field record we need to delimit record. A dot will be enough (though you know you can apply CR/LF).
A list could be:
Michael,39,Jordan.Shaquille,40,O'neal.Lebron,24,James.
or with CR/LF
Michael,39,Jordan
Shaquille,40,O'neal
Lebron,24,James
You can say them to list 10 nba (or nlf) players they like. Then, they should type them according to a format. Then make a program to parse it and display each record. One group, can make list in a comma-separated format and a program to parse a list in a fixed size format, and viceversa.
Parsing to me is breaking down something into meaningful parts... using a definable or predefined known, common set of part "definitions".
For programming languages there would be keyword parts, usable punctuation sequences...
For pumpkin pie it might be something like the crust, filling and toppings.
For written languages there might be what a word is, a sentence, what a verb is...
For spoken languages it might be tone, volume, mood, implication, emotion, context
Syntax analysis (as well as common sense after all) would tell if what your are parsing is a pumpkinpie or a programming language. Does it have crust? well maybe it's pumpkin pudding or perhaps a spoken language !
One thing to note about parsing stuff is there are usually many ways to break things into parts.
For example you could break up a pumpkin pie by cutting it from the center to the edge or from the bottom to the top or with a scoop to get the filling out or by using a sledge hammer or eating it.
And how you parse things would determine if doing something with those parts will be easy or hard.
In the "computer languages" world, there are common ways to parse text source code. These common methods (algorithims) have titles or names. Search the Internet for common methods/names for ways to parse languages. Wikipedia can help in this regard.
In linguistics, to divide language into small components that can be analyzed. For example, parsing this sentence would involve dividing it into words and phrases and identifying the type of each component (e.g.,verb, adjective, or noun).
Parsing is a very important part of many computer science disciplines. For example, compilers must parse source code to be able to translate it into object code. Likewise, any application that processes complex commands must be able to parse the commands. This includes virtually all end-user applications.
Parsing is often divided into lexical analysis and semantic parsing. Lexical analysis concentrates on dividing strings into components, called tokens, based on punctuationand other keys. Semantic parsing then attempts to determine the meaning of the string.
http://www.webopedia.com/TERM/P/parse.html
Simple explanation: Parsing is breaking a block of data into smaller pieces (tokens) by following a set of rules (using delimiters for example),
so that this data could be processes piece by piece (managed, analysed, interpreted, transmitted, ets).
Examples: Many applications (like Spreadsheet programs) use CSV (Comma Separated Values) file format to import and export data. CSV format makes it possible for the applications to process this data with a help of a special parser.
Web browsers have special parsers for HTML and CSS files. JSON parsers exist. All special file formats must have some parsers designed specifically for them.