Decoding byte stream - parsing

I have a series of messages that are defined by independent structs. These structs share a common header are sent between applications. I am creating a decoder that will take the raw data captures in the messages that were built using these structs and decode/parse them to some plain text.
I have over 1000 different messages that need to be decoded so I am not sure if defining all the struct formats in XML and then using XSL or some translation is the way to go or if there is a better way to do this.
There are times when I will need to decode logs containing over a million messages so performance is a concern.
Any recommendations for techniques/tools/algorithms to go about creating the decoder/parser?
struct:
struct {
dword messageid;
dword datavalue1;
dword datavalue2;
} struct1;
Example raw data:
0101010A0A0A0A0F0F0F0F
Decoded message (desired output):
message id: 0x01010101, datavalue1: 0x0A0A0A0A, datavalue2: 0x0F0F0F0F
I am using C++ to do this development.

Regarding "performance" - if you are using disk IO and possible display IO I doubt your parser/decoder will have much effect unless you use a truly horrible algorithm.
I am also unsure about what the problem is - Given the question right now - you have 3 DWORDs in a struct and you claim that there are over 1000 unique messages based on these values.
Your decoded message does not imply to me that you need any kind of parsing - just straight output seems to work (convert from bytes to ascii representation of a hex value)
If you do have a mapping from a value to a string, then a big switch statement is simple - or alternatively if you want to be able to have these added dynamically or change the display, then I would provide the key/value pairs (mapping) in a config file (text, xml, etc) and then do a lookup as the log file/raw data is read.
map is what I would use in that case.
Perhaps if you provide another specific example of the values and decoded output I can come up with a more appropriate suggestion.

If you have the message definitions already given in the syntax that you've used in your example, you should definitely not try to convert it manually into some other syntax (XML or otherwise).
Instead, you should try to write a compiler that takes these method definitions, and compiles them into a decoder function.
These days, the recommendation is to use ANTLR as the parser generator, using any of the ANTLR languages for the actual compiler (Java, Python, Ruby, C#, C++). That compiler then should output C code, which does the entire decoding and pretty-printing.

You can use yacc or antlr, add appropriate parsing rules, populate some data structure out of it(a tree may be) while parsing, then traverse the data structure and do whatever you like.

I'm going to assume that all you need to do is format the records and output them.
Use a custom code generator. The generated code will look something like this:
typedef struct { word messageid; } Header;
//repeated for each record type
typedef struct {
word messageid;
// <members here>
} Record_##;
//END
void Process(Input inp, Output out) {
char buffer[BIG_ENOUGH];
char *offset;
offset = &buffer[BIG_ENOUGH];
while(notEnd) {
if(&offset[sizeof(LargestStruct)] >= &buffer[BIG_ENOUGH])
// move remaining buffer to start and fill tail from inp
Header *hpt = (Header*)offset;
switch(hpt->messageid)
{
//repeated for each record type
case <recond ID for given type>:
{
Record_##* rpt = (Record_##*)offset;
outp.format("name1: %t, ...\n", rpt->name1, ...);
offset += sizeof(Record_##);
break;
}
//END
}
}
}
Most of that's boiler plate so writing a program to generate it shouldn't be to hard.
If you need more processing, I think this idea could be tweaked some to make that work as well.
Edit: after re-reading the question, it looks like you might have the structs defined already. In that cases you can just #include them and use them directly. However then you end up with the issue of how to parse the structs to generate the input to the formating function. Awk or sed might be handy there.

Related

How to force nom to parse the whole input string?

I am working with nom version 6.1.2 and I am trying to parse Strings like
A 2 1 2.
At the moment I would be happy to at least differentiate between input that fits the requirements and inputs which don't do that. (After that I would like to change the output to a tuple that has the "A" as first value and as second value a vector of the u16 numbers.)
The String always has to start with a capital A and after that there should be at least one space and after that one a number. Furthermore, there can be as much additional spaces and numbers as you want. It is just important to end with a number and not with a space. All numbers will be within the range of u16. I already wrote the following function:
extern crate nom;
use nom::sequence::{preceded, pair};
use nom::character::streaming::{char, space1};
use nom::combinator::recognize;
use nom::multi::many1;
use nom::character::complete::digit1;
pub fn parse_and(line: &str) -> IResult<&str, &str>{
preceded(
char('A'),
recognize(
many1(
pair(
space1,
digit1
)
)
)
)(line)
}
Also I want to mention that there are answers for such a problem which use CompleteStr but that isn't an option anymore because it got removed some time ago.
People explained that the reason for my behavior is that nom doesn't know when the slice of a string ends and therefore I get parse_and: Err(Incomplete(Size(1))) as answer for the provided example as input.
It seems like that one part of the use declarations created that problem. In the documentation (somewhere in some paragraph way to low that I looked at it) it says:
"
Streaming / Complete
Some of nom's modules have streaming or complete submodules. They hold different variants of the same combinators.
A streaming parser assumes that we might not have all of the input data. This can happen with some network protocol or large file parsers, where the input buffer can be full and need to be resized or refilled.
A complete parser assumes that we already have all of the input data. This will be the common case with small files that can be read entirely to memory.
"
Therefore, the solution to my problem is to swap use nom::character::complete::{char, space1}; instead of nom::character::streaming::{char, space1}; (3rd loc without counting empty lines). That worked for me :)

How to implement a json decoder which doesn't convert numerics to double?

When I call json.decode on financial data returned from a server, I would like to either convert my numerics to decimal (pub.dev package) (or even leave them as strings so I can manually do that later). Everything else I would like to be converted as normal
There's a reviver callback which is passed to _parseJson(), but I can't find the implementation of this external function to study.
Update:
Looks like reviver() is too late: basic conversion has already happened by here and doubles get passed in. Is there any alternative callback that can be used?
I'd use the jsontool package (disclaimer: I wrote it, so obviously that's what I'd turn to first).
It's a low-level JSON processor which allows you (and requires you) to take control of the processing.
It has an example where it parses numbers into BigInts. You could probably adapt that to use Decimal instead.

How to prevent Flex from ignoring previous analysis?

I recently started using Lex, as a simple way to explain the problem I encoutered, supposing that I'm trying to realize a lexical analyser with Flex that print all the letters and also all the bigrams in a given text, that seems very easy and simple, but once I implemented it, I've realised that it shows bigrams first and only shows letters when they are single, example: for the following text
QQQZ ,JQR
The result is
Bigram QQ
Bigram QZ
Bigram JQ
Letter R
Done
This is my lex code
%{
%}
letter[A-Za-z]
Separ [ \t\n]
%%
{letter} {
printf(" Letter %c\n",yytext[0]);
}
{letter}{2} {
printf(" Bigram %s\n",yytext);
}
%%
main()
{ yylex();
printf("Done");
}
My question is How can realise the two analysis seperatly, knowing that my actual problem isn't as simple as this example
Lexical analysers divide the source text into separate tokens. If your problem looks like that, then (f)lex is an appropriate tool. If your problem does not look like that, then (f)lex is probably not the correct tool.
Doing two simultaneous analyses of text is not really a use case for (f)lex. One possibility would be to use two separate reentrant lexical analysers, arranging to feed them the same inputs. However, that will be a lot of work for a problem which could easily be solved in a few lines of C.
Since you say that your problem is different from the simple problem in your question, I did not bother to either write the simple C code or the rather more complicated code to generate and run two independent lexical analysers, since it is impossible to know whether either of those solutions is at all relevant.
If your problem really is matching two (or more) different lexemes from the same starting position, you could use one of two strategies, both quite ugly (IMHO):
I'm assuming the existence of handler functions:
void handle_letter(char ch);
void handle_bigram(char* s); /* Expects NUL-terminated string */
void handle_trigram(char* s); /* Expects NUL-terminated string */
For historical reasons, lex implements the REJECT action, which causes the current match to be discarded. The idea was to let you process a match, and then reject it in order to process a shorter (or alternate) match. With flex, the use of REJECT is highly discouraged because it is extremely inefficient and also prevents the lexer from resizing the input buffer, which arbitrarily limits the length of a recognisable token. However, in this particular use case it is quite simple:
[[:alpha:]][[:alpha:]][[:alpha:]] handle_trigram(yytext); REJECT;
[[:alpha:]][[:alpha:]] handle_bigram(yytext); REJECT;
[[:alpha:]] handle_letter(*yytext);
If you want to try this solution, I recommend using flex's debug facility (flex -d ...) in order to see what is going on.
See debugging options and REJECT documentation.
The solution I would actually recommend, although the code is a bit clunkier, is to use yyless() to reprocess part of the recognised token. This is quite a bit more efficient than REJECT; yyless() just changes a single pointer, so it has no impact on speed. Without REJECT, we have to know all the lexeme handlers which will be needed, but that's not very difficult. A complication is the interface for handle_bigram, which requires a NUL-terminated string. If your handler didn't impose this requirement, the code would be simpler.
[[:alpha:]][[:alpha:]][[:alpha:]] { handle_trigram(yytext);
char tmp = yytext[2];
yytext[2] = 0;
handle_bigram(yytext);
yytext[2] = tmp;
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]][[:alpha:]] { handle_bigram(yytext);
handle_letter(yytext[0]);
yyless(1);
}
[[:alpha:]] handle_letter(*yytext);
See yyless() documentation

Calculate CRC 8 in objective c

I've an app in which I need to send a packet to an external device. This packet has a CRC before the end message. The CRC has to be separated in CRCH and CRCL.
For example: CRC = 0x5B so CRCH should be 0x35 (ASCII representation of 5) and CRCL should be 0x42 (ASCII representation of B).
I searched on internet and I found several functions in C or in other language to create CRC32, but my device need to use a CRC8. How I can create a CRC8 in Objective-C? Can you help me to find a way to do this?
Surprising how this rather simple question is still not answered.
First, you need to separate problems in your question. CRH and CRL are just hex conversion and that's easy to do (and has lots of examples on internet too). In most cases, you just need to compare crc you received to one you calculated. So, you just need to convert them to the same form. E.g. convert the crc you calculated to text using sprintf and %2X format and compare with CRC you received (as text).
The second part is actually CRC. This is a little bit trickier.Your options are as follows:
1) the easiest is to rename your .m file to .mm and use CRC library from boost C++. It's just a header include, so it won't affect rest of your code in any way and you can even make it in a separate file, so you'll have a C function which will use boost under the hood.
You might need to find parameters for your CRC though. For that, see this excellent resource http://reveng.sourceforge.net/crc-catalogue/
2) You can write your own implementation. Surprisingly there is plenty of examples for particular algorithms in the internet, but they often optimized for particular crc and are hard to adopt for other algorithms.
So, your best bet is probably starting with "A Painless Guide to CRC Error Detection Algorithms" article by Ross Williams. It also includes examples in C.
Though it could be complicated to get your head around all the technical stuff and explanations there.
So, as as short cut I'd like to suggest to look at my own implementation in java here. It's obviously not Objective-C. But I looked through it and you should be able to just copy and paste to your .m file and just compile it. Possibly adjusting few types.
You'll need public static long calculateCRC(Parameters crcParams, byte[] data) and private static long reflect(long in, int count) functions there. And the Parameters class which looks scarier, but should just become a struct in your case:
struct Parameters
{
int width; // Width of the CRC expressed in bits
long polynomial; // Polynomial used in this CRC calculation
bool reflectIn; // Refin indicates whether input bytes should be reflected
bool reflectOut; // Refout indicates whether input bytes should be reflected
long init; // Init is initial value for CRC calculation
long finalXor; // Xor is a value for final xor to be applied before returning result
}
You might also want to also adjust types there to a shorter unsigned type (java has no unsigned). But it should work perfectly well as is.

smartgwt listgrid population from excel file data [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to write a program for a school java project to parse some CSV I do not know. I do know the datatype of each column - although I do not know the delimiter.
The problem I do not even marginally know how to fix is to parse Date or even DateTime Columns. They can be in one of many formats.
I found many libraries but have no clue which is the best for my needs:
http://opencsv.sourceforge.net/
http://www.csvreader.com/java_csv.php
http://supercsv.sourceforge.net/
http://flatpack.sourceforge.net/
The problem is I am a total java beginner. I am afraid non of those libraries can do what I need or I can't convince them to do it.
I bet there are a lot of people here who have code sample that could get me started in no time for what I need:
automatically split in Columns (delimiter unknown, Columntypes are known)
cast to Columntype (should cope with $, %, etc.)
convert dates to Java Date or Calendar Objects
It would be nice to get as many code samples as possible by email.
Thanks a lot!
AS
You also have the Apache Commons CSV library, maybe it does what you need. See the guide. Updated to Release 1.1 in 2014-11.
Also, for the foolproof edition, I think you'll need to code it yourself...through SimpleDateFormat you can choose your formats, and specify various types, if the Date isn't like any of your pre-thought types, it isn't a Date.
There is a serious problem with using
String[] strArr=line.split(",");
in order to parse CSV files, and that is because there can be commas within the data values, and in that case you must quote them, and ignore commas between quotes.
There is a very very simple way to parse this:
/**
* returns a row of values as a list
* returns null if you are past the end of the input stream
*/
public static List<String> parseLine(Reader r) throws Exception {
int ch = r.read();
while (ch == '\r') {
//ignore linefeed chars wherever, particularly just before end of file
ch = r.read();
}
if (ch<0) {
return null;
}
Vector<String> store = new Vector<String>();
StringBuffer curVal = new StringBuffer();
boolean inquotes = false;
boolean started = false;
while (ch>=0) {
if (inquotes) {
started=true;
if (ch == '\"') {
inquotes = false;
}
else {
curVal.append((char)ch);
}
}
else {
if (ch == '\"') {
inquotes = true;
if (started) {
// if this is the second quote in a value, add a quote
// this is for the double quote in the middle of a value
curVal.append('\"');
}
}
else if (ch == ',') {
store.add(curVal.toString());
curVal = new StringBuffer();
started = false;
}
else if (ch == '\r') {
//ignore LF characters
}
else if (ch == '\n') {
//end of a line, break out
break;
}
else {
curVal.append((char)ch);
}
}
ch = r.read();
}
store.add(curVal.toString());
return store;
}
There are many advantages to this approach. Note that each character is touched EXACTLY once. There is no reading ahead, pushing back in the buffer, etc. No searching ahead to the end of the line, and then copying the line before parsing. This parser works purely from the stream, and creates each string value once. It works on header lines, and data lines, you just deal with the returned list appropriate to that. You give it a reader, so the underlying stream has been converted to characters using any encoding you choose. The stream can come from any source: a file, a HTTP post, an HTTP get, and you parse the stream directly. This is a static method, so there is no object to create and configure, and when this returns, there is no memory being held.
You can find a full discussion of this code, and why this approach is preferred in my blog post on the subject: The Only Class You Need for CSV Files.
My approach would not be to start by writing your own API. Life's too short, and there are more pressing problems to solve. In this situation, I typically:
Find a library that appears to do what I want. If one doesn't exist, then implement it.
If a library does exist, but I'm not sure it'll be suitable for my needs, write a thin adapter API around it, so I can control how it's called. The adapter API expresses the API I need, and it maps those calls to the underlying API.
If the library doesn't turn out to be suitable, I can swap another one in underneath the adapter API (whether it's another open source one or something I write myself) with a minimum of effort, without affecting the callers.
Start with something someone has already written. Odds are, it'll do what you want. You can always write your own later, if necessary. OpenCSV is as good a starting point as any.
i had to use a csv parser about 5 years ago. seems there are at least two csv standards: http://en.wikipedia.org/wiki/Comma-separated_values and what microsoft does in excel.
i found this libaray which eats both: http://ostermiller.org/utils/CSV.html, but afaik, it has no way of inferring what data type the columns were.
You might want to have a look at this specification for CSV. Bear in mind that there is no official recognized specification.
If you do not now the delimiter it will not be possible to do this so you have to find out somehow. If you can do a manual inspection of the file you should quickly be able to see what it is and hard code it in your program. If the delimiter can vary your only hope is to be able to deduce if from the formatting of the known data. When Excel imports CSV files it lets the user choose the delimiter and this is a solution you could use as well.
I agree with #Brian Clapper. I have used SuperCSV as a parser though I've had mixed results. I enjoy the versatility of it, but there are some situations within my own csv files for which I have not been able to reconcile "yet". I have faith in this product and would recommend it overall--I'm just missing something simple, no doubt, that I'm doing in my own implementation.
SuperCSV can parse the columns into various formats, do edits on the columns, etc. It's worth taking a look-see. It has examples as well, and easy to follow.
The one/only limitation I'm having is catching an 'empty' column and parsing it into an Integer or maybe a blank, etc. I'm getting null-pointer errors, but javadocs suggest each cellProcessor checks for nulls first. So, I'm blaming myself first, for now. :-)
Anyway, take a look at SuperCSV. http://supercsv.sourceforge.net/
At a minimum you are going to need to know the column delimiter.
Basically you will need to read the file line by line.
Then you will need to split each line by the delimiter, say a comma (CSV stands for comma-separated values), with
String[] strArr=line.split(",");
This will turn it into an array of strings which you can then manipulate, for example with
String name=strArr[0];
int yearOfBirth = Integer.valueOf(strArr[1]);
int monthOfBirth = Integer.valueOf(strArr[2]);
int dayOfBirth = Integer.valueOf(strArr[3]);
GregorianCalendar dob=new GregorianCalendar(yearOfBirth, monthOfBirth, dayOfBirth);
Student student=new Student(name, dob); //lets pretend you are creating instances of Student
You will need to do this for every line so wrap this code into a while loop. (If you don't know the delimiter just open the file in a text editor.)
I would recommend that you start by pulling your task apart into it's component parts.
Read string data from a CSV
Convert string data to appropriate format
Once you do that, it should be fairly trivial to use one of the libraries you link to (which most certainly will handle task #1). Then iterate through the returned values, and cast/convert each String value to the value you want.
If the question is how to convert strings to different objects, it's going to depend on what format you are starting with, and what format you want to wind up with.
DateFormat.parse(), for example, will parse dates from strings. See SimpleDateFormat for quickly constructing a DateFormat for a certain string representation.
Integer.parseInt() will prase integers from strings.
Currency, you'll have to decide how you want to capture it. If you want to just capture as a float, then Float.parseFloat() will do the trick (just use String.replace() to remove all $ and commas before you parse it). Or you can parse into a BigDecimal (so you don't have rounding problems). There may be a better class for currency handling (I don't do much of that, so am not familiar with that area of the JDK).
Writing your own parser is fun, but likely you should have a look at
Open CSV. It provides numerous ways of accessing the CSV and also allows to generate CSV. And it does handle escapes properly. As mentioned in another post, there is also a CSV-parsing lib in the Apache Commons, but that one isn't released yet.

Resources