Greetings,
I am working on distributed pub-sub system expected to have minimum latency. I am now having to choose between using serialization/deserialization and raw data buffer. I prefer raw data approach as there's almost no overhead, which keeps the latency low. But my colleague argues that I should use marshaling as the parser will be less complex and buggier. I stated my concern about latency but he said it's gonna be worth it in the long run and there's even FPGA devices to accelerate.
What's your opinions on this?
TIA!
Using a 'raw data' approach, if hardcoded in one language, for one platform, has a few problems when you try to write code on another platform in another language (or sometimes even the same language/platform for a different compiler, if your fields don't have natural alignment).
I recommend using an IDL to describe your message formats. If you pick one that is reducible to 'raw data' in your language of choice, then the generated code to access message fields for that language won't do anything but access variables in a structure that is memory overlayed on your data buffer, but it still represents the metadata in a platform and language neutral way, so parsers can make more complicated code for other platforms.
The downside of picking something that is reducible to a C structure overlay is that it doesn't handle optional fields, and doesn't handle variable-length arrays, and may not handle backwards compatible extensions in the future (unless you just tack them on the end of the structure). I'd suggest you read about google's protocol buffers if you haven't yet.
Related
I have some experience with Pragmatic-Programmer-type code generation: specifying a data structure in a platform-neutral format and writing templates for a code generator that consume these data structure files and produce code that pulls raw bytes into language-specific data structures, does scaling on the numeric data, prints out the data, etc. The nice pragmatic(TM) ideas are that (a) I can change data structures by modifying my specification file and regenerating the source (which is DRY and all that) and (b) I can add additional functions that can be generated for all of my structures just by modifying my templates.
What I had used was a Perl script called Jeeves which worked, but it's general purpose, and any functions I wanted to write to manipulate my data I was writing from the ground up.
Are there any frameworks that are well-suited for creating parsers for structured binary data? What I've read of Antlr suggests that that's overkill. My current target langauges of interest are C#, C++, and Java, if it matters.
Thanks as always.
Edit: I'll put a bounty on this question. If there are any areas that I should be looking it (keywords to search on) or other ways of attacking this problem that you've developed yourself, I'd love to hear about them.
Also you may look to a relatively new project Kaitai Struct, which provides a language for that purpose and also has a good IDE:
Kaitai.io
You might find ASN.1 interesting, as it provide an absract way to describe the data you might be processing. If you use ASN.1 to describe the data abstractly, you need a way to map that abstract data to concrete binary streams, for which ECN (Encoding Control Notation) is likely the right choice.
The New Jersey Machine Toolkit is actually focused on binary data streams corresponding to instruction sets, but I think that's a superset of just binary streams. It has very nice facilities for defining fields in terms of bit strings, and automatically generating accessors and generators of such. This might be particularly useful
if your binary data structures contain pointers to other parts of the data stream.
I'm currently learning F# and I'm exploring using it to analyse financial time-series. Can anyone recommend a good data structure to store time-series data in?
F# offers a rich selection of native types and I'm looking for a some simple combination that would provide an elegant, succinct and efficient solution.
I'm looking store tick data, which consists of millions of records each with a time stamp, and several (~5-20) fields of numerical and textual data, with possible missing values.
My first thoughts are perhaps a sequence of tuples or records, but I was wondering if someone could kindly suggest something that has worked well in the real world.
EDIT:
A few extra points for clarification:
The common operations that I'm likely to require are:
Time based lookup - i.e. find the most recent data point at a given time
Time based joins
Appends
(Updates and deletes are going to be rare. )
I should make it clear I'm exploring using F# primarily as an interactive tool for research, with the ability to compile as a (really big) added bonus.
ANOTHER EDIT:
I should also have mentioned, my role/use of F# and this data is purely within research not development. The intention being that once we understand the data (and what we want to do with it) better then we can later specify tools that our developers would build. Such as data warehouses etc. at which we'd start using their data structures etc.
Although, I am concerned that our models are computationally intensive, use a lot of memory and can't always be coded in a recursive manner. So we many end up having to query out large chunks anyway.
I should also say that I've always used Matlab or R for these sorts of tasks before but I'm now interested in F# as it offers that interactive, high level flexibility for Research but the same code can be used in production.
My apologies for not giving this context information at the start (It's my first question), I can see now that it helps people form their answers.
My thanks again to everyone that's taken the time to help me.
It really sounds like your data should be stored and queried in a relational database (where is it currently stored?: loading millions of records with several fields into memory must be an expensive operation, and could leave you with stale data and difficulty persisting changes). And then you could use the F# LINQ to SQL implementation (which I believe you can find in the Power Pack) to have F# expressions translated to SQL expressions.
Here's a link from Don Syme about LINQ Support in F# Power Pack: http://blogs.msdn.com/b/dsyme/archive/2009/10/23/a-quick-refresh-on-query-support-in-the-f-power-pack.aspx
The best choice of data structure depends upon what operations you want to do on it.
The simplest would be an array of structs. This has the advantages of fast random lookup, good space efficiency for an uncompressed representation and good locality. If there is sharing between substructures (like the strings) then intern them to make sure they get shared.
Alternatives might be a seq that is loaded from disk on-demand, a singly-linked list that allows you to prepend elements quickly or a balanced binary trees that allows operations like insertion at random locations efficiently.
I have a requirement to parse a huge text file and send parts of this file to be added as seperate rows in Content Manager. what is the best way of parsing and then update the DB?
I also would need identify certain tokens within this text file.
Please suggest what language should I use to code this requirement.
Thanks
All widely used programming languages can do that, though scripting languages (especially Perl) may be better suited to the task than others. However, your personal experience is a bigger factor: using the language you're most familiar with would probably be best, unless you have specific reasons not to use it, or to use a different language.
A classic problem when working with large files is just reading them in the first place. A lot of standard libraries tend to want to read the entire file into memory / array. However for really large files this is usually not practical.
For what ever language you end up choosing, look over the file I/O libraries carefully and select a method that will allow you to read in the file in chunks. Then run your parsing logic over the chunks and when getting to the end of a chunk, read in the next. Be careful with the parsing logic, it can sometimes be tricky to handle a chunk when it ends in a place that your parsing is not expecting.
Additionally a double buffer system sometimes works well. Process one chunk and when you get near the end, you fill the other buffer with the next chunk. If your parsing is CPU intensive, you might even look at filling a buffer on another thread to overlap the file I/O with the parsing. However, I wouldn't do this first. Start with just getting the logic working before any performance optimizations.
Without more detailed requirements it's difficult to suggest a particular language. Certainly no language is going to magically solve the problem of parsing such a big file. Depending on the format of the file there might be parsing library particularly suited to the job which might guide your choice of language.
If by "Content Manager" you mean Microsoft Content Manager Server I guess one of the Microsoft languages such as C# or VB.Net might be a better choice.
So my answer would pick one of the languages you already know, probably the one you know best.
I'm maintaining a program that needs to parse out data that is present in an "almost structured" form in text. i.e. various programs that produce it use slightly different formats, it may have been printed out and OCR'd back in (yeah, I know) with errors, etc. so I need to use heuristics that guess how it was produced and apply different quirks modes, etc. It's frustrating, because I'm somewhat familiar with the theory and practice of parsing if things are well behaved, and there are nice parsing frameworks etc. out there, but the unreliability of the data has led me to write some very sloppy ad-hoc code. It's OK at the moment but I'm worried that as I expand it to process more variations and more complex data, things will get out of hand. So my question is:
Since there are a fair number of existing commercial products that do related things ("quirks modes" in web browsers, error interpretation in compilers, even natural language processing and data mining, etc.) I'm sure some smart people have put thought into this, and tried to develop a theory, so what are the best sources for background reading on parsing unprincipled data in as principled a manner as possible?
I realize this is somewhat open-ended, but my problem is that I think I need more background to even know what the right questions to ask are.
Given the choice between what you've proposed and fighting a hungry crocodile while covered in raw-beef-flavored marmalade and both hands tied behind my back, I'd choose the ...
Well, OK on a more serious note, if you have data that doesn't abide by the any "sane" structure, you have to study the data and find frequencies of quirks in it and correlate the data for the given context (i.e. how it was generated)
Print to OCR to get the data in is almost always going to lead to heart break. The company I work for employs a veritable army of people who manually read such documents and hand "code" (i.e. enter by hand) the data for known problematic OCR scenarios, or documents our customers detect the original OCR failed on.
As for leveraging "Parsing Frameworks" these tend to expect data that will always follow the grammar rules you've laid out. The data you've described has no such guarantees. If you go that route be prepared for unexpected - though not always obvious - failures.
By all means if there is any way possible to get the original data files, do so. Or if you can demand that those providing the data make their data come in a single well defined format, even better. (It might not be "YOUR" format, but at least it's a regular and predictable format you can convert from)
I'm about to write a small utility to organze and tag my mp3s.
What is the best way to store small amounts of data. More importantly, are there databases which exist where I don't need to install a client/server environment, I just include the library and I'm good?
I could use XML, but I'm afraid that the file size would become large and hard to handle, not to mention keeping the memory footprint small.
Thanks
EDIT: I haven't decided on the language, I wanted to make my decision independent of platform. If I had to choose, most likely .NET, second Java, third C++.
My apologies, this is for a Windows App.
On Windows you can use the built-in esent database engine. There is an API you can use from C++
http://blogs.msdn.com/windowssdk/archive/2008/10/23/esent-extensible-storage-engine-api-in-the-windows-sdk.aspx
There is also a managed interop layer that you can use from C# code:
http://www.codeplex.com/ManagedEsent
Which language/platform are you talking about?
In the Java world I prefer using embedded databases such as HSQLDB, H2 or JavaDB (f.k.a. Derby).
They don't need installing and still provide the simple access you're used to from a "real" DBMS.
In the C/Python/Unixy world SQLite is a hot contender in that area.
Another option is the various forms of the Berkeley database (eg, db3, db4, SleepyCat.)
SQLITE if you want the pain of a relational DB without a server install or hassle.
I would use one of the many text-serialization formats. I personally think that YAML 1.1 is the most powerful (built-in support for referential object graphs) and easiest to read/modify by a human (parsing is a bear, use a library such as PyYAML or JYaml or some .NET libaray).
Otherwise XML or JSON are adequate file formats.
Whichever format you use, just compress the file if you're concerned about disk usage. If you're worried about in-memory usage, then I don't see how your serialization format matters...
Have a look at Prevayler - it's a serialization persistence framework (use xstream etc if you want to human-read your data), which is really fast, does not require annotations and "just works". Some basic info:
It does impose a more rigorous transaction pattern, as it does not give you automatic rollback:
Ensure transaction will succeed (with current state of system) - e.g. does it make sense now?
[transaction is added to queue], and stored (for power reset etc)
transaction is executed and applied to the object structure.
Writes of 1000's of transactions/sec
Reads of 100,000's transactions/sec
I haven't used it much, but it's sooo much nicer to use for small projects (persisting any serializable object is so nice)
Oh - as for every one saying "what platform you running on?", Prevayler (java) has/had ports to quite a few platforms, but I can't find a decent list :(. I remember that there were around 5-7, but can only remember .NET.
If you're planning on storing everything in memory while your program does work on it, then serializing to a file using a basic load() and save() function that you write would be fine, and less pain than a full on DB.
In Java that can be done using standard Serialization (or can serialize to and from XML to make it somewhat human readable and editable).
It shouldn't affect your memory footprint at all as it is merely saving and restoring your objects. You just won't get transactions and random access and queries and all that good stuff.
you could even use xml, json, an .ini file... a text file even
I would advise a SQL like database (such as SQLLite). Today your requirements might make a full SQL database seem silly. But you never know how much this "little project" will grow over the years. When it does grow to the point where you have to have a SQL engine, you will be glad you didn't just serialize some Java objects or store stuff in JSON format.