Parsing text file (100+ MB) and sending data over network - parsing

I have a requirement to parse a huge text file and send parts of this file to be added as seperate rows in Content Manager. what is the best way of parsing and then update the DB?
I also would need identify certain tokens within this text file.
Please suggest what language should I use to code this requirement.
Thanks

All widely used programming languages can do that, though scripting languages (especially Perl) may be better suited to the task than others. However, your personal experience is a bigger factor: using the language you're most familiar with would probably be best, unless you have specific reasons not to use it, or to use a different language.

A classic problem when working with large files is just reading them in the first place. A lot of standard libraries tend to want to read the entire file into memory / array. However for really large files this is usually not practical.
For what ever language you end up choosing, look over the file I/O libraries carefully and select a method that will allow you to read in the file in chunks. Then run your parsing logic over the chunks and when getting to the end of a chunk, read in the next. Be careful with the parsing logic, it can sometimes be tricky to handle a chunk when it ends in a place that your parsing is not expecting.
Additionally a double buffer system sometimes works well. Process one chunk and when you get near the end, you fill the other buffer with the next chunk. If your parsing is CPU intensive, you might even look at filling a buffer on another thread to overlap the file I/O with the parsing. However, I wouldn't do this first. Start with just getting the logic working before any performance optimizations.

Without more detailed requirements it's difficult to suggest a particular language. Certainly no language is going to magically solve the problem of parsing such a big file. Depending on the format of the file there might be parsing library particularly suited to the job which might guide your choice of language.
If by "Content Manager" you mean Microsoft Content Manager Server I guess one of the Microsoft languages such as C# or VB.Net might be a better choice.
So my answer would pick one of the languages you already know, probably the one you know best.

Related

reading/parsing common lisp files from lisp without all packages available or loading everything

I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.

Search for string in large text file [duplicate]

What is the best way to load huge text file data in delphi? Is there any component that can load text file superfast?
Let's say I have a text file contains database and stored in fix length format.
It contains 150 field with each at least 50 characters.
1. I need to load it into memory
2. I need to parse it and probably store it in a memdataset for processing
My questions:
1. Is it enough if I use TStringList.loadFromFile method?
2. Is there any other better component to manipulate the text file?
3. Should I use low level reading from textfile?
Thank you in advance.
TStringList is never the optimal way of working with lots of text, but it's the simplest. If you've got small files on your hands you can use TStringList without issues. Even if you have large files (not huge files) you might implement a version of you algorithm using TStringList for testing purposes, because it's simple and easy to understand.
If your files are large, as they probably are since you call them "databases", you need to look into alternative technologies that will enable you to read only as much as you need from the database. Look into:
TFileStream
Memory mapped files.
Don't look at the old "file" based API's still available in Delphi, they're plain old.
I'm not going to go into details on how to access text using those methods because we've recently had two similar questions on SO:
How Can I Efficiently Read The FIrst Few Lines of Many Files in Delphi
and
Fast Search to see if a String Exists in Large Files with Delphi
Since you have a fixed length that you're working with, you can build an access class based on TList with a TWriter and TReader that will take your records into account. You'll have none of the overhead of a TStringList (not that it's a bad thing, but if you don't need it, why have it) and you can build in your own access to records into the class.
Ultimately it depends on what you are trying to accomplish with the data once you have it loaded into memory. While TStringlist is easy to use, it isn't as efficient as "rolling your own".
However, efficiency in data manipulation may not be that much of an issue, as you are using text files to hold a database. If you just need to read in and make decisions based on data in the file, the more flexible TList may be overkill.
I recommend to adhere to TStringList if you find it convenient for your problem. Optimization is another thing that should be done later.
As for TStringList the optimization is to declare a descendant class that overrides TStrings.LoadFromStream method - you can make it practically as fast as possible, taking into account the structure of your files.
It is not entirely clear from your question why you need to load the entire file into memory, prior to then going on to create an in-memory data set.... are you conflating the two issues? (i.e. because you need to create an in-memory data set you think you first need to load the source data entirely into memory? Or is there some initial pre-processing of the source file which is only possible with the entire file loaded in memory (this is unlikely and even if this is the case, it isn't necessary with a navigable stream object such as a TFileStream).
But I think the answer you are looking for is right there in the question....
If you are loading this file in order to parse it and populate/initialise a further data structure (the data set) for further processing, then using an existing high level data structure is an unnecessary and potentially costly (in terms of time) step.
Use the lowest level means of access that provides the capabilities you need.
In this case a TFileStream will likely provide the best balance of convenience and ease of use.

Is using TStringList to load huge text file the best way in Delphi?

What is the best way to load huge text file data in delphi? Is there any component that can load text file superfast?
Let's say I have a text file contains database and stored in fix length format.
It contains 150 field with each at least 50 characters.
1. I need to load it into memory
2. I need to parse it and probably store it in a memdataset for processing
My questions:
1. Is it enough if I use TStringList.loadFromFile method?
2. Is there any other better component to manipulate the text file?
3. Should I use low level reading from textfile?
Thank you in advance.
TStringList is never the optimal way of working with lots of text, but it's the simplest. If you've got small files on your hands you can use TStringList without issues. Even if you have large files (not huge files) you might implement a version of you algorithm using TStringList for testing purposes, because it's simple and easy to understand.
If your files are large, as they probably are since you call them "databases", you need to look into alternative technologies that will enable you to read only as much as you need from the database. Look into:
TFileStream
Memory mapped files.
Don't look at the old "file" based API's still available in Delphi, they're plain old.
I'm not going to go into details on how to access text using those methods because we've recently had two similar questions on SO:
How Can I Efficiently Read The FIrst Few Lines of Many Files in Delphi
and
Fast Search to see if a String Exists in Large Files with Delphi
Since you have a fixed length that you're working with, you can build an access class based on TList with a TWriter and TReader that will take your records into account. You'll have none of the overhead of a TStringList (not that it's a bad thing, but if you don't need it, why have it) and you can build in your own access to records into the class.
Ultimately it depends on what you are trying to accomplish with the data once you have it loaded into memory. While TStringlist is easy to use, it isn't as efficient as "rolling your own".
However, efficiency in data manipulation may not be that much of an issue, as you are using text files to hold a database. If you just need to read in and make decisions based on data in the file, the more flexible TList may be overkill.
I recommend to adhere to TStringList if you find it convenient for your problem. Optimization is another thing that should be done later.
As for TStringList the optimization is to declare a descendant class that overrides TStrings.LoadFromStream method - you can make it practically as fast as possible, taking into account the structure of your files.
It is not entirely clear from your question why you need to load the entire file into memory, prior to then going on to create an in-memory data set.... are you conflating the two issues? (i.e. because you need to create an in-memory data set you think you first need to load the source data entirely into memory? Or is there some initial pre-processing of the source file which is only possible with the entire file loaded in memory (this is unlikely and even if this is the case, it isn't necessary with a navigable stream object such as a TFileStream).
But I think the answer you are looking for is right there in the question....
If you are loading this file in order to parse it and populate/initialise a further data structure (the data set) for further processing, then using an existing high level data structure is an unnecessary and potentially costly (in terms of time) step.
Use the lowest level means of access that provides the capabilities you need.
In this case a TFileStream will likely provide the best balance of convenience and ease of use.

Background reading for parsing sloppy / quirky / "almost structured" data?

I'm maintaining a program that needs to parse out data that is present in an "almost structured" form in text. i.e. various programs that produce it use slightly different formats, it may have been printed out and OCR'd back in (yeah, I know) with errors, etc. so I need to use heuristics that guess how it was produced and apply different quirks modes, etc. It's frustrating, because I'm somewhat familiar with the theory and practice of parsing if things are well behaved, and there are nice parsing frameworks etc. out there, but the unreliability of the data has led me to write some very sloppy ad-hoc code. It's OK at the moment but I'm worried that as I expand it to process more variations and more complex data, things will get out of hand. So my question is:
Since there are a fair number of existing commercial products that do related things ("quirks modes" in web browsers, error interpretation in compilers, even natural language processing and data mining, etc.) I'm sure some smart people have put thought into this, and tried to develop a theory, so what are the best sources for background reading on parsing unprincipled data in as principled a manner as possible?
I realize this is somewhat open-ended, but my problem is that I think I need more background to even know what the right questions to ask are.
Given the choice between what you've proposed and fighting a hungry crocodile while covered in raw-beef-flavored marmalade and both hands tied behind my back, I'd choose the ...
Well, OK on a more serious note, if you have data that doesn't abide by the any "sane" structure, you have to study the data and find frequencies of quirks in it and correlate the data for the given context (i.e. how it was generated)
Print to OCR to get the data in is almost always going to lead to heart break. The company I work for employs a veritable army of people who manually read such documents and hand "code" (i.e. enter by hand) the data for known problematic OCR scenarios, or documents our customers detect the original OCR failed on.
As for leveraging "Parsing Frameworks" these tend to expect data that will always follow the grammar rules you've laid out. The data you've described has no such guarantees. If you go that route be prepared for unexpected - though not always obvious - failures.
By all means if there is any way possible to get the original data files, do so. Or if you can demand that those providing the data make their data come in a single well defined format, even better. (It might not be "YOUR" format, but at least it's a regular and predictable format you can convert from)

Lightweight Store Mechanisms

I'm about to write a small utility to organze and tag my mp3s.
What is the best way to store small amounts of data. More importantly, are there databases which exist where I don't need to install a client/server environment, I just include the library and I'm good?
I could use XML, but I'm afraid that the file size would become large and hard to handle, not to mention keeping the memory footprint small.
Thanks
EDIT: I haven't decided on the language, I wanted to make my decision independent of platform. If I had to choose, most likely .NET, second Java, third C++.
My apologies, this is for a Windows App.
On Windows you can use the built-in esent database engine. There is an API you can use from C++
http://blogs.msdn.com/windowssdk/archive/2008/10/23/esent-extensible-storage-engine-api-in-the-windows-sdk.aspx
There is also a managed interop layer that you can use from C# code:
http://www.codeplex.com/ManagedEsent
Which language/platform are you talking about?
In the Java world I prefer using embedded databases such as HSQLDB, H2 or JavaDB (f.k.a. Derby).
They don't need installing and still provide the simple access you're used to from a "real" DBMS.
In the C/Python/Unixy world SQLite is a hot contender in that area.
Another option is the various forms of the Berkeley database (eg, db3, db4, SleepyCat.)
SQLITE if you want the pain of a relational DB without a server install or hassle.
I would use one of the many text-serialization formats. I personally think that YAML 1.1 is the most powerful (built-in support for referential object graphs) and easiest to read/modify by a human (parsing is a bear, use a library such as PyYAML or JYaml or some .NET libaray).
Otherwise XML or JSON are adequate file formats.
Whichever format you use, just compress the file if you're concerned about disk usage. If you're worried about in-memory usage, then I don't see how your serialization format matters...
Have a look at Prevayler - it's a serialization persistence framework (use xstream etc if you want to human-read your data), which is really fast, does not require annotations and "just works". Some basic info:
It does impose a more rigorous transaction pattern, as it does not give you automatic rollback:
Ensure transaction will succeed (with current state of system) - e.g. does it make sense now?
[transaction is added to queue], and stored (for power reset etc)
transaction is executed and applied to the object structure.
Writes of 1000's of transactions/sec
Reads of 100,000's transactions/sec
I haven't used it much, but it's sooo much nicer to use for small projects (persisting any serializable object is so nice)
Oh - as for every one saying "what platform you running on?", Prevayler (java) has/had ports to quite a few platforms, but I can't find a decent list :(. I remember that there were around 5-7, but can only remember .NET.
If you're planning on storing everything in memory while your program does work on it, then serializing to a file using a basic load() and save() function that you write would be fine, and less pain than a full on DB.
In Java that can be done using standard Serialization (or can serialize to and from XML to make it somewhat human readable and editable).
It shouldn't affect your memory footprint at all as it is merely saving and restoring your objects. You just won't get transactions and random access and queries and all that good stuff.
you could even use xml, json, an .ini file... a text file even
I would advise a SQL like database (such as SQLLite). Today your requirements might make a full SQL database seem silly. But you never know how much this "little project" will grow over the years. When it does grow to the point where you have to have a SQL engine, you will be glad you didn't just serialize some Java objects or store stuff in JSON format.

Resources