Delphi TWriter.WriteListBegin - delphi

I have a component which is writing into a blob different information by using the TWriter class. The problem is that some blobs have been saved incorrect into blob(or under another data sequence), and I need to correct somehow those errors. The problem arrives when I'm expectin an WriteListBegin or WriteListEnd and I get an EReadError "Invalid property value". I'm thinking of reading the stream byte by byte, and to know where these separators are located. How can I know that I'm encountering a WriteListBegin or WriteListEnd?
LE: The issue can not be solved as easily the comments suggest. I don't know the vendor, so I can not ask for details. Concerning of what is behind the TWriter mechanism, this is the following assembly routine, which I don't understand what bytes writes as a
start-of-list marker to the writer object's associated stream
procedure TWriter.Write(const Buf; Count: Longint); assembler;
Probably I will start to write my own custom TReader in order to fix the bogus streams.

If I unraveled your question well, then I understand you have corrupt data which, obviously, does not let itself get parsed correctly. Specifically, the list begin and end markings are missing or, are in the wrong order or at the wrong place.
I can think of four solutions to fix that:
See to get the data uncorrupted (ask supplier).
Fix the data manually (if within reasonable size).
Write an own custom parser to fix the markings automatically, and use that in advance.
Using TWriter, for every line; remember current Position, check current line, rewrite, substitute or ignore that line in case of corruption and return to old position if necessary.
In case of several data chunks being corrupted in the same manner, maybe a partial manual investigation (2) may lead to custom parser (3) in no time.

I solved the problem by reading data type until reader.EndOfList separator for most of the blobs.
Thank all, especially for the -1's

Related

GNURADIO 3.7.8: identify a part of a byte stream

I am feeling Stream Tags, Message Passing, Packet Data Transmission are a bit of overkill, and I have hard time to understand.
I have a simple wish: starting from a stream of bytes I would like to "extract" only a fixed number of bytes) starting from a known pattern. For example from a stream like this: ...01h 55h XXh YYh ZZh..., it should extract XXh YYh ZZh.
I utilized Correlate Access Code Tag block -- Tagged Stream Align -- Pack K Bits to convert a bit stream into a byte stream and synch to the desired Access Code (01h 55h), but how do I tell gnuradio to only process 3 bytes after every time the code is found? Likely OOT block would solve, but is it there some combinatino of standard GRC block to do this?
I think with correllate_access_code_tag_bb you can actually build this, with a bit of brain-twisting, from existing blocks alone. (Note: this does rely on stream tags, because those are the right tool to mark special points in a sample flow.)
However, your simple case might really not be worth it. Simply follow the guided tutorials up to the point where you can write your own python block.
Use self.set_history(len(preamble)+len_payload) in the constructor of your new block to make sure you always see the last samples of the previous iteration in your current call to work, and simply search for the preamble in your sample stream, outputting only the len_payload following bytes when you find it, not producing anything if you don't find it.

Zlib ruby - how to check if data is deflated/compressed before processing it?

I'm in a situation where some data in a database is not compressed, but I want to enable compression for new data coming in, without having to update all records currently in the database to make them compressed as well.
So I need to be able to say, if deflated, inflate it and process it, else just process it. But I can't see how to gracefully check whether the data is already compressed before trying to process it, unless I do a 'begin ... rescue' block:
begin
process(Zlib::Inflate.inflate(data))
rescue Zlib::DataError
process(data)
end
Is there a better way? I've seen references to magic numbers and checking the first couple of bytes of the file, but no good examples of how to achieve these things in ruby. Any help appreciated. Thanks.
What you propose is the best way. It will very quickly determine that the first two bytes are not a zlib header. If by accident the input data appears to be a zlib header (about a 1 in 1024 chance), then the decompression will detect an invalid deflate stream given random data within on the order of 30 bytes almost all the time.
As you suggested, you could either rescue the specific exception or manually validate the file type by reading the magic numbers.
Ruby IO's readpartial can read the specified number of bytes, which you can compare with the magic number.
I personally would stick to rescuing the exception as a lot of the core libraries perform the same magic number check before raising the exception.

TStringList of objects taking up tons of memory in Delphi XE

I'm working on a simulation program.
One of the first things the program does is read in a huge file (28 mb, about 79'000 lines,), parse each line (about 150 fields), create a class for the object, and add it to a TStringList.
It also reads in another file, which adds more objects during the run. At the end, it ends up being about 85'000 objects.
I was working with Delphi 2007, and the program used a lot of memory, but it ran OK. I upgraded to Delphi XE, and migrated the program over and now it's using a LOT more memory, and it ends up running out of memory half way through the run.
So in Delphi 2007, it would end up using 1.4 gigs after reading in the initial file, which is obviously a huge amount, but in XE, it ends up using almost 1.8 gigs, which is really huge and leads to running out and getting the error
So my question is
Why is it using so much memory?
Why is it using so much more memory in XE than 2007?
What can I do about this? I can't change how big or long the file is, and I do need to create an object for each line and to store it somewhere
Thanks
Just one idea which may save memory.
You could let the data stay on the original files, then just point to them from in-memory structures.
For instance, it's what we do for browsing big log files almost instantly: we memory-map the log file content, then we parse it quick to create indexes of useful information in memory, then we read the content dynamically. No string is created during the reading. Only pointers to each line beginning, with dynamic arrays containing the needed indexes. Calling TStringList.LoadFromFile would be definitively much slower and memory consuming.
The code is here - see the TSynLogFile class. The trick is to read the file only once, and make all indexes on the fly.
For instance, here is how we retrieve a line of text from the UTF-8 file content:
function TMemoryMapText.GetString(aIndex: integer): string;
begin
if (self=nil) or (cardinal(aIndex)>=cardinal(fCount)) then
result := '' else
result := UTF8DecodeToString(fLines[aIndex],GetLineSize(fLines[aIndex],fMapEnd));
end;
We use the exact same trick to parse JSON content. Using such a mixed approach is used by the fastest XML access libraries.
To handle your high-level data, and query them fast, you may try to use dynamic arrays of records, and our optimized TDynArray and TDynArrayHashed wrappers (in the same unit). Arrays of records will be less memory consuming, will be faster to search in because the data won't be fragemented (even faster if you use ordered indexes or hashes), and you'll be able to have high-level access to the content (you can define custom functions to retrieve the data from the memory mapped file, for instance). Dynamic arrays won't fit fast deletion of items (or you'll have to use lookup tables) - but you wrote you are not deleting much data, so it won't be a problem in your case.
So you won't have any duplicated structure any more, only logic in RAM, and data on memory-mapped file(s) - I added a "s" here because the same logic could perfectly map to several source data files (you need some "merge" and "live refresh" AFAIK).
It's hard to say why your 28 MB file is expanding to 1.4 GB worth of objects when you parse it out into objects without seeing the code and the class declarations. Also, you say you're storing it in a TStringList instead of a TList or TObjecList. This sounds like you're using it as some sort of string->object key/value mapping. If so, you might want to look at the TDictionary class in the Generics.Collections unit in XE.
As for why you're using more memory in XE, it's because the string type changed from an ANSI string to a UTF-16 string in Delphi 2009. If you don't need Unicode, you could use a TDictionary to save space.
Also, to save even more memory, there's another trick you could use if you don't need all 79,000 of the objects right away: lazy loading. The idea goes something like this:
Read the file into a TStringList. (This will use about as much memory as the file size. Maybe twice as much if it gets converted into Unicode strings.) Don't create any data objects.
When you need a specific data object, call a routine that checks the string list and looks up the string key for that object.
Check if that string has an object associated with it. If not, create the object from the string and associate it with the string in the TStringList.
Return the object associated with the string.
This will keep both your memory usage and your load time down, but it's only helpful if you don't need all (or a large percentage) of the objects immediately after loading.
In Delphi 2007 (and earlier), a string is an Ansi string, that is, every character occupies 1 byte of memory.
In Delphi 2009 (and later), a string is a Unicode string, that is, every character occupies 2 bytes of memory.
AFAIK, there is no way to make a Delphi 2009+ TStringList object use Ansi strings. Are you really using any of the features of the TStringList? If not, you could use an array of strings instead.
Then, naturally, you can choose between
type
TAnsiStringArray = array of AnsiString;
// or
TUnicodeStringArray = array of string; // In Delphi 2009+,
// string = UnicodeString
Reading though the comments, it sounds like you need to lift the data out of Delphi and into a database.
From there it is easy to match organ donors to receivers*)
SELECT pw.* FROM patients_waiting pw
INNER JOIN organs_available oa ON (pw.bloodtype = oa.bloodtype)
AND (pw.tissuetype = oa.tissuetype)
AND (pw.organ_needed = oa.organ_offered)
WHERE oa.id = '15484'
If you want to see the patients that might match against new organ-donor 15484.
In memory you only handle the few patients that match.
*) simplified beyond all recognition, but still.
In addition to Andreas' post:
Before Delphi 2009, a string header occupied 8 bytes. Starting with Delphi 2009, a string header takes 12 bytes. So every unique string uses 4 bytes more than before, + the fact that each character takes twice the memory.
Also, starting with Delphi 2010 I believe, TObject started using 8 bytes instead of 4. So for each single object created by delphi, delphi now uses 4 more bytes. Those 4 bytes were added to support the TMonitor class I believe.
If you're in desperate need to save memory, here's a little trick that could help if you have a lot of string value that repeats themselve.
var
uUniqueStrings : TStringList;
function ReduceStringMemory(const S : String) : string;
var idx : Integer;
begin
if not uUniqueStrings.Find(S, idx) then
idx := uUniqueStrings.Add(S);
Result := uUniqueStrings[idx]
end;
Note that this will help ONLY if you have a lot of string values that repeat themselves. For exemple, this code use 150mb less on my system.
var sl : TStringList;
I: Integer;
begin
sl := TStringList.Create;
try
for I := 0 to 5000000 do
sl.Add(ReduceStringMemory(StringOfChar('A',5)));every
finally
sl.Free;
end;
end;
I also read in a lot of strings in my program that can approach a couple of GB for large files.
Short of waiting for 64-bit XE2, here is one idea that might help you:
I found storing individual strings in a stringlist to be slow and wasteful in terms of memory. I ended up blocking the strings together. My input file has logical records, which may contain between 5 and 100 lines. So instead of storing each line in the stringlist, I store each record. Processing a record to find the line I need adds very little time to my processing, so this is possible for me.
If you don't have logical records, you might just want to pick a blocking size, and store every (say) 10 or 100 strings together as one string (with a delimiter separating them).
The other alternative, is to store them in a fast and efficient on-disk file. The one I'd recommend is the open source Synopse Big Table by Arnaud Bouchez.
May I suggest you try using the jedi class library (JCL) class TAnsiStringList, which is like TStringList fromDelphi 2007 in that it is made up of AnsiStrings.
Even then, as others have mentioned, XE will be using more memory than delphi 2007.
I really don't see the value of loading the full text of a giant flat file into a stringlist. Others have suggested a bigtable approach such as Arnaud Bouchez's one, or using SqLite, or something like that, and I agree with them.
I think you could also write a simple class that will load the entire file you have into memory, and provide a way to add line-by-line object links to a giant in-memory ansichar buffer.
Starting with Delphi 2009, not only strings but also every TObject has doubled in size. (See Why Has the Size of TObject Doubled In Delphi 2009?). But this would not explain this increase if there are only 85,000 objects. Only if these objects contain many nested objects, their size could be a relevant part of the memory usage.
Are there many duplicate strings in your list? Maybe trying to only store unique strings will help reducing the memory size. See my Question
about a string pool for a possible (but maybe too simple) answer.
Are you sure you don't suffer from a case of memory fragementation?
Be sure to use the latest FastMM (currently 4.97), then take a look at the UsageTrackerDemo demo that contains a memory map form showing the actual usage of the Delphi memory.
Finally take a look at VMMap that shows you how your process memory is used.

Using PARSE on a PORT! value

I tried using PARSE on a PORT! and it does not work:
>> parse open %test-data.r [to end]
** Script error: parse does not allow port! for its input argument
Of course, it works if you read the data in:
>> parse read open %test-data.r [to end]
== true
...but it seems it would be useful to be able to use PARSE on large files without first loading them into memory.
Is there a reason why PARSE couldn't work on a PORT! ... or is it merely not implemented yet?
the easy answer is no we can't...
The way parse works, it may need to roll-back to a prior part of the input string, which might in fact be the head of the complete input, when it meets the last character of the stream.
ports copy their data to a string buffer as they get their input from a port, so in fact, there is never any "prior" string for parse to roll-back to. its like quantum physics... just looking at it, its not there anymore.
But as you know in rebol... no isn't an answer. ;-)
This being said, there is a way to parse data from a port as its being grabbed, but its a bit more work.
what you do is use a buffer, and
APPEND buffer COPY/part connection amount
Depending on your data, amount could be 1 byte or 1kb, use what makes sense.
Once the new input is added to your buffer, parse it and add logic to know if you matched part of that buffer.
If something positively matched, you remove/part what matched from the buffer, and continue parsing until nothing parses.
you then repeat above until you reach the end of input.
I've used this in a real-time EDI tcp server which has an "always on" tcp port in order to break up a (potentially) continuous stream of input data, which actually piggy-backs messages end to end.
details
The best way to setup this system is to use /no-wait and loop until the port closes (you receive none instead of "").
Also make sure you have a way of checking for data integrity problems (like a skipped byte, or erroneous message) when you are parsing, otherwise, you will never reach the end.
In my system, when the buffer was beyond a specific size, I tried an alternate rule which skipped bytes until a pattern might be found further down the stream. If one was found, an error was logged, the partial message stored and a alert raised for sysadmin to sort out the message.
HTH !
I think that Maxim's answer is good enough. At this moment the parse on port is not implemented. I don't think it's impossible to implement it later, but we must solve other issues first.
Also as Maxim says, you can do it even now, but it very depends what exactly you want to do.
You can parse large files without need to read them completely to the memory, for sure. It's always good to know, what you expect to parse. For example all large files, like files for music and video, are divided into chunks, so you can just use copy|seek to get these chunks and parse them.
Or if you want to get just titles of multiple web pages, you can just read, let's say, first 1024 bytes and look for the title tag here, if it fails, read more bytes and try it again...
That's exactly what must be done to allow parse on port natively anyway.
And feel free to add a WISH in the CureCode database: http://curecode.org/rebol3/

How to handle billions of objects without "Outofmemory" error

I have an application which may needs to process billions of objects.Each object of is of TRange class type. These ranges are created at different parts of an algorithm which depends on certain conditions and other object properties. As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects. If I create all the (billions of) objects and add to the collection, the system will throw Outofmemory error. Now I want to iterate through each object mainly for two purposes:
To apply an operation for each TRange object(eg:Output certain properties)
To get a cumulative sum of a certain property.(eg: Each range has a weight property and I want to retreive totalweight that is a sum of all the range weights).
How do I effectively create an Iterator for these object without raising Outofmemory?
I have handled the first case by passing a function pointer to the algorithm function. For eg:
procedure createRanges(aProc: TRangeProc);//aProc is a pointer to function that takes a //TRange
var range: TRange;
rangerec: TRangeRec;
begin
range:=TRange.Create;
try
while canCreateRange do begin//certain conditions needed to create a range
rangerec := ReturnRangeRec;
range.Update(rangerec);//don't create new, use the same object.
if Assigned(aProc) then aProc(range);
end;
finally
range.Free;
end;
end;
But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter. Please suggest some ideas.
Thank you all in advance
Pradeep
For such large ammounts of data you need to only have a portion of the data in memory. The other data should be serialized to the hard drive. I tackled such a problem like this:
I Created an extended storage that can store a custom record either in memory or on the hard drive. This storage has a maximum number of records that can live simultaniously in memory.
Then I Derived the record classes out of the custom record class. These classes know how to store and load themselves from the hard drive (I use streams).
Everytime you need a new or already existing record you ask the extended storage for such a record. If the maximum number of objects is exceeded, the storage streams some of the least used record back to the hard drive.
This way the records are transparent. You always access them as if they are in memory, but they may get loaded from hard drive first. It works really well. By the way RAM works in a very similar way so it only holds a certain subset of all you data on your hard drive. This is your working set then.
I did not post any code because it is beyond the scope of the question itself and would only confuse.
Look at TgsStream64. This class can handle a huge amounts of data through file mapping.
http://code.google.com/p/gedemin/source/browse/trunk/Gedemin/Common/gsMMFStream.pas
But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter.
It's usually done like this: you write a enumerator function (like you did) which receives a callback function pointer (you did that too) and an untyped pointer ("Data: pointer"). You define a callback function to have first parameter be the same untyped pointer:
TRangeProc = procedure(Data: pointer; range: TRange);
procedure enumRanges(aProc: TRangeProc; Data: pointer);
begin
{for each range}
aProc(range, Data);
end;
Then if you want to, say, sum all ranges, you do it like this:
TSumRecord = record
Sum: int64;
end;
PSumRecord = ^TSumRecord;
procedure SumProc(SumRecord: PSumRecord; range: TRange);
begin
SumRecord.Sum := SumRecord.Sum + range.Value;
end;
function SumRanges(): int64;
var SumRec: TSumRecord;
begin
SumRec.Sum := 0;
enumRanges(TRangeProc(SumProc), #SumRec);
Result := SumRec.Sum;
end;
Anyway, if you need to create billions of ANYTHING you're probably doing it wrong (unless you're a scientist, modelling something extremely large scale and detailed). Even more so if you need to create billions of stuff every time you want one of those. This is never good. Try to think of alternative solutions.
"Runner" has a good answer how to handle this!
But I would like to known if you could do a quick fix: make smaller TRange objects.
Maybe you have a big ancestor? Can you take a look at the instance size of TRange object?
Maybe you better use packed records?
This part:
As a result, if you have 100 items,
you can't directly create the 100th
object without creating all the prior
objects.
sounds a bit like calculating Fibonacci. May be you can reuse some of the TRange objects instead of creating redundant copies? Here is a C++ article describing this approach - it works by storing already calculated intermediate results in a hash map.
Handling billions of objects is possible but you should avoid it as much as possible. Do this only if you absolutely have to...
I did create a system once that needed to be able to handle a huge amount of data. To do so, I made my objects "streamable" so I could read/write them to disk. A larger class around it was used to decide when an object would be saved to disk and thus removed from memory. Basically, when I would call an object, this class would check if it's loaded or not. If not, it would re-create the object again from disk, put it on top of a stack and then move/write the bottom object from this stack to disk. As a result, my stack had a fixed (maximum) size. And it allowed me to use an unlimited amount of objects, with a reasonable good performance too.
Unfortunately, I don't have that code available anymore. I wrote it for a previous employer about 7 years ago. I do know that you would need to write a bit of code for the streaming support plus a bunch more for the stack controller which maintains all those objects. But it technically would allow you to create an unlimited number of objects, since you're trading RAM memory for disk space.

Resources