Delphi: how to dynamically "split" a string into substrings according to a (dynamic) mask - delphi

This is my situation: I have a text file containing a lot of equal-length strings representing records to be loaded into an SQL DB table, so I'll have to generate SQL code from these strings.
I than have a table on that DB (let's call it the "formatting table") that tells me how the strings are formatted and where to load them (each record of that table contains the destination table name, field name, data position and length referred to the strings from the text file).
I have already solved that problem in a way I think it's well-known to every Delphi programmer, using the Copy(string, pos, length) function and iterating through each field, based on the informations from the "formatting table".
That works well, but it's slow, especially when we talk of source text files with a million or more of lines, each representing several tens or even hundreds of data fields.
What I'm trying to do now is to "see" the source strings in a way that they appear already splitted, avoiding the Copy() funcion that continuously create new strings copying the content from the original string, allocating and freeing memory and so on. What I'd say is "I have the whole string, let's see it in a way that represent each 'piece' (field) of it in a single step, without creating substrings from it".
What could solve my problem would be some way to define a dynamic structure like a dynamic record or a dynamic array (not what Delphy calls a dynamic array, more something like a "dynamic static array") to "superimpose" on the string in order to "watch" it from that point of view... I don't know I'm sufficiently clear on that explanation... However Delphi (from my knowledge) doesn't implements such kind of dynamic structures.
This is a piece of (static) code that does what I want, apart from the lack of dynamism.
procedure TForm1.FormCreate(Sender: TObject);
PDecodeStr = ^TDecodeStr;
TDecodeStr = record
s1: Array[0..3] of AnsiChar;
s2: Array[0..9] of AnsiChar;
s3: Array[0..4] of AnsiChar;
s4: Array[0..7] of AnsiChar;
s5: Array[0..2] of AnsiChar;
cWholeStr: AnsiString;
cWholeStr := '123456789012345678901234567890';
Any idea on how to solve this problem?
Thanks in advance.

You can't really avoid creating extra strings. Your example at the end of your question creates strings.
Your call to TStrings.Add() in this code creates a dynamic string implicitly from the parameter you pass and then this string is passed to Add().
The solution with Copy is probably the way to go since I don't see any easy way to avoid the copying of memory if you wish to do anything with the split strings.

I think that there is not a very more efficient, in Delphi, way than to use Copy.
But another solution is to load the all strings directly in a one column temporay table and, after, make the spilt with a SQL query.
The total time is depending on a lot of parameters so the best way is to test !!


Passing open array into an anonymous function

What is the least wasteful way (i.e. avoiding copying if at all possible) to pass the content of an open string array into an anonymous function and from there into another function that expects an open array?
The problem is that open arrays cannot be captured in anonymous functions in Delphi XE2.
This illustrates the problem:
procedure TMyClass.DoSomething(const aStrings: array of string);
function (aItem: string) : Boolean
Result := IndexText(aItem, aStrings) >= 0;
The compiler complains: "Cannot capture symbol 'aStrings'".
An obvious solution is to make a copy of aStrings in a dynamic array and capture that. But I don't want to make a copy. (While my specific problem involves a string array and making a copy would only copy the pointers not the string data itself due to reference counting, it would also be instructive to learn how to solve the problem for an arbitrarily large array of a non-reference counted type.)
So I tried capturing instead a PString pointer to the first string in aStrings and an Integer value of the length. But then I couldn't figure out a way to pass these to InsertText.
One other constraint: I want it to be possible to call DoSomething([a, b, c]).
Is there a way to do this without making a copy of the array, and without writing my own version of IndexText, and without being hideously ugly? If so, how?
For the sake of this question I've used IndexText, but it would be instructive to find a solution for a function that could not be trivially rewritten to accept a pointer and length parameter instead of an open array.
An acceptable answer to this question would be: No, you can't do that (at least not without making a copy or rewriting IndexText) though if so I'd also like to know the fundamental reason why not.
If you don't want to copy the array then you should change the signature of DoSomething to take a TArray<string> instead. You of course have to change the caller side if you are passing the elements directly (only since XE7 you can pass dynamic arrays in the same way) - like DoSomething([a, b, c]) i mean.
My advice is not to mess around with some internal pointers and stuff, especially not for an open array.
There's no way to do this without making a copy. Open arrays cannot be captured as you have found, and you cannot get the information into the anonymous method without capture. You must capture, in general, because you need to extend the life of the variable.
So, you cannot do this with an open array and avoid a copy. You could instead:
Switch from open array to a dynamic array, TArray<string>.
Make a copy of the array. You would not be copying the string data, just the array of references to the strings.

Displaying the result of mmioRead

After locating the data chunk using mmioDescend, then how i suppose to read and display the sample data into for example into a memo in delphi 7?
I have follow the step like open the file, locating the riff, locating the fmt, locating data chunk.
if (mmioDescend(HMMIO, #ckiData, #ckiRIFF, MMIO_FINDCHUNK) = MMSYSERR_NOERROR) then
SetLength(buf, ckiData.cksize);
mmioRead(HMMIO, PAnsiChar(buf), ckiData.cksize);
I use mmioRead too but i don't know how to display the data.Can anyone help give an example how to use the mmioRead and then display the result?
Well, I'd probably read into a buffer that was declared using a more appropriate type.
For example, suppose your data are 16 bit integers, Smallint in Delphi. Then declare a dynamic array of Smallint.
buf: array of Smallint;
Then allocate enough space for the data:
Assert(ckiData.cksize mod SizeOf(buf[0])=0);
SetLength(buf, ckiData.cksize div SizeOf(buf[0]));
And then read the buffer:
mmioRead(HMMIO, PAnsiChar(buf), ckiData.cksize);
Now you can access the elements as Smallint values.
If you have different element types, then you can adjust your array declaration. If you don't know until runtime what the element type is you may be better off with array of Byte and then using pointer arithmetic and casting to access the actual content.
I'd say that the design of the interface to mmioRead is a little weak. The buffer isn't really a string. It's probably best considered as a byte array. But perhaps because C does not have separate byte and character types, the function is declared as taking a pointer to char array. Really the Delphi translation would be better exposing a pointer to byte or even better in my view, a plain untyped Pointer type.
I assumed that you were struggling with interpreting the output of mmioRead since that was the code that you included in the question. But, according to now deleted comments, your question is a GUI question.
You want to add content to a memo. Do it like this:
for i := low(buf) to high(buf) do
If you want to convert to floating point then, still assuming 16 bit signed data, do this:
for i := low(buf) to high(buf) do
Memo1.Items.Add(FormatFloat('0.00000', buf[i]/32768.0));//show 5dp

Delphi Unicode String Type Stored Directly at its Address (or "Unicode ShortString")

I want a string type that is Unicode and that stores the string directly at the adress of the variable, as is the case of the (Ansi-only) ShortString type.
I mean, if I declare a S: ShortString and let S := 'My String', then, at #S, I will find the length of the string (as one byte, so the string cannot contain more than 255 characters) followed by the ANSI-encoded string itself.
What I would like is a Unicode variant of this. That is, I want a string type such that, at #S, I will find a unsigned 32-bit integer (or a single byte would be enough, actually) containing the length of the string in bytes (or in characters, which is half the number of bytes) followed by the Unicode representation of the string. I have tried WideString, UnicodeString, and RawByteString, but they all appear only to store an adress at #S, and the actual string somewhere else (I guess this has do do with reference counting and such). Update: The most important reason for this is probably that it would be very problematic if sizeof(string) were variable.
I suspect that there is no built-in type to use, and that I have to come up with my own way of storing text the way I want (which actually is fun). Am I right?
I will, among other things, need to use these strings in packed records. I also need manually to read/write these strings to files/the heap. I could live with fixed-size strings, such as <= 128 characters, and I could redesign the problem so it will work with null-terminated strings. But PChar will not work, for sizeof(PChar) = 1 - it's merely an address.
The approach I eventually settled for was to use a static array of bytes. I will post my implementation as a solution later today.
You're right. There is no exact analogue to ShortString that holds Unicode characters. There are lots of things that come close, including WideString, UnicodeString, and arrays of WideChar, but if you're not willing to revisit the way you intend to use the data type (make byte-for-byte copies in memory and in files while still being using them in all the contexts a string could be allowed), then none of Delphi's built-in types will work for you.
WideString fails because you insist that the string's length must exist at the address of the string variable, but WideString is a reference type; the only thing at its address is another address. Its length happens to be at the address held by the variable, minus four. That's subject to change, though, because all operations on that type are supposed to go through the API.
UnicodeString fails for that same reason, as well as because it's a reference-counted type; making a byte-for-byte copy of one breaks the reference counting, so you'll get memory leaks, invalid-pointer-operation exceptions, or more subtle heap corruption.
An array of WideChar can be copied without problems, but it doesn't keep track of its effective length, and it also doesn't act like a string very often. You can assign string literals to it and it will act like you called StrLCopy, but you can't assign string variables to it.
You could define a record that has a field for the length and another field for a character array. That would resolve the length issue, but it would still have all the rest of the shortcomings of an undecorated array.
If I were you, I'd simply use a built-in string type. Then I'd write functions to help transfer it between files, blocks of memory, and native variables. It's not that hard; probably much easier than trying to get operator overloading to work just right with a custom record type. Consider how much code you will write to load and store your data versus how much code you're going to write that uses your data structure like an ordinary string. You're going to write the data-persistence code once, but for the rest of the project's lifetime, you're going to be using those strings, and you're going to want them to look and act just like real strings. So use real strings. "Suffer" the inconvenience of manually producing the on-disk format you want, and gain the advantage of being able to use all the existing string library functions.
PChar should work like this, right? AFAIK, it's an array of chars stored right where you put it. Zero terminated, not sure how that works with Unicode Chars.
You actually have this in some way with the new unicode strings.
s as a pointer points to s[1] and the 4 bytes on the left contains the length.
But why not simply use Length(s)?
And for direct reading of the length from memory:
procedure TForm9.Button1Click(Sender: TObject);
s: string;
s := 'hlkk ljhk jhto';
Assert(Length(s) = (PInteger(s)-1)^);
//if you don't want POINTERMATH, replace by PInteger(Cardinal(s)-SizeOf(Integer))^
There's no Unicode version of ShortString. If you want to store unicode data inline inside an object instead of as a reference type, you can allocate a buffer:
buffer = array[0..255] of WideChar;
This has two disadvantages. 1, the size is fixed, and 2, the compiler doesn't recognize it as a string type.
The main problem here is #1: The fixed size. If you're going to declare an array inside of a larger object or record, the compiler needs to know how large it is in order to calculate the size of the object or record itself. For ShortString this wasn't a big problem, since they could only go up to 256 bytes (1/4 of a K) total, which isn't all that much. But if you want to use long strings that are addressed by a 32-bit integer, that makes the max size 4 GB. You can't put that inside of an object!
This, not the reference counting, is why long strings are implemented as reference types, whose inline size is always a constant sizeof(pointer). Then the compiler can put the string data inside a dynamic array and resize it to fit the current needs.
Why do you need to put something like this into a packed array? If I were to guess, I'd say this probably has something to do with serialization. If so, you're better off using a TStream and a normal Unicode string, and writing an integer (size) to the stream, and then the contents of the string. That turns out to be a lot more flexible than trying to stuff everything into a packed array.
The solution I eventually settled for is this (real-world sample - the string is, of course, the third member called "Ident"):
TASStructMemHeader = packed record
TotalSize: cardinal;
MemType: TASStructMemType;
Ident: packed array[0..63] of WideChar;
DataSize: cardinal;
procedure SetIdent(const AIdent: string);
function ReadIdent: string;
function TASStructMemHeader.ReadIdent: string;
result := WideCharLenToString(PWideChar(#(Ident[0])), length(Ident));
procedure TASStructMemHeader.SetIdent(const AIdent: string);
i: Integer;
if length(AIdent) > 63 then
raise Exception.Create('Too long structure identifier.');
FillChar(Ident[0], length(Ident) * sizeof(WideChar), 0);
Move(AIdent[1], Ident[0], length(AIdent) * sizeof(WideChar));
But then I realized that the compiler really can interpret array[0..63] of WideChar as a string, so I could simply write
MyStr: string;
Ident := 'This is a sample string.';
MyStr := Ident;
Hence, after all, the answer given by Mason Wheeler above is actually the answer.

Delphi; performance of passing const strings versus passing var strings

Quick one; am I right in thinking that passing a string to a method 'as a CONST' involves more overhead than passing a string as a 'VAR'? The compiler will get Delphi to make a copy of the string and then pass the copy, if the string parameter is declared as a CONST, right?
The reason for the question is a bit tedious; we have a legacy Delphi 5 utility whose days are truly numbered (the replacement is under development). It does a large amount of string processing, frequently passing 1-2Kb strings between various functions and procedures. Throughout the code, the 'correct' observation of using CONST or VAR to pass parameters (depending on the job in hand) has been adhered to. We're just looking for a few 'quick wins' that might shave a few microseconds off the execution time, to tide us over until the new version is ready. We thought of changing the memory manager from the default Delphi 5 one to FastMM, and we also wondered if it was worth altering the way the strings are passed around - because the code is working fine with the strings passed as const, we don't see a problem if we changed those declarations to var - the code within that method isn't going to change the string.
But would it really make any difference in real terms? (The program really just does a large amount of processing on these 1kb+ish strings; several hundred strings a minute at peak times). In the re-write these strings are being held in objects/class variables, so they're not really being copied/passed around in the same way at all, but in the legacy code it's very much 'old school' pascal.
Naturally we'll profile an overall run of the program to see what difference we've made but there's no point in actually trying this if we're categorically wrong about how the string-passing works in the first instance!
No, there shouldn't be any performance difference between using const or var in your case. In both cases a pointer to the string is passed as the parameter. If the parameter is const the compiler simply disallows any modifications to it. Note that this does not preclude modifications to the string if you get tricky:
procedure TForm1.Button1Click(Sender: TObject);
s: string;
s := 'foo bar baz';
Caption := s;
procedure TForm1.SetConstCaption(const AValue: string);
P: PChar;
P := PChar(AValue);
P[3] := '?';
Caption := AValue;
This will actually change the local string variable in the calling method, proof that only a pointer to it is passed.
But definitely use FastMM4, it should have a much bigger performance impact.
const for parameters in Delphi essentially means "I'm not going to mutate this, and I also don't care if this is passed by value or by reference - whichever is most efficient is fine by me". The bolded part is important, because it is actually observable. Consider this code:
type TFoo =
x: integer;
//dummy: array[1..10] of integer;
procedure Foo(var x1: TFoo; const x2: TFoo);
x: TFoo;
Foo(x, x);
The trick here is that we pass the same variable both as var and as const, so that our function can mutate via one argument, and see if this affects the other. If you try it with code above, you'll see that incrementing x1.x inside Foo doesn't change x2.x, so x2 was passed by value. But try uncommenting the array declaration in TFoo, so that its size becomes larger, and running it again - and you'll see how x2.x now aliases x1.x, so we have pass-by-reference for x2 now!
To sum it up, const is always the most efficient way to pass parameter of any type, but you should not make any assumptions about whether you have a copy of the value that was passed by the caller, or a reference to some (potentially mutated by other code that you may call) location.
This is really a comment, but a long one so bear with me.
About 'so called' string passing by value
Delphi always passes string and ansistring (WideStrings and ShortStrings excluded) by reference, as a pointer.
So strings are never passed by value.
This can be easily tested by passing 100MB strings around.
As long as you don't change them inside the body of the called routine string passing takes O(1) time (and with a small constant at that)
However when passing a string without var or const clause, Delphi does three things.
Increase the reference count of the string.
put an implicit try-finally block around the procedure, so the reference count of the string parameter gets decreased again when the method exits.
When the string gets changed (and only then) Delphi makes a copy of the string, decreases the reference count of the passed string and uses the copy in the rest of the routine.
It fakes a pass by value in doing this.
About passing by reference (pointer)
When the string is passed as a const or var, Delphi also passes a reference (pointer), however:
The reference count of the string does not increase. (tiny, tiny speed increase)
No implicit try/finally is put around the routine, because it is not needed. This is part 1 why const/var string parameters execute faster.
When the string is changed inside the routine, no copy is make the actual string is changed. For const parameters the compiler prohibits string alternations. This is part 2 of why var/const string parameters work faster.
If however you need to create a local var to assign the string to; Delphi copies the string :-) and places an implicit try/finally block eliminating 99%+ of the speed gain of a const string parameter.
Hope this sheds some light on the issue.
Disclaimer: Most of this info comes from here, here and here
The compiler won't make a copy of the string when using const afaik. Using const saves you the overhead of incrementing/decrementing the refcounter for the string that you use.
You will get a bigger performanceboost by upgrading the memorymanager to FastMM, and, because you do a lot with strings, consider using the FastCode library.
Const is already the most efficient way of passing parameters to a function. It avoids creating a copy (default, by value) or even passing a pointer (var, by reference).
It is particularly true for strings and was indeed the way to go when computing power was limited and not to be wasted (hence the "old school" label).
IMO, const should have been the default convention, being up to the programmer to change it when really needed to by value or by var. That would have been more in line with the overall safety of Pascal (as in limiting the opportunity of shooting oneself in the foot).
My 2ยข...

Are Delphi strings immutable?

As far as I know, strings are immutable in Delphi. I kind of understand that means if you do:
string1 := 'Hello';
string1 := string1 + " World";
first string is destroyed and you get a reference to a new string "Hello World".
But what happens if you have the same string in different places around your code?
I have a string hash assigned for identifying several variables, so for example a "change" is identified by a hash value of the properties of that change. That way it's easy for me to check to "changes" for equality.
Now, each hash is computed separately (not all the properties are taken into account so that to separate instances can be equal even if they differ on some values).
The question is, how does Delphi handles those strings? If I compute to separate hashes to the same 10 byte length string, what do I get? Two memory blocks of 10 bytes or two references to the same memory block?
Clarification: A change is composed by some properties read from the database and is generated by an individual thread. The TChange class has a GetHash method that computes a hash based on some of the values (but not all) resulting on a string. Now, other threads receive the Change and have to compare it to previously processed changes so that they don't process the same (logical) change. Hence the hash and, as they have separate instances, two different strings are computed. I'm trying to determine if it'd be a real improvement to change from string to something like a 128 bit hash or it'll be just wasting my time.
Edit: Version of Delphi is Delphi 7.0
Delphi strings are copy on write. If you modify a string (without using pointer tricks or similar techniques to fool the compiler), no other references to the same string will be affected.
Delphi strings are not interned. If you create the same string from two separate sections of code, they will not share the same backing store - the same data will be stored twice.
Delphi strings are not immutable (try: string1[2] := 'a') but they are reference-counted and copy-on-write.
The consequences for your hashes are not clear, you'll have to detail how they are stored etc.
But a hash should only depend on the contents of a string, not on how it is stored. That makes the whole question mute. Unless you can explain it better.
As others have said, Delphi strings are not generally immutable. Here are a few references on strings in Delphi.
The Delphi version may be important to know. The good old Delphi BCL handles strings as copy-on-write, which basically means that a new instance is created when something in the string is changed. So yes, they are more or less immutable.
