I have a TMemo that displays text from a query. I would like to remove all chars between '{' and '}' so this string '{color:black}😊{color}{color:black}{color}' would end up like this 😊.
MemoComments.Lines.Text := StringReplace(MemoComments.Lines.Text, '{'+ * +'}', '', rfReplaceAll);
I know that the * in my code is wrong. It's just a placeholder. How can I do this the right way?
Is this possible, or do I have to create a complicated loop?
This is a case where you can use a regular expression. I trust someone will publish such an answer for you very shortly.
However, just for the sake of completeness, I want to show that a loop-based approach isn't complicated at all, but rather straightforward:
function ExtractContent(const S: string): string;
var
i, c: Integer;
InBracket: Boolean;
begin
SetLength(Result, S.Length);
InBracket := False;
c := 0;
for i := 1 to S.Length do
begin
if S[i] = '{' then
InBracket := True
else if S[i]= '}' then
InBracket := False
else if not InBracket then
begin
Inc(c);
Result[c] := S[i];
end;
end;
SetLength(Result, c);
end;
Notice that I avoid unnecessary heap allocations.
(Personally, I have never been a huge fan of regular expressions. To me, the correctness of the above algorithm is obvious, it can only be interpreted in one way, and it is clearly written in a performant way. A regex, on the other hand, is a bit more like "magic". But I am a bit of a dinosaur, I admit that.)
Looks like you want a sort of regular expression, which Delphi fortunately offers in their RTL.
s := TRegEx.Replace('{color:black}😊{color}{color:black}{color}', '{.*?}', '', []);
or using the memo:
MemoComments.Lines.Text := TRegEx.Replace(MemoComments.Lines.Text, '{.*?}', '', []);
In this expression, {.*?}, .*? means any number (*) of any character (.), but as few as possible to match the rest of the expression (*?). That last bit is very powerful. By default, regexes are 'greedy', which means that .* would just match as many characters as possible, so it would take everything up to the last }, including the smiley and all the other color codes in between.
Pitfalls/cons
Like Andreas, I'm not a huge fan of regular expressions either. The awkward syntax can be hard to decypher, especially if you don't use them a lot.
Also, a seemingly simple regex can be hard to execute making it actually very slow sometimes, especially when working with larger strings. I recently bumped into one that was so magical, it was stuck for minutes on verifying whether a string of about 1000 characters matched a certain pattern.
The used expression is actually an example of that. It will have to look forward after the .*? part, to check whether it can satisfy the rest of the expression already. If not, go back, take another character, and look forward again. For this expression that's not an issue, but if an expression has multiple parts of variable length, this can be a CPU intensive process!
My earlier version, {[^}]*} is, theoretically at least, more efficient, because instead of any character, it just matches all characters that are not a }. Easier to execute, but harder to read. In the answer above I went for readability over performance, but it's always something to keep in mind.
Note that my first version, \{[^\}]*\} looked even more convoluted. I was using \ to escape the brackets, since they also have a special meaning for grouping, but it doesn't seem necessary in this case.
Lastly, there are different regex dialects, which is not helpful either.
That said
Fortunately Delphi wraps the PCRE library, which is open source, highly optimized, well maintained, well documented, and implements the most commonly used dialect.
And for operations like this they can be brief and easy to write, fast enough to use, and if you use them more often, it also becomes easier to read and write them, especially if you use a tool like regex101.com, where you can try out and debug regexes.
Related
This question is to discuss how to code a spell corrector and is not duplicate of
Delphi Spell Checker component.
Two years ago, I found and used the code of spell corrector by Peter Norvig at his website in Python. But the performance seemed not high. Quite interestingly, more languages that implement the same task have been appended in his webpage list recently.
Some lines in Peter's page include syntax like:
[a + c + b for a, b in splits for c in alphabet]
How to translate it into delphi?
I am interested in how Delphi expert at SO will use the same theory and do the same task with some suitable lines and possible mediocre or better performance. This is not to downvote any language but to learn to compare how they implement the task differently.
Thanks so much in advance.
[Edit]
I will quote Marcelo Toledo who contributes C version, as saying "...While the purpose of this article [C version] was to show the algorithms, not to highlight Python...". Though his C version is with the second most lines, according to his article, his version is high performance when the dictionary file is huge. So this question is not highlight any language but to ask for delphi solution and it is not at all intended for competition, though Peter is influential in directing Google Research.
[Update]
I was enlightened by David's suggestion and studied theory and routine of Peter's page. A very rough and inefficient routine was done, slightly different from other languages, mine is GUI's. I am a beginner and learner in Delphi, I dare not post my complete code ( it is poorly written). I will outline my idea of how I did it. Your comment is welcome so that the routine will be improved.
My hardware and software is old. This is enough to my work (my specialization is not in computer or program related)
AMD Athlon Dual Core Processor
2.01 Ghz, 480 Memory
Windows XP SP2
IDE Delphi 7.0
This is the snapshot and record of processing time of 'correct' word.
I tried Gettickcount, Tdatetime, and Queryperformancecounter to track correct time for word, but gettickcount and Tdatetime will output o ms for each check, so I have to use
Queryperformancecounter. Maybe there are other ways to do it more precisely.
The total lines is 72, not including function that records check time. Number of lines may not be yardstick as mentioned above by Marcelo. The post is discuss how to do the task differently. Delphi Experts at SO will of course use minimum lines to do it with best performance.
procedure Tmajorform.FormCreate(Sender: TObject);
begin
loaddict;
end;
procedure Tmajorform.loaddict;
var
fs: TFilestream;
templist: TStringlist;
p1: tperlregex;
w1: string;
begin
//load that big.txt (6.3M, is Adventures of Sherlock Holmes)
//templist.loadfromstream
//Use Tperlregex to tokenize ( I used regular expression by [Jan Goyvaerts][5])
//The load and tokenize time is about 7-8 seconds on my machine, Maybe there are other ways to
//speed up loading and tokenizing.
end;
procedure Tmajorform.edits1(str: string);
var
i: integer;
ch: char;
begin
// This is to simulate Peter's page in order to fast generate all possible combinations.
// I do not know how to use set in delphi. I used array.
// Peter said his routine edits1 would generate 494 elements of 'something'. Mine will
// generate 469. I do not know why. Before duplicate ignore, mine is over 500. After setting
// duplicate ignore, there are 469 unique elements for 'something'.
end;
procedure Tmajorform.correct(str: string);
var
i, j: integer;
begin
//This is a loop and binary search to add candidate word into list.
end;
procedure Tmajorform.Button2Click(Sender: TObject);
var
str: string;
begin
// Trigger correct(str: string);
end;
It seems by Tfilestream it can increase loading by 1-2 second. I tried using CreateFileMapping method but failed and it seemed a little complicated. Maybe there are other ways to load huge file fast. Because this big.txt will not be big considering availability of corpus, there should be more efficient way to load larger and larger file.
Another point is Delphi 7.0 does not have built-in regular expression. I have a look at other languages that do spell check at Perter's page, they are largely Directly calling their built-in regular expression. Of course, real expert does not need any built-in class or library and can build by himself. To beginner, some classes or libraries are convenience.
Your comment is welcome.
[Update]
I continued the research and further included edits2 function (edit distance 2). This will increase about another 12 lines of code. Peter said edit distance 2 would include almost all possibilities. 'something' will have 114,324 possibilities. My function will generate 102,727 UNIQUE possibilities for it. Of course, suggested words will also include more.
If with edits2, reponse time for correction obviously delay as it increases data by about 200 times. But I find some suggested corrections are obviously impossibilities as a typist will not type a error word that will be in the long corrected word list. So, edit distance 1 will be better provided that the big.txt file is sufficiently big to include more correct words.
Below is the snapshot of tracking edits 2 correct time.
This is a Python list comprehension. It forms the Cartesian product of splits and alphabets.
Each item of splits is a tuple which is unpacked into a and b. Each item of alphabet is put into a variable called c. Then the 3 variables are concatenated, assuming that they are strings. The result of the list comprehension expression is a list containing elements of the form a + c + b, one element for each item in the Cartesian product.
In Python it could be written equivalently as
res = []
for a, b in splits:
for c in alphabets:
res.append(a + c + b)
In Delphi it would be
res := TStringList.Create;
for split in splits do
for c in alphabets do
res.Add(split.a + c + split.b);
I suggest you read up on Python list comprehensions to get a better understanding of this very powerful Python feature.
How to get the entire code of a method in memory so I can calculate its hash at runtime?
I need to make a function like this:
type
TProcedureOfObject = procedure of object;
function TForm1.CalculateHashValue (AMethod: TProcedureOfObject): string;
var
MemStream: TMemoryStream;
begin
result:='';
MemStream:=TMemoryStream.Create;
try
//how to get the code of AMethod into TMemoryStream?
result:=MD5(MemStream); //I already have the MD5 function
finally
MemStream.Free;
end;
end;
I use Delphi 7.
Edit:
Thank you to Marcelo Cantos & gabr for pointing out that there is no consistent way to find the procedure size due to compiler optimization. And thank you to Ken Bourassa for reminding me of the risks. The target procedure (the procedure I would like to compute the hash) is my own and I don't call another routines from there, so I could guarantee that it won't change.
After reading the answers and Delphi 7 help file about the $O directive, I have an idea.
I'll make the target procedure like this:
procedure TForm1.TargetProcedure(Sender: TObject);
begin
{$O-}
//do things here
asm
nop;
nop;
nop;
nop;
nop;
end;
{$O+}
end;
The 5 succesive nops at the end of the procedure would act like a bookmark. One could predict the end of the procedure with gabr's trick, and then scan for the 5 nops nearby to find out the hopefully correct size.
Now while this idea sounds worth trying, I...uhm... don't know how to put it into working Delphi code. I have no experience on lower level programming like how to get the entry point and put the entire code of the target procedure into a TMemoryStream while scanning for the 5 nops.
I'd be very grateful if someone could show me some practical examples.
Marcelo has correctly stated that this is not possible in general.
The usual workaround is to use an address of the method that you want to calculate the hash for and an address of the next method. For the time being the compiler lays out methods in the same order as they are defined in the source code and this trick works.
Be aware that substracting two method addresses may give you a slightly too large result - the first method may actually end few bytes before the next method starts.
The only way I can think of, is turning on TD32 debuginfo, and try JCLDebug to see if you can find the length in the debuginfo using it. Relocation shouldn't affect the length, so the length in the binary should be the same as in mem.
Another way would be to scan the code for a ret or ret opcode. That is less safe, but probably would guard at least part of the function, without having to mess with debuginfo.
The potential deal breaker though is short routines that are tail-call optimized (iow they jump instead of ret). But I don't know if Delphi does that.
You might struggle with this. Functions are defined by their entry point, but I don't think that there is any consistent way to find out the size. In fact, optimisers can do screwy things like merge two similar functions into a common shared function with multiple entry points (whether or not Delphi does stuff like this, I don't know).
EDIT: The 5-nop trick isn't guaranteed to work either. In addition to Remy's caveats (see his comment below), The compiler merely has to guarantee that the nops are the last thing to execute, not that they are last thing to appear in the function's binary image. Turning off optimisations is a rather baroque "solution" that still won't fix all the issues that others have raised.
In short, there are simply too many variables here for what you are trying to do. A better approach would be to target compilation units for checksumming (assuming it satisfies whatever overall objective you have).
I achieve this by letting Delphi generate a MAP-file and sorting symbols based on their start address in ascending order. The length of each procedure or method is then the next symbols start address minus this symbols start address. This is most likely as brittle as the other solutions suggested here but I have this code working in production right now and it has worked fine for me so far.
My implementation that reads the map-file and calculate sizes can be found here at line 3615 (TEditorForm.RemoveUnusedCode).
Even if you would achieve it, there is a few things you need to be aware of...
The hash will change many times, even if the function itself didn't change.
For example, the hash will change if your function call another function that changed address since the last build. I think the hash might also change if your function calls itself recursively and your unit (not necessarily your function) changed since the last build.
As for how it could be achieved, gabr's suggestion seems to be the best one... But it's really prone to break over time.
What is the best approach, simple string concatenation or string.format?
For instance, what is the better to use:
s:=v1+' '+v2
or
s:=format('%S %S',[v1,v2])
Depends on your criteria for "best". If all you're doing is concatenating two strings, I'd go with the + operator. It's obvious what you're trying to do and easy to read, and it's a little bit faster because it doesn't have to use variants. (Have you looked at what format actually does under the hood? it's kinda scary!)
The major advantage of format is that it lets you make a single string and store it somewhere, such as in a text file or a resourcestring, and gather other parameters later. This makes it useful for more complex tasks. But if all you need to do is stick two strings together, it's kinda overkill IMO.
Format works with internationalization, making it possible to localize your app. Concatenation does not. Hence, I favor format for any display which may have to be produced in a culture-dependent manner.
Update: The reason format works for internationalization is that not all languages express everything in the same order. A contrived example would be:
resourcestring
sentence = ' is ';
var
subject = 'Craig';
adjective = 'helpful';
begin
WriteLn(subject + sentence + adjective + '!');
This works, and I can customize with a resourcestring, but in Spanish I would write, "¡Qué servicial es Craig!" The resourcestring doesn't help me. Instead I should write:
resourcestring
sentence = '%S is %S!'; // ES: '¡Qué %1:S es %0:S!'
Here's a third option:
s:=Concat(V1,V2);
I use:
s := v1 + ' ' + v2;
It's clearest and easiest to understand.
That is the most important thing.
You may find a construct that is marginally more efficient, e.g. using TStringBuilder in Delphi 2009. If efficiency is of utmost importance, then do what's necessary in the two or three most critical lines. Everywhere else, use code and constructs that are clear and easy to understand.
Code Complete says it is good practice to always use block identifiers, both for clarity and as a defensive measure.
Since reading that book, I've been doing that religiously. Sometimes it seems excessive though, as in the case below.
Is Steve McConnell right to insist on always using block identifiers? Which of these would you use?
//naughty and brief
with myGrid do
for currRow := FixedRows to RowCount - 1 do
if RowChanged(currRow) then
if not(RecordExists(currRow)) then
InsertNewRecord(currRow)
else
UpdateExistingRecord(currRow);
//well behaved and verbose
with myGrid do begin
for currRow := FixedRows to RowCount - 1 do begin
if RowChanged(currRow) then begin
if not(RecordExists(currRow)) then begin
InsertNewRecord(currRow);
end //if it didn't exist, so insert it
else begin
UpdateExistingRecord(currRow);
end; //else it existed, so update it
end; //if any change
end; //for each row in the grid
end; //with myGrid
I have always been following the 'well-behaved and verbose' style, except those unnecessary extra comments at the end blocks.
Somehow it makes more sense to be able to look at code and make sense out of it faster, than having to spend at least couple seconds before deciphering which block ends where.
Tip: Visual studio KB shortcut for C# jump begin and end: Ctrl + ]
If you use Visual Studio, then having curly braces for C# at the beginning and end of block helps also by the fact that you have a KB shortcut to jump to begin and end
Personally, I prefer the first one, as IMHO the "end;" don't tell me much, and once everything is close, I can tell by the identation what happens when.
I believe blocks are more useful when having large statements. You could make a mixed approach, where you insert a few "begin ... end;"s and comment what they are ending (for instance use it for the with and the first if).
IMHO you could also break this into more methods, for example, the part
if not(RecordExists(currRow)) then begin
InsertNewRecord(currRow);
end //if it didn't exist, so insert it
else begin
UpdateExistingRecord(currRow);
end; //else it existed, so update it
could be in a separate method.
I would use whichever my company has set for its coding standards.
That being said, I would prefer to use the second, more verbose, block. It is a lot easier to read. I might, however, leave off the block-ending comments in some cases.
I think it depends somewhat on the situation. Sometimes you simply have a method like this:
void Foo(bool state)
{
if (state)
TakeActionA();
else
TakeActionB();
}
I don't see how making it look like this:
void Foo(bool state)
{
if (state)
{
TakeActionA();
}
else
{
TakeActionB();
}
}
Improves on readability at all.
I'm a Python developer, so I see no need for block identifiers. I'm quite happy without them. Indentation is enough of an indicator for me.
Block identifier are not only easier to read they are much less error prone if you are changing something in the if else logic or simply adding a line and don't recognizing that the line is not in the same logical block then the rest of the code.
I would use the second code block. The first one looks prettier and more familiar but I think this a problem of the language and not the block identifiers
If it is possible I use checkstyle to ensure that brackets are used.
If I remember correctly, CC also gave some advices about comments. Especially about not rewriting what code does in comments, but explaining why it does what it does.
I'd say he's right just for the sake that the code can still be interpreted correctly if the indentation is incorrect. I always like to be able to find the start and end block identifiers for loops when I skim through code, and not rely on proper indentation.
It's never always one way or the other. Because I trust myself, I would use the shorter, more terse style. But if you're in a team environment where not everyone is of the same skill and maintainability is important, you may want to opt for the latter.
My knee-jerk reaction would be the second listing (with the repetitive comments removed from the end of the lines, like everyone's been saying), but after thinking about it more deeply I'd go with the first plus a one or two line useful comment beforehand explaining what's going on (if needed). Obviously in this toy example, even the comment before the concise answer would probably not be needed, but in other examples it might.
Having less (but still readable) and easy to understand code on the screen helps keep your brain space free for future parts of the code IMO.
I'm with those who prefer more concise code.
And it looks like prefering a verbose version to a concise one is more of a personal choice, than of a universal suitableness. (Well, within a company it may become a (mini-)universal rule.)
It's like excessive parentheses: some people prefer it like (F1 and F2) or ((not F2) and F3) or (A - (B * C)) < 0, and not necessarily because they do not know about the precedence rules. It's just more clear to them that way.
I vote for a happy medium. The rule I would use is to use the bracketing keywords any time the content is multiple lines. In action:
// clear and succinct
with myGrid do begin
for currRow := FixedRows to RowCount - 1 do begin
if RowChanged(currRow) then begin
if not(RecordExists(currRow))
InsertNewRecord(currRow);
else
UpdateExistingRecord(currRow);
end; // if RowChanged
end; // next currRow
end; // with myGrid
commenting the end is really usefull for html-like languages so do malformed C code like an infinite succession of if/else/if/else
frequent // comments at the end of code lines (per your Well Behaved and Verbose example) make the code harder to read imho -- when I see it I end up scanning the 'obvious' comments form something special that typically isn't there.
I prefer comments only where the obvious isn't (i.e. overall and / or unique functionality)
Personally I recommend always using block identifiers in languages that support them (but follow your company's coding standards, as #Muad'Dib suggests).
The reason is that, in non-Pythonic languages, whitespace is (generally) not meaningful to the compiler but it is to humans.
So
with myGrid do
for currRow := FixedRows to RowCount - 1 do
if RowChanged(currRow) then
Log(currRow);
if not(RecordExists(currRow)) then
InsertNewRecord(currRow)
else
UpdateExistingRecord(currRow);
appears to do one thing but does something quite different.
I would eliminate the end-of-line comments, though. Use an IDE that highlights blocks. I think Castalia will do that for Delphi. How often do you read code printouts anymore?
Since the age of the dinosaurs, Turbo Pascal and nowadays Delphi have a Write() and WriteLn() procedure that quietly do some neat stuff.
The number of parameters is variable;
Each variable can be of all sorts of types; you can supply integers, doubles, strings, booleans, and mix them all up in any order;
You can provide additional parameters for each argument:
Write('Hello':10,'World!':7); // alignment parameters
It even shows up in a special way in the code-completion drowdown:
Write ([var F:File]; P1; [...,PN] )
WriteLn ([var F:File]; [ P1; [...,PN]] )
Now that I was typing this I've noticed that Write and WriteLn don't have the same brackets in the code completion dropdown. Therefore it looks like this was not automatically generated, but it was hard-coded by someone.
Anyway, am I able to write procedures like these myself, or is all of this some magic hardcoded compiler trickery?
Writeln is what we call a compiler "magic" function. If you look in System.pas, you won't find a Writeln that is declared anything like what you would expect. The compiler literally breaks it all down into individual calls to various special runtime library functions.
In short, there is no way to implement your own version that does all the same things as the built-in writeln without modifying the compiler.
As the Allen said you can't write your own function that does all the same things.
You can, however, write a textfile driver that does something custom and when use standard Write(ln) to write to your textfile driver. We did that in ye old DOS days :)
("Driver" in the context of the previous statement is just a piece of Pascal code that is hooked into the system by switching a pointer in the System unit IIRC. Been a long time since I last used this trick.)
As far as I know, the pascal standards don't include variable arguments.
Having said that, IIRC, GNU Pascal let's you say something like:
Procecdure Foo(a: Integer; b: Integer; ...);
Try searching in your compiler's language docs for "Variable Argument Lists" or "conformant arrays". Here's an example of the later: http://www.gnu-pascal.de/demos/conformantdemo.pas.
As the prev poster said, writeln() is magic. I think the problem has to do with how the stack is assembled in a pascal function, but it's been a real long time since I've thought about where things were on the stack :)
However, unless you're writing the "writeln" function (which is already written), you probably don't need to implement a procedure with a variable arguments. Try iteration or recursion instead :)
It is magic compiler behaviour rather than regular procedure. And no, there is no way to write such subroutines (unfortunately!). Code generation resolves count of actual parameters and their types and translates to appropriate RTL calls (eg. Str()) at compile time. This opposes frequently suggested array of const (single variant array formal parameter, actually) which leads to doing the same at runtime. I'm finding later approach clumsy, it impairs code readability somewhat, and Bugland (Borland/Inprise/Codegear/Embarcadero/name it) broke Code Insight for variant open array constructors (yes, i do care, i use OutputDebugString(PChar(Format('...', [...])))) and code completion does not work properly (or at all) there.
So, closest possible way to simulate magic behaviour is to declare lot of overloaded subroutines (really lot of them, one per specific formal parameter type in the specific position). One could call this a kludge too, but this is the only way to get flexibility of variable parameter list and can be hidden in the separate module.
PS: i left out format specifiers aside intentionally, because syntax doesn't allow to semicolons use where Str(), Write() and Writeln() are accepting them.
Yes, you can do it in Delphi and friends (e.g. free pascal, Kylix, etc.) but not in more "standard" pascals. Look up variant open array parameters, which are used with a syntax something like this:
procedure MyProc(args : array of const);
(it's been a few years and I don't have manuals hand, so check the details before proceeding). This gives you an open array of TVarData (or something like that) that you can extract RTTI from.
One note though: I don't think you'll be able to match the x:y formatting syntax (that is special), and will probably have to go with a slightly more verbose wrapper.
Most is already said, but I like to add a few things.
First you can use the Format function. It is great to convert almost any kind of variable to string and control its size. Although it has its flaws:
myvar := 1;
while myvar<10000 do begin
Memo.Lines.Add(Format('(%3d)', [myVar]));
myvar := myvar * 10;
end;
Produces:
( 1)
( 10)
(100)
(1000)
So the size is the minimal size (just like the :x:y construction).
To get a minimal amount of variable arguments, you can work with default parameters and overloaded functions:
procedure WriteSome(const A1: string; const A2: string = ''; const A3: string = '');
or
procedure WriteSome(const A1: string); overload;
procedure WriteSome(const A1: Integer); overload;
You cannot write your own write/writeln in old Pascal. They are generated by the compiler, formatting, justification, etc. That's why some programmers like C language, even the flexible standard functions e.g. printf, scanf, can be implemented by any competent programmers.
You can even create an identical printf function for C if you are inclined to create something more performant than the one implemented by the C vendor. There's no magic trickery in them, your code just need to "walk" the variable arguments.
P.S.
But as MarkusQ have pointed out, some variants of Pascal(Free Pascal, Kylix, etc) can facilitate variable arguments. I last tinker with Pascal, since DOS days, Turbo Pascal 7.
Writeln is not "array of const" based, but decomposed by the compiler into various calls that convert the arguments to string and then call the primitive writestring. The "LN" is just a function that writes the lineending as a string. (OS dependant). The procedure variables (function pointers) for the primitives are part of the file type (Textrec/filerec), which is why they can be customized. (e.g. AssignCrt in TP)
If {$I+} mode is on, after each element, a call to the iocheck function is made.
The GPC construct made above is afaik the boundless C open array. FPC (and afaik Delphi too) support this too, but with different syntax.
procedure somehting (a:array of const);cdecl;
will be converted to be ABI compatible to C, printf style. This means that the relevant function (somehting in this case) can't get the number of arguments, but must rely on formatstring parsing. So this is something different from array of const, which is safe.
Although not a direct answer to you question, I would like to add the following comment:
I have recently rewritten some code using Writeln(...) syntax into using a StringList, filling the 'lines' with Format(...) and just plain IntToStr(...), FloatToStr(...) functions and the like.
The main reason for this change was speed improvement. Using a StringList and SaveFileTo is much, much more quicker than the WriteLn, Write combination.
If you are writing a program which creates a lot of text files (I was working on a web site creation program), this makes a lot of difference.