StrUtils.SplitString not working as expected

StrUtils.SplitString not working as expected - delphi

I use the StrUtils in to split a string into a TStringDynArray, but the output was not as expected. I will try to explain the issue:
I have a string str: 'a'; 'b'; 'c'
Now I called StrUtils.SplitString(str, '; '); to split the string and I expected an array with three elements: 'a', 'b', 'c'
But what I got is an array with five elements: 'a', '', 'b', '', 'c'.
When I split with just ';' instead of '; ' I get three elements with a leading blank.
So why do I get empty strings in my first solution?

This function is designed not to merge consecutive separators. For instance, consider splitting the following string on commas:
foo,,bar
What would you expect SplitString('foo,,bar', ',') to return? Would you be looking for ('foo', 'bar') or should the answer be ('foo', '', 'bar')? It's not clear a priori which is right, and different use cases might want different output.
If your case, you specified two delimiters, ';' and ' '. This means that
'a'; 'b'
splits at ';' and again at ' '. Between those two delimiters there is nothing, and hence an empty string is returned in between 'a' and 'b'.
The Split method from the string helper introduced in XE3 has a TStringSplitOptions parameter. If you pass ExcludeEmpty for that parameter then consecutive separators are treated as a single separator. This program:
{$APPTYPE CONSOLE}
uses
System.SysUtils;
var
S: string;
begin
for S in '''a''; ''b''; ''c'''.Split([';', ' '], ExcludeEmpty) do begin
Writeln(S);
end;
end.
outputs:
'a'
'b'
'c'
But you do not have this available to you in XE2 so I think you are going to have to roll your own split function. Which might look like this:
function IsSeparator(const C: Char; const Separators: string): Boolean;
var
sep: Char;
begin
for sep in Separators do begin
if sep=C then begin
Result := True;
exit;
end;
end;
Result := False;
end;
function Split(const Str, Separators: string): TArray<string>;
var
CharIndex, ItemIndex: Integer;
len: Integer;
SeparatorCount: Integer;
Start: Integer;
begin
len := Length(Str);
if len=0 then begin
Result := nil;
exit;
end;
SeparatorCount := 0;
for CharIndex := 1 to len do begin
if IsSeparator(Str[CharIndex], Separators) then begin
inc(SeparatorCount);
end;
end;
SetLength(Result, SeparatorCount+1); // potentially an over-allocation
ItemIndex := 0;
Start := 1;
CharIndex := 1;
for CharIndex := 1 to len do begin
if IsSeparator(Str[CharIndex], Separators) then begin
if CharIndex>Start then begin
Result[ItemIndex] := Copy(Str, Start, CharIndex-Start);
inc(ItemIndex);
end;
Start := CharIndex+1;
end;
end;
if len>Start then begin
Result[ItemIndex] := Copy(Str, Start, len-Start+1);
inc(ItemIndex);
end;
SetLength(Result, ItemIndex);
end;
Of course, all of this assumes that you want a space to act as a separator. You've asked for that in the code, but perhaps you actually want just ; to act as a separator. In that case you probably want to pass ';' as the separator, and trim the strings that are returned.

SplitString is defined as
function SplitString(const S, Delimiters: string): TStringDynArray;
One would thought that Delimiters denote single delimiter string used for splitting string, but it actually denotes set of single characters used to split string. Each character in Delimiters string will be used as one of possible delimiters.
SplitString
Splits a string into different parts delimited by the specified
delimiter characters. SplitString splits a string into different parts
delimited by the specified delimiter characters. S is the string to be
split. Delimiters is a string containing the characters defined as
delimiters.

It is because the second parameter of SplitString is a list of single character delimiters, so '; ' means split at a ';' OR split at a ' '. So the string is split at every ';' and at every space, and between the ';' and the ' ' there is nothing, hence the empty strings.

Related

How to count all the words in a textfile with multiple space characters

I am trying to write a procedure that counts all the words in a text file in Pascal. I want it to handle multiple space characters, but I have no idea how to do it.
I tried adding a boolean function Space to determine whether a character is a space and then do
while not eof(file) do
begin
read(file,char);
words:=words+1;
if Space(char) then
while Space(char) do
words:=words;
but that doesnt work, and basically just sums up my(probably bad) idea about how the procedure should look like. Any ideas?

Basically, as Tom outlines in his answer, you need a state machine with the two states In_A_Word and Not_In_A_Word and then count whenever your state changes from Not_In_A_Word to In_A_Word.
Something along the lines of (pseudo-code):
var
InWord: Boolean;
Ch: Char;
begin
InWord := False;
while not eof(file) do begin
read(file, Ch);
if Ch in ['A'..'Z', 'a'..'z'] then begin
if not InWord then begin
InWord := True;
Words := Words + 1;
end;
end else
InWord := False
end;
end;

Use a boolean variable to indicate whether you are processing a word.
Set it true (and increment the counter) on first only non-space character.
Set it false on a space character.

Another method could be to read whole file in one string and then use following steps to count words:
{$mode objfpc}
uses sysutils;
var
fullstr: string = 'this is a test string. ';
ch: char;
count: integer=0;
begin
{trim string- remove spaces at beginning and end: }
fullstr:= trim(fullstr);
{replace all double spaces with single: }
while pos(' ', fullstr) > 0 do
fullstr := stringreplace(fullstr, ' ', ' ', [rfReplaceAll, rfIgnoreCase]);
{count spaces: }
for ch in fullstr do
if ch=' ' then
count += 1;
{add one to get number of words: }
writeln('Number of words: ',count+1);
end.
The comments in above code explain the steps.
Output:
Number of words: 5

Join and add delimiter to start/end

When I join an array (of strings), I will get a delimiter in between every element of the array
Writeln(string.Join('-', ['a','b','c']));
-> 'a-b-c'
However I would like to add delimiters also to the start and end of the string. I know I can do it like this
program Project1;
{$APPTYPE CONSOLE}
uses
System.SysUtils;
function JoinAndAddDelimitersToStartAndEnd(const Delimiter: string; const SArr: TArray<string>): string;
begin
Result := Delimiter + string.Join(Delimiter, SArr) + Delimiter;
end;
begin
Writeln(JoinAndAddDelimitersToStartAndEnd('-', ['a','b','c']));
//-> '-a-b-c-'
Readln;
end.
Is there a better (built-in?) way to do this?

Another "ugly" solution would be adding empty elements to the beginning and end of the array, if the array is more joined than written to. By adding the first empty element before array population and one after, it won't have much overhead and the benefit is 1 (relatively expensive) string concatenation instead of 3.

How about:
var
s : string;
SArr : TArray<string>;
Delim : string;
begin
Delim := '-';
SArr := ['a','b','c'];
s := format('%s%s%s',[Delim,string.Join(Delim,SArr),Delim]);
end;

Delphi Firebird UDF with UTF8 strings

We are trying to write a UDF in Delphi (10 Seattle) for our Firebird 2.5 database which should remove some characters from the input string.
All our string fields in the database are using character set UTF8 with collation UNICODE_CI_AI.
The function should remove some characters like space, . ; : / \ and others from the string.
Our function works fine for strings containing characters with ascii value <= 127. As soon as there are characters with ascii value bigger than 127, the UDF fails.
We have tried using PChar instead of PAnsiChar parameters but without success. For now we do a check if the character has an ascii value above 127 and if so, we remove that character from the string too.
What we want though, is a UDF that returns the original string without the punctuation characters.
This is our code so far:
unit UDFs;
interface
uses ib_util;
function UDF_RemovePunctuations(InputString: PAnsiChar): PAnsiChar; cdecl;
implementation
uses SysUtils, AnsiStrings, Classes;
//FireBird declaration:
//DECLARE EXTERNAL FUNCTION UDF_REMOVEPUNCTUATIONS
// CSTRING(500)
//RETURNS CSTRING(500) FREE_IT
//ENTRY_POINT 'UDF_RemovePunctuations' MODULE_NAME 'FB_UDF.dll';
function UDF_RemovePunctuations(InputString: PAnsiChar): PAnsiChar;
const
PunctuationChars = [' ', ',', '.', ';', '/', '\', '''', '"','(', ')'];
var
I: Integer;
S, NewS: String;
begin
S := UTF8ToUnicodeString(InputString);
For I := 1 to Length(S) do
begin
If Not CharInSet(S[I], PunctuationChars)
then begin
If S[I] <= #127
then NewS := NewS + S[I];
end;
end;
Result := ib_util_malloc(Length(NewS) + 1);
NewS := NewS + #0;
AnsiStrings.StrPCopy(Result, NewS);
end;
end.
When we remove the check on ascii value <= #127 we can see that NewS contains all characters as it should be (without the punctuation characters of course) but things go wrong when doing the StrPCopy we think.
Any help would be appreciated!

Thanks to LU RD I got this working.
The answer was to declare my string variables as Utf8String instead of String and not converting the inputstring to Unicode.
I have adapted my code like this:
//FireBird declaration:
//DECLARE EXTERNAL FUNCTION UDF_REMOVEPUNCTUATIONS
// CSTRING(500)
//RETURNS CSTRING(500) FREE_IT
//ENTRY_POINT 'UDF_RemovePunctuations' MODULE_NAME 'CarfacPlus_UDF.dll';
function UDF_RemovePunctuations(InputString: PAnsiChar): PAnsiChar;
const
PunctuationChars = [' ', ',', '.', ';', '/', '\', '''', '"','(', ')', '-',
'+', ':', '<', '>', '=', '[', ']', '{', '}'];
var
I: Integer;
S: Utf8String;
begin
S := InputString;
For I := Length(S) downto 1 do
If CharInSet(S[I], PunctuationChars)
then Delete(S, I, 1);
Result := ib_util_malloc(Length(S) + 1);
AnsiStrings.StrPCopy(Result, AnsiString(S));
end;

Only allow certain characters in a string

I am trying to validate a string, where by it can contain all alphebetical and numerical characters, aswell as the underline ( _ ) symbol.
This is what I tried so far:
var
S: string;
const
Allowed = ['A'..'Z', 'a'..'z', '0'..'9', '_'];
begin
S := 'This_is_my_string_0123456789';
if Length(S) > 0 then
begin
if (Pos(Allowed, S) > 0 then
ShowMessage('Ok')
else
ShowMessage('string contains invalid symbols');
end;
end;
In Lazarus this errors with:
Error: Incompatible type for arg no. 1: Got "Set Of Char", expected
"Variant"
Clearly my use of Pos is all wrong and I am not sure if my approach is even the correct way of going about it or not?
Thanks.

You will have to check every single character of the string, if it's contained in Allowed
e.g.:
var
S: string;
const
Allowed = ['A' .. 'Z', 'a' .. 'z', '0' .. '9', '_'];
Function Valid: Boolean;
var
i: Integer;
begin
Result := Length(s) > 0;
i := 1;
while Result and (i <= Length(S)) do
begin
Result := Result AND (S[i] in Allowed);
inc(i);
end;
if Length(s) = 0 then Result := true;
end;
begin
S := 'This_is_my_string_0123456789';
if Valid then
ShowMessage('Ok')
else
ShowMessage('string contains invalid symbols');
end;

TYPE TCharSet = SET OF CHAR;
FUNCTION ValidString(CONST S : STRING ; CONST ValidChars : TCharSet) : BOOLEAN;
VAR
I : Cardinal;
BEGIN
Result:=FALSE;
FOR I:=1 TO LENGTH(S) DO IF NOT (S[I] IN ValidChars) THEN EXIT;
Result:=TRUE
END;
If you are using a Unicode version of Delphi (as you seem to be), beware that a SET OF CHAR cannot contain all valid characters in the Unicode character set. Then perhaps this function will be useful instead:
FUNCTION ValidString(CONST S,ValidChars : STRING) : BOOLEAN;
VAR
I : Cardinal;
BEGIN
Result:=FALSE;
FOR I:=1 TO LENGTH(S) DO IF POS(S[I],ValidChars)=0 THEN EXIT;
Result:=TRUE
END;
but then again, not all characters (actually Codepoints) in Unicode can be expressed by a single character, and some characters can be expressed in more than one way (both as a single character and as a multi-character).
But as long as you constrain yourself within these limitations, one of the above functions should be useful. You can even include both, if you add an "OVERLOAD;" directive to the end of each function declaration, as in:
FUNCTION ValidString(CONST S : STRING ; CONST ValidChars : TCharSet) : BOOLEAN; OVERLOAD;
FUNCTION ValidString(CONST S,ValidChars : STRING) : BOOLEAN; OVERLOAD;

Lazarus/Free Pascal doesn't overload pos for that but has "posset" variants in unit strutils for that;
http://www.freepascal.org/docs-html/rtl/strutils/posset.html
Regarding Andreas' (IMHO correct ) remark, you can use isemptystr for that. It was meant to check for strings that only contain whitespace, but it basically checks if a string only contains characters in a set.
http://www.freepascal.org/docs-html/rtl/strutils/isemptystr.html

You can use Regular Expressions:
uses System.RegularExpressions;
if not TRegEx.IsMatch(S, '^[_a-zA-Z0-9]+$') then
ShowMessage('string contains invalid symbols');

How to wash/validate a string to assign it to a componentname?

I have a submenu that list departments. Behind this each department have an action who's name is assigned 'actPlan' + department.name.
Now I realize this was a bad idea because the name can contain any strange character in the world but the action.name cannot contain international characters. Obviously Delphi IDE itself call some method to validate if a string is a valid componentname. Anyone know more about this ?
I have also an idea to use
Action.name := 'actPlan' + department.departmentID;
instead. The advantage is that departmentID is a known format, 'xxxxx-x' (where x is 1-9), so I have only to replace '-' with for example underscore. The problem here is that those old actionnames are already persisted in a personal textfile. It will be exceptions if I suddenly change from using departments name to the ID.
I could of course eat the exception first time and then call a method that search replace that textfile with the right data and reload it.
So basically I search the most elegant and futureproof method to solve this :)
I use D2007.

Component names are validated using the IsValidIdent function from SysUtils, which simply checks whether the first character is alphabetic or an underscore and whether all subsequent characters are alphanumeric or an underscore.
To create a string that fits those rules, simply remove any characters that don't qualify, and then add a qualifying character if the result starts with a number.
That transformation might yield the same result for similar names. If that's not something you want, then you can add something unique to the end of the string, such as a checksum computed from the input string, or your department ID.
function MakeValidIdent(const s: string): string;
var
len: Integer;
x: Integer;
c: Char;
begin
SetLength(Result, Length(s));
x := 0;
for c in s do
if c in ['A'..'Z', 'a'..'z', '0'..'9', '_'] then begin
Inc(x);
Result[x] := c;
end;
SetLength(Result, x);
if x = 0 then
Result := '_'
else if Result[1] in ['0'..'9'] then
Result := '_' + Result;
// Optional uniqueness protection follows. Choose one.
Result := Result + IntToStr(Checksum(s));
Result := Result + GetDepartment(s).ID;
end;
In Delphi 2009 and later, replace the second two in operators with calls to the CharInSet function. (Unicode characters don't work well with Delphi sets.) In Delphi 8 and earlier, change the first in operator to a classic for loop and index into s.

I have written a routine
// See SysUtils.IsValidIdent:
function MakeValidIdent(const AText: string): string;
const
Alpha = ['A'..'Z', 'a'..'z', '_'];
AlphaNumeric = Alpha + ['0'..'9'];
function IsValidChar(AIndex: Integer; AChar: Char): Boolean;
begin
if AIndex = 1 then
Result := AChar in Alpha
else
Result := AChar in AlphaNumeric;
end;
var
i: Integer;
begin
Result := AText;
for i := 1 to Length(Result) do
if not IsValidChar(i, Result[i]) then
Result[i] := '_';
end;
which makes Pascal identifiers from strings.
You might also want to copy FindUniqueName from Classes.pas and apply that to the result from MakeValidIdent.

Here is my routine:
function MakeValidIdent(const s: string): string;
begin
Result := 'clm'; //Prefix
for var c in s do
if CharInSet(c, ['A'..'Z', 'a'..'z', '0'..'9', '_']) then
Result := Result + c;
end;

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

StrUtils.SplitString not working as expected - delphi

It is because the second parameter of SplitString is a list of single character delimiters, so '; ' means split at a ';' OR split at a ' '. So the string is split at every ';' and at every space, and between the ';' and the ' ' there is nothing, hence the empty strings.

Related

How to count all the words in a textfile with multiple space characters

Join and add delimiter to start/end

Delphi Firebird UDF with UTF8 strings

Only allow certain characters in a string

How to wash/validate a string to assign it to a componentname?

Categories

Resources