Delphi Firebird UDF with UTF8 strings - delphi

We are trying to write a UDF in Delphi (10 Seattle) for our Firebird 2.5 database which should remove some characters from the input string.
All our string fields in the database are using character set UTF8 with collation UNICODE_CI_AI.
The function should remove some characters like space, . ; : / \ and others from the string.
Our function works fine for strings containing characters with ascii value <= 127. As soon as there are characters with ascii value bigger than 127, the UDF fails.
We have tried using PChar instead of PAnsiChar parameters but without success. For now we do a check if the character has an ascii value above 127 and if so, we remove that character from the string too.
What we want though, is a UDF that returns the original string without the punctuation characters.
This is our code so far:
unit UDFs;
interface
uses ib_util;
function UDF_RemovePunctuations(InputString: PAnsiChar): PAnsiChar; cdecl;
implementation
uses SysUtils, AnsiStrings, Classes;
//FireBird declaration:
//DECLARE EXTERNAL FUNCTION UDF_REMOVEPUNCTUATIONS
// CSTRING(500)
//RETURNS CSTRING(500) FREE_IT
//ENTRY_POINT 'UDF_RemovePunctuations' MODULE_NAME 'FB_UDF.dll';
function UDF_RemovePunctuations(InputString: PAnsiChar): PAnsiChar;
const
PunctuationChars = [' ', ',', '.', ';', '/', '\', '''', '"','(', ')'];
var
I: Integer;
S, NewS: String;
begin
S := UTF8ToUnicodeString(InputString);
For I := 1 to Length(S) do
begin
If Not CharInSet(S[I], PunctuationChars)
then begin
If S[I] <= #127
then NewS := NewS + S[I];
end;
end;
Result := ib_util_malloc(Length(NewS) + 1);
NewS := NewS + #0;
AnsiStrings.StrPCopy(Result, NewS);
end;
end.
When we remove the check on ascii value <= #127 we can see that NewS contains all characters as it should be (without the punctuation characters of course) but things go wrong when doing the StrPCopy we think.
Any help would be appreciated!

Thanks to LU RD I got this working.
The answer was to declare my string variables as Utf8String instead of String and not converting the inputstring to Unicode.
I have adapted my code like this:
//FireBird declaration:
//DECLARE EXTERNAL FUNCTION UDF_REMOVEPUNCTUATIONS
// CSTRING(500)
//RETURNS CSTRING(500) FREE_IT
//ENTRY_POINT 'UDF_RemovePunctuations' MODULE_NAME 'CarfacPlus_UDF.dll';
function UDF_RemovePunctuations(InputString: PAnsiChar): PAnsiChar;
const
PunctuationChars = [' ', ',', '.', ';', '/', '\', '''', '"','(', ')', '-',
'+', ':', '<', '>', '=', '[', ']', '{', '}'];
var
I: Integer;
S: Utf8String;
begin
S := InputString;
For I := Length(S) downto 1 do
If CharInSet(S[I], PunctuationChars)
then Delete(S, I, 1);
Result := ib_util_malloc(Length(S) + 1);
AnsiStrings.StrPCopy(Result, AnsiString(S));
end;

Related

Delphi Split(): retrieve the separator into separators list

i want retrieve (at runtime) the separator that split a string, eg.:
aStr := 'foo=bar';
aParts := aStr.Split(['=', '!', '<', '>', '^']);
aParts[0] is foo, aParts[1] is bar, but who is the separator? there is any way for retrieve the char used for split the string: =?
I think you need to parse the delimiters separately. For instance, the following code will put your delimiters into a second array named separators.
var
astr: string;
aparts: System.TArray<System.string>;
separators: System.TArray<System.string>;
findindex: integer;
findpos: integer;
begin
astr := 'foo=bar!abc^def';
aparts := astr.Split(['=', '!', '<', '>', '^']);
findindex := 1;
repeat
findpos := astr.IndexOfAny(['=', '!', '<', '>', '^'], findindex);
if findpos >= 1 then
begin
SetLength(separators, length(separators) + 1);
separators[length(separators) - 1] := astr.Substring(findpos, 1);
end;
findindex := findpos + 1;
until findpos = -1;
end;

StrUtils.SplitString not working as expected

I use the StrUtils in to split a string into a TStringDynArray, but the output was not as expected. I will try to explain the issue:
I have a string str: 'a'; 'b'; 'c'
Now I called StrUtils.SplitString(str, '; '); to split the string and I expected an array with three elements: 'a', 'b', 'c'
But what I got is an array with five elements: 'a', '', 'b', '', 'c'.
When I split with just ';' instead of '; ' I get three elements with a leading blank.
So why do I get empty strings in my first solution?
This function is designed not to merge consecutive separators. For instance, consider splitting the following string on commas:
foo,,bar
What would you expect SplitString('foo,,bar', ',') to return? Would you be looking for ('foo', 'bar') or should the answer be ('foo', '', 'bar')? It's not clear a priori which is right, and different use cases might want different output.
If your case, you specified two delimiters, ';' and ' '. This means that
'a'; 'b'
splits at ';' and again at ' '. Between those two delimiters there is nothing, and hence an empty string is returned in between 'a' and 'b'.
The Split method from the string helper introduced in XE3 has a TStringSplitOptions parameter. If you pass ExcludeEmpty for that parameter then consecutive separators are treated as a single separator. This program:
{$APPTYPE CONSOLE}
uses
System.SysUtils;
var
S: string;
begin
for S in '''a''; ''b''; ''c'''.Split([';', ' '], ExcludeEmpty) do begin
Writeln(S);
end;
end.
outputs:
'a'
'b'
'c'
But you do not have this available to you in XE2 so I think you are going to have to roll your own split function. Which might look like this:
function IsSeparator(const C: Char; const Separators: string): Boolean;
var
sep: Char;
begin
for sep in Separators do begin
if sep=C then begin
Result := True;
exit;
end;
end;
Result := False;
end;
function Split(const Str, Separators: string): TArray<string>;
var
CharIndex, ItemIndex: Integer;
len: Integer;
SeparatorCount: Integer;
Start: Integer;
begin
len := Length(Str);
if len=0 then begin
Result := nil;
exit;
end;
SeparatorCount := 0;
for CharIndex := 1 to len do begin
if IsSeparator(Str[CharIndex], Separators) then begin
inc(SeparatorCount);
end;
end;
SetLength(Result, SeparatorCount+1); // potentially an over-allocation
ItemIndex := 0;
Start := 1;
CharIndex := 1;
for CharIndex := 1 to len do begin
if IsSeparator(Str[CharIndex], Separators) then begin
if CharIndex>Start then begin
Result[ItemIndex] := Copy(Str, Start, CharIndex-Start);
inc(ItemIndex);
end;
Start := CharIndex+1;
end;
end;
if len>Start then begin
Result[ItemIndex] := Copy(Str, Start, len-Start+1);
inc(ItemIndex);
end;
SetLength(Result, ItemIndex);
end;
Of course, all of this assumes that you want a space to act as a separator. You've asked for that in the code, but perhaps you actually want just ; to act as a separator. In that case you probably want to pass ';' as the separator, and trim the strings that are returned.
SplitString is defined as
function SplitString(const S, Delimiters: string): TStringDynArray;
One would thought that Delimiters denote single delimiter string used for splitting string, but it actually denotes set of single characters used to split string. Each character in Delimiters string will be used as one of possible delimiters.
SplitString
Splits a string into different parts delimited by the specified
delimiter characters. SplitString splits a string into different parts
delimited by the specified delimiter characters. S is the string to be
split. Delimiters is a string containing the characters defined as
delimiters.
It is because the second parameter of SplitString is a list of single character delimiters, so '; ' means split at a ';' OR split at a ' '. So the string is split at every ';' and at every space, and between the ';' and the ' ' there is nothing, hence the empty strings.

How to use TNetEncoding.URL.Encode for encoding spaces in query string parameters?

In Delphi XE7 it is advised to use TNetEncoding.URL.Encode
So far I have been using a custom routine:
class function THttp.UrlEncode(const S: string; const InQueryString: Boolean): string;
var
I: Integer;
begin
Result := EmptyStr;
for i := 1 to Length(S) do
case S[i] of
// The NoConversion set contains characters as specificed in RFC 1738 and
// should not be modified unless the standard changes.
'A'..'Z', 'a'..'z', '*', '#', '.', '_', '-', '0'..'9',
'$', '!', '''', '(', ')':
Result := Result + S[i];
'—': Result := Result + '%E2%80%94';
' ' :
if InQueryString then
Result := Result + '+'
else
Result := Result + '%20';
else
Result := Result + '%' + System.SysUtils.IntToHex(Ord(S[i]), 2);
end;
end;
Using the method above I have been able to manually specify whether the encoded parameter S is a part of the Path or a part of the Query string.
The spaces should be encoded as + if found in the Path and as %20 is part of the Query parameters.
The function above emits properly
Url := 'http://something/?q=' + THttp.UrlEncode('koko jambo', true);
// Url := http://something/?q=koko%20jambo
but the following is returning different value
Url := 'http://something/?q=' + TNetEncoding.URL.Encode('koko jambo;);
// Url := http://something/?q=koko+jambo
Please elaborate in what way TNetEncoding.URL.Encode should be properly used for encoding query parameters containing spaces as %20?
Read the documentation:
System.NetEncoding.TURLEncoding
TURLEncoding only encodes spaces (as plus signs: +) and the following reserved URL encoding characters: ;:&=+,/?%#[].
It is not possible to make TNetEncoding.URL.Encode encode spaces as %20.
Normally, I would suggest Indy's TIdURI class, as it has separate PathEncode() and ParamsEncode() methods, but they both encode spaces as %20 as well, which does not satisfy your "encoded as + if found in the Path" requirement.

Character to empty in Delphi6

How can I set the specified character by index to empty character in Delphi6?
procedure TMainForm.Button1Click(Sender: TObject);
var i: integer;
s_ord_account : String[10];
begin
s_ord_account := '0930002930' ;
i := 1;
REPEAT
IF s_ord_account[i] = '0' THEN
s_ord_account[i] := '';
INC(i);
UNTIL (i=5) OR (s_ord_account[i] <> ' ');
MessageDlg(s_ord_account,mtError, mbOKCancel, 0);
yend;
When I try to execute this code I get an error
[Error] Main.pas(30): Incompatible types: 'Char' and 'String'
First of all it would make a lot of sense for you to stop using Turbo Pascal strings and use the native Delphi string type, string.
There is no such thing as an empty character. You can use the Delete function to remove a character from the string. A simpler approach would be to use the StringReplace function. That renders your code entirely needless.
{$APPTYPE CONSOLE}
uses
SysUtils;
var
s: string;
begin
s := StringReplace('0930002930', '0', '', [rfReplaceAll]);
Writeln(s);
end.
Output
93293

Only allow certain characters in a string

I am trying to validate a string, where by it can contain all alphebetical and numerical characters, aswell as the underline ( _ ) symbol.
This is what I tried so far:
var
S: string;
const
Allowed = ['A'..'Z', 'a'..'z', '0'..'9', '_'];
begin
S := 'This_is_my_string_0123456789';
if Length(S) > 0 then
begin
if (Pos(Allowed, S) > 0 then
ShowMessage('Ok')
else
ShowMessage('string contains invalid symbols');
end;
end;
In Lazarus this errors with:
Error: Incompatible type for arg no. 1: Got "Set Of Char", expected
"Variant"
Clearly my use of Pos is all wrong and I am not sure if my approach is even the correct way of going about it or not?
Thanks.
You will have to check every single character of the string, if it's contained in Allowed
e.g.:
var
S: string;
const
Allowed = ['A' .. 'Z', 'a' .. 'z', '0' .. '9', '_'];
Function Valid: Boolean;
var
i: Integer;
begin
Result := Length(s) > 0;
i := 1;
while Result and (i <= Length(S)) do
begin
Result := Result AND (S[i] in Allowed);
inc(i);
end;
if Length(s) = 0 then Result := true;
end;
begin
S := 'This_is_my_string_0123456789';
if Valid then
ShowMessage('Ok')
else
ShowMessage('string contains invalid symbols');
end;
TYPE TCharSet = SET OF CHAR;
FUNCTION ValidString(CONST S : STRING ; CONST ValidChars : TCharSet) : BOOLEAN;
VAR
I : Cardinal;
BEGIN
Result:=FALSE;
FOR I:=1 TO LENGTH(S) DO IF NOT (S[I] IN ValidChars) THEN EXIT;
Result:=TRUE
END;
If you are using a Unicode version of Delphi (as you seem to be), beware that a SET OF CHAR cannot contain all valid characters in the Unicode character set. Then perhaps this function will be useful instead:
FUNCTION ValidString(CONST S,ValidChars : STRING) : BOOLEAN;
VAR
I : Cardinal;
BEGIN
Result:=FALSE;
FOR I:=1 TO LENGTH(S) DO IF POS(S[I],ValidChars)=0 THEN EXIT;
Result:=TRUE
END;
but then again, not all characters (actually Codepoints) in Unicode can be expressed by a single character, and some characters can be expressed in more than one way (both as a single character and as a multi-character).
But as long as you constrain yourself within these limitations, one of the above functions should be useful. You can even include both, if you add an "OVERLOAD;" directive to the end of each function declaration, as in:
FUNCTION ValidString(CONST S : STRING ; CONST ValidChars : TCharSet) : BOOLEAN; OVERLOAD;
FUNCTION ValidString(CONST S,ValidChars : STRING) : BOOLEAN; OVERLOAD;
Lazarus/Free Pascal doesn't overload pos for that but has "posset" variants in unit strutils for that;
http://www.freepascal.org/docs-html/rtl/strutils/posset.html
Regarding Andreas' (IMHO correct ) remark, you can use isemptystr for that. It was meant to check for strings that only contain whitespace, but it basically checks if a string only contains characters in a set.
http://www.freepascal.org/docs-html/rtl/strutils/isemptystr.html
You can use Regular Expressions:
uses System.RegularExpressions;
if not TRegEx.IsMatch(S, '^[_a-zA-Z0-9]+$') then
ShowMessage('string contains invalid symbols');

Resources