Unicode string can contain surrogate pairs (especially emoticons). Now I need to truncate this string to n chars. How can I do it safely without breaking any emoticons ?
The following code should be able to solve your issue:
FUNCTION IsDiacritical(C : CHAR) : BOOLEAN;
VAR
W : WORD ABSOLUTE C;
BEGIN
Result:=((W>=$1AB0) AND (W<=$1AFF)) OR
((W>=$0300) AND (W<=$036F)) OR
((W>=$1DC0) AND (W<=$1DFF))
END;
FUNCTION GetNextChar(VAR S : STRING) : STRING;
VAR
C : CHAR;
P : Cardinal;
BEGIN
CASE S.Length OF
0 : Result:='';
1 : Result:=S
ELSE // OTHERWISE //
Result:=''; P:=1;
FOR C IN S DO
IF NOT IsDiacritical(C) THEN
BREAK
ELSE BEGIN
Result:=Result+C;
INC(P)
END;
IF (P<LENGTH(S)) AND IsSurrogatePair(S,P) THEN
Result:=Result+COPY(S,P,2)
ELSE
Result:=Result+COPY(S,P,1)
END;
DELETE(S,1,Result.Length)
END;
FUNCTION GetStringByCodePoints(S : STRING ; CodePoints : Cardinal) : STRING;
VAR
I : Cardinal;
BEGIN
Result:='';
FOR I:=1 TO CodePoints DO Result:=Result+GetNextChar(S)
END;
PROCEDURE SetLengthByCodePoints(VAR S : STRING ; CodePoints : Cardinal);
BEGIN
SetLength(S,GetStringByCodePoints(S,CodePoints).Length)
END;
The GetStringByCodePoints is analogous to COPY, and SetLengthByCodePoints is analogous to SetLength. Both, however, takes the number of Code Points ("visible characters" or control characters) instead of characters.
If there are more Combining Diacritical code points, the relevant function can be extended to include these. The three groups I check for are the ones I could find by a simple Google search.
Related
How can I make this code faster? the string can contain characters such as ", .?#" and possibly others.
Const Nums = ['0'..'9'];
function CleanNumber(s: String): Int64;
Var z: Cardinal;
begin
for z := length(s) downto 1 do
if not (s[z] in Nums) then Delete(s,z,1);
if s = '' then
Result := 0 else
Result := StrToInt64(s);
end;
Results (long loop):
CL2,CL3 = HeartWare's
32-bit, "dirty number" / "clean number"
Mine: 270ms, 165ms
CL2: 220ms, 210ms
CL3: 100ms, 110ms
DirtyStrToNum: 215ms, 90ms
64-bit, "dirty number" / "clean number"
Mine: 2280ms, 75ms
CL2: 1320ms, 130ms
CL3: 280ms, 25ms
DirtyStrToNum: 1390ms, 125ms
Here are two examples that for sure are faster than the one you have (deleting a character from a string is relatively slow):
This one works by pre-allocating a string of the maximum possible length and then filling it out with the digits as I come across them in the source string. No delete for every unsupported character, and no expansion of the target string for every supported character.
FUNCTION CleanNumber(CONST S : STRING) : Int64;
VAR
I,J : Cardinal;
C : CHAR;
T : STRING;
BEGIN
SetLength(T,LENGTH(S));
J:=LOW(T);
FOR I:=LOW(S) TO HIGH(S) DO BEGIN
C:=S[I];
IF (C>='0') AND (C<='9') THEN BEGIN
T[J]:=C;
INC(J)
END
END;
IF J=LOW(T) THEN
Result:=0
ELSE BEGIN
SetLength(T,J-LOW(T)); // or T[J]:=#0 [implementation-specific]
Result:=StrToInt64(T)
END
END;
This one works by simple multiplication of the end result by 10 and adding the corresponding digit value.
{$IFOPT Q+}
{$DEFINE OverflowEnabled }
{$ELSE }
{$Q+ If you want overflow checking }
{$ENDIF }
FUNCTION CleanNumber(CONST S : STRING) : Int64;
VAR
I : Cardinal;
C : CHAR;
BEGIN
Result:=0;
FOR I:=LOW(S) TO HIGH(S) DO BEGIN
C:=S[I];
IF (C>='0') AND (C<='9') THEN Result:=Result*10+(ORD(C)-ORD('0'))
END
END;
{$IFNDEF OverflowEnabled } {$Q-} {$ENDIF }
{$UNDEF OverflowEnabled }
Also note that I don't use IN or CharInSet as these are much slower than a simple inline >= and <= comparison.
Another comment I could make is the use of LOW and HIGH on the string variable. This makes it compatible with both 0-based strings (mobile compilers) and 1-based strings (desktop compilers).
Your function is slow mainly because of the Delete approach. Each call to Delete needs to move a lot of characters around.
A faster approach would be like this:
function DirtyStrToNum(const S: string): Int64;
var
tmp: string;
i, j: Integer;
const
DIGITS = ['0'..'9'];
begin
SetLength(tmp, S.Length);
j := 0;
for i := 1 to S.Length do
if CharInSet(S[i], DIGITS) then
begin
Inc(j);
tmp[j] := S[i];
end;
SetLength(tmp, j);
if tmp.IsEmpty then
Result := 0
else
Result := StrToInt64(tmp);
// Or, but not equivalent: Result := StrToInt64Def(tmp, 0);
end;
Notice I make a single allocation for a new string, and then only copy the minimum number of characters to it.
I am trying to validate a string, where by it can contain all alphebetical and numerical characters, aswell as the underline ( _ ) symbol.
This is what I tried so far:
var
S: string;
const
Allowed = ['A'..'Z', 'a'..'z', '0'..'9', '_'];
begin
S := 'This_is_my_string_0123456789';
if Length(S) > 0 then
begin
if (Pos(Allowed, S) > 0 then
ShowMessage('Ok')
else
ShowMessage('string contains invalid symbols');
end;
end;
In Lazarus this errors with:
Error: Incompatible type for arg no. 1: Got "Set Of Char", expected
"Variant"
Clearly my use of Pos is all wrong and I am not sure if my approach is even the correct way of going about it or not?
Thanks.
You will have to check every single character of the string, if it's contained in Allowed
e.g.:
var
S: string;
const
Allowed = ['A' .. 'Z', 'a' .. 'z', '0' .. '9', '_'];
Function Valid: Boolean;
var
i: Integer;
begin
Result := Length(s) > 0;
i := 1;
while Result and (i <= Length(S)) do
begin
Result := Result AND (S[i] in Allowed);
inc(i);
end;
if Length(s) = 0 then Result := true;
end;
begin
S := 'This_is_my_string_0123456789';
if Valid then
ShowMessage('Ok')
else
ShowMessage('string contains invalid symbols');
end;
TYPE TCharSet = SET OF CHAR;
FUNCTION ValidString(CONST S : STRING ; CONST ValidChars : TCharSet) : BOOLEAN;
VAR
I : Cardinal;
BEGIN
Result:=FALSE;
FOR I:=1 TO LENGTH(S) DO IF NOT (S[I] IN ValidChars) THEN EXIT;
Result:=TRUE
END;
If you are using a Unicode version of Delphi (as you seem to be), beware that a SET OF CHAR cannot contain all valid characters in the Unicode character set. Then perhaps this function will be useful instead:
FUNCTION ValidString(CONST S,ValidChars : STRING) : BOOLEAN;
VAR
I : Cardinal;
BEGIN
Result:=FALSE;
FOR I:=1 TO LENGTH(S) DO IF POS(S[I],ValidChars)=0 THEN EXIT;
Result:=TRUE
END;
but then again, not all characters (actually Codepoints) in Unicode can be expressed by a single character, and some characters can be expressed in more than one way (both as a single character and as a multi-character).
But as long as you constrain yourself within these limitations, one of the above functions should be useful. You can even include both, if you add an "OVERLOAD;" directive to the end of each function declaration, as in:
FUNCTION ValidString(CONST S : STRING ; CONST ValidChars : TCharSet) : BOOLEAN; OVERLOAD;
FUNCTION ValidString(CONST S,ValidChars : STRING) : BOOLEAN; OVERLOAD;
Lazarus/Free Pascal doesn't overload pos for that but has "posset" variants in unit strutils for that;
http://www.freepascal.org/docs-html/rtl/strutils/posset.html
Regarding Andreas' (IMHO correct ) remark, you can use isemptystr for that. It was meant to check for strings that only contain whitespace, but it basically checks if a string only contains characters in a set.
http://www.freepascal.org/docs-html/rtl/strutils/isemptystr.html
You can use Regular Expressions:
uses System.RegularExpressions;
if not TRegEx.IsMatch(S, '^[_a-zA-Z0-9]+$') then
ShowMessage('string contains invalid symbols');
I use C# basically. There I can do:
string trimmed = str.Trim('\t');
to trim tabulation from the string str and return the result to trimmed.
In delphi7 I found only Trim, that trims spaces.
How can I achieve the same functionality?
There is string helper TStringHelper.Trim that accepts array of Char as optional parameter.
function Trim(const TrimChars: array of Char): string; overload;
So, you can use
trimmed := str.Trim([#09]);
for your example. #09 here is ASCII code for Tab character.
This function exists since at least Delphi XE3.
Hope it helps.
This is a kind of procedure sometimes easier to create than to find where it lives :)
function TrimChar(const Str: string; Ch: Char): string;
var
S, E: integer;
begin
S:=1;
while (S <= Length(Str)) and (Str[S]=Ch) do Inc(S);
E:=Length(Str);
while (E >= 1) and (Str[E]=Ch) do Dec(E);
SetString(Result, PChar(#Str[S]), E - S + 1);
end;
In Delphi the Trim function does not take parameters but it does trim other characters as well as spaces. Here's the code (from System.SysUtils in XE2, I don't think it has changed):
function Trim(const S: string): string;
var
I, L: Integer;
begin
L := Length(S);
I := 1;
if (L > 0) and (S[I] > ' ') and (S[L] > ' ') then Exit(S);
while (I <= L) and (S[I] <= ' ') do Inc(I);
if I > L then Exit('');
while S[L] <= ' ' do Dec(L);
Result := Copy(S, I, L - I + 1);
end;
It is trimming anything less than ' ' which would eliminate any control characters like tab, carriage return and line feed.
Delphi doesn't provide a function that does what you want. The built-in Trim function always trims the same set of characters (whitespace and control characters) from both ends of the input string. Several answers here show the basic technique for trimming arbitrary characters. As you can see, it doesn't have to be complicated. Here's my version:
function Trim(const s: string; c: Char): string;
var
First, Last: Integer;
begin
First := 1;
Last := Length(s);
while (First <= Last) and (s[First] = c) do
Inc(First);
while (First < Last) and (s[Last] = c) do
Dec(last);
Result := Copy(s, First, Last - First + 1);
end;
To adapt that for trimming multiple characters, all you have to do is change the second conditional term in each loop. What you change it to depends on how you choose to represent the multiple characters. C# uses an array. You could also put all the characters in a string, or you could use Delphi's native set type.
function Trim(const s: string; const c: array of Char): string;
// Replace `s[x] = c` with `CharInArray(s[x], c)`.
function Trim(const s: string; const c: string): string;
// Replace `s[x] = c` with `CharInString(s[x], s)`.
function Trim(const s: string; const c: TSysCharSet): string;
// Replace `s[x] = c` with `s[x] in c`.
The CharInArray and CharInString functions are easy to write:
function CharInArray(c: Char; ar: array of Char): Boolean;
var
i: Integer;
begin
Result := True;
for i := Low(ar) to High(ar) do
if ar[i] = c then
exit;
Result := False;
end;
// CharInString is identical, except for the type of `ar`.
Recall that as of Delphi 2009, Char is an alias for WideChar, meaning it's too big to fit in a set, so you wouldn't be able to use the set version unless you were guaranteed the input would always fit in an AnsiChar. Furthermore, the s[x] in c syntax generates warnings on WideChar arguments, so you'd want to use CharInSet(s[x], c) instead. (Unlike CharInArray and CharInString, the RTL provides CharInSet already, for Delphi versions that need it.)
You can use StringReplace:
var
str:String;
begin
str:='The_aLiEn'+Chr(VK_TAB)+'Delphi';
ShowMessage(str);
str:=StringReplace(str, chr(VK_Tab), '', [rfReplaceAll]);
ShowMessage(str);
end;
This omits all Tab characters from given string. But you can improve it, if you want leading and trailing tabs to be removed then you can use Pos function also.
Edit:
For the comment asking how to do it with Pos, here it is:
var
str:String;
s, e: PChar;
begin
str:=Chr(VK_TAB)+Chr(VK_TAB)+'The_aLiEn'+Chr(VK_TAB)+'Delphi'+Chr(VK_TAB)+Chr(VK_TAB);
s:=PChar(str);
while Pos(Chr(VK_TAB), s)=1 do inc(s);
e:=s;
inc(e, length(s)-1);
while Pos(Chr(VK_TAB), e)=1 do dec(e);
str:=Copy(s, 1, length(s)-length(e)+1);
ShowMessage(str);
end;
It is of course the same approach by Maksee's and a bit more job to do as it is. But if there isn't much time to finish the work and if Pos is what you've thought first, then this is how it can be done. You, the programmer should and have to think about optimizations, not me. And if we're talking constraints of optimization, with a little tweak to replace Pos with char compare, this will run faster than Maksee's code.
Edit for Substr search generalization:
function TrimStr(const Source, SubStr: String): String;
var
s, e: PChar;
l: Integer;
begin
s:=PChar(Source);
l:=Length(SubStr);
while Pos(SubStr, s)=1 do inc(s, l);
e:=s;
inc(e, length(s)-l);
while Pos(SubStr, e)=1 do dec(e, l);
Result:=Copy(s, 1, length(s)-length(e)+l);
end;
The JEDI JCL v2.7 provides these useful functions for what you need:
function StrTrimCharLeft(const S: string; C: Char): string;
function StrTrimCharsLeft(const S: string; const Chars: TCharValidator): string; overload;
function StrTrimCharsLeft(const S: string; const Chars: array of Char): string; overload;
function StrTrimCharRight(const S: string; C: Char): string;
function StrTrimCharsRight(const S: string; const Chars: TCharValidator): string; overload;
function StrTrimCharsRight(const S: string; const Chars: array of Char): string; overload;
function StrTrimQuotes(const S: string): string;
OS: Hungarian Windows (Windows 1250)
Under Delphi 6 Prof there is no WideStringPos, WideStringCopy, WideStringReplace...
But in an XML based project I need to use them.
Because that I tried to write "something like" these functions.
But I'm not sure they are working as I want...
Because Delphi converts the Wide to Ansi and reverse in the background, I cannot be sure that my code is safe from these side effects... :-)
The code is very primitive - I need the solution quickly...
function WideStringCopy(WWhat : WideString; From, HowMany : integer) : WideString;
var
i : integer;
l : integer;
wc : WideChar;
begin
Result := '';
if WWhat = ''
then Exit;
if (HowMany <= 0)
then Exit;
if (From < 1)
then From := 1;
l := From + HowMany - 1;
if l > Length(WWhat)
then l := Length(WWhat);
for i := From to l do begin
wc := WWhat[i];
Result := Result + wc;
end;
end;
function WideStringPos(WWhere, WWhat : WideString) : integer;
var
wscomp : WideString;
i : integer;
begin
Result := 0;
for i := 1 to Length(WWhere) do begin
wscomp := WideStringCopy(WWhere, i, LengtH(WWhat));
if WideSameStr(wscomp, WWhat)
then begin
Result := i;
Exit;
end;
end;
end;
function WideStringReplace(WWhere, WFrom, WTo : WideString) : WideString;
var
actpos : integer;
wcomp : WideString;
wc : WideChar;
begin
Result := '';
actpos := 1;
while actpos <= Length(WWhere) do begin
wcomp := WideStringCopy(WWhere, actpos, Length(WFrom));
if WideSameStr(wcomp, WFrom) then begin
Result := Result + WTo;
inc(actpos, Length(WFrom));
end else begin
wc := WWhere[actpos];
Result := Result + wc;
inc(actpos);
end;
end;
end;
I have two questions about it:
Do you see any piece of code that surely making bad result (converting the Wide to Ansi silently, and causing character loosing)?
Do you know some character with I can test this code?
For example, chr(XXX) what is remaining when my converters are keeping the Wide rules, but loosing if I make wrong code...
Thanks for every info you will write...
Do you know some character with I can test this code?
Any codepage beyond Win1250 - for example Cyrillic Win1251, Greek, Hebrew - almost all letters there would be missed from 1250/1252
You can take Jedi CodeLibrary and use its locale conversion routines: make a string consisting of #128 till #255 in some encoding like aforementioned, convert it to Unicode from that codepage and then convert back from Unicode to Hungarian codepage.
function StringToWideStringEx(const S: AnsiString; CodePage: Word): WideString;
function WideStringToStringEx(const WS: WideString; CodePage: Word): AnsiString;
Or in one call
function TranslateString(const S: AnsiString; CP1, CP2: Word): AnsiString;
Then look which chars failed to translate and turned into ReplacementCharacter.
However in JCL you'd have your Pos function and such ready to use. And XML parser. So why bother ?
We are tring to use in delphi a pas file generated by Free Pascal. The link is at:
http://www.markwatson.com/opensource/FastTag_Pascal.zip
While testing, it prompts InValidPointer. Please look at the following error line in debugger.
interface
procedure ReadLexicon;
type sarray = array[1..12] of string;
type big_sarray = array[1..1000] of string; { used for word lists and tags: limit on size of input text }
type psarray = ^sarray;
{function GetTagList(word: string): psarray;}
procedure TagWordList(wordList : big_sarray; var tags : big_sarray);
implementation
uses SysUtils, Classes;
{ Hash Table Support - copied from FreePascal source: benchmork shootout examples }
type
THashEntryPtr = ^THashEntryRec;
THashEntryRec = record
name : string;
number : psarray;
next : THashEntryPtr;
end;
const
TABLE_SIZE = 100000;
...
...
...
function GetTagList(word: string): psarray;
var
ret : psarray;
ok : boolean;
begin
ok := localHash.fetch(word, ret);
if ok then GetTagList := ret else GetTagList := nil;
end;
procedure TagWordList(wordList : big_sarray; var tags : big_sarray);
var i : integer;
x : real;
psa : psarray;
lastThreeChars : string;
lastTwoChars : string;
lastChar : string;
firstTwoChars : string;
tagFirstTwoChars : string;
begin
for i := 0 to length(wordList) do
begin
**psa := GetTagList(wordList[i]);///// EInvalidPointer ERROR**
if psa <> nil then tags[i] := psa^[1] else tags[i] := '???';
end;
...
...
...
How can we fix it.
Thank you very much in advance.
The original source doesn't set any compiler mode, and so the default TP like mode is active, meaning string=shortstring.
Replace, in the entire source string with shortstring and it will probably work.
There are at least two errors I can find in the TagWordList procedure.
for i := 0 to length(wordList) do, the array is 1-based so the loop has to start with 1.
A bit later there is a check if i > 0 that fails for the same reason.
It could also be a wrong definition of the type big_sarray = array[1..1000] of string;. If you change that to a 0-based array it might work.