How to convert text to UTF-8 Delphi - delphi

I have a function that returns an HTML page from Internet, but the Cyrillic symbols are displayed with some others unknown characters.
How can I convert the text and be able to see the normal Cyrillic symbols?
I'm with Delphi 2009 and im using indy to send HTTP request and get back response from the server.
(i think i have indy9)
This is how i take the HTML page
http := TIDHttp.Create(nil);
http.HandleRedirects := true;
http.ReadTimeout := 5000;
http.Request.ContentType:='multipart/form-data';
param:=TIdMultiPartFormDataStream.Create;
param.AddFormField('subcat_id','501');
param.AddFormField('reg_id','1');
text:=http.Post('example.com',param);
I don't know if indy has any functions that gets the page with any unicode.

You have not given enough information, but I will try to suggest this: If possible, load the data in a Stream and then create a StringList and load it like this:
var
MS:TMemoryStream;
SL: TStringList;
(...)
begin
MS:=TMemoryStream.Create;
SL:=TStringList.Create;
// Load your string to MS
SL.LoadFromStream(MS, TEncoding.UTF8);
(...)
MS.Free;
SL.Free;
end;
Comment if there is a problem.

Your question title seems to be out of sync with question body. Assuming you want to decode UTF-8 encoded HTML page, your friend is function UTF8Decode. The opposite operation done by UTF8Encode. These functions were available as early as Delphi 7 (correct me if D6 applies too). Check out "See Also" section of article, there are buffer handling entry-points for more convenience too.

Indy 9 does not support Delphi 2009. Make sure you are using the latest Indy 10 release instead. In Indy 10, the version of TIdHTTP.Post() (and TIdHTTP.Get()) that returns a String will automatically decode the data to Unicode using whatever charset is specified by the server, either in the HTTP Content-Type header, or in a <meta> tag within the HTML itself.

Related

How to avoid wrong characters reading UTF-8 emails with Indy 10.6 and Delphi 7

I am reading email with Indy 10.6.1.5187 and Delphi 7.
I have only problems with UTF-8 encoded emails, which translate to wrong characters in the customers computers.
I have read a lot about this problem, but I have not found a solution, except decoding the raw email by myself.
I wonder if there is a way to get correct emails when the sender encodes them in UTF-8.
Thanks.
UTF8 string is received like it was an Ansi string. You have to decode it.
You have to receive the message text in an UTF8String (aka. AnsiString aka String in Delphi 7) then convert them from UTF8 to AnsiString or (preferably)WideString. You can use the UFT8Decode() or Utf8ToAnsi() function to decode the email body.
If you use the UFT8Decode() function, you will still need WideString aware controls to display the received message.
If you use the Utf8ToAnsi() function, the result might not contain characters that are not part of the users local codepage.
So you will use something like:
var
ustrEmailBody: UTF8String;
wstrDecoded: WideString;
begin
...
// ustrEmailBody now contains the email body
wstrDecoded := UTF8Decode(ustrEmailBody);
SomeUnicodeAwareMemo.Text := wstrDecoded;
or
var
ustrEmailBody: UTF8String;
astrDecoded: AnsiString;
begin
...
// ustrEmailBody now contains the email body
astrDecoded := Utf8ToAnsi(ustrEmailBody);
SomeMemo.Text := astrDecoded; // the memo might display '?' in place of unknown characters
For further information see the documentation of the UFT8Decode() or Utf8ToAnsi() functions in the Delphi help.

Using Indy httpserver to find keywords in a webpage [duplicate]

This question already has an answer here:
Delphi: Easiest way to search for string in memorystream
(1 answer)
Closed 9 years ago.
I'm trying to use Indy http server to find keywords within a webpage for a proxy filter. I've set up a proxy and the http server, which works with web browsers, but I'm struggling when it comes to finding a keyword within the web page.
I've been trying to convert a memory stream to string and searching for a keyword within it but maybe this is the wrong way to be doing it. I have limited experience with delphi so I'm slightly stuck.
If anyone could give me any pointers, that would be great.
Thanks.
EDIT: Ok I have added a function here where 'Stream' is the memory stream from the http server and 'what' is the keyword I'm searching, it doesn't seem to work though....
function FindInMemStream(Stream: TMemoryStream; What: String):Integer;
var
bufBuffer, bufBuffer2: array[0..254] of Char;
i: Integer;
begin
filter.Form2.ListBox1.Items.Add('finding');
What := 'train';
Result := 0;
i := 0;
FillChar(bufBuffer, 255, #0);
FillChar(bufBuffer2, 255, #0);
StrPCopy(#bufBuffer2, What);
Stream.Position:=0;
while Stream.Position <> Stream.Size do
begin
Stream.Read(bufBuffer[0],Length(What));
if CompareMem(#bufBuffer,#bufBuffer2,Length(What)) then
begin
filter.Form2.ListBox1.Items.Add(IntToStr(Stream.Position-Length(What)));
Result := Stream.Position-Length(What); // not 0 : it's found keyphrase
Exit;
end;
i := i + 1;
// filter.Form2.ListBox1.Items.Add(IntToStr(i));
Stream.Seek(i,0)
end;
end;
There are libraries which can be used for HTML parsing, for example the (commercial) DIHtmlParser.
DIHtmlParser reads, extracts information from, and writes HTML, XHTML, and XML.
From its feature list:
Full Unicode support (UnicodeString or WideString, depending on Delphi version).
Reads and writes over 70 character sets natively (independent of the OS).
Operates on TStreams, memory buffers or strings.
Returns a single piece of HTML to the application at a time.
With such a library, the HTML content (visible text) can be extracted easily from the HTML response, and the remaining task to find the search term would become trivial.
I would not try to write my own HTML parser, but rather use an existing library.

HTTP Post text to SMS service getting %20 in text Delphi 2007

I'm using Indy to do a Post to an SMS service that will send the SMS, but the SMS text ends up on my phone with %20 instead of spaces, here is the code:
url,text:string;
IdHTTP1: TIdHTTP;
IdSSLIOHandlerSocketOpenSSL2: TIdSSLIOHandlerSocketOpenSSL;
begin
IdSSLIOHandlerSocketOpenSSL2 := TIdSSLIOHandlerSocketOpenSSL.Create;
IdHTTP1 := TIdHTTP.Create;
IdSSLIOHandlerSocketOpenSSL2.SSLOptions.Method := sslvSSLv23;
IdHTTP1.IOHandler := IdSSLIOHandlerSocketOpenSSL2;
IdHTTP1.HandleRedirects := true;
IdHTTP1.ReadTimeout := 5000;
param:=TStringList.create;
param.Clear;
param.Add('action=create');
param.Add('token=' + SMSToken);
param.Add('to=' + Phone);
param.Add('msg=' + MessageText);
url:='https://api.tropo.com/1.0/sessions';
try
text:=IdHTTP1.Post(url, param);
thanks
The TStrings version of TIdHTTP.Post() sends an application/x-www-form-urlencoded request to the server. The posted data is url-encoded by default. The server needs to decode the posted data before processing it. It sounds like the server-side code is not doing that correctly. You can remove the hoForceEncodeParams flag from the TIdHTTP.HTTPOptions property to disable the url-encoding of the posted data, but I would advise you to report the bug to Tropo instead so they can fix their server-side code.
TIdHTTP itself does not apply quoted-printable encoding to posted data, so the data being posted has to be quoted-printable encoded beforehand.
In Indy 10, you can use the TIdFormDataField.Charset property to specify how strings are converted to bytes, and then use the TIdFormDataField.ContentTransfer property to specify how the bytes are encoded. For the ContentTransfer, you can specify '7bit', '8bit', 'binary', 'quoted-printable', 'base64', or a blank string (which is equivilent to '7bit', but without stating as much in the MIME header).
Set the TIdFormDataField.CharSet property to a charset that matches what your OS is using, and then set the TIdFormDataField.ContentTransfer property to '8bit'.
Alternatively, use the TStream overloaded version of TIdMultipartFormDataStream.AddFormField() instead of the String overloaded version, then you can store data in your input TStream any way you wish and it will be encoded as-is based on the value of the TIdFormDataField.ContentTransfer property. This should remove the %20 you are getting.

HttpGetText(), autodetect charset, and convert source to UTF8

I'm using HttpGetText with Synapse for Delphi 7 Professional to get the source of a web page - but feel free to recommend any component and code.
The goal is to save some time by 'unifying' non-ASCII characters to a single charset, so I can process it with the same Delphi code.
So I'm looking for something similar to "Select All and Convert To UTF without BOM in Notepad++", if you know what I mean. ANSI instead of UTF8 would also be okay.
Webpages are encoded in 3 charsets: UTF8, "ISO-8859-1=Win 1252=ANSI" and straight up the alley HTML4 without charset spec, ie. htmlencoded Å type characters in the content.
If I need to code a PHP page that does the conversion, that's fine too. Whatever is the least code / time.
When you retreive a webpage, its Content-Type header (or sometimes a <meta> tag inside the HTML itself) tells you which charset is being used for the data. You would decode the data to Unicode using that charset, then you can encode the Unicode to whatever you need for your processing.
I instead did the reverse conversion directly after retrieving the HTML using GpTextStream. Making the documents conform to ISO-8859-1 made them processable using straight up Delphi, which saved quite a bit of code changes. On output all the data was converted to UTF-8 :)
Here's some code. Perhaps not the prettiest solution but it certainly got the job done in less time. Note that this is for the reverse conversion.
procedure UTF8FileTo88591(fileName: string);
const bufsize=1024*1024;
var
fs1,fs2: TFileStream;
ts1,ts2: TGpTextStream;
buf:PChar;
siz:integer;
procedure LG2(ss:string);
begin
//dont log for now.
end;
begin
fs1 := TFileStream.Create(fileName,fmOpenRead);
fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
//compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
//also works for ASCII sources with htmlencoded accent chars, naturally
try
LG2('Files opened OK.');
GetMem(buf,bufsize);
ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
try
siz:=ts1.Read(buf^,bufsize);
LG2(inttostr(siz)+' bytes read.');
if siz>0 then ts2.Write(buf^,siz);
finally
LG2('Bytes read and written OK.');
FreeAndNil(ts1);FreeAndNil(ts2);end;
finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
LG2('Everything freed OK.');
end;
end; // UTF8FileTo88591

Delphi: problem with httpcli (ICS) post method

I am using HttpCli component form ICS to POST a request. I use an example that comes with the component. It says:
procedure TForm4.Button2Click(Sender: TObject);
var
Data : String;
begin
Data:='status=no';
HttpCli1.SendStream := TMemoryStream.Create;
HttpCli1.SendStream.Write(Data[1], Length(Data));
HttpCli1.SendStream.Seek(0, 0);
HttpCli1.RcvdStream := TMemoryStream.Create;
HttpCli1.URL := Trim('http://server/something');
HttpCli1.PostAsync;
end;
But it fact, it sends not
status=no
but
s.t.a.t.u
I can't understand, where is the problem. Maybe someone can show an example, how to send POST request with the help of HttpCli component?
PS I can't use Indy =)
I suppose you're using Delphi 2009 or later, where the string type holds two-byte-per-character Unicode data. The Length function gives the number of characters, not the number of bytes, so when you put your string into the memory stream, you only copy half the bytes from the string. Even if you'd copied all of them, though, you'd still have a bunch of extra data in the stream since each character has two bytes and the server probably only expects to get one.
Use a different string type, such as AnsiString or UTF8String.

Resources