UTF8 Encoding and Network Streams - character-encoding

A client and server communicate with each other via TCP. The server and client send each other UTF-8 encoded messages.
When encoding UTF-8, the amount of bytes per character is variable. It could take one or more bytes to represent a single character.
Lets say that I am reading a UTF-8 encoded message on the network stream and it is a huge message. In my case it was about 145k bytes. To create a buffer of this size to read from the network stream could lead to an OutMemoryException since the byte array needs that amount of sequential memory.
It would be best then to read from the network stream in a while loop until the entire message is read, reading the pieces in to a smaller buffer (probably 4kb) and then decoding the string and concatenating.
What I am wondering is what happens when the very last byte of the read buffer is actually one of the bytes of a character which is represented by multiple bytes. When I decode the read buffer, that last byte and the beginning bytes of the next read would either be invalid or the wrong character. The quickest way to solve this in my mind would be to encode using a non variable encoding (like UTF-16), and then make your buffer a multiple of the amount of bytes in each character (with UTF-16 being a buffer using the power 2, UTF-32 the power of 4).
But UTF-8 seems to be a common encoding, which would leave me to believe this is a solved problem. Is there another way to solve my concern other than changing the encoding? Perhaps using a linked-list type object to store the bytes would be the way to handle this since it would not use sequential memory.

It is a solved problem. Woot woot!
http://mikehadlow.blogspot.com/2012/07/reading-utf-8-characters-from-infinite.html

Related

How to convert hexadecimal data (stored in a string variable) to an integer value

Edit (abstract)
I tried to interpret Char/String data as Byte, 4 bytes at a time. This was because I could only get TComport/TDatapacket to interpret streamed data as String, not as any other data type. I still don't know how to get the Read method and OnRxBuf event handler to work with TComport.
Problem Summary
I'm trying to get data from a mass spectrometer (MS) using some Delphi code. The instrument is connected with a serial cable and follows the RS232 protocol. I am able to send commands and process the text-based outputs from the MS without problems, but I am having trouble with interpreting the data buffer.
Background
From the user manual of this instrument:
"With the exception of the ion current values, the output of the RGA are ASCII character strings terminated by a linefeed + carriage return terminator. Ion signals are represented as integers in units of 10^-16 Amps, and transmitted directly in hex format (four byte integers, 2's complement format, Least Significant Byte first) for maximum data throughput."
I'm not sure whether (1) hex data can be stored properly in a string variable. I'm also not sure how to (2) implement 2's complement in Delphi and (3) the Least Significant Byte first.
Following #David Heffernan 's advice, I went and revised my data types. Attempting to harvest binary data from characters doesn't work, because not all values from 0-255 can be properly represented. You lose data along the way, basically. Especially it your data is represented 4 bytes at a time.
The solution for me was to use the Async Professional component instead of Denjan's Comport lib. It handles datastreams better and has a built-in log that I could use to figure out how to interpret streamed resposes from the instrument. It's also better documented. So, if you're new to serial communications (like I am), rather give that a go.

Is there some byte combination that can be used as a separator of streams of Int16

I was given the task to specify a file format for internal use inside an application.
One of the intended requirements says:
The data section of the file should be made up of a series of streams of type Int16 values (short integers), delimited by a suitable combination of one or more bytes.
As I understand, Int16 can contain any single byte value, so I don't know how I could choose some sequence of bytes that is guaranteed not to appear incidentally inside a stream. Is there such a sequence?
(And also, if the answer is "no", what would be a good way to determine the position and size of each stream in the file?)
By "streams," I assume the request indicates that the length is unknown when the writing of the data begins.
Therefore, I'd suggest a "chunked" encoding, where each substream is parcelled out into variable-size pieces, with the length of each piece written at the beginning as a fixed size integer. An empty chunk signals the end of the substream. Normally, there would be a maximum length of a chunk to facilitate allocation of buffers for efficient reading and writing.
This is patterned after HTTP's "chunked" transfer encoding and a similar approach is used in many other formats, such as the indefinite length encoding supported by the basic encoding rules for ASN.1.
I would suggest prefixing each stream with a length field, rather than trying to use delimiters, for the reason you've already given (no suitable unique delimiter). E.g.:
<length>
<stream>
<length>
<stream>
<length>
<stream>
...
where <length> is, say, a 4 byte integer which defines the number of 16 bit elements in the following stream.

Jfif/jpeg parsing, bytes between streams

I'm parsing an Jpeg/JFIF file and I noticed that after the SOI (0xFF D8) I parse the different "streams" starting with 0xFFXX (where XX is a hexadecimal number) until I find the EOI (0XFFD9). Now the structure of the diffrent chunks is:
APP0 marker 2 Bytes
Length 2 Bytes
Now when I parse the a chunk I parse until i reach the length written in the 2 Bytes of the length field. After that I thought I would immediately find another Marker, followed by a length for the next chunk. According to my parser that is not always true, there might be data between the chunks. I couldn't find out what that data is, and if it is relevant to the image. Do you have any hints what this could be and how to interpret those bytes?
I'm lost and would be happy if somebody could point me in the correct direction. Thanks in advance
I've recently noticed this too. In my case it's an APP2 chunk which is the ICC profile which doesn't contain the length of the chunk.
In fact so far as I can see the length of the chunk needn't be the first 2 bytes (though it usually is).
In JFIF all 0xFF bytes are replaced with 0xFF 0x00 in the data section, so it should just be a matter of calculating the length from that. I just read until I hit another header, however I've noticed that sometimes (again in the ICC profile) there are byte sequences which don't make sense such as 0xFF 0x6D, so I may still be missing something.

empty buffer but IdTCPClient.IOHandler.InputBufferIsEmpty is false

I have problem in below code with idTCPClient for reading buffer from a telnet server:
procedure TForm2.ReadTimerTimer(Sender: TObject);
var
S: String;
begin
if IdTCPClient.IOHandler.InputBufferIsEmpty then
begin
IdTCPClient.IOHandler.CheckForDataOnSource(10);
if IdTCPClient.IOHandler.InputBufferIsEmpty then Exit;
end;
s := idTCPClient.IOHandler.InputBufferAsString(TEncoding.UTF8);
CheckText(S);
end;
this procedure run every 1000 milliseconds and when the buffer have a value CheckText called.
this code works but sometimes this return the empty buffer to CheckText.
what's the problem?
thanks
Your code is attempting to read arbitrary blocks of data from the InputBuffer and expects them to be complete and valid strings. It is doing this without ANY consideration for what kind of data you are receiving. That is a recipe for disaster on multiple levels.
You are connected to a Telnet server, but you are using TIdTCPClient directly instead of using TIdTelnet, so you MUST manually decode any Telnet sequences that are received BEFORE you can then process any remaining string data. Look at the source code for TIdTelnet. There is a lot of decoding logic that takes place before the OnDataAvailable event is fired. All Telnet sequence data is handled internally, then the OnDataAvailable event provides whatever non-Telnet data is left over after decoding.
Once you have Telnet decoding taken care of, another problem you have to watch out for is that TEncoding.UTF8 only handles properly encoded COMPLETE UTF-8 sequences. If it encounters a badly encoded sequence, or more importantly encounters an incomplete sequence, THE ENTIRE DECODE FAILS and it returns a blank string. This has already been reported as a bug (see QC #79042).
CheckForDataOnSource() stores whatever raw bytes are in the socket at that moment into the InputBuffer. InputBufferAsString() extracts whatever raw bytes are in the InputBuffer at that moment and attempts to decode them using the specified encoding. It is very possible and likely that the raw bytes that are in the InputBuffer when you call InputBufferAsString() do not always contain COMPLETE UTF-8 sequences. Chances are that sometimes the last sequence in the InputBuffer is still waiting for bytes to arrive in the socket and they will not be read until the next call to CheckForDataOnSource(). That would explain why your CheckText() function is receiving blank strings when using TEncoding.UTF8.
You should use IndyUTF8Encoding() instead (Indy implements its own UTF-8 encoder/decoder to avoid the decoding bug in TEncoding.UTF8). At the very least, you will not get blank strings anymore, however you can still lose data when a UTF-8 sequence spans multiple CheckForDataOnSource() calls (incomplete UTF-8 sequences will be converted to ? characters). For that reason alone, you should not be using InputBufferAsString() in this situation (even if TEncoding.UTF8 did work properly). To handle this properly, you should either:
1) scan through the InputBuffer manually, calculating how many bytes constitute COMPLETE UTF-8 sequences only, and then pass that count to InputBuffer.Extract() or TIdIOHandler.ReadString(). Any left over bytes will remain in the InputBuffer for the next time. For that to work, you will have to get rid of the first InputBufferIsEmpty() call and just call CheckForDataOnSource() unconditionally so that you are always checking for more bytes even if you already have some.
2) use TIdIOHandler.ReadChar() instead and get rid of the calls to InputBufferIsEmpty() and CheckForDataOnSource() altogether. The downside is that you will lose data if a UTF-8 sequence decodes into a UTF-16 surrogate pair. ReadChar() can decode surrogates, but it cannot return the second character in the pair (I have started working on new ReadChar() overloads for a future release of Indy that return String instead of Char so full surrogate pairs can be returned).
While your code is correct, the problem is most likely that the inputBuffer contains data that might contain null characters (#0) which would end the string.
Try Remy's solution, and check what you get in the rawbytestring.
Edit
I didn't read that the OP was reading from a TelnetServer.
OP should use TidTelnet instead of IdTCPClient.
Edit2
I just read an older post of OP which explains the reason why he is not using TidTelnet.
/Daddy
Telnet servers send a null character (#0) after each carriage return. This is most likely what you are seeing.
A null character encoded to UTF8 is still a single byte with the value of 0. Check to see if that's what you are receiving.

Writing a lexer for chunked data

I have an embedded application which communicates with a RESTful server over HTTP. Some services involve sending some data to the client which is interpreted using a very simple lexer I wrote using flex.
Now I'm in the process of adding a gzip compression layer to reduce bandwidth consumption but I'm not satisfied with the current architecture because of the memory requirements: first I receive the whole data in a buffer, then I decompress the whole buffer into a new buffer and then I feed the whole data to flex.
I can save some memory between the first and second steps by feeding chunked data from the HTTP client to the zlib routines. But I'm wondering whether it's possible to do the same between the zlib chunked output and the flex input.
Currently I use only yy_scan_bytes and yylex to analyze the input. Does flex have any feature to feed multiple chunks of data to yylex? I've read the documentation about multiple input buffers but to no avail.
YY_INPUT seems to be the correct answer:
The nature of how [the scanner] gets its input can be controlled by defining the
YY_INPUT macro. The calling sequence for YY_INPUT() is
YY_INPUT(buf,result,max_size). Its action is to place up to max_size
characters in the character array buf and return in the integer
variable result either the number of characters read or the constant
YY_NULL (0 on Unix systems) to indicate `EOF'.

Resources