Default TEncoding.UTF8 discarding invalid blocks of data in TStreamReader input - delphi

I'm using a TStreamReader to read data from a file that purports to be utf-8. I have no problem reading the file until it comes to a section containing what appears to me to be a UTF-8 "£" symbol with the preceding xC2 missing - the file only contains the xA3 part of the character. I've traced this through the run-time library until it calls
Result := UnicodeFromLocaleChars(FCodePage, FMBToWCharFlags,
PAnsiChar(Bytes), ByteCount, nil, 0);
which returns 0 indicating that it doesn't like the input. Unfortunately the TStreamReader simply ends up discarding this buffer of input and then continues with the rest of the file without raising an error. This is extremely misleading about what the problem but that is just a side issue.
The issue appears to be a "defect" in the UTF-8 TEncoding class in that it simply discards the results of a failed conversion whilst the TStreamReader assumes that this isn't the behaviour of TEncoding.
I can work around this by using
Reader := TStreamReader.Create(FileStream, TMBCSEncoding.Create(CP_UTF8, 0, 0));
instead of
Reader := TStreamReader.Create(FileStream, TEncoding.UTF8);
as this makes it ignore the corrupt UTF-8 and simply include something (I haven't checked what) in my output. However, I would like to combine allowing the data through with reporting it and there doesn't seem to be any obvious way of doing this as the behaviour is hidden deep within the library.
Does anyone know of any standard Delphi library tools for doing this or do I need to resort to a lot of custom code?

Related

Multiple TFileStreams, TStreamWriter writing to same file

I start using a TFileStream and TStreamWriter to write simple text logfiles (instead of old Writeln(T,....)). And I have multiple applicatiosn writing to the same logfile.
Each appplication has its own TFileStream of course and they each open the file like this
FFileStream:=TFileStream.Create(LogName, fmOpenReadWrite+fmShareDenyNone)
FExporter:=TStreamWriter.Create(FFilestream, TEncoding.UTF8);
FExporter.NewLine:=#$0A;
FExporter.AutoFlush:=TRUE;
and write to the file with
FExporter.BaseStream.Seek(0, soFromEnd);
FExporter.Write('['+DateToStr(Now, FDateTimeFormat)+'] ['+TimeToStr(Now, FDateTimeFormat)+'] [#'+Lead0(GetCurrentThreadId, 5)+']: '+EntryText);
FExporter.WriteLine;
the result is somewhat "unsatisfactory" as the lines are displaced, empty lines in between and does not seem to work.
HOW would I do that correctly?
Writing multiples lines at the same time in multiples process may result in unexpected continue, because parallels execution.
You should assure that you are writing a block continually so WriteLine shoud be send inside the write using lineBreak at the end.
So the way you can write should be:
FExporter.BaseStream.Seek(0, soFromEnd);
FExporter.Write('['+DateToStr(Now, FDateTimeFormat)+'] ['+TimeToStr(Now, FDateTimeFormat)+'] [#'+Lead0(GetCurrentThreadId, 5)+']: '+EntryText + System.slineBreak);
//FExporter.WriteLine;
Update1:
As the link Oliver posted, sometime it can not work if the message size to be written is bigger than the OS file sector and, at that very moment, other process also try to write a message. Thus in this case the result content might be mixed.
So doing what I first purpose you would increase the probability to have the desired result, but may not be the solution in 100% of the cases.
To be 100% sure of writing continuous log in a single file, using multiples process, you should create a log process to receive a message from the others and to be the only responsible for writing synchronized log throughout threads.

Trying to read the contents from strstream causes access violation

I am trying to read the contents of an ostrstream using the str (). While trying to do so, i always come across access violations and my application crashes. Is there a way to read from strstream without causing stream errors?
I am working on a legacy project built on Borland C++. I am presently using Borland C++ v5.02 for building my project. Since the code is vast and scattered over a large number of files, I am unable to paste the code here. However, I will try to highlight my use case.
ps is the stream which is being used throughout the project to print receipts. I need to get the receipt data from this strstream without breaking the code.
string str = ps.pStr->str ();
ps.Pstr->rdbuf ()->freeze (0);
ps << EndJob;
The last line causes access violation
You missed set null in the end of the buffer.
Before any call to str() that uses the result as a C string, the buffer must be null-terminated, typically with std::ends.

Error loading file with full name containing spaces in directory with delphi

I am using XE8, win 8.1.
When trying load a file with spaces in directory, I am getting a exception of syntax name of the file or directory is invalid.
If I use imageen dialog to preview the file, no erros are found.
I did two tests with the procedure load_file1 and load_file2 and I have the same problem.
Is there a wrokaround to solve it?
function get_file:string;
begin
result:='"C:\Compartilhada\dicomserver versoes\dicomserverx\data\Genesis-1000\1.2.410.200013.1.215.1.200912141600580009_0001_000001_13061821270002.dcm"'
end;
procedure load_file1;
var fStm:Tstream;
p1:string;
begin
p1:=get_file;
fStm := tFileStream.Create( p1, fmOpenRead or fmShareDenyNone ); //->Error Here
try
TBlobField(FieldByName('dicom')).LoadFromStream(fStm);
Post;
finally
fSTm.Free;
end;
end;
procedure load_file2;
p1:string;
begin
p1:=get_file;
TBlobField(FieldByName('dicom')).LoadFromFile(p1); //-->Error Here
Post;
end;
Remove the double quote marks from your string. It should be:
'C:\Compartilhada\dicomserver versoes\dicomserverx\data\Genesis-1000\1.2.410.200013.1.215.1.200912141600580009_0001_000001_13061821270002.dcm'
You might use " for paths containing spaces in some situations, for instance a command interpreter. But at the API level, it is simply not needed. And indeed it is a mistake as you have discovered. The double quote character " is actually a reserved character in a file name. That is documented on MSDN:
Naming Files, Paths, and Namespaces: Naming Conventions
The following fundamental rules enable applications to create and process valid names for files and directories, regardless of the file system:
...
Use any character in the current code page for a name, including Unicode characters and characters in the extended character set (128–255), except for the following:
The following reserved characters:
< (less than)
> (greater than)
: (colon)
" (double quote)
/ (forward slash)
\ (backslash)
| (vertical bar or pipe)
? (question mark)
* (asterisk)
...
...
In comments below you indicate that the code in the question does not reflect your actual problem. Which makes me wonder how you expect us to help. Your real problem is not the error message produced by the specific code, but that your debugging skills are letting you down. Let me try to explain how to debug a problem like this.
First of all, you are passing a file name to LoadFromFile or TFileStream.Create. These calls fail with an error that indicates that the file name is not valid.
So, when faced with that knowledge, the first step is to check the value of the file name that you are passing. Use debugging techniques to do that. Either the IDE debugger, or logging.
Once you have identified what value you are actually passing to these functions you can try to work out what is invalid about it.
To repeat, your real problem is not with the specifics, but in your debugging skills. You should take this as an opportunity to learn more about debugging. Stack Overflow is not a substitute for debugging. Learn to debug better, and your life as a programmer will become very much easier.

What should the JCA deployment descriptor (ra.xml) character encoding be?

Looking through JCA 1.7 specification I could only find in one of their examples on the Resource Adapter Deployment Descriptor the following (Chapter 13: Message Inflow P 13-50):
This example is showing the usage of UTF-8 encoding, however there is nothing saying if this was an optional selection for the example illustration or a must restriction on the file character encoding.
I'm asking this because I'm writing a Java program to read one of these files and FindBugs™ is giving me this message:
DM_DEFAULT_ENCODING: Reliance on default encoding
Found a call to a method which will perform a byte to String (or
String to byte) conversion, and will assume that the default platform
encoding is suitable. This will cause the application behaviour to
vary between platforms. Use an alternative API and specify a charset
name or Charset object explicitly.
Line 4 in this Java code snippet is where character encoding will be specified:
01. byte[] contents = new byte[1024];
02. int bytesRead = 0;
03. while ((bytesRead = bin.read(contents)) != -1)
04. result.append(new String(contents, 0, bytesRead));
So, Is it possible to specify the expected encoding of this file in this case or not?
From what I saw, Most people use the UTF-8 encoding for their ra.xml. However there is no restriction on using other encoding. So if you base your parsing to expect UTF-8 only, the result might not be as expected.
So you either need to count for this in your code when you are reading this as a normal text, or read it as an xml file and save yourself the headache. I don't think the difference in performance will be an issue because the ra.xml files do not usually grow to gigabytes. At least the ones I've seen so far are on an average of few megabytes.
For the Findbug issue, you just need to specify the encoding as a UTF-8. Otherwise you will be using the default of the JVM which is determined during virtual-machine startup and typically depends upon the locale and charset of the underlying operating system. Although using the default is not a recommended behavior here, if that is what you want then just specify the usage of default encoding. This would get rid of the Findbug issue.
So your code would look like something like this:
01. byte[] contents = new byte[1024];
02. int bytesRead = 0;
03. while ((bytesRead = bin.read(contents)) != -1)
04. result.append(new String(contents, 0, bytesRead, Charset.defaultCharset()));
FindBugs just warns you that you're relying on default system encoding, so it's possible that if your application will be launched by another user in another country you might get unexpected results. It's better to explicitly specify which encoding you want to use.
In your case the actual encoding should be extracted from XML file. There are several ways to get it. One method is to use XMLStreamReader as described in this answer.

TIdMessageParts.CountParts Returns 0

I'm trying to pull multipart emails in MIME format from an IMAP server using Indy 10.5.5 in Delphi 2010. These are the lines of code that I'm having trouble with are below, where I instatiate the curMessage object, retrieve a message into it, and then call CountParts:
var
curMessage: TIdMessage;
IMAP4: TIdIMAP4;
msgIndex: Integer;
begin
...
curMessage := TIdMessage.Create(nil);
IMAP4.Retrieve(msgIndex, curMessage);
curMessage.MessageParts.CountParts;
//code that checks counts
//and
end;
I then have some code that checks the various count properties of curMessage.MessageParts (i.e. TextPartCount). However, the CountPart procedure isn't returning anything, because the Count property referenced in the procedure block is 0, even though I've verified that the message is retrieved and placed into the curMessage.
One thing I've noticed, and haven't gotten to the bottom of yet, is that IsMsgSinglePartMime is coming back as true, even though all the messages on the server have Content-Type: multipart/mixed;.
Any help would be really appreciated.
What am I missing here? I can provide more code if needed,
Without seeing the actual email data, it is difficult to say for sure exactly why the data is not where you expect it to be. But if the TIdMessage.IsMsgSinglePartMime is getting set to True then that means that either:
TIdMessage.Encoding is meMIME but TIdMessage.MIMEBoundary.Count is 0, meaning there was no MIME boundary value detected in the top-level Content-Type header. If the Content-Type is a 'multipart/...' type, a boundary is required. If it is present, it is likely malformed in a way that prevented Indy from parsing it.
TIdMessage.Encoding is mePlainText but TIdMessage.ContentTransferEncoding is either 'base64' or 'quoted-printable'.
In either case, if there is body content present then it would end up in the TIdMessage.Body property if it is textual data, otherwise it would end up in the TIdMessage.MessageParts as an attachment instead. Since TIdMessage.MessageParts.Count is 0 in your case, the data is either in TIdMessage.Body, or is got discarded.
You may want to consider upgrading to a newer Indy version. The version shipped with D2010 is pretty old, and there have been fixes/changes made to TIdIMAP4 and TIdMessage (and its internal parsers) in recent years.

Resources