We're in the process of standardizing on UTF-8 encoding for all source files, to make it easier for developers using a plethora of tools (notably including IntelliJ IDEA on Windows, Mac and Linux) to handle Git merge conflicts without introducing unwanted encoding changes.
While Delphi 11 seems able to handle both UTF-8 and ANSI encoded PAS and DFM files well, and has a configuration setting (under Tools > Options > Editor) called "Default file encoding", which can be changed from its default setting of ANSI to UTF8, making all newly created PAS files be saved with UTF-8 encoding, this does not seem to affect DFM files.
DFM files seem to always get saved as ANSI. This seems to apply also to DFM files that originally were in UTF-8 encoding: when I edit them in Delphi and re-save, they get changed to ANSI.
Is this a feature or a bug? If it is a feature, could you point to some authoritative documentation stating that.
DFM files use their own proprietary encoding (# followed by number of Unicode code point) to store non-ASCII characters in string values.
However, in newer versions of Delphi, DFM files in text form may be automatically stored using UTF-8 if identifiers (class, property or component names) contain non-ASCII characters.
From the documentation for Delphi 11 Alexandria:
Component streaming (Text DFM files):
Are fully backward-compatible.
Stream as UTF-8 only if component type, property, or name contains non-ASCII-7 characters.
String property values are still streamed in “#” escaped format.
May allow values as UTF-8 as well (open issue).
Only change in binary format is potential for UTF-8 data for component name, properties, and type name.
Related
I implement org.osgi.service.cm.ManagedService interface to get Karaf configuration. But when I give a Chinese value to the property, it is garbled.Initially, the files in the etc folder are encoded in latin1. I have tried to set utf-8 encoding, but it has no effect. Can anyone help me?
In Karaf, the configurations files (ie etc/*.cfg) are handled by the felix subproject "fileinstall".
fileinstall doesn't support yet to specified a custom character encoding for the configuration, it uses the Properties class and Properties.load(InputStream), which documents:
The load(Reader) / store(Writer, String) methods load and store
properties from and to a character based stream in a simple
line-oriented format specified below. The load(InputStream) /
store(OutputStream, String) methods work the same way as the
load(Reader)/store(Writer, String) pair, except the input/output
stream is encoded in ISO 8859-1 character encoding. Characters that
cannot be directly represented in this encoding can be written using
Unicode escapes as defined in section 3.3 of The Java™ Language
Specification; only a single 'u' character is allowed in an escape
sequence. The native2ascii tool can be used to convert property files
to and from other character encodings.
So, you have to encode your file in ISE-8859-1 and quote every UTF character, or use an xml file to encode your configuration files.
There is a way to change cfg files encoding.
Configuration for fileinstall subproject polling etc/*.cfg files is written in config.properties file.
You can add
felix.fileinstall.configEncoding = UTF-8
The solution was checked in Karaf 4
I have a text file whose size is 1.3 GB. Most of text editors (including NotePad++) cannot open it. I need to change its format from ANSI to UTF-8. In what program can I do this?
Try EmEditor. It supports Huge files very well. Free version exists.
If you want a free (and open source) command-line tool that can run on Windows, and which allows you to convert huge files from ANSI to UTF-8 (or any other encodings), you can use this tool that I've just created (runs on nodejs and uses the iconv-lite library):
https://github.com/sorin-postelnicu/convert-file-encoding
You can use it like this:
node bin\convertFileEncoding.js -f latin-1 -t utf-8 -i myinputfile.txt -o myoutputfile.txt
It is fast and supports converting very large files with minimal memory consumption (around 20MB of RAM no matter the size of the input file).
You can also use shareware text editor UltraEdit.
First, configure UltraEdit for editing large files according to power tip large file text editor.
Then open your file in UltraEdit and use File - Save As and select for Encoding (Windows 7/8/8.1/Vista) respectively Format (Windows XP/2000) the option UTF-8 - NO BOM or UTF-8 for saving with conversion to UTF-8 encoding without or with byte order mark at beginning of the file.
What the Windows 'hosts' file encoding is? Is it UTF-8? Or ASCII + system codepage? How IDN (international domain names with umlauts etc.) entries should be added and can they be added at all?
It should be ANSI or UTF-8 without BOM. I just dealt with a server that had the hosts file encoding set to UCS-2 Little Endian, and that led to the file being ignored.
There is a wealth of information here:
https://serverfault.com/questions/452268/hosts-file-ignored-how-to-troubleshoot
The simple answer is
ANSI or UTF-8 WITH BOM.
(UTF-8 without BOM is NOT valid).
Details:
As far as I have tried, the encoding of the hosts file on Windows should be
ANSI or UTF-8 with BOM.
I know this question is many years old, but a colleague made the mistake of looking at this post and the ServerFault post, so I decided to add an answer.
1. Simple case only ASCII
Works.
Without any multi-byte characters, This is equivalent to ANSI, also equivalent to UTF-8 without BOM.
2. ANSI (with Japanese ANSI multi-byte characters)
Works.
note: There are Japanese characters but this is valid ANSI encoding in windows.
In Japanese editions of Windows, this code page cp932 is referred to as "ANSI",
https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)
3. UTF-8 with BOM
Works.
note: BOM 付き means with BOM.
4. UTF-8 without BOM
DOES NOT work.
5. Additional test cases
If you use emoji instead of Japanese, the result will be the same.
Use emoji and save as UTF8 without BOM does not work.
(However, other lines not include emoji may be worked correctly.)
Use emoji and save as UTF8 with BOM can resolve host correctly.
note: If you use Notepad to check it yourself, be sure to put double quotes in the file name when you save it, or Notepad will be create hosts.txt.
Appended:
(Asked in comment)
The hosts file supports inline comments.
My program is written in Delphi 7 and I want to avoid a Russian or a Chinese,
Korean try to use my soft because file paths contains Unicode chars and my program can t handle them yet (as long as I do not port my program on a new Delphi version supporting UNICODE).
How do I write a function detecting the "Unicode language" in Delphi 7?
A Delphi 7 program (in its VCL part) can handle Russian, Chinese or Korean characters without any problem.
If the Windows system language is properly set, the charset will match the corresponding encoding, and the file names will be able to have Unicode chars as available in this charset. In fact, default string=AnsiString is converted into Unicode when the VCL calls Windows APIs (all ....A() calls will do the conversion then call the ....W() version).
You can force the default code page (the one which will select the charset to be used) by calling code like this:
if GetThreadLocale<>LCID then // force locale settings if different
if SetThreadLocale(LCID) then
GetFormatSettings; // resets all locale-specific variables
In this case, the TFileName (=AnsiString) in the current system charset will be converted by Windows into the corresponding Unicode characters, and you'll be able to use it in your Delphi 7 application.
What you can't do with the standard VCL AnsiString use it to directly mix charsets, as you can since Delphi 2009, thanks to the new string = UnicodeString default paradigm.
PS:
Since the CharSet only involve #128..#255 chars (i.e. all with bit 7 set), if you use only #0..#127 chars, your string will be consistent whatever the current charset/codepage setting is. If you use only English chars and numbers e.g., your path will always work, whatever the charset/codepage is. But if you use non English chars, the path will only work if the charset/codepage is correctly set, which is the case for a path used by an end-user (using a TOpenDialog at runtime for instance).
We have been reading and writing Sticky Notes/Annotations/Comments to pdfs via an activex control in our application for a number of years. We have recently upgraded to Delphi2009 with Unicode Support. The following is causing problems.
When we call
CAcroPDAnnot.GetContents
The results seem to be rather strange and we lose our Unicode Chars. It is not like saving as an ansi string which would usually result in returning ????? instead we get a string such as
‚És‚“ú‚É•—Ž×‚ð‚Ђ¢‚½‚ç
For a string of Japanese characters.
However if I save the comments in the pdf to a datafile via the menu in the pdf itself it is written to file as something like
0kˆL0Oeå0k˜¨ª0’0r0D0_0‰
The latter can be export and reimported into an acrobat pdf and will recreate the correct unicode characters. However once I call CAcroPDAnnot.GetContents in my code it is coming back as something else.
Is CAcroPDAnnot.GetContents broken?
Is there an encoding scheme I should be aware of?
Is there an alternative I might be able to do?
Thanks
‚És‚“ú‚É•—Ž×‚ð‚Ђ¢‚½‚ç
That's the string:
に行く日に風邪をひいたら
in CP-932 aka Shift-JIS encoding, an awful but lamentably still-popular encoding in Japan.
You're currently interpreting it in as CP-1252 (Windows Western European). If your PDF-reading component won't convert it for you automatically, you'll need to find a way to detect what encoding the document is in and convert it manually.
I don't know what Delphi provides for reading encodings, but have you got the encodings for Shift-JIS installed in Windows, from the Control Panel -> Regional Options -> "Install files for East Asian languages" option? If not, that might explain why it'd be failing to convert automatically, perhaps.
You're not exactly giving us a lot of information to work with.
I take it you're talking about the "Acrobat.CAcroPDAnnot" class' method GetContents here. Which version of Acrobat are you using? Have you perhaps switched versions (or run an update) around the time you started programming with Delphi 2009?
Then: how did you instantiate the object? If using a *_TLB.pas file generated from the DLL, are you certain it still matches it? (Try re-generating it, if uncertain).
Third: how are you calling the method? What type of variable are you assigning the result to?
What might also help, is if you could provide a sample of an annotation (preferably including non-ASCII chars); and for that annotation:
what it should look like (and what it does look like inside Reader)
what it returns when using a pre-2009 version of Delphi*
what it returns when using Delphi 2009*
(* preferably the HEX byte codes of the (ansi/wide)strings; but output from the Ctrl-F7 inspector should do)
Then maybe someone could provide a more meaningful answer.
Ok, one of the main differences between Delphi 2009 and the earlier versions is that the default string type is an unicode string. That means that if you use the same ActiveX component as in previous versions, you are passing unicode strings to ascii strings and that is usually not a good idea.
There are a couple of solutions for this problem:
Try if you can upgrade your activeX component so that it supports full unicode strings.
Use AnsiString and not string to communicate with the activeX component. In this case, you can still use the old interface, but you are still bound to the same limitations.
Use an other control that creates pdf. There is a lot to find, but be prepared to change a big chunk of your software. (Some controls are XML based and use encoding. )