I have an ant build that concatenates my javascript into one file and then compresses it. The problem is that Visual Studio's default encoding attaches a BOM to every file. How do I configure ant to strip out BOM's that would otherwise appear in the middle of the resulting concatenated file?
My googl'ing revealed this discussion which is the exact problem I'm having but doesn't provide a solution: http://marc.info/?l=ant-user&m=118598847927096
The Unicode byte order mark codepoint is U+FEFF. This concatenation command will strip out all BOM characters when concatenating two files:
<concat encoding="UTF-8" outputencoding="UTF-8" destfile="nobom-concat.txt">
<filelist dir="." files="bom1.txt,bom2.txt" />
<filterchain>
<deletecharacters chars="" />
</filterchain>
</concat>
This form of the concat command tells the task to decode the files as UTF-8 character data. I'm assuming UTF-8 as this is usually where Java/BOM issues occur.
In UTF-8, the BOM is encoded as the bytes EF BB BF. If you needed it to appear at the start of the resultant file, you could use a subsequent concatenation to prefix the output file with a BOM again.
Encoded values for U+FEFF in other UTF encodings are listed here.
Related
I have several files which include various strings in different written languages. The files I am working with are in the .inf format which is somewhat similar to .ini files.
I am inputting the text from these files into a parser which considers the [ symbol as the beginning of a 'category'. Therefore, it is important that this character does not accidentally appear in string sequences or parsing will fail because it interprets these as "control characters".
For example, this string contains some Japanese writings:
iANSProtocol_HELP="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X Protocol �̓`�[���������щ��z LAN �Ȃǂ̍��x�#�\�Ɏg�����܂��B"
DISKNAME ="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X CD-ROM �܂��̓t���b�s�[�f�B�X�N"
In my text-editors (Atom) default UTF-8 encoding this gives me garbage text which would not be an issue, however the 0x5B character is interpreted as [. Which causes the parser to fail because it assumes that this is signaling the beginning of a new category.
If I change the encoding to Japanese (CP 932), these characters are interpreted correctly as:
iANSProtocol_HELP="インテル(R) アドバンスト・ネットワーク・サービス Protocol はチーム化および仮想 LAN などの高度機能に使われます。"
DISKNAME ="インテル(R) アドバンスト・ネットワーク・サービス CD-ROM またはフロッピーディスク"
Of course I cannot encode every file to be Japanese because they may contain Chinese or other languages which will be written incorrectly.
What is the best course of action for this situation? Should I edit the code of the parser to escape characters inside string literals? Are there any special types of encoding that would allow me to see all special characters and languages?
Thanks
If the source file is in shift-jis, then you should use a parser that can support it, or convert the file to UTF-8 before you parse it.
I believe that this character set also uses ASCII as it's base type but it uses 2 bytes to for certain characters, so if 0x5B it probably doesn't appear as the 'first byte' of a character. (note: this is conjecture based on how I think shift-jis works).
So yea, you need to modify your parser to understand shift-jis, or you need to convert the file to UTF-8 before parsing. I imagine that converting is the easiest.
I am Facing an below pipe delimiter issue in SSIS.
CRLF Pipe delimited text file:
-----------------------------
Col1|Col2 |Col3
1 |A/C No|2015
2 |A|C No|2016
Because of embedded pipe within pipes SSIS failing to read the data.
Bad news: once you have a file with this problem, there is NO standard way for ANY software program to correctly parse the file.
Good news: if you can control (or affect) the way the file is generated to begin with, you would usually address this problem by including what is called a "Text Delimiter" (for example, having field values surrounded by double quotes) in addition to the Field Delimiter (pipe). The Text Delimiter will help because a program (like SSIS) can tell the field values apart from the delimiters, even if the values contain the Field Delimiter (e.g. pipes).
If you can't control how the file is generated, the best you can usually do is GUESS, which is problematic for obvious reasons.
I typed in
file -I*
to look at all the encoding of all the CSV files in an entire directory. A lot of the file encodings are charset=binary. I'm not too familiar with this encoding format.
Does anyone know how to handle this encoding?
Thanks a lot for your time.
"Binary" encoding pretty much means that the encoding is unknown.
Everything is binary data under the hood. In text files each byte, or sequence of bytes, represents a specific character, and which character in particular depends on the encoding the file was encoded with/you're interpreting the file with. Some encodings are unambiguously recognisable, others aren't (e.g. any file is valid in any single-byte encoding, you can't easily distinguish one single-byte encoding from another). What file is telling you with charset=binary is that it doesn't have any more specific information than that the file contains bits and bytes (Capt'n Obvious to the rescue). It's up to you to interpret the file in the correct encoding/interpret it as the correct file format.
I have a file depends.txt containing 2015001, 2015002, 2015003. I created an ANT target that have the following code. I tried searching on how to use the stringtokenizer attribute but the descriptions are vague. I would like to run the target and get
2015001
2015002
2015003
All the help are greatly appreciated. Thanks.
`<loadfile srcFile="depends.txt" property="depends"/>
<filterchain>
<tokenfilter>
<stringtokenizer delims="," />
</tokenfilter>
</filterchain>`
Ant's filterreaders can modify the input read by Ant and the tokenfilter is one of them. The tokenfilter doesn't do anything by itself but rather coordinates two different actors - a tokenizer and a filter.
The filter is the thing that performs the real action and the tokenizer is responsible for feeding the filter with chunks of text it works on. The separation of tokenizer and filters allows the same algorithm - say uniq - to be applied to either words or lines depending on the tokenizer.
In your example you only specify a tokenizer but no filter so the output is the same as the input. IIUC you only want to strip the comma characters, in that case
<loadfile srcFile="depends.txt" property="depends">
<filterchain>
<deletecharacters chars=","/>
</filterchain>
</loadfile>
should do the trick.
I am using JAXB to unmarshall an XML message. It seems to replace multiple consecutive spaces by a single space.
<testfield>this is a test<\testfield>
(several spaces between a and test)
upon unmarshalling, the above becomes:
this is test
How do I keep consecutive spaces as they are in source XML?
From the msdn page:
Document authors can use the xml:space attribute to identify portions
of documents where white space is considered important. Style sheets
can also use the xml:space attribute as a hook to preserve white space
in presentation. However, because many XML applications do not
understand the xml:space attribute, its use is considered advisory.
You can try adding xml:space="preserve" so it doesn't replace the spaces
<poem xml:space="default">
<author xml:space="default">
<givenName xml:space="default">Alix</givenName>
<familyName xml:space="default">Krakowski</familyName>
</author>
<verse xml:space="preserve">
<line xml:space="preserve">Roses are red,</line>
<line xml:space="preserve">Violets are blue.</line>
<signature xml:space="default">-Alix</signature>
</verse>
</poem>
http://msdn.microsoft.com/en-us/library/ms256097%28v=vs.110%29.aspx