I have a question about EDI document (either x12 or edifact format),
Does anyone know that can it contain a NUL (0x00) character inside?
Since I am implementing an EDI parser and parsing terminates once it encounters NUL char.
Thanks
yes, of course it can.
in edifact, there are character sets, eg UNOA, UNOB, UNOC.
0x00 is not part of UNOA, but it is part of UNOC (AFAIK).
x12: I am not sure. They have rules, but AFAIK not well followed.
btw, for an open source edifact/x12 parser see: http://bots.sourceforge.net
Related
From my Lua knowledge (and according to what I have read in Lua manuals), I've always been under impression that an identifier in Lua is only limited to A-Z & a-z & _ & digits (and can not start using a digit nor be a reserved keyword i.e. local local = 123).
And now I have run into some (obfuscated) Lua program which uses all kind of weird characters for an identifier:
https://i.imgur.com/HPLKMxp.png
-- Most likely, copy+paste won't work. Download the file from https://tknk.io/7HHZ
print(_VERSION .. " " .. (jit and "JIT" or "non-JIT"))
local T = {}
T.math = T.math or {}
T.math.​â®â€‹âŞâ®â€‹ď»żâ€Śâ€âŽ = math.sin
T.math.â¬â€‹ââ¬ââ«â®â€â€¬ = math.cos
for k, v in pairs(T.math) do print(k, v) end
Output:
Lua 5.1 JIT
â¬â€‹ââ¬ââ«â®â€â€¬ function: builtin#45
​â®â€‹âŞâ®â€‹ď»żâ€Śâ€âŽ function: builtin#44
It is unclear to me, why is this set of characters allowed for an identifier?
In other words, why is it a completely valid Lua program?
Unlike some languages, Lua is not really defined by a formal specification, one which covers every contingency and entirely explains all of Lua's behavior. Something as simple as "what character set is a Lua file encoded in" isn't really explain in Lua's documentation.
All the docs say about identifiers is:
Names (also called identifiers) in Lua can be any string of letters, digits, and underscores, not beginning with a digit and not being a reserved word.
But nothing ever really says what a "letter" is. There isn't even a definition for what character set Lua uses. As such, it's essentially implementation-dependent. A "letter" is... whatever the implementation wants it to be.
So, let's say you're writing a Lua implementation. And you want users to be able to provide Unicode-encoded strings (that is, strings within the Lua text). Lua 5.3 requires this. But you also don't want them to have to use UTF-16 encoding for their files (also because lua_load gets sequences of bytes, not shorts). So your Lua implementation assumes the byte sequence it gets in lua_load is encoded in UTF-8, so that users can write strings that use Unicode characters.
When it comes to writing the lexer/parser part of this implementation, how do you handle this? The simplest, easiest way to handle UTF-8 is to... not handle UTF-8. Indeed, that's the whole point of that encoding. Since everything that Lua defines with specific symbols are encoded in ASCII, and ASCII text is also UTF-8 text with the same meaning, you can basically treat a UTF-8 string like an ASCII string. For in-Lua strings, you just copy the sequence of bytes between the start and end characters of the string.
So how do you go about lexing identifiers? Well, you could ask the question above. Or you could ask a much simpler question: is the character a space, control character, digit, or symbol? A "letter" is merely something that isn't one of those.
Lua defines what things it considers to be "symbols". ASCII can tell you what is a control character, space, and a digit. In such an implementation, any UTF-8 code unit with a value outside of ASCII is a letter. Even if technically, those code units decode into something Unicode thinks of as a "symbol", your lexer just threats it as a letter.
This simple form of UTF-8 lexing gives you fast performance and low memory overhead. You don't have to decode UTF-8 into Unicode codepoints, and you don't need a giant Unicode table to tell you whether a codepoint is a "symbol" or "space" or whatever. And of course, it's also something that would naturally fall out of many ASCII-based Lua implementations.
So most Lua implementations will do it this way, if only by accident. Doing something more would require deliberate effort.
It also allows a user to use Unicode character sequences as identifiers. That means that someone can easily write code in their native language (outside of keywords).
But it also means that obfuscators have lots of ways to create "identifiers" that are just strings of nonsensical bytes. Indeed, because there are multiple ways in Unicode to "spell" the same apparent Unicode string (unless you examine the bytes directly), obfuscators can rig up identifiers that appear when rendered in a text editor to all be the same text, while actually being different strings.
To clarify there is only one identifier T
T.math is sugar syntax for T["math"] this also extends to the obfuscate strings. It is perfectly valid to have a key contain any characters or even start with a number.
Now being able to use the . rather then [ ] does not work with a string that don't conform to the identifier's limitations. See Nicol Bolas' answer for a great break down of those limitations.
I have several files which include various strings in different written languages. The files I am working with are in the .inf format which is somewhat similar to .ini files.
I am inputting the text from these files into a parser which considers the [ symbol as the beginning of a 'category'. Therefore, it is important that this character does not accidentally appear in string sequences or parsing will fail because it interprets these as "control characters".
For example, this string contains some Japanese writings:
iANSProtocol_HELP="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X Protocol �̓`�[���������щ��z LAN �Ȃǂ̍��x�#�\�Ɏg�����܂��B"
DISKNAME ="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X CD-ROM �܂��̓t���b�s�[�f�B�X�N"
In my text-editors (Atom) default UTF-8 encoding this gives me garbage text which would not be an issue, however the 0x5B character is interpreted as [. Which causes the parser to fail because it assumes that this is signaling the beginning of a new category.
If I change the encoding to Japanese (CP 932), these characters are interpreted correctly as:
iANSProtocol_HELP="インテル(R) アドバンスト・ネットワーク・サービス Protocol はチーム化および仮想 LAN などの高度機能に使われます。"
DISKNAME ="インテル(R) アドバンスト・ネットワーク・サービス CD-ROM またはフロッピーディスク"
Of course I cannot encode every file to be Japanese because they may contain Chinese or other languages which will be written incorrectly.
What is the best course of action for this situation? Should I edit the code of the parser to escape characters inside string literals? Are there any special types of encoding that would allow me to see all special characters and languages?
Thanks
If the source file is in shift-jis, then you should use a parser that can support it, or convert the file to UTF-8 before you parse it.
I believe that this character set also uses ASCII as it's base type but it uses 2 bytes to for certain characters, so if 0x5B it probably doesn't appear as the 'first byte' of a character. (note: this is conjecture based on how I think shift-jis works).
So yea, you need to modify your parser to understand shift-jis, or you need to convert the file to UTF-8 before parsing. I imagine that converting is the easiest.
I am implementing an EDI-x12 header parser (only to parse "ISA" segment)
I notice that there are several character sets can be used.
My question is that how do I know that which one is used of incoming edi-x12 message so that I know how to interpret the message?
actually there is no such thing as a character set in x12.
this is up to the partners/interchange agreement.
but as X12 is mainly used in USA, it is us-ascii (almost always).
(but .....some companies send x12 as EBCEDIC ;-)))
If you're only doing ANSI X12, the ISA segment should be easy for you to parse, as it is a fixed length.
Position 4 will give you the element delimiter (field delimiter).
Position 106 will give you the record terminator.
Position 105 will give you the subelement delimiter
You probably won't have much use for the subelement delimiter, depending on the document type.
Once you figure out what your field delimiters are and then the record delimiter, it should be a snap.
(Standard disclaimer: there are many great tools out there in the form of data translators that make this job much simpler than having a programmer reinvent the wheel. Some of these tools are even open source and free. Just sayin'...)
Hope this helps.
I am trying to convert UTF-8 string into UCS-2 string.
I need to get string like "\uFF0D\uFF0D\u6211\u7684\u4E0A\u7F51\u4E3B\u9875".
I have googled for about a month by now, but still there is no reference about converting UTF-8 to UCS-2.
Please someone help me.
Thx in advance.
EDIT: okay, maybe my explanation was not good enough. Here is what I am trying to do.
I live in Korea, and I am trying to send a sms message using CTMessageCenter. I tried to send chinese simplified character through my app. And I get ???? Instead of proper characters. So I tried UTF-8, UTF-16, BE and LE as well. But they all return ??. Finally I found out that SMS uses UCS-2 and EUC-KR encoding in Korea. Weird, isn't it?
Anyway I tried to send string like \u4E3B\u9875 and it worked.
So I need to convert string into UCS-2 encoding first and get the string literal from those strings.
Wikipedia:
The older UCS-2 (2-byte Universal Character Set) is a similar
character encoding that was superseded by UTF-16 in version 2.0 of the
Unicode standard in July 1996.2 It produces a fixed-length format
by simply using the code point as the 16-bit code unit and produces
exactly the same result as UTF-16 for 96.9% of all the code points in
the range 0-0xFFFF, including all characters that had been assigned a
value at that time.
IBM:
Since the UCS-2 standard is limited to 65,535 characters, and the data
processing industry needs over 94,000 characters, the UCS-2 standard
is in the process of being superseded by the Unicode UTF-16 standard.
However, because UTF-16 is a superset of the existing UCS-2 standard,
you can develop your applications using the systems existing UCS-2
support as long as your applications treat the UCS-2 as if it were
UTF-16.
uincode.org:
UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided.
UCS-2 does not define a distinct data format, because UTF-16 and UCS-2
are identical for purposes of data exchange. Both are 16-bit, and have
exactly the same code unit representation.
So, using the "UTF8toUnicode" transformation in most language libraries will produce UTF-16, which is essentially UCS-2. And simply extracting the 16-bit characters from an Objective-C string will accomplish the same thing.
In other words, the solution has been staring you in the face all along.
UCS-2 is not a valid Unicode encoding. UTF-8 is.
It is therefore impossible to convert UTF-8 into UCS-2 — and indeed, also the reverse.
UCS-2 is dead, ancient history. Let it rot in peace.
I've read XML or CSV before, but I've never seen anything like EDI.
How do I read this file and get the data that I need? I see things like ~, REF, N1, N2, N4 but have no idea what any of this stuff means.
I've seen somethings about x12 but don't know if thats what I have or not, how can I tell?
-- update
Thanks guys for the quick responses. Does anyone know of a parser that I can use in .Net? In the long run, I'm going to be converting this EDI file to a CSV file...
EDI messages are defined by the X12 standard.
If you look for X12 parsers, you can find helpful information.
For example, http://code.activestate.com/recipes/299485/
Those are ANSI X12 Files the standard is managed here http://www.wpc-edi.com/
Brief tutorial on structure
Hierarchy = Loops-> Segments -> Elements -> Sub Elements.
Loops are bounded either by control segments or logically based on the standard.
Segments are separated by the segment terminator, by default ~
Elements are separated by the element separator, by default *
Sub Elements are separated by sub element separator, by default :
EDI is a delimited file format. You have to know both the line delimiter and the column delimiter (for lack of a better answer). You might, for example, see an EDI file with the following format (from http://www.slik.co.nz/HTML_help/edi_file_format.htm):
HDR|6||||
DTL|1|ABC|xyz|123|1
DTL|13|ABC|animal|334|1
DTL|11|ABC|sfdk|432|2
DTL|12|ABC|wewdc|3|1
DTL|14|ABC|qwdx|416|4
The first line is the header and tells you there are six records. The other lines are detail lines.
X12 is one standard used by EDI. You will see X12 used commonly in healthcare. If you have X12, you can examine the X12 standard to figure out how to parse.
EDI stands for Electronic Data Interchange...
It's not a specific format per-se. Generally speaking it's a flat text file of data that usually has an associated published specification. For example: "Position 23-34 is the original price as a monetary value"
You really won't be able to do anything useful with an EDI file if you don't have the defined specification that goes along with it.
Once you get the specification, I believe how to read the file will be quite clear.
Generally the process is:
1. Read/Parse the EDI file.
2. Perform any processing/transformation on that data that you need to.
3. Persist it into your local system format (tables, other flat files, whatever).
Sorry there's not much more we could tell you unfortunately.
EDI stands for “Electronic Data Interchange.” The practice involves using computer technology to exchange information – or data – electronically between two organizations, called “Trading Partners.” Technically, EDI is a set of standards that define common formats for the information so it can be exchanged in this way.
Read more: http://www.1edisource.com/learn-about-edi/what-is-edi#ixzz2g5E4p2ET
EDI is just a flat file that contain some type of hierarchy. Usually companies buy EDI translator software to parse those files and extract data and then integrate with other systems. You can also use some type of service and they will do that for you. You can try to use Amosoft EDI Serices (www.amosoft.com) and they can help you with that.