How do I know what character set is used in x12 document? - edi

I am implementing an EDI-x12 header parser (only to parse "ISA" segment)
I notice that there are several character sets can be used.
My question is that how do I know that which one is used of incoming edi-x12 message so that I know how to interpret the message?

actually there is no such thing as a character set in x12.
this is up to the partners/interchange agreement.
but as X12 is mainly used in USA, it is us-ascii (almost always).
(but .....some companies send x12 as EBCEDIC ;-)))

If you're only doing ANSI X12, the ISA segment should be easy for you to parse, as it is a fixed length.
Position 4 will give you the element delimiter (field delimiter).
Position 106 will give you the record terminator.
Position 105 will give you the subelement delimiter
You probably won't have much use for the subelement delimiter, depending on the document type.
Once you figure out what your field delimiters are and then the record delimiter, it should be a snap.
(Standard disclaimer: there are many great tools out there in the form of data translators that make this job much simpler than having a programmer reinvent the wheel. Some of these tools are even open source and free. Just sayin'...)
Hope this helps.

Related

ASCII Representation of Hexadecimal

I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.

What should I do with emails using charset ansi_x3.110-1983?

My application is parsing incoming emails. I try to parse them as best as possible but every now and then I get one with puzzling content. This time is an email that looks to be in ASCII but the specified charset is: ansi_x3.110-1983.
My application handles it correctly by defaulting to ASCII, but it throws a warning which I'd like to stop receiving, so my question is: what is ansi_x3.110-1983 and what should I do with it?
According to this page on the IANA's site, ANSI_X3.110-1983 is also known as:
iso-ir-99
CSA_T500-1983
NAPLPS
csISO99NAPLPS
Of those, only the name NAPLPS seems interesting or informative. If you can, consider getting in touch with the people sending those mails. If they're really using Prodigy in this day and age, I'd be amazed.
The IANA site also has a pointer to RFC 1345, which contains a description of the bytes and the characters that they map to. Compared to ISO-8859-1, the control characters are the same, as are most of the punctuation, all of the numbers and letters, and most of the remaining characters in the first 7 bits.
You could possibly use the guide in the RFC to write a tool to map the characters over, if someone hasn't written a tool for it already. To be honest, it may be easier to simply ignore the whines about the weird character set given that the character mapping is close enough to what is expected anyway...

What strategies are there for escaping character entities?

We are doing Natural Language Processing on a range of English language documents (mainly scientific) and run into problems in carrying non-ANSI characters through the various components. The documents may be "ASCII", UNICODE, PDF, or HTML. We cannot predict at this stage what tools will be in our chain or whether they will allow character encodings other than ANSI. Even ISO-Latin characters expressed in UNICODE will give problems (e.g. displaying incorrectly in browsers). We are likely to encounter a range of symbols including mathematical and Greek. We would like to "flatten" these into a text string which will survive multistep processing (including XML and regex tools) and then possibly reconstitute it in the last step (although it is the semantics rather than the typography we are concerned with so this is a minor concern).
I appreciate that there is no absolute answer - any escaping can clash in some cases - but I am looking for something allong the lines of XML's <![CDATA[ ...]]> which will survive most non-recursive XML operations. Characters such as [ are bad as they are common in regexes. So I'm wondering if there is a generally adopted approach rather than inventing our own.
A typical example is the "degrees" symbol:
HTML Entity (decimal) °
HTML Entity (hex) °
HTML Entity (named) °
How to type in Microsoft Windows Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex) 0xC2 0xB0 (c2b0)
UTF-8 (binary) 11000010:10110000
UTF-16 (hex) 0x00B0 (00b0)
UTF-16 (decimal) 176
UTF-32 (hex) 0x000000B0 (00b0)
UTF-32 (decimal) 176
C/C++/Java source code "\u00B0"
Python source code u"\u00B0"
We are also likely to encounter TeX
$10\,^{\circ}{\rm C}$
or
\degree
so backslashes, curlies and dollars are a poor idea.
We could for example use markup like:
__deg__
__#176__
and this will probably work but I'd appreciate advice from those who have similar problems.
update I accept #MichaelB's insistence that we use UTF-8 throughout. I am worried that some of our tools may not conform and if so I'll revisit this. Note that my original question is not well worded - read his answer and the link in it.
Get someone to do this who really understands character encodings. It looks like you don't, because you're not using the terminology correctly. Alternatively, read this.
Do not brew up your own escape scheme - it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you're really that scared of high bits.
In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn't, abandon it - it is most likely very bad quality code in many other ways as well and not worth the hassle using.
Maybe I don't get the problem correctly, but I would create a very unique escape marker which is unlikely to be touched, and then use it to enclose the entity encoded as a base32 string.
Eventually, you can transmit the unique markers and their number along the chain through a separate channel, and check their presence and number at the end.
Example, something like
the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee
your marker is a uuid, and the entity is &deg encoded in base32. You then pass along the marker cd48d8c50d7f40aeb6a164181b17feee. It cannot be corrupted (if it gets corrupted, your filters will probably corrupt anything made of letters and numbers anyway, but at least you can exclude them because they are fixed length) and you can always recover the content by looking inside the two markers.
Of course, if you have uuids in your documents, this could represent a problem, but since you are not transmitting them as authorized markers along the lateral channel, they won't be recognized as such (and in any case, what's inbetween won't validate as a base32 string anyway).
If you need to search for them, then you can keep the uuid subdivision, and then use a proper regexp to spot these occurrences. Example:
>>> re.search("(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
>>>
If you really need a specific "token" to test, you can use a uuid1, with a very defined specification of a node:
>>> uuid.uuid1(node=0x1234567890)
UUID('bdcce554-e95d-11de-bd0f-001234567890')
>>> uuid.uuid1(node=0x1234567890)
UUID('c4c57a91-e95d-11de-90ca-001234567890')
>>>
You can use anything you prefer as a node, the uuid will be unique, but you can still test for presence (although you can get false positives).

How to understand an EDI file?

I've seen XML before, but I've never seen anything like EDI.
How do I read this file and get the data that I need? I see things like ~, REF, N1, N2, N4 but have no idea what any of this stuff means.
I am looking for Examples and Documentations.
Where can I find them?
Aslo
EDI guide i found says that it is based on " ANSI ASC X12/ ver. 4010".
Should I search form X12 ?
Kindly help.
Several of these other answers are very good. I'll try to fill in some things they haven't mentioned.
EDI is a set of standards, the most common of which are:
ANSI X12 (popular in the states)
EDIFACT (popular in Europe)
Sounds like you're looking at X12 version 4010. That's the most widely used (in my experience, anyway) version. There are lots and lots of different versions.
The file, or properly "interchange," is made up of Segments and Elements (and sometimes subelements). Each segment begins with a two- or three-word identifier (ISA, GS, ST, N1, REF).
The structure for all documents begins and ends with an envelope. The envelope is usually made up of the ISA segment and the GS segments. There can be more than one GS segment per file, but there should only be one ISA segment per file (note the should, not everyone plays by the rules).
The ISA is a special segment. Whereas all the other segments are delimited, and therefore can be of varying lenghts, the ISA segment is of fixed width. This is because it tells you how to read the rest of the file.
Start with the last three characters of the ISA segment. Those will tell you the element delimiter, the sub-element delimiter, and the segment delimiter. Here's an example ISA line.
ISA:00: :00: :01:1515151515 :01:5151515151 :041201:1217:U:00403:000032123:0:P:*~
In this case, the ":" is the element delimiter, "*" is a subelement delimiter, and "~" the segment delimiter. It's much easier if you're just trying to look at a file to put linebreaks after each segment delimiter (~).
The ISA also tells you who the document is from and to, what the version is (00403, which is also known as 4030), and the interchange control number (0000321233). The other stuff is probably not important to you at this stage.
This document is from sender "01:1515151515" and to receiver "01:5151515151". So what's with the "01:"? Well, this introduces an important concept in EDI, the qualifier. Several elements have qualifiers, which tell you what type of data the next element is. In this case, the 01 is supposed to be a Dunn and Bradstreet number. Other qualifiers for the ISA05 and ISA07 elements are 12 for phone number, and ZZ for "user defined". You'll find the concept of qualifiers all over EDI segments. A decent rule of thumb is that if it's two characters, it's a qualifier. In order to know what all the qualifiers mean, you'll need a standards guide (either in hard copy from the EDI standards body, or in some software).
The next line is the GS. This is a functional group (a way to group like documents together within an interchange.) For instance, you can have several purchase orders, and several functional acknowledgements within an ISA. These should be placed in separate functional groups (GS segments). You can figure out what type of documents are in a GS segment by looking at the first GS01 element.
GS:PO:9988776655:1122334455:20041201:1217:128:X:004030
Besides the document type, you can see the from (9988776655) and to (1122334455) again. This time they're using different identifiers, which is legal, because you may be receiving an interchange on behalf of someone else (if you're an intermediary, for instance). You can also see the version number again, this time with the trailing "0" (0004030). Use significant digits logic to strip off the leading zeros. Why is there an extra zero here and not in the ISA? I don't know. Lastly this GS segment also has it's own identifier, 128.
That's it for the beginning of the envelope. After that there will be a loop of documents beginning with ST. In this case they'd all be POs, which have a code (850), so the line would start with ST:850:blablabla
The envelope stuff ends with a GE segment which references the GS identifier (128) so you know which segment is being closed. Then comes an IEA which similarly closes out the ISA.
GE:1:128~
IEA:1:000032123~
That's an overview of the structure and how to read it. To understand it you'll need a reference book or software so you understand the codes, lots and lots of time, and lots and lots of practice. Good luck, and post again if you have more specific questions.
Wow, flashbacks. It's been over sixteen years ...
In principle, each line is a "segment", and the identifiers are the beginning of the line is a segment identifier. Each segment contains "elements" which are essentially positional fields. They are delimited by "element delimiters".
Different segments mean different things, and can indicate looping constructs, repeats, etc.
You need to get a current version of the standard for the basic parsing, and then you need the data dictionary to describe the content of the document you are dealing with, and then you might need an industry profile, implementation guide, or similar to deal with the conventions for the particular document type in your environment.
Examples? Not current, but I'm sure you could find a whole bunch using your search engine of choice. Once you get the basic segment/element parsing done, you're dealing with your application level data, and I don't know how much a general example will help you there.
EDI is a file format for structured text files, used by lots of larger organisations and companies for standard database exchange. It tends to be much shorter than XML which used to be great when data packets had to be small. Many organisations still use it, since many mainframe systems use EDI instead of XML.
With EDI messages, you're dealing with text messages that match a specific format. This would be similar to an XML schema, but EDI doesn't really have a standardized schema language. EDI messages themselves aren't really human-readable while most specifications aren't really machine-readable. This is basically the advantage of XML, where both the XML and it's schema can be read by humans and machines.
Chances are that when you're doing electronic banking through some client-side software (not browser-based) then you might already have several EDI files on your system. Banks still prefer EDI over XML to send over transaction data, although many also use their own custom text-based formats.
To understand EDI, you'll have to understand the data first, plus the EDI standard that you want to follow.
Assuming the data stream starts with “ISA”, towards the beginning there should be a section “~ST*” followed by three numeric digits. If you can post these three digits, I can probably provide you with more information. Also, knowing the industry would be helpful. For example, healthcare uses 270, 271, 276, 277 and a few others.

EDI Format

I've read XML or CSV before, but I've never seen anything like EDI.
How do I read this file and get the data that I need? I see things like ~, REF, N1, N2, N4 but have no idea what any of this stuff means.
I've seen somethings about x12 but don't know if thats what I have or not, how can I tell?
-- update
Thanks guys for the quick responses. Does anyone know of a parser that I can use in .Net? In the long run, I'm going to be converting this EDI file to a CSV file...
EDI messages are defined by the X12 standard.
If you look for X12 parsers, you can find helpful information.
For example, http://code.activestate.com/recipes/299485/
Those are ANSI X12 Files the standard is managed here http://www.wpc-edi.com/
Brief tutorial on structure
Hierarchy = Loops-> Segments -> Elements -> Sub Elements.
Loops are bounded either by control segments or logically based on the standard.
Segments are separated by the segment terminator, by default ~
Elements are separated by the element separator, by default *
Sub Elements are separated by sub element separator, by default :
EDI is a delimited file format. You have to know both the line delimiter and the column delimiter (for lack of a better answer). You might, for example, see an EDI file with the following format (from http://www.slik.co.nz/HTML_help/edi_file_format.htm):
HDR|6||||
DTL|1|ABC|xyz|123|1
DTL|13|ABC|animal|334|1
DTL|11|ABC|sfdk|432|2
DTL|12|ABC|wewdc|3|1
DTL|14|ABC|qwdx|416|4
The first line is the header and tells you there are six records. The other lines are detail lines.
X12 is one standard used by EDI. You will see X12 used commonly in healthcare. If you have X12, you can examine the X12 standard to figure out how to parse.
EDI stands for Electronic Data Interchange...
It's not a specific format per-se. Generally speaking it's a flat text file of data that usually has an associated published specification. For example: "Position 23-34 is the original price as a monetary value"
You really won't be able to do anything useful with an EDI file if you don't have the defined specification that goes along with it.
Once you get the specification, I believe how to read the file will be quite clear.
Generally the process is:
1. Read/Parse the EDI file.
2. Perform any processing/transformation on that data that you need to.
3. Persist it into your local system format (tables, other flat files, whatever).
Sorry there's not much more we could tell you unfortunately.
EDI stands for “Electronic Data Interchange.” The practice involves using computer technology to exchange information – or data – electronically between two organizations, called “Trading Partners.” Technically, EDI is a set of standards that define common formats for the information so it can be exchanged in this way.
Read more: http://www.1edisource.com/learn-about-edi/what-is-edi#ixzz2g5E4p2ET
EDI is just a flat file that contain some type of hierarchy. Usually companies buy EDI translator software to parse those files and extract data and then integrate with other systems. You can also use some type of service and they will do that for you. You can try to use Amosoft EDI Serices (www.amosoft.com) and they can help you with that.

Resources