Correct padding for EDI ISA segment - edi

I have written an EDI document generator, and it currently pads any fields in the ISA segment that are less than the required number of characters with spaces on the left, e.g. ' 1234567890' for a 15 character element. I have a client who wants me to pad with spaces on the right. I can do this, but does ANSI X12 specify how elements in the ISA segment should be padded?

Yes, this is specified.
In x12 alphanumeric fields are left-aligned, so spaces are on right:
'1234567890' should be '1234567890 '

Padding to the left (right justified) is uncommon (but legal) in an X12 document, at least with retail documents. Here's a link to a healthcare document with the padding you're currently doing: http://www.xtranslator.com/prod/beginguidex12.pdf
The ISA is important because it is the only fixed length segment in the standard, and as such, is probably the most important segment for a a parser. The ISA MUST be 106 characters. There is a min/max definition of each element. If you don't have enough data to fill that element it should be padded with spaces on the right. The ISA02 and ISA04 are commonly empty elements, but need to be padded to make up the fixed length width of the segment. Sender IDs and Receiver IDs are commonly less than 15 characters (see snippet below), and therefore must be padded.
ISA snippet:
ISA*00* *00* *ZZ*RECEIVERID *12*SENDERID *100325*1113*U*00403*000011436*0*T*>~
I suspect you're going to find more partners who want left justified for the sender / receiver elements than right justified.

Related

Can EDI files have ~ in data?

I'm parsing a EDI file and splitting by ~s. I am wondering if it's possible for EDI to have ~ in the data itself? Is there a rule that says no ~ in the data? This is for 810/850 etc
The value defined in the 106th character of the ISA segment (or, alternatively – to be a bit less brittle to whitespace issues – the 1st character after the ISA16 element) is the segment delimiter (in official terms: the segment terminator). Most of the time people specify the ~ character, but other choices are certainly valid.
In this example, the 106th character is ~:
ISA*00* *00* *ZZ*AMAZONDS *01*TESTID *070808*1310*U*00401*000000043*1*T*+~
Instead of counting 106 characters (which, again, can be brittle to whitespace issues), you can count 16 elements – that is, 16 asterisks – to find the value for ISA16 (which is +), and then pick the next character (which is ~).
There are two relevant sections in the official X12 specification (bolded for emphasis):
12.5.4.3 Delimiter Specifications
The delimiters consist of three separators and a terminator. The
delimiters are devised for inclusion within the data stream of the transfer. The delimiters are:
segment terminator [note: this is the one we're discussing]
data element separator
component element separator
repetition separator
The delimiters are assigned by the interchange sender. These characters are disjoint from those of the data elements; if a character is selected for the data element separator, the component element separator, the repetition separator or the segment terminator from those available for the data elements, that character is no longer available during this interchange for use in a data element. The instance of the terminator (<tr>) must be different from the instance of the data element separator (<gs>), the component element separator (<us>) and the repetition separator (<rs>). The data element separator, component element separator and repetition separator must not have the same character assignment.
So, according to this part of the spec, if the ~ is used as the segment terminator, then the use of the ~ is disallowed in a data element (that is, the textual body).
Now, let's look at section 12.5.A.5 – Recommendations for the Delimiters:
Delimiter characters must be chosen with care, after consideration of data content, limitations of the transmission protocol(s) used, and applicable industry conventions. In the absence of other guidelines, the following recommendations are offered:
<tr> terminator: ~ | Note: the "~" was chosen for its infrequency of use in textual data.
This section is saying that ~ was chosen as the default because ~ is seldom found in textual data (it would have been a bad idea, for example, to use . as the default, since that's such a common inclusion).
That said, even though using the segment terminator is technically prohibited, it's still possible for an EDI transmission to inadvertently include ~ in the textual data – in other words, your trading partner may include this by accident. Further, the BIN and BSD (binary data) segments can certainly include ~ (though these may not apply based on the transaction sets you're working with).
In our parsing API, we apply a set of specific set of patterns based on the type of segment we encounter. For us, it's not sufficient to split naively based on the segment delimiter alone because we may encounter binary segments (BIN, BSD), where it's possible that the segment delimiter character is included in the textual data.
For a regular segment (i.e. not BIN or BSD), the logic is something like this:
Consume the segment code (i.e. the characters before the first element delimiter).
Consume each element of the segment based on the element delimiter.
Stop if the next character is a segment delimiter or a new line.
As an example, for segment BEG*PO-00001**20210901~, the process would look like:
Consume BEG. Since this is not a special segment (BIN or BSD), consume elements by splitting on *.
Consume PO-00001.
Consume ''.
Consume 20210901.
Stop since next char is ~.
(The pattern for binary segments is different from the pattern we use for regular segments.)
Here's an example of how our parser "fails" on a ~ in the textual data when the ISA16 segment delimiter is also ~; the JSON representation is particularly helpful for seeing the issue.
Here's an example of our parser succeeding on a ~ in the textual data when the ISA16 segment delimiter is ^.
Lastly, here's an example of our parsing succeeding where the ~ is specified in ISA16, but has been omitted altogether in favor of newlines – which we see occasionally.
Hope this helps.

How do I get rid of trailing and embedded spaces in a string?

I am writing a program that converts national and international account numbers into IBAN numbers. To start, I need to form a string: Bank ID + Branch ID + Account Number + ISO Country Code without the trailing spaces that may be present in these fields. But not every account number has the same length, some account numbers have branch identifiers while others don't, so I will always end up with trailing spaces from these fields.
My working storage looks something like this:
01 Input-IBAN.
05 BANK-ID PIC N(10) VALUE "LOYD".
05 BRANCH-ID PIC N(10) VALUE " ".
05 ACCOUNT-NR PIC N(28) VALUE "012345678912 ".
05 COUNTRY-CODE PIC N(02) VALUE "GB".
01 Output-IBAN PIC N(34).
I've put some values in there for the example; in reality it would depend on the input. The branch code is optional, hence me leaving it empty in the example.
I basically want to go from this input strung together:
"LOYD 012345678912 GB"
to this:
"LOYD012345678912GB"
Does anyone know a way to do this that does not result in performance issues? I have thought of using the FUNCTION REVERSE and then using an INSPECT for tallying leading spaces. But I've heard that's a slow way to do it. Does anyone have any ideas? And maybe an example on how to use said idea?
EDIT:
I've been informed that the elementary fields may contain embedded spaces.
I see now that you have embedded blanks in the data. Neither answer you have so far works, then. Gilbert's "squeezes out" the embedded blanks, mine would lose any data after the first blank in each field.
However, just to point out, I don't really believe you can have embedded blanks if you are in any way generating an "IBAN". For instance, https://en.wikipedia.org/wiki/International_Bank_Account_Number#Structure,
specifically:
The IBAN should not contain spaces when transmitted electronically.
When printed it is expressed in groups of four characters separated by
a single space, the last group being of variable length
If your source-data has embedded blanks, at the field level, then you need to refer that back up the line for a decision on what to do. Presuming that you receive the correct answer (no embedded blanks at the field level) then both existing answers are back on the table. You amend Gilbert's by (logically) changing LENGTH OF to FUNCTION LENGTH and dealing with any possibility of overflowing the output.
With the STRING you again have to deal with the possibility of overflowing the output.
Original answer based on the assumption of no embedded blanks.
I'll assume you don't have embedded blanks in the elementary items which make up your structure, as they are sourced by standard values which do not contain embedded blanks.
MOVE SPACE TO OUTPUT-IBAN
STRING BANK-ID
BRANCH-ID
ACCOUNT-NR
COUNTRY-CODE
DELIMITED BY SPACE
INTO OUTPUT-IBAN
STRING only copies the values until it runs out of data to copy, so it is necessary to clear the OUTPUT-IBAN before the STRING.
Copying of the data from each source field will end when the first SPACE is encountered in each source field. If a field is entirely space, no data will be copied from it.
STRING will almost certainly cause a run-time routine to be executed and there will be some overhead for that. Gilbert LeBlanc's example may be slightly faster, but with STRING the compiler deals automatically with all the lengths of all the fields. Because you have National fields, ensure you use the figurative-constant SPACE (or SPACES, they are identical) not a literal value which you think contains a space " ". It does, but it doesn't contain a National space.
If the result of the STRING is greater than 34 characters, the excess characters will be quietly truncated. If you want to deal with that, STRING has an ON OVERFLOW phrase, where you specify what you want done in that case. If using ON OVERFLOW, or indeed NOT ON OVERFLOW you should use the END-STRING scope-terminator. A full-stop/period will terminate the STRING statement as well, but when used like that it can never, with ON/NOT ON, be used within a conditional statement of any type.
Don't use full-stops/periods to terminate scopes.
COBOL doesn't have "strings". You cannot get rid of trailing spaces in fixed-length fields, unless the data fills the field. Your output IBAN will always contain trailing spaces when the data is short.
If you were to actually have embedded blanks at the field level:
Firstly, if you want to "squeeze out" embedded blanks so that they don't appear in the output, I can't think of a simpler way (using COBOL) than Gilbert's.
Otherwise, if you want to preserve embedded blanks, you have no reasonable choice other than to count the trailing blanks so that you can calculate the length of the actual data in each field.
COBOL implementations do have Language Extensions. It is unclear which COBOL compiler you are using. If it happens to be AcuCOBOL (now from Micro Focus) then INSPECT supports TRAILING, and you can count trailing blanks that way. GnuCOBOL also supports TRAILING on INSPECT and in addition has a useful intrinsic FUNCTION, TRIM, which you could use to do exactly what you want (trimming trailing blanks) in a STRING statement.
move space to your-output-field
string function
trim
( your-first-national-source
trailing )
function
trim
( your-second-national-source
trailing )
function
trim
( your-third-national-source
trailing )
...
delimited by size
into your-output-field
Note that other than the PIC N in your definitions, the code is the same as if using alphanumeric fields.
However, for Standard COBOL 85 code...
You mentioned using FUNCTION REVERSE followed by INSPECT. INSPECT can count leading spaces, but not, by Standard, trailing spaces. So you can reverse the bytes in a field, and then count the leading spaces.
You have National data (PIC N). A difference with that is that it is not bytes you need to count, but characters, which are made up of two bytes. Since the compiler knows you are using PIC N fields, there is only one thing to trip you - the Special Register, LENGTH OF, counts bytes, you need FUNCTION LENGTH to count characters.
National data is UTF-16. Which happens to mean the two bytes for each character happen to be "ASCII", when one of the bytes happens to represent a displayable character. That doesn't matter either, running on z/OS, an EBCDIC machine, as the compiler will do necessary conversions automatically for literals or alpha-numeric data-items.
MOVE ZERO TO a-count-for-each-field
INSPECT FUNCTION
REVERSE
( each-source-field )
TALLYING a-count-for-each-field
FOR LEADING SPACE
After doing one of those for each field, you could use reference-modification.
How to use reference-modification for this?
Firstly, you have to be careful. Secondly you don't.
Secondly first:
MOVE SPACE TO output-field
STRING field-1 ( 1 : length-1 )
field-2 ( 1 : length-2 )
DELIMITED BY SIZE
INTO output-field
Again deal with overflow if possible/necessary.
It is also possible with plain MOVEs and reference-modification, as in this answer, https://stackoverflow.com/a/31941665/1927206, whose question is close to a duplicate of your question.
Why do you have to be careful? Again, from the answer linked previously, theoretically a reference-modification can't have a zero length.
In practice, it will probably work. COBOL programmers generally seem to be so keen on reference-modification that they don't bother to read about it fully, so don't worry about a zero-length not being Standard, and don't notice that it is non-Standard, because it "works". For now. Until the compiler changes.
If you are using Enterprise COBOL V5.2 or above (possibly V5.1 as well, I just haven't checked) then you can be sure, by compiler option, if you want, that a zero-length reference-modification works as expected.
Some other ways to achieve your task, if embedded blanks can exist and can be significant in the output, are covered in that answer. With National, just always watch to use FUNCTION LENGTH (which counts characters), not LENGTH OF (which counts bytes). Usually LENGTH OF and FUNCTION LENGTH give the same answer. For multi-byte characters, they do not.
I have no way to verify this COBOL. Let me know if this works.
77 SUB1 PIC S9(4) COMP.
77 SUB2 PIC S9(4) COMP.
MOVE 1 TO SUB2
PERFORM VARYING SUB1 FROM 1 BY 1
UNTIL SUB1 > LENGTH OF INPUT-IBAN
IF INPUT-IBAN(SUB1:1) IS NOT EQUAL TO SPACE
MOVE INPUT-IBAN(SUB1:1) TO OUTPUT-IBAN(SUB2:1)
ADD +1 TO SUB2
END-IF
END-PERFORM.

What is a code point and code space?

I was reading the Wikipedia article on code points, but not sure if I understand correctly.
For example, the character encoding scheme ASCII comprises 128 code
points in the range 0hex to 7Fhex
So is 0hex a code point?
Also could not find anything on code space.
PS. If it's a duplicate please post a link in the comments and I'll remove the question.
A code point is a numerical code that refers to a single element/character in a specific coded character set, that sentence means that ASCII has 128 possible symbols (only a part of those will be printable characters) and each one of those has a related numerical code by which it can be identified/addressed, the code point.
For an alternative wording, check out this Joel's post and this summary by Oracle that also introduces the concept of code unit :)
To give you a real world example of what code points are, consider the unicode character snowman ☃, its code point (with unicode syntax U+<code point in hex>) is U+2603.
The concepts are slightly more abstract than the traditional, pre-Unicode concepts.
Traditionally, a "code space" was more or less synonymous with "character range". A 7-bit encoding would have a code space from 0 through 127, an 8-bit encoding 0 through 255, a 16-bit encoding 0 through 65535. Unicode has a code space from 0 through 0x10FFFF, though parts of the code space are unpopulated.
Traditionally, a "code point" was more or less synonymous with "character code". Unicode abstracts away from the single "character code" mapping to emphasize that there is a more-complex relationship between a set of glyphs and a set of character codes, and that some code points (such as joining modifiers) do not encode individual glyphs as such. Superficially, U+0020 is still the same character as ASCII SPACE 0x20, but Unicode has a much richer set of well-defined attributes and relationships.
Unicode had to coin new terms for these concepts so as not to overload the traditional terms with extended meanings. A "code space" is a unique, well-defined concept, which is not exactly the same thing as an (implicitly contiguous, possibly fully populated) character range. A "code point" is a unique, well-defined concept, which is not exactly the same thing as a "character code" (which isn't even entirely well-defined in the first place; it has multiple ambiguous interpretations).

How many chars can numeric EDIFACT data elements be long?

In EDIFACT there are numeric data elements, specified e.g. as format n..5 -- we want to store those fields in a database table (with alphanumeric fields, so we can check them). How long must the db-fields be, so we can for sure store every possible valid value? I know it's at least two additional chars (for decimal point (or comma or whatever) and possibly a leading minus sign).
We are building our tables after the UN/EDIFACT standard we use in our message, not the specific guide involved, so we want to be able to store everything matching that standard. But documentation on the numeric data elements isn't really straightforward (or at least I could not find that part).
Thanks for any help
I finally found the information on the UNECE web site in the documentation on UN/EDIFACT rules Part 4. UN/EDIFACT rules Chapter 2.2 Syntax Rules . They don't say it directly, but when you put all the parts together, you get it. See TOC-entry 10: REPRESENTATION OF NUMERIC DATA ELEMENT VALUES.
Here's what it basically says:
10.1: Decimal Mark
Decimal mark must be transmitted (if needed) as specified in UNA (comma or point, put always one character). It shall not be counted as a character of the value when computing the maximum field length of a data element.
10.2: Triad Seperator
Triad separators shall not be used in interchange.
10.3: Sign
[...] If a value is to be indicated to be negative, it shall in transmission be immediately preceded by a minus sign e.g. -112. The minus sign shall not be counted as a character of the value when computing the maximum field length of a data element. However, allowance has to be made for the character in transmission and reception.
To put it together:
Other than the digits themselves there are only two (optional) chars allowed in a numeric field: the decimal seperator and a minus sign (no blanks are permitted in between any of the characters). These two extra chars are not counted against the maximum length of the value in the field.
So the maximum number of characters in a numeric field is the maximal length of the numeric field plus 2. If you want your database to be able to store every syntactically correct value transmitted in a field specified as n..17, your column would have to be 19 chars long (something like varchar(19)). Every EDIFACT-message that has a value longer than 19 chars in a field specified as n..17 does not need to be stored in the DB for semantic checking, because it is already syntactically wrong and can be rejected.
I used EDI Notepad from Liaison to solve a similar challenge. https://liaison.com/products/integrate/edi/edi-notepad
I recommend anyone looking at EDI to at least get their free (express) version of EDI Notepad.
The "high end" version (EDI Notepad Productivity Suite) of their product comes with a "Dictionary Viewer" tool that you can export the min / max lengths of the elements, as well as type. You can export the document to HTML from the Viewer tool. It would also handle ANSI X12 too.

How do I know what character set is used in x12 document?

I am implementing an EDI-x12 header parser (only to parse "ISA" segment)
I notice that there are several character sets can be used.
My question is that how do I know that which one is used of incoming edi-x12 message so that I know how to interpret the message?
actually there is no such thing as a character set in x12.
this is up to the partners/interchange agreement.
but as X12 is mainly used in USA, it is us-ascii (almost always).
(but .....some companies send x12 as EBCEDIC ;-)))
If you're only doing ANSI X12, the ISA segment should be easy for you to parse, as it is a fixed length.
Position 4 will give you the element delimiter (field delimiter).
Position 106 will give you the record terminator.
Position 105 will give you the subelement delimiter
You probably won't have much use for the subelement delimiter, depending on the document type.
Once you figure out what your field delimiters are and then the record delimiter, it should be a snap.
(Standard disclaimer: there are many great tools out there in the form of data translators that make this job much simpler than having a programmer reinvent the wheel. Some of these tools are even open source and free. Just sayin'...)
Hope this helps.

Resources