How do I find the type of a delimiter? - delimiter

I have a log file that has a delimiter that I haven't seen before. It isn't tab or comma delimited as I would have noticed it already. When I open the log file (a .gz compressed file), the fields are delimited and separated by this black field with the letters SOH.
I've tried copying that SOH into Google and other editors, but only see the same character again. When copied into Google, I see a square with 4 numbers: 0, 0, 0, 1.
How do I figure out what the exact character of the delimiter?

SOH is ASCII character 0x01 or Unicode character U+0001. It is also sometimes known as ^A (Control-A). It stands for "Start of Heading". For more information, see C0 Control Codes.

Related

Where should my brackets be in relation to the text for Arabic languages?

Our application automatically modifies the layout of Arabic text when it is followed by a bracket and I was wondering whether this was the correct behaviour or not?
The application shows items in the following format:
[ID of structure](version)
So version 1.5 of the English structure "stackoverflow" would be displayed as:
stackoverflow(1.5)
Note: the brackets need to be displayed. There is no space between the ID and the first bracket. The brackets simply encompass the version. The brackets could have been any character but it's far too late to switch to a different character now!
This works fine for left to right languages, but for Arabic languages the structures appear in the form:
ستاكوفيرفلوو(1.0)
I am not an Arabic speaker and I need to know if this is actually correct. Is the Arabic format the equivalent of the English format or has something gone horribly wrong?
The text in Arabic should be shown like:
ستاكوفيرفلوو(1.0) ‏
I added the html entity of RLM / Right-to-left Mark ‏ in order to fix the text. You should do so if your application doesn't support Bidi native-ly. You can add the RLM by these ways:
HTML Entity (decimal) ‏
HTML Entity (hex) ‏
HTML Entity (named) ‏
How to type in Microsoft Windows Alt +200F
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)
UTF-8 (binary) 11100010:10000000:10001111
UTF-16 (hex) 0x200F (200f)
UTF-16 (decimal) 8,207
UTF-32 (hex) 0x0000200F (200f)
UTF-32 (decimal) 8,207
C/C++/Java source code "\u200F"
Python source code u"\u200F"
(note: StackOverflow right transliteration is ستاك-أوفرفلو)

How to mask specific elements in HL7?

Currently I am learning how to work with HL7 and how to parse it in python. Now I was wondering what happens if a value in a HL7 segment contains a pipe sign, e.g. '|'. How is this sign handled? If there is no masking, it would lead to a crash of the HL7 parser. Is there a masking possibility?
\F\
You should read the relevant sections of chapter 2 of the version 2 standard about how escaping works in version 2.
The HL7 structure has defined escape sequences for the separators like |.
When you look at a HL7 message, the used five delimiters are right after the MSH:
MSH|^~\&
| is the Field separator F
^ the component separator S
~ is the repetition separator (for the second level elements) R
\ is the escape character E
& is the sub-component separator T
So to escape one of the special characters like |, you have to take the escape character and then add the defined letter (F,S, etc.)
So in above case, to escape the | you would have to put \F\. Or escaping the escape character is \E\.
If you like you can also change the delimiters after the MSH completely, but I don't recommend that.

EDI X12 tool is not able to convert the perpendicular as a segment separator

While processing an EDI 210 X12 inbound file, receiving a following exception as EdiInvoice Service process failed: '', hexadecimal value 0x15, is an invalid character. Line 2, position 37.'. Because X12 Input file having a perpendicular in 106 position of ISA 16 element.
Can you please provide a solution to handle this symbol
It is not uncommon to define a segment separator like "|" is the X12 ISA segment (ISA16, character 106 of the ISA segment). Have a look at a related tutorial.
To my knowledge, characters with ASCII codes below 128 (hex 0x80) are allowed.
If your EdiInvoice service cannot handle partner-specific segment separators, you most likely have to contact the developer of your tool or the provider of your service first.
As suggested by eppye: If the sending partner can switch to an "easier" segment separator, this would also be an option, but there has to be a good reason for the partner to invest time and effort.
If the syntax of the EDI 210 X12 message conforms to the specification, the sending partner has no obligation to change anything.
No, neither CR nor CR+LF are allowed as Segment Terminator.
X12 is based on the idea of "graphical characters", to be independent of character encodings. CR is a non-printable character, not a graphical character.
The allowable chars in the basic char list are
"ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
!"&'()*+,-./:;?="
The extended character list is
"abcdefghijklmnopqrstuvwxyz%#[]_{}\|<>~^`#$"
Supporting the extended characters is a Trading Partner agreement item.
To be "X12 legal", a character from the basic set must be used. If the extended set is supported, a char from it may be used.
"X12 legal" is only as important as your trading partner thinks it is (or their software).
Using CR or CR+LF is not uncommon, even if frowned upon by purists.
Having a CR or CR+LF after each Segment Terminator is more common.
some companies indeed use non-printable character as segment terminator. AFAIK this is OK in ANSI X12. Sort of smart, as you are not allowed to use the segment terminator in your data, and data will (almost;-)) never contain hex 15. I have seen hex 07, carriage return etc..
possible solutions:
1. Contact the provider of the service, they should fix it.
2. Ask edi-partner if they can use other segment terminator.
3. pre-process the file and replace that character. Maybe not possible.
Not sure which EDI tool you're using, but another option is to define the element and segment terminators in your partner profile in your tool. I've done this in Sterling Integrator and know others support this as well.
In your case you need to confirm the trading partner of the below rule so that they may not get mistaken ISA16 with Segment Terminator or Suffix .
ISA16 (Sub Element separator)
Char
3a if the type is Hex
Limited to the values in the ASCII character set.
Segment terminator
~ if the type is Char
7e if the type is Hex.
empty
but if you do, you need to designate a suffix.This element is limited to the values in the ASCII character set.
Suffix
either None
CR (carriage return)
LF(line feed)
CRLF (carriage return/line feed).
Various combination for Segment Terminator and Suffix
Segment terminator
Segment terminator + carriage return
Segment terminator + line feed
Segment terminator + carriage return/line feed
Carriage return
Line feed
Carriage return/line feed

RAILS 3 CSV "Illegal quoting" is a lie

I've hit a problem during parsing of a CSV file where I get the following error:
CSV::MalformedCSVError: Illegal quoting on line 3.
RAILS code in question:
csv = CSV.read(args.local_file_path, col_sep: "\t", headers: true)
Line 3 in the CSV file is:
A-067067 VO VIA CE 0 8 8 SWCH Ter 4, Loc Is Here, Mne, Per Fl Auia/Sey IMAC NEK_HW 2011-03-09 09:47:44 2011-03-09 11:50:26 2011-01-13 10:49:17 2011-02-14 14:02:43 2011-02-14 14:02:44 0 0 771 771 46273 "[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC" SOME_TEXT SOME_TEXT N/A Name Here RESOLVED_CLOSED RESOLVED_CLOSED
UPDATE: Tabs don't appear to have come across above. See pastebin RAW TEXT: http://pastebin.com/4gj7iUpP
I've read numerous threads all over StackOverflow and Google about why this is and I understand that. But the CSV row above has perfectly legal quoting does it not?
The CSV is tab delimited and there is only a tab followed by the quote on either side of the column in question. There is 1 quote in that field and it is double quoted to escape it. So what gives? I can't work it out. :(
Assuming I've got something wrong here, I'd like the solution to include a way to work around the issue as I don't have control over how the CSV is constructed.
This part of your CSV is at fault:
46273 "[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC" SOME_TEXT
At least one of these parts has a stray space:
46273 "
" SOME_TEXT
I'd guess that the "3" and the double are supposed to be separated by one or more tabs but there is a space before the quote. Or, there is a space after the quote on the other end when there are only supposed to be tabs between the closing quote and the "S".
CSV escapes double quotes by double them so this:
"[O/H 15/02] B270 W31 ""TEXT TEXT 2 X TEXT SWITC"
is supposed to be a single filed that contains an embedded quote:
[O/H 15/02] B270 W31 "TEXT TEXT 2 X TEXT SWITC
If you have a space before the first quote or after the last quote then, since your fields are tab delimited, you have an unescaped double quote inside a field and that's where your "illegal quoting" error comes from.
Try sending your CSV file through cat -t (which should represent tabs as ^I) to find where the stray space is.

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources