Note:
This question is not asking for advice on which library to use; I'm rolling my own.
I'm reading through the HL7 v2.5.1 spec in order to make a parsing engine for iOS and Windows.
My question is related to the Name Validity Range component in the Patient Name field (PID-5). But I think it applies generally to all DR (Date Range) components.
In Chapter 3: Patient Administration, on page 75, the following information is listed:
Components: {...omitted...} ^ <Name Validity Range (DR)> ^
{...omitted...}
Subcomponents for Name Validity Range (DR):
<Range Start Date/Time (TS)> & <Range End Date/Time (TS)>
Subcomponents for Range Start Date/Time (TS):
<Time (DTM)> & <Degree of Precision (ID)>
Subcomponents for Range End Date/Time (TS):
<Time (DTM)> & <Degree of Precision (ID)>
I understand how the fields, components and subcomponents are structured and how their separators are used... or at least I think I do. However, the above information confuses me as to how the data would be expressed. I have searched, but cannot find a suitable message sample for this kind of data. Based on my understanding of the HL7 data structures, here's how the data would be encoded:
PID|||01234||JONES^SUSIE^Q^^^^^^^199505011201&M&199505011201&M^199505011201&M&199505011201&M
The problem here, of course, is that having subcomponents embedded in subcomponents leaves you unsure exactly how to parse the data and what data goes where.
I did look into Chapter 2: Control, Appendix A and found this text on page 160:
Note: DR cannot be legally expressed when embedded within another data type. Its use is constrained to a segment field.
So, it appears that the standard listed for PID-5 is invalid. I haven't seen any messages from my system that even generate this information, so it may be a moot point for my particular case, but I don't like developing solutions with known holes. Has anybody encountered this "in the wild"?
An item with DR data type can be subdivided and have a precision subcomponent if the item is of type field .eg. ARQ/11 Requested start date/time range.
It can be subdivided in start and end of data range subcomponents but not precision subcomponent if the item with DR data type is already part of an other data type as in your example PID/5.
Patient name is an XPN data type which is a composite data type. That basically mean it can have a combination of Primary (like ST) and other Composites, as shown here
Now, you are looking at XPN.10 which is 10th component which is DR Data type, and again DR is combination of 2 primary DTM - start and end - or 2 subcomponents. And subcomponents are seperated by &.
Related
An extensive search has produced no answer to the question, "Is there a class or function that parses input for soundness relating to UK Ordnance Survey Grid References".
The UK is mapped by the UK Ordnance Survey who produce detailed maps of the United Kingdom with many types of referencing. One of these is the commonly used six figure Grid Reference, we are at SO896804.
We already use a postcode (zip) checker to make sure that the information entered into the postcode field is sound, but we can't find the same for the OS Grid Reference.
Does such a Grid Reference function exist, or do we down tools and write one?
Thank you.
Since you tagged this "parsing", rather than using some GIS-like tag, I'd say that a reasonably valid OS grid reference corresponds to the regular expression:
(H[PTUWXYZ]|N[ABCDFGHJKLMNORQSTUWXYZ]|OV|S[CDEHJKMNOPRSTUVWXYZ]|T[AFGLMQRV])([0-9]{6})
If you were prepared to accept 4-digit 100ha blocks as well as the six-digit 1ha blocks, you could replace the second parenthesized expression with ([0-9]{4}|[0-9]{2}).
Of course, some of the two-letter blocks are almost completely marine. You could almost certainly ignore OV, for example. NW contains a little bit of Dumfries & Galloway (Portpatrick, NW995545), and a much larger but irrelevant part of Northern Ireland.
I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.
We have a COBOL program in which we populate values into a COBOL internal table and then search this table to find out a certain value. Prior to this search, we initialize tables index variable.
SET PAF-IDX TO 1.
Could anyone clarify, if it is allowed in COBOL to initialize an index variable loke this.
INITIALIZE PAF-IDX.
No. And why would you want to "INITIALIZE" it?
This is from the IBM Enterprise Cobol manual:
identifier-1
Receiving areas.
identifier-1 must reference one of the following:
v An alphanumeric group item
v A national group item
v An elementary data item of one of the following categories:
Alphabetic
Alphanumeric
Alphanumeric-edited
DBCS
External floating-point
Internal floating-point
National
National-edited
Numeric
Numeric-edited
v A special register that is valid as a receiving operand in a MOVE
statement with identifer-2 or literal-1 as the sending operand.
When identifier-1 re
EDIT:
The OpenCobol Programmer's Guide documents it specifically:
The list of data items eligible to be set to new values by this statement is:
Every elementary item specified as identifier-1 ..., PLUS...
Every elementary item defined subordinate to every group item specified as dentifier-1
..., with the following exceptions:
USAGE INDEX items are excluded.
Items with a REDEFINES as part of their definition are excluded; this means that
tems subordinate to them are excluded as well.
The Draft Cobol Standard is less explicit, but these items when used in INITIALIZE are processed by generating a SET rather than a MOVE: DATA-POINTER, FUNCTION-POINTER, OBJECT-REFERENCE, or PROGRAM-POINTER.
EDIT:
Seeing that the OpenCobol reference is not as "specific" as I thought: In IBM Cobol, currently, nothing which cannot be manipulated by being the target of a MOVE can be INITIALIZEd. This is the same for the current OpenCobol. The Draft Cobol has some exceptions, listed, but including neither INDEXED BY (which are not part of the table itself, but separate items for which the compiler itself defines storage) nor USAGE INDEX.
We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.
I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.
Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm
The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.
Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.
Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser
I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }
A basic algorithm could do the following:
First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues
"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name.
Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.
As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...
I've seen XML before, but I've never seen anything like EDI.
How do I read this file and get the data that I need? I see things like ~, REF, N1, N2, N4 but have no idea what any of this stuff means.
I am looking for Examples and Documentations.
Where can I find them?
Aslo
EDI guide i found says that it is based on " ANSI ASC X12/ ver. 4010".
Should I search form X12 ?
Kindly help.
Several of these other answers are very good. I'll try to fill in some things they haven't mentioned.
EDI is a set of standards, the most common of which are:
ANSI X12 (popular in the states)
EDIFACT (popular in Europe)
Sounds like you're looking at X12 version 4010. That's the most widely used (in my experience, anyway) version. There are lots and lots of different versions.
The file, or properly "interchange," is made up of Segments and Elements (and sometimes subelements). Each segment begins with a two- or three-word identifier (ISA, GS, ST, N1, REF).
The structure for all documents begins and ends with an envelope. The envelope is usually made up of the ISA segment and the GS segments. There can be more than one GS segment per file, but there should only be one ISA segment per file (note the should, not everyone plays by the rules).
The ISA is a special segment. Whereas all the other segments are delimited, and therefore can be of varying lenghts, the ISA segment is of fixed width. This is because it tells you how to read the rest of the file.
Start with the last three characters of the ISA segment. Those will tell you the element delimiter, the sub-element delimiter, and the segment delimiter. Here's an example ISA line.
ISA:00: :00: :01:1515151515 :01:5151515151 :041201:1217:U:00403:000032123:0:P:*~
In this case, the ":" is the element delimiter, "*" is a subelement delimiter, and "~" the segment delimiter. It's much easier if you're just trying to look at a file to put linebreaks after each segment delimiter (~).
The ISA also tells you who the document is from and to, what the version is (00403, which is also known as 4030), and the interchange control number (0000321233). The other stuff is probably not important to you at this stage.
This document is from sender "01:1515151515" and to receiver "01:5151515151". So what's with the "01:"? Well, this introduces an important concept in EDI, the qualifier. Several elements have qualifiers, which tell you what type of data the next element is. In this case, the 01 is supposed to be a Dunn and Bradstreet number. Other qualifiers for the ISA05 and ISA07 elements are 12 for phone number, and ZZ for "user defined". You'll find the concept of qualifiers all over EDI segments. A decent rule of thumb is that if it's two characters, it's a qualifier. In order to know what all the qualifiers mean, you'll need a standards guide (either in hard copy from the EDI standards body, or in some software).
The next line is the GS. This is a functional group (a way to group like documents together within an interchange.) For instance, you can have several purchase orders, and several functional acknowledgements within an ISA. These should be placed in separate functional groups (GS segments). You can figure out what type of documents are in a GS segment by looking at the first GS01 element.
GS:PO:9988776655:1122334455:20041201:1217:128:X:004030
Besides the document type, you can see the from (9988776655) and to (1122334455) again. This time they're using different identifiers, which is legal, because you may be receiving an interchange on behalf of someone else (if you're an intermediary, for instance). You can also see the version number again, this time with the trailing "0" (0004030). Use significant digits logic to strip off the leading zeros. Why is there an extra zero here and not in the ISA? I don't know. Lastly this GS segment also has it's own identifier, 128.
That's it for the beginning of the envelope. After that there will be a loop of documents beginning with ST. In this case they'd all be POs, which have a code (850), so the line would start with ST:850:blablabla
The envelope stuff ends with a GE segment which references the GS identifier (128) so you know which segment is being closed. Then comes an IEA which similarly closes out the ISA.
GE:1:128~
IEA:1:000032123~
That's an overview of the structure and how to read it. To understand it you'll need a reference book or software so you understand the codes, lots and lots of time, and lots and lots of practice. Good luck, and post again if you have more specific questions.
Wow, flashbacks. It's been over sixteen years ...
In principle, each line is a "segment", and the identifiers are the beginning of the line is a segment identifier. Each segment contains "elements" which are essentially positional fields. They are delimited by "element delimiters".
Different segments mean different things, and can indicate looping constructs, repeats, etc.
You need to get a current version of the standard for the basic parsing, and then you need the data dictionary to describe the content of the document you are dealing with, and then you might need an industry profile, implementation guide, or similar to deal with the conventions for the particular document type in your environment.
Examples? Not current, but I'm sure you could find a whole bunch using your search engine of choice. Once you get the basic segment/element parsing done, you're dealing with your application level data, and I don't know how much a general example will help you there.
EDI is a file format for structured text files, used by lots of larger organisations and companies for standard database exchange. It tends to be much shorter than XML which used to be great when data packets had to be small. Many organisations still use it, since many mainframe systems use EDI instead of XML.
With EDI messages, you're dealing with text messages that match a specific format. This would be similar to an XML schema, but EDI doesn't really have a standardized schema language. EDI messages themselves aren't really human-readable while most specifications aren't really machine-readable. This is basically the advantage of XML, where both the XML and it's schema can be read by humans and machines.
Chances are that when you're doing electronic banking through some client-side software (not browser-based) then you might already have several EDI files on your system. Banks still prefer EDI over XML to send over transaction data, although many also use their own custom text-based formats.
To understand EDI, you'll have to understand the data first, plus the EDI standard that you want to follow.
Assuming the data stream starts with βISAβ, towards the beginning there should be a section β~ST*β followed by three numeric digits. If you can post these three digits, I can probably provide you with more information. Also, knowing the industry would be helpful. For example, healthcare uses 270, 271, 276, 277 and a few others.