How- XLST Transformation

How- XLST Transformation - xml-parsing

Just wanted to ask on how to get the author names in the given xml sample below and put an attribut of eq="yes". EQ means Equal Contributors.
This is the XML.
<ArticleFootnote Type="Misc">
<Para>John Doe and Jane Doe are equal contributors.</Para>
</ArticleFootnote>
This should be the output in other form of XML.
<AuthorGroups>
<Authors eq="yes">John Doe</Authors>
<Authors eq="yes">Jane Doe</Authors>
</AuthorGroups>
Assuming that JOhn Doe and Jane Doe are already defined in the list of authors but after the transformation, author tag should have the attribute eq="yes". Please help as I don't know much writing in xlst.
Thanks in advance.

There's not really enough information here to give you a clear answer.
If you have a list of authors, you could use fn:match() on each author's name in turn, maybe after changing space to \s+ in the pattern.
I normally use Perl to do this sort of thing, though, being careful not to disrupt the tagging structure.
In any case you'll either need to process the text a word at a time, probably recursively to find the longest match in cases where one name is just "John" and another is "John Doe". Watch that you don't add markup to names you've already processed.
In the case that the text really always says exactly what you have there, but with different names, though, you could have a template to match ArticleFootnote/Para[contains(., 'are equal contrubutors')] and either use substring() and substring-before() or the XSLT 2 pattern matching.

Related

AIML Parser PHP

I am trying to develop Artificial Bot i found AIML is something that can be used for achieving such goal i found these points regarding AIML parsing which is done by Program-O
1.) All letters in the input are converted to UPPERCASE
2.) All punctuation is stripped out and replaced with spaces
3.) extra whitespace chatacters, including tabs, are removed
From there, Program O performs a search in the database, looking for all potential matches to the input, including wildcards. The returned results are then “scored” for relevancy and the “best match” is selected. Program O then processes the AIML from the selected result, and returns the finished product to the user.
I am just wondering how to define score and find relevant answer closest to user input
Any help or ideas will be appreciated

#user3589042 (rather cumbersome name, don't you think?)
I'm Dave Morton, lead developer for Program O. I'm sorry I missed this at the time you asked the question. It only came to my attention today.
The way that Program O scores the potential matches pulled from the database is this:
Is the response from the aiml_userdefined table? yes=300/no=0
Is the category for this bot, or it's parent (if it has one)? this=250/parent=0
Does the pattern have one or more underscore (_) wildcards? yes=100/no=0
Does the current category have a <topic> tag? yes(see below)/no=0
a. Does the <topic> contain one or more underscore (_) wildcards? yes=80/no=0
b. Does the <topic> directly match the current topic? yes=50/no=0
c. Does the <topic> contain a star (*) wildcard? yes=10/no=0
Does the current category contain a <that> tag? yes(see below)/no=0
a. Does the <that> contain one or more underscore (_) wildcards? yes=45/no=0
b. Does the <that> directly match the current topic? yes=15/no=0
c. Does the <that> contain a star (*) wildcard? yes=2/no=0
Is the <pattern> a direct match to the user's input? yes=10/no=0
Does the <pattern> contain one or more star (*) wildcards? yes=1/no=0
Does the <pattern> match the default AIML pattern from the config? yes=5/no=0
The script then adds up all passed tests listed above, and also adds a point for each word in the category's <pattern> that also matches a word in the user's input. The AIML category with the highest score is considered to be the "best match". In the event of a tie, the script will then select either the "first" highest scoring category, the "last" one, or one at random, depending on the configuration settings. this selected category is then returned to other functions for parsing of the XML.
I hope this answers your question.

Regexp for a name

I need to make sure people enter their first, middle and last names correctly for a form in Rails. So the first thought for a regular expression is:
\A[[:upper:]][[:alpha:]'-]+( [[:upper:]][[:alpha:]'-]*)*\z
That'll make sure every word in the name starts with an uppercase letter followed by a letter or hyphen or apostrophe.
My first question I guess doesn't have much to do with regular expressions, though I'm hoping there's a regular expression I can copy for this. Are letters, hyphens and apostrophes the only characters I should be checking in a name?
My second question is if it's important to make sure each name has at least 1 uppercase letter? So many people enter all lowercase names and I really want to avoid that, but is it sometimes legitimate?
Here's what I have so far that makes sure there's at least 1 uppercase letter somewhere in the name:
\A([[:alpha:]'-]+ )*[[:alpha:]'-]*[[:upper:]][[:alpha:]'-]*( [[:alpha:]'-]+)*\z
Isn't there a [:name:] bracket expression? :)
UPDATE: I added . and , to the characters allowed, surprised I didn't think of them originally. So many people must have to deal with this kind of regular expression! Nobody has any pre-made regular expressions for this sort of thing?

A good start would be to allow letters, marks, punctiation and whitespace. To allow for a given name like "María-Jose" and a last name like "van Rossum" (note the whitespace).
So that boils down to something like:
[\p{Letter}\p{Mark}\p{Punctuation}\p{Separator}]+
If you want to restrict that a bit you could have a look at classes like \p{Lowercase_Letter}, \p{Uppercase_Letter}, \p{Titlecase_Letter}, but there may be scripts that don't have casing. \p{Space_Separator} and \p{Dash_Punctuation} can narrow it down to names that I know. But names I don't...I don't know...
But before you start constructing your regex for "validating" a name. Please read this excellent piece on names by W3C. It will shake even your concepts of first, middle and last names.
For example:
In some cultures you are given a name (Björk, Osama) and an indication of who your father (or mother) was (Guðmundsdóttir, bin Mohammed). So the "first name" could be "Björk" but:
Björk wouldn’t normally expect to be called Ms. Guðmundsdóttir. Telephone directories in Iceland are sorted by given name.
But in other cultures, the first name is not given, but a family name. In "Zhāng Mànyù", "Zhāng" is the family name. And how to address her, would depend how well you know her, but again "Ms. Zhāng" would be strange.
The list of examples goes on and ends in a some 30+ links to Wikipedia for more examples.
The article does end with suggestions for field design and some pointers on what characters to allow:
Don't forget to allow people to use punctuation such as hyphens, apostrophes, etc. in names. Don't require names to be entered all in upper case – this can be difficult on a mobile device. Allow the user to enter a name with spaces , eg. to support prefixes and suffixes such as de in French, von in German, and Jnr/Jr in American names, and also because some people consider a space-separated sequence of characters to be a single name, eg. Rose Marie.

To answer your question about capital letters: in many areas of the world, names do not necessarily start with a capital letter. In Dutch for instance, you have surnames like "van der Vliet" where words like "van", "de", "den" and "der" are not capitalised. Additionally, you have special cases like "De fauw" and "Van pellicom" where an administrative error never got rectified, and the correct capitalisation is fairly illogical. Please do not make the mistake of rejecting such names.
I also know about town names in South Africa such as eThekwini, where the capital letter is not necessarily the first letter of the word. This could very well appear in surnames or given names as well.

Name fixing / validation?

Frequently, I have found, users enter very poorly formatted names when they register. I get all kinds of crazy formatting from Paypal IPN and other payment gateways even from all lower case to all caps to just flat out messed up.
One thing I do with this information is to send out emails and offer greetings, however I dislike the poorly formatted names. Has someone thought about this before and figured out a happy middle road solution? For example, I realize it would be poor form to simply correct spellings that are seemingly errors, but it would be wise to at least fix "what is reasonable." At the minimum that would be capitalization. Perhaps simply upcasing the first letters of each distinct "word" in the first and last name strings would be sufficient?
Or is there a a better method? Perhaps a database of common name capitalizations for things like "McBerry" and "van Buuren"? A gem or some such tool? Just kind of curious. Perhaps it is foolish to put this much thought into this topic, but I really like to be as courteous and professional as possible in my communications with users vs just using a poorly formatted name as is the usual.

The best you can hope to do is capitalize the first letter of their first/last/middle name:
"bob".capitalize => "Bob"
From Ruby:
capitalize → new_str click to toggle source
Returns a copy of str with the first character converted to uppercase and the remainder to lowercase. Note: case conversion is effective only in ASCII region.
"hello".capitalize #=> "Hello"
"HELLO".capitalize #=> "Hello"
"123ABC".capitalize #=> "123abc"
You can also use downcase to level everything out, then capitalize to make it "right".
For instance:
fName = "jIMMY"
lName = "sMITH"
fName.downcase
lName.downcase
fName.capitalize
lNmae.capitalize
puts fName + lName => Jimmy Smith
However, with names like VanBuuren, it might be a little harder.
Here is a link to Ruby strings which has some methods that might help you on your quest.
http://www.ruby-doc.org/core-2.0/String.html

What are recommended patterns to localize a dynamically built phrase?

Given a phrase that is dynamically constructed with portions present or removed based on parameters, what are some possible solutions for supporting localization? For example, consider the following two phrases with bold parts that represent dynamically inserted portions:
The dog is spotted, has a doghouse and is chasing a ball.
The dog is white, and is running in circles.
For English, this can be solved by simply concatenating the phrase portions or perhaps having a few token-filled strings in a resource file that can be selected based on parameters. But these solutions won't work or get ugly quickly once you need to localize for other languages or have more parameters. In the example above, assuming that the dog appearance is the only portion always present, a localized resource implementation might consist of the following resource strings:
AppearanceOnly: The dog is %appearance%.
ActivityOnly: The dog is %appearance% and is %activity%.
AssessoryOnly: The dog is %appearance% and has %accessory%.
AccessoryActivity: The dog is %appearance%, has %accessory% and is %activity%.
While this works, the required number of strings grows exponentially depending upon the number of parameters.
Been searching far and wide for best practices that might help me with this challenge. The only solution I have found is to simply reword the phrase—but you lose the natural sentence structure, which I really don't want to do:
Dog: spotted, doghouse, chasing ball
Suggestions, links, thoughts, examples, or "You're crazy, just reword it!" feedback is welcome :) Thanks!

The best approach is probably to divide the sentence to separate sentences, like “The dog is spotted. The dog has a doghouse. The dog is chasing a ball.” This may look boring, but if you would replace all occurrences of “the dog” except the first one, you have a serious pronoun problem. In many languages, the pronoun to be used would depend on the noun it refers to. (Even in English, it is not quite clear whether a dog is he, she, or it.)
The reason for separation is that different languages have different verb systems. For example, in Russian, you cannot really combine the three sentences into one sentence that has three verbs sharing a subject. (In Russian, you don’t use the verb “to be” in present tense – instead, you would just say the equivalent of “Dog – spotted”, and there is no verb corresponding to “to have” – instead, you use the equivalent of “at dog doghouse”. Finnish is similar with respect to “to have”. Such issues are sometimes handled, in “forced” localizations, by using a word that corresponds to “to possess” or “to own”, but the result is odd-looking, to put it mildly.)
Moreover, languages have different natural orders for subject, verb, and object. Your initial approach implicitly postulates a SVO order. You should not assume that the normal, unmarked word order always starts with the subject. Instead of using sentence patterns like "%subject% %copula% %appearance% (where %copula% is “is”, “are”, or “am” in English), you would need to call a function with two parameters, subject and appearance, returning a sentence that has a language-dependent copula, or no copula, and that has a word order determined by the rules of the language. Yes, it gets complicated; localization of generated statements gets rather complicated as soon as you deal with anything but structurally very similar languages.

Parsing a full name into its constituents

We are in need of developing a back end application that can parse a full name into
Prefix (Dr. Mr. Ms. etc)
First Name
Last Name
Middle Name
etc
Challenge here is that it has to support names of multiple countries and languages. One assumption that we have is we will always get a country and language along with the full name as input.
The full name may come in any format. For the same country / language combination, it may come in with first name last name or the reverse. Comma will not be a part of the Full Name.
Is is feasible? We are also open to any commercially available software.

I think this is impossible. Consider Ralph Vaughan Williams. His family name is "Vaughan Williams" and his first name is "Ralph". Contrast this with Charles Villiers Stanford, whose family name is "Stanford", with first name "Charles" and middle name "Villiers".
Both are English-speaking composers from England, so country and language information is not sufficient to establish the correct parsing logic.

Since the OP was open to any commercially available offering...
The "IBM InfoSphere Global Name Analytics" appears to be a commercial solution satisfying the original request for the parsing of a [free-form unstructured] personal name [full name]; apparently with a degree of certainty in regards to resolving some of the name ambiguity issues alluded to in other responses.Note: I have no personal experience nor association with the product, I had merely encountered this discussion and the following reference links while re-investigating effectively the same concern as described by the OP. HTH.
A general product documentation link:
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gna_con_gnaoverview.html
Refer to the "Parsing names using NameParser" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_np_con_parsingnamesusingnameparser.html
The NameParser is a component API for the product per
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturecapis.html
Refer to the "Parsing names using IBM NameWorks" at
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_parsingnamesusingnameworks.html
"IBM NameWorks combines the individual IBM InfoSphere Global Name Recognition components into a single, unified, easy-to-use application programming interface (API), and also extends this functionality to Java applications and as a Web service"
http://publib.boulder.ibm.com/infocenter/gnrgna/v4r1m0/topic/com.ibm.gnr.gna.ic.doc/topics/gnr_gnm_con_logicalarchitecturenwapis.html
To clarify why I think this answers the question, ameliorating some of the previous alluded difficulties in accomplishing the task... If I understood correctly what I read, the APIs use the "NameHunter Server" to search the "IBM InfoSphere Global Name Data Archive (NDA)" which is described as "a collection of nearly one billion names from around the world, along with gender and country of association for each name. This large repository of name information powers the algorithms and rules that IBM InfoSphere Global Name Recognition products use to categorize, classify, parse, genderize , and match names."
FWiW I also ran across a "Name Parser" which uses a database of ~140K names as noted at:
http://www.melissadata.com/dqt/websmart-web-services.htm

The only reasonable approach is to avoid having to do so in the first place. The most obvious (and common) way to do that is to have the user enter the title, first/given name, last/family name, suffix, etc., separately from each other, rather than attempting to parse them out of a single string.

Ask yourself: do you really need the different parts of a name? Parsing names is inherently un-doable, since different cultures use different conventions (e.g. "middle name" is a typical USA-ism) and some small percentage of names will always be treated wrongly.
It is much preferable to treat a name as an "atomic" not-splittable entity.

Here are two free PHP name parsing libraries for those on a budget:
https://code.google.com/p/php-name-parser/
http://jasonpriem.org/human-name-parse/
And here is a Javasript library in Node package manager:
https://npmjs.org/package/name-parser

I wrote a simple human name parser in javascript as an npm module:
https://www.npmjs.org/package/humanparser
humanparser
Parse a human name string into salutation, first name, middle name, last name, suffix.
Install
npm install humanparser
Usage
var human = require('humanparser');
var fullName = 'Mr. William R. Jenkins, III'
, attrs = human.parseName(fullName);
console.log(attrs);
//produces the following output
{ saluation: 'Mr.',
firstName: 'William',
suffix: 'III',
lastName: 'Jenkins',
middleName: 'R.',
fullName: 'Mr. William R. Jenkins, III' }

A basic algorithm could do the following:
First see if incoming string starts with a title such as Mrs and remove it if it does, checking against a fixed list of titles.
If there is one space left and one space exactly, assume first word is first name and second word is surname (which will be incorrect at times)
To go beyond that would be lots of work, see How to parse full names to identify avenues for improvement and see these involved IBM docs for further implementation clues

"Ashton Jordan" "Jordan Ashton" -- u can't tell which is the surname and which is the give name.
Also people in South India apparently don't have a surname. The same with Sherpas in the Himalayas.
But say you have a huge list of all surnames (which are never used as given names) then maybe you can use that to identify other parts of the name (Salutations/Given/Middle/Jr/Sr/I/II/...) And if there is ambiguity your name-parser could ask for human input.

As others have explained, the problem is not solvable. The best approach I can think of to storing names is storing the full name, followed by the start (and potentially also ending) offsets into a "primary collating subfield" which the person entering the name could have indicated by highlighting it or such. For example
John Robert Miller, Jr.
where the boldface is indicating what was marked as the "primary collating subfield". This range would then be moved to the beginning of the string when generating the collating key.
Of course this approach alone may not be sufficient if you also want to support titles (and ignoring them for collation purposes)...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart