AIML Parser PHP - parsing

I am trying to develop Artificial Bot i found AIML is something that can be used for achieving such goal i found these points regarding AIML parsing which is done by Program-O
1.) All letters in the input are converted to UPPERCASE
2.) All punctuation is stripped out and replaced with spaces
3.) extra whitespace chatacters, including tabs, are removed
From there, Program O performs a search in the database, looking for all potential matches to the input, including wildcards. The returned results are then “scored” for relevancy and the “best match” is selected. Program O then processes the AIML from the selected result, and returns the finished product to the user.
I am just wondering how to define score and find relevant answer closest to user input
Any help or ideas will be appreciated

#user3589042 (rather cumbersome name, don't you think?)
I'm Dave Morton, lead developer for Program O. I'm sorry I missed this at the time you asked the question. It only came to my attention today.
The way that Program O scores the potential matches pulled from the database is this:
Is the response from the aiml_userdefined table? yes=300/no=0
Is the category for this bot, or it's parent (if it has one)? this=250/parent=0
Does the pattern have one or more underscore (_) wildcards? yes=100/no=0
Does the current category have a <topic> tag? yes(see below)/no=0
a. Does the <topic> contain one or more underscore (_) wildcards? yes=80/no=0
b. Does the <topic> directly match the current topic? yes=50/no=0
c. Does the <topic> contain a star (*) wildcard? yes=10/no=0
Does the current category contain a <that> tag? yes(see below)/no=0
a. Does the <that> contain one or more underscore (_) wildcards? yes=45/no=0
b. Does the <that> directly match the current topic? yes=15/no=0
c. Does the <that> contain a star (*) wildcard? yes=2/no=0
Is the <pattern> a direct match to the user's input? yes=10/no=0
Does the <pattern> contain one or more star (*) wildcards? yes=1/no=0
Does the <pattern> match the default AIML pattern from the config? yes=5/no=0
The script then adds up all passed tests listed above, and also adds a point for each word in the category's <pattern> that also matches a word in the user's input. The AIML category with the highest score is considered to be the "best match". In the event of a tie, the script will then select either the "first" highest scoring category, the "last" one, or one at random, depending on the configuration settings. this selected category is then returned to other functions for parsing of the XML.
I hope this answers your question.

Related

Machine learning: which algorithm fits for questions answering

I want to build a ml program who talks to user and get some inputs from the user.
The ml program analyze the input data(keywords) then predict the best solution.
So, you are looking at an AI application which needs some sort of machine intelligence for processing natural language.
Let us say the language of choice here is English. There are many things to be considered before building such a system.
Dependency parsing
Word Sense Disambiguation
Verb Sense Disambiguation
Coreference Resolution
Semantic Role Labelling
Universe of knowledge.
In brief you need to build all the above essential modules before you can generate your response.
You need to decide what kind of problem you are working on? Is it an open domain or closed domain problem, meaning what is the scope of knowledge of this application.
For example: Google now is an open domain problem which can practically take any possible input.
But some applications pertain to a particular task like automating food orders in an app etc where the scope of questions which can be asked is limited.
Once that is decided, you need to parse your input sentence and dependency parsing is the way to go. You can use Stanford core NLP suite to achieve most of the NLP tasks which were mentioned above.
Once the input sentence is parsed and you have the subjects, objects, etc it is time to disambiguate the words in the sentence as a particular word can have different meanings.
Then disambiguate the verb meaning identifying the type of verb (like return could mean going back to a place or giving back something )
Then you need to resolve coreference resolution meaning mapping the nouns and pronouns and other entities in a given context. For example:
My name is John. I work at ABC company.
Here I in the second sentence refers to John.
This helps us in answering questions like where does John work. Since John was only used in the first sentence and his work was mentioned in the second sentence coreference resolution helps us map them together.
The next task at hand is semantic role labelling, which basically means labelling all the arguments in a sentence with respect to each of its verb.
For example: John killed Mary.
Here the verb is kill, John and Mary are the arguments of the verb kill. John takes the role A0 and Mary the role A1. Where the definitions of these roles for each verb are mentioned in a huge frame and argument annotation framework created by the NLP community. Here A0 means the person who killed, A1 means the person who was killed.
Now once you have identified A0 and A1 just look into the definition of the kill frame and return A0 for killer and A1 for the victim.
Another important task at hand is to identify when your system must respond with an answer. For which you need to know if the given sentence is a declarative or assertive sentence or an interrogative sentence. You can just check that by seeing if the input sentence ends with a question mark.
Now to answer your question:
Let us say your input to the application is:
Input 1: John killed Mary.
Clearly this is an assertive sentence so just store it and process it as mentioned above.
Now the next input is:
Input 2: Who killed Mary?
This is an interrogative sentence so you need to come up with a reply or a response.
Now find the semantic role labels of input 1 and input 2 and return the word of input 1 which matches the argument of Who in sentence 2.
Here in this case who would be labeled as A0 and John would be labeled as A0, simply return John.
Most of the NLP modules mentioned can directly be implemented using Stanford core NLP however if you want to implement some algorithms on your own you can go through the recent publications in EMNLP, NIPS, ICML, CONLL etc to understand them better and implement the one which best suits you.
Good luck !

Extracting from .bib file with Raku (previously aka Perl 6)

I have this .bib file for reference management while writing my thesis in LaTeX:
#article{garg2017patch,
title={Patch testing in patients with suspected cosmetic dermatitis: A retrospective study},
author={Garg, Taru and Agarwal, Soumya and Chander, Ram and Singh, Aashim and Yadav, Pravesh},
journal={Journal of Cosmetic Dermatology},
year={2017},
publisher={Wiley Online Library}
}
#article{hauso2008neuroendocrine,
title={Neuroendocrine tumor epidemiology},
author={Hauso, Oyvind and Gustafsson, Bjorn I and Kidd, Mark and Waldum, Helge L and Drozdov, Ignat and Chan, Anthony KC and Modlin, Irvin M},
journal={Cancer},
volume={113},
number={10},
pages={2655--2664},
year={2008},
publisher={Wiley Online Library}
}
#article{siperstein1997laparoscopic,
title={Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases},
author={Siperstein, Allan E and Rogers, Stanley J and Hansen, Paul D and Gitomirsky, Alexis},
journal={Surgery},
volume={122},
number={6},
pages={1147--1155},
year={1997},
publisher={Elsevier}
}
If anyone wants to know what bib file is, you can find it detailed here.
I'd like to parse this with Perl 6 to extract the key along with the title like this:
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study
hauso2008neuroendocrine: Neuroendocrine tumor epidemiology
siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases
Can you please help me to do this, maybe in two ways:
Using basic Perl 6
Using a Perl 6 Grammar
TL;DR
A complete and detailed answer that does just exactly as #Suman asks.
An introductory general answer to "I want to parse X. Can anyone help?"
A one-liner in a shell
I'll start with terse code that's perfect for some scenarios[1], and which someone might write if they're familiar with shell and Raku basics and in a hurry:
> raku -e 'for slurp() ~~ m:g / "#article\{" (<-[,]>+) \, \s+
"title=\{" (<-[}]>+) \} / -> $/ { put "$0: $1\n" }' < derm.bib
This produces precisely the output you specified:
garg2017patch: Patch testing in patients with suspected cosmetic dermatitis: A retrospective study
hauso2008neuroendocrine: Neuroendocrine tumor epidemiology
siperstein1997laparoscopic: Laparoscopic thermal ablation of hepatic neuroendocrine tumor metastases
Same single statement, but in a script
Skipping shell escapes and adding:
Whitespace.
Comments.
► use tio.run to run the code below
for slurp() # "slurp" (read all of) stdin and then
~~ m :global # match it "globally" (all matches) against
/ '#article{' (<-[,]>+) ',' \s+ # a "nextgen regex" that uses (`(...)`) to
'title={' (<-[}]>+) '}' / # capture the article id and title and then
-> $/ { put "$0: $1\n" } # for each article, print "article id: title".
Don't worry if the above still seems like pure gobbledygook. Later sections explain the above while also introducing code that's more general, clean, and readable.[2]
Four statements instead of one
my \input = slurp;
my \pattern = rule { '#article{' ( <-[,]>+ ) ','
'title={' ( <-[}]>+ ) }
my \articles = input .match: pattern, :global;
for articles -> $/ { put "$0: $1\n" }
my declares a lexical variable. Raku supports sigils at the start of variable names. But it also allows devs to "slash them out" as I have done.
my \pattern ...
my \pattern = rule { '#article{' ( <-[,]>+ ) ','
'title={' ( <-[}]>+ ) }
I've switched the pattern syntax from / ... / in the original one-liner to rule { ... }. I did this to:
Eliminate the risk of pathological backtracking
Classic regexes risk pathological backtracking. That's fine if you can just kill a program that's gone wild, but click the link to read how bad it can get! 🤪 We don't need backtracking to match the .bib format.
Communicate that the pattern is a rule
If you write a good deal of pattern matching code, you'll frequently want to use rule { ... }. A rule eliminates any risk of the classic regex problem just described (pathological backtracking), and has another superpower. I'll cover both aspects below, after first introducing the adverbs corresponding to those superpowers.
Raku regexes/rules can be (often are) used with "adverbs". These are convenient shortcuts that modify how patterns are applied.
I've already used an adverb in the earlier versions of this code. The "global" adverb (specified using :global or its shorthand alias :g) directs the matching engine to consume all of the input, generating a list of as many matches as it contains, instead of returning just the first match.
While there are shorthand aliases for adverbs, some are used so repeatedly that it's a lot tidier to bundle them up into distinct rule declarators. That's why I've used rule. It bundles up two adverbs appropriate for matching many data formats like .bib files:
:ratchet (alias :r)
:sigspace (alias :s)
Ratcheting (:r / :ratchet) tells the compiler that when an "atom" (a sub-pattern in a rule that is treated as one unit) has matched, there can be no going back on that. If an atom further on in the pattern in the same rule fails, then the whole rule immediately fails.
This eliminates any risk of the "pathological backtracking" discussed earlier.
Significant space handling (:s / :sigspace) tells the compiler that an atom followed by literal spacing that is in the pattern indicates that a "token" boundary pattern, aka ws should be appended to the atom.
Thus this adverb deals with tokenizing. Did you spot that I'd dropped the \s+ from the pattern compared to the original one in the one-liner? That's because :sigspace, which use of rule implies, takes care of that automatically:
say 'x#y x # y' ~~ m:g:s /x\#y/; # (「x#y」) <-- only one match
say 'x#y x # y' ~~ m:g /x \# y/; # (「x#y」) <-- only one match
say 'x#y x # y' ~~ m:g:s /x \# y/; # (「x#y」 「x # y」) <-- two matches
You might wonder why I've reverted to using / ... / to show these two examples. Turns out that while you can use rule { ... } with the .match method (described in the next section), you can't use rule with m. No problem; I just used :s instead to get the desired effect. (I didn't bother to use :r for ratcheting because it makes no difference for this pattern/input.)
To round out this dive into the difference between classic regexes (which can also be written regex { ... }) and rule rules, let me mention the other main option: token. The token declarator implies the :ratchet adverb, but not the :sigspace one. So it also eliminates the pathological backtracking risk of a regex (or / ... /) but, just like a regex, and unlike a rule, a token ignores whitespace used by a dev in writing out the rule's pattern.
my \articles = input .match: pattern, :global
This line uses the method form (.match) of the m routine used in the one-liner solution.
The result of a match when :global is used is a list of Match objects rather than just one. In this case we'll get three, corresponding to the three articles in the input file.
for articles -> $/ { put "$0: $1\n" }
This for statement successively binds a Match object corresponding to each of the three articles in your sample file to the symbol $/ inside the code block ({ ... }).
Per Raku doc on $/, "$/ is the match variable, so it usually contains objects of type Match.". It also provides some other conveniences; we take advantage of one of these conveniences related to numbered captures:
The pattern that was matched earlier contained two pairs of parentheses;
The overall Match object ($/) provides access to these two Positional captures via Positional subscripting (postfix []), so within the for's block, $/[0] and $/[1] provide access to the two Positional captures for each article;
Raku aliases $0 to $/[0] (and so on) for convenience, so most devs use the shorter syntax.
Interlude
This would be a good time to take a break. Maybe just a cuppa, or return here another day.
The last part of this answer builds up and thoroughly explains a grammar-based approach. Reading it may provide further insight into the solutions above and will show how to extend Raku's parsing to more complex scenarios.
But first...
A "boring" practical approach
I want to parse this with Raku. Can anyone help?
Raku may make writing parsers less tedious than with other tools. But less tedious is still tedious. And Raku parsing is currently slow.
In most cases, the practical answer when you want to parse well known formats and/or really big files is to find and use an existing parser. This might mean not using Raku at all, or using an existing Raku module, or using an existing non-Raku parser in Raku.
A suggested starting point is to search for the file format on modules.raku.org or raku.land. Look for a publicly shared parsing module already specifically packaged for Raku for the given file format. Then do some simple testing to see if you have a good solution.
At the time of writing there are no matches for 'bib'.
Even if you don't know C, there's almost certainly a 'bib' parsing C library already available that you can use. And it's likely to be the fastest solution. It's typically surprisingly easy to use an external library in your own Raku code, even if it's written in another programming language.
Using C libs is done using a feature called NativeCall. The doc I just linked may well be too much or too little, but please feel free to visit the freenode IRC channel #raku and ask for help. (Or post an SO question.) We're friendly folk. :)
If a C lib isn't right for a particular use case, then you can probably still use packages written in some other language such as Perl, Python, Ruby, Lua, etc. via their respective Inline::* language adapters.
The steps are:
Install a package (that's written in Perl, Python or whatever);
Make sure it runs on your system using a compiler of the language it's written for;
Install the appropriate Inline language adapter that lets Raku run packages in that other language;
Use the "foreign" package as if it were a Raku package containing exported Raku functions, classes, objects, values, etc.
(At least, that's the theory. Again, if you need help, please pop on the IRC channel or post an SO question.)
The Perl adapter is the most mature so I'll use that as an example. Let's say you use Perl's Text::BibTex packages and now wish to use Raku with that package. First, setup it up as it's supposed to be per its README. Then, in Raku, write something like:
use Text::BibTeX::BibFormat:from<Perl5>;
...
#blocks = $entry.format;
Explanation of these two lines:
The first line is how you tell Raku that you wish to load a Perl module.
(It won't work unless Inline::Perl5 is already installed and working. But it should be if you're using a popular Raku bundle. And if not, you should at least have the module installer zef so you can run zef install Inline::Perl5.)
The last line is just a mechanical Raku translation of the #blocks = $entry->format; line from the SYNOPSIS of the Perl package Text::BibTeX::BibFormat.
A Raku grammar / parser
OK. Enough "boring" practical advice. Let's now try have some fun creating a grammar based Raku parser good enough for the example from your question.
► use glot.io to run the code below
unit grammar bib;
rule TOP { <article>* }
rule article { '#article{' $<id>=<-[,]>+ ','
<kv-pairs>
'}'
}
rule kv-pairs { <kv-pair>* % ',' }
rule kv-pair { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }
With this grammar in place, we can now write something like:
die "Use CommaIDE?" unless bib .parsefile: 'derm.bib';
for $<article> -> $/ { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }
to generate exactly the same output as the previous solutions.
When a match or parse fails, by default Raku just returns Nil, which is, well, rather terse feedback.
There are several nice debugging options to figure out what's going on with a regex or grammar, but the best option by far is to use CommaIDE's Grammar-Live-View.
If you haven't already installed and used Comma, you're missing one of the best parts of using Raku. The features built in to the free version of Comma ("Community Edition") include outstanding grammar development / tracing / debugging tools.
Explanation of the 'bib' grammar
unit grammar bib;
The unit declarator is used at the start of a source file to tell Raku that the rest of the file declares a named package of code of a particular type.
The grammar keyword specifies a grammar. A grammar is like a class, but contains named "rules" -- not just named methods, but also named regexs, tokens, and rules. A grammar also inherits a bunch of general purpose rules from a base grammar.
rule TOP {
Unless you specify otherwise, parsing methods (.parse and .parsefile) that are called on a grammar start by calling the grammar's rule named TOP (declared with a rule, token, regex, or method declarator).
As a, er, rule of thumb, if you don't know if you should be using a rule, regex, token, or method for some bit of parsing, use a token. (Unlike regex patterns, tokens don't risk pathological backtracking.)
But I've used a rule. Like token patterns, rules also avoid the pathological backtracking risk. But, in addition rules interpret some whitespace in the pattern to be significant, in a natural manner. (See this SO answer for precise details.)
rules are typically appropriate towards the top of the parse tree. (Tokens are typically appropriate towards the leaves.)
rule TOP { <article>* }
The space at the end of the rule (between the * and pattern closing }) means the grammar will match any amount of whitespace at the end of the input.
<article> invokes another named rule in this grammar.
Because it looks like one should allow for any number of articles per bib file, I added a * (zero or more quantifier) at the end of <article>*.
rule article { '#article{' $<id>=<-[,]>+ ','
<kv-pairs>
'}'
}
If you compare this article pattern with the ones I wrote for the earlier Raku rules based solutions, you'll see various changes:
Rule in original one-liner
Rule in this grammar
Kept pattern as simple as possible.
Introduced <kv-pairs> and closing }
No attempt to echo layout of your input.
Visually echoes your input.
<[...]> is the Raku syntax for a character class, like[...] in traditional regex syntax. It's more powerful, but for now all you need to know is that the - in <-[,]> indicates negation, i.e. the same as the ^ in the [^,] syntax of ye olde regex. So <-[,]>+ attempts a match of one or more characters, none of which are ,.
$<id>=<-[,]>+ tells Raku to attempt to match the quantified "atom" on the right of the = (i.e. the <-[,]>+ bit) and store the results at the key <id> within the current match object. The latter will be hung from a branch of the parse tree; we'll get to precisely where later.
rule kv-pairs { <kv-pair>* % ',' }
This pattern illustrates one of several convenient Raku regex features. It declares you want to match zero or more kv-pairs separated by commas.
(In more detail, the % regex infix operator requires that matches of the quantified atom on its left are separated by the atom on its right.)
rule kv-pair { $<key>=\w* '={' ~ '}' $<value>=<-[}]>* }
The new bit here is '={' ~ '}'. This is another convenient regex feature. The regex Tilde operator parses a delimited structure (in this case one with a ={ opener and } closer) with the bit between the delimiters matching the quantified regex atom on the right of the closer. This confers several benefits but the main one is that error messages can be clearer.
I could have used the ~ approach in the /.../ regex in the one-liner, and vice-versa. But I wanted this grammar solution to continue the progression toward illustrating "better practice" idioms.
Constructing / deconstructing the parse tree
for $<article> { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }`
$<article>, $<id> etc. refer to named match objects that are stored somewhere in the "parse tree". But how did they get there? And exactly where is "there"?
Returning to the top of the grammar:
rule TOP {
If a .parse is successful, a single 'TOP' level match object is returned. (After a parse is complete the variable $/ is also bound to that top match object.) During parsing a tree will have been formed by hanging other match objects off this top match object, and then others hung off those, and so on.
Addition of match objects to a parse tree is done by adding either a single generated match object, or a list of them, to either a Positional (numbered) or Associative (named) capture of a "parent" match object. This process is explained below.
rule TOP { <article>* }
<article> invokes a match of the rule named article. An invocation of the rule <article> has two effects:
Raku tries to match the rule.
If it matches, Raku captures that match by generating a corresponding match object and adding it to the parse tree under the key <article> of the parent match object. (In this case the parent is the top match object.)
If the successfully matched pattern had been specified as just <article>, rather than as <article>*, then only one match would have been attempted, and only one value, a single match object, would have been generated and added under the key <article>.
But the pattern was <article>*, not merely <article>. So Raku attempts to match the article rule as many times as it can. If it matches at all, then a list of one or more match objects is stored as the value of the <article> key. (See my answer to "How do I access the captures within a match?" for a more detailed explanation.)
$<article> is short for $/<article>. It refers to the value stored under the <article> key of the current match object (which is stored in $/). In this case that value is a list of 3 match objects corresponding to the 3 articles in the input.
rule article { '#article{' $<id>=<-[,]>+ ','
Just as the top match object has several match objects hung off of it (the three captures of article matches that are stored under the top match object's <article> key), so too do each of those three article match objects have their own "child" match objects hanging off of them.
To see how that works, let's consider just the first of the three article match objects, the one corresponding to the text that starts "#article{garg2017patch,...". The article rule matches this article. As it's doing that matching, the $<id>=<-[,]>+ part tells Raku to store the match object corresponding to the id part of the article ("garg2017patch") under that article match object's <id> key.
Hopefully this is enough (quite possibly way too much!) and I can at last exhaustively (exhaustingly?) explain the last line of code, which, once again, was:
for $<article> -> $/ { put "$<id>: $<kv-pairs><kv-pair>[0]<value>\n" }`
At the level of the for, the variable $/ refers to the top of the parse tree generated by the parse that just completed. Thus $<article>, which is shorthand for $/<article>, refers to the list of three article match objects.
The for then iterates over that list, binding $/ within the lexical scope of the -> $/ { ... } block to each of those 3 article match objects in turn.
The $<id> bit is shorthand for $/<id>, which inside the block refers to the <id> key within the article match object that $/ has been bound to. In other words, $<id> inside the block is equivalent to $<article><id> outside the block.
The $<kv-pairs><kv-pair>[0]<value> follows the same scheme, albeit with more levels and a positional child (the [0]) in the midst of all the key (named/ associative) children.
(Note that there was no need for the article pattern to include a $<kv-pairs>=<kv-pairs> because Raku just presumes a pattern of the form <foo> should store its results under the key <foo>. If you wish to disable that, write a pattern with a non-alpha character as the first symbol. For example, use <.foo> if you want to have exactly the same matching effect as <foo> but just not store the matched input in the parse tree.)
Phew!
When the automatically generated parse tree isn't what you want
As if all the above were not enough, I need to mention one more thing.
The parse tree strongly reflects the tree structure of the grammar's rules calling one another from the top rule down to leaf rules. But the resulting structure is sometimes inconvenient.
Often one still wants a tree, but a simpler one, or perhaps some non-tree data structure.
The primary mechanism for generating exactly what you want from a parse, when the automatic results aren't suitable, is make. (This can be used in code blocks inside rules or factored out into Action classes that are separate from grammars.)
In turn, the primary use case for make is to generate a sparse tree of nodes hanging off the parse tree, such as an AST.
Footnotes
[1] Basic Raku is good for exploratory programming, spikes, one-offs, PoCs and other scenarios where the emphasis is on quickly producing working code that can be refactored later if need be.
[2] Raku's regexes/rules scale up to arbitrary parsing, as introduced in the latter half of this answer. This contrasts with past generations of regex which could not.[3]
[3] That said, ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ remains a great and relevant read. Not because Raku rules can't parse (X)HTML. In principle they can. But for a task as monumental as correctly handling full arbitrary in-the-wild XHTML I would strongly recommend you use an existing parser written expressly for that purpose. And this applies generally for existing formats; it's best not to reinvent the wheel. But the good news with Raku rules is that if you need to write a full parser, not just a bunch of regexes, you can do so, and it need not involve going insane!

What is parsing? (And differences from search and grep?

What exactly is parsing? I mean, generally. How different is parsing different from searching? On command line, if I use the grep tool/command; is that parsing?
For example, if I have just one string:
"Hello world! How are you doing today?"
and I tried to search (using grep or any other tool) whether the word "you" is within that string; is that parsing?
What if I do a web search; for example in Google? Is that parsing?
Or is parsing the name of the process that is a part of the process known as "Search"?
The verb "parse" is essentially related to the word "part", as in "part of speech". (See, for example, the on-line etymology dictionary.)
To "parse" a sentence has traditionally meant to break the sentence down into its component parts and identify their relationship with each other. For example, given "I asked a question.", we can parse it into a subject ("I"), a transitive verb in past tense ("asked"), and an object phrase consisting of an article ("a") and a noun ("question"). The parse indicates that the subject performed some action on the object; this is not the same statement as *"A question asked I", and not just because the latter is ungrammatical.
With the advent of computer languages and computational theory, the term "parsing" has been generalized to include analysis of strings which are not human languages. Some people would even use it to simply mean "to divide a string into its component parts", such as "parsing" a line in a CSV file into fields.
It's quite a stretch to apply that to merely searching for a string inside another string, although there may be contexts in which that is an acceptable use of the word. Personally, I would only use it for the action of completely deconstructing a structured string.

opennlp does not recognized twiiter input

I have a file that contain twitter post and I am trying to identify the structure of the twitter post per line, like get the noun ,verb and stuff, using opennlp.
it work perfectly until it reach line that contain hashtag and link only
example :
#birthday www.mybirthday/test/mypi.com
and give error com.cybozu.labs.langdetect.LangDetectException: no features in text
when I write a sentence next to the line it just work. any idea how to handle it?? there are more then thousand line that almost like the example.
To use the POS tagger, you need to pass tokens, (in laymen terms say individual words). The link contains multiple words separated by a slash /. The link in itself is not associated with any Part Of Speech. See here the list of tags and how they are assigned to a word. If you want it to identify your link, and give a separate tag to it, say LN either give your own training data, here you will know how to create the training dataor separate the words in the link as separate token (you can separate a link by slash/, question mark?, equal to sign = or ampersand (&)) to get the underlying words and then use the POSTagger to get Part Of Speech (similar case for the hash tag.) For tokenization also, you can use opennlp tokenizer and for your special case, train it. Go through the documentation, it will help you a lot.

How- XLST Transformation

Just wanted to ask on how to get the author names in the given xml sample below and put an attribut of eq="yes". EQ means Equal Contributors.
This is the XML.
<ArticleFootnote Type="Misc">
<Para>John Doe and Jane Doe are equal contributors.</Para>
</ArticleFootnote>
This should be the output in other form of XML.
<AuthorGroups>
<Authors eq="yes">John Doe</Authors>
<Authors eq="yes">Jane Doe</Authors>
</AuthorGroups>
Assuming that JOhn Doe and Jane Doe are already defined in the list of authors but after the transformation, author tag should have the attribute eq="yes". Please help as I don't know much writing in xlst.
Thanks in advance.
There's not really enough information here to give you a clear answer.
If you have a list of authors, you could use fn:match() on each author's name in turn, maybe after changing space to \s+ in the pattern.
I normally use Perl to do this sort of thing, though, being careful not to disrupt the tagging structure.
In any case you'll either need to process the text a word at a time, probably recursively to find the longest match in cases where one name is just "John" and another is "John Doe". Watch that you don't add markup to names you've already processed.
In the case that the text really always says exactly what you have there, but with different names, though, you could have a template to match ArticleFootnote/Para[contains(., 'are equal contrubutors')] and either use substring() and substring-before() or the XSLT 2 pattern matching.

Resources