How do I deal with duplicate lemmas in Spacy - parsing

I'm parsing the following two sentences using the en_core_web_sm model version 2.0.0: "Rabbits are mammals" and "Rabbits have hair". In the first sentence I'm getting a token with lemma 'rabbit', id 10130653840019909946 while in the second I'm getting 'rabbits' with id 4224103442939446549. This is surprising. Is it a bug in the model or am I misunderstanding something? I've tried doing the same with en_core_web_md, but got exactly the same results.

Related

How to predict a textual field on the basis of input features

I'm stuck with a problem statement of predicting an identifier for a product on the basis of couple of product features. A sample of data available to me looks like the one shown below:
ABC10L 20.0 34 XYZ G345F FG MKD -> 000000DEF_VYA
Here, ABC10L,20.0,34,XYZ,G345F,FG,MKD are the features and 000000DEF_VYA is the unique identifier associated with the product. Initially I tried to formulate this problem as a regression problem but I'm not sure how to generate textual output from my model and what should be my cost function. Also, I'm not sure is regression the right tool to solve the issue here.
Please help in suggesting the right approach and how I may proceed to solve this !!!

dafny corresponding SMT queries

I'm trying to inspect a simple looping program that finds the maximal element in an integer array. Here is the permalink here. Everything works fine, but I'm really interested in
the resulting SMT file, so I extracted it using:
$ dafny /compile:3 /proverLog:./mySMT.smt myCode.dfy
Then ran with z3 as follows:
$ z3 ./mySMT.smt
I got 3 unsat responses and I was wondering what are the corresponding 3 queries?
I looked at the *.smt file and found 11K of machine-generated SMT.
Any tips on deciphering the smt file? thanks!
If you want the resulting SMT file, then that 11K file is your answer. I imagine that looking at it will lead you to the conclusion that you don't actually want to look at the resulting SMT file.
So, I don't know what it is that you want to accomplish. If you want to learn more about your program, then the best way is to work (only) from the Dafny program text. For example, you can add more assert statements to, essentially, ask the verifier if the given condition is provable at the location of the statement.
If you're interested in how Dafny encodes its verification conditions (that is, if you yourself are a tool developer and want to learn how to generate good verification conditions), then I suggest you use the /print switch to generate the Boogie program that Dafny generates. With some understanding of the Boogie intermediate verification language, the Boogie code is readable. For a more tutorial account of how to encode a Dafny-like language into Boogie, I recommend:
"Specification and verification of object-oriented software",
K. Rustan M. Leino.
Lecture note, Marktoberdorf 2008.
Rustan
PS. Unless you insist on particular formatting, you can print your array elements without using a loop if you first convert the array's element to a sequence:
print "a = ", a[..], "\n";

Stanford CoreNLP merge tokens

I found the powerful RegexNER and it's superset TokensRegex from Stanford CoreNLP.
There are some rules that should give me fine results, like the pattern for PERSONs with titles:
"g. Meho Mehic" or "gdin. N. Neko" (g. and gdin. are abbrevs in Bosnian for mr.).
I'm having some trouble with existing tokenizer. It splits some strings on two tokens and some leaves as one, for example, token "g." is left as word <word>g.</word> and token "gdin." is split on 2 tokens: <word>gdin</word> and <word>.</word>.
That causes trouble with my regex, I have to deal with one-token and multi-token cases (note the two "maybe-dot"s), RegexNER example:
( /g\.?|gdin\.?/ /\./? ([{ word:/[A-Z][a-z]*\.?/ }]+) ) PERSON
Also, this causes another issue, with sentence splitting, some sentences are not well recognized so regex fails... For example, when a sentence contains "gdin." it will split it on two, so a dot will end the (non-existing) sentence. I managed to bypass this with ssplit.isOneSentence = true for now.
Questions:
Do I have to make my own tokenizer, and how? (to merge some tokens like "gdin.")
Are there any settings I missed that could help me with this?
Ok I thought about this for a bit and can actually think of something pretty straight forward for your case. One thing you could do is add "gdin" to the list of titles in the tokenizer.
The tokenizer rules are in edu.stanford.nlp.process.PTBLexer.flex (look at line 741)
I do not really understand the tokenizer that well, but clearly there are a list of job titles in there, so they must be cases where it will not split off the period.
This will of course require you to work with a custom build of Stanford CoreNLP.
You can get the full code at our GitHub:https://github.com/stanfordnlp/CoreNLP
There are instructions on the main page for building a jar with all of the main Stanford CoreNLP classes. I think if you just run the ant process it will automatically generate the new PTBLexer.java based on PTBLexer.flex.

Problem with PMML generation of Random Forest in R

I am trying to generate a PMML from a random forest model I obtained using R. I am using the randomForest package 4.6-12 and the last version of PMML for R. But every time I try to generate the PMML obtain an error. Here is the code:
data_train.rf <- randomForest( TARGET ~ ., data = train, ntree=100, na.action=na.omit, importance=TRUE)
pmml_file = pmml(data_train.rf)
[1] "Now converting tree 1 to PMML"
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I haven't been able to find the origin of the problem, any thoughts?
Thanks in advance,
Alvaro
Looks like the variable splitNode has not been initialized inside the "pmml" package. The initialization pathway depends on the data type of the split variable (eg. numeric, logical, factor). Please see the source code of the /R/pmml.randomForest.R file inside "pmml" package.
So, what are the columns in your train data.frame object?
Alternatively, you could try out the r2pmml package as it is much better at handling the randomForest model type.
The pmml code assumes the data type of the variables are numeric, simple logical or factor. It wont work if the data you use are some other type; DateTime for example.
It would help if your problem is reproducible; ideally you would provide the dataset you used. If not, at least a sample of it or a description of it...maybe summarize it.
You should also consider emailing the package maintainers directly.
I may have found the origin for this problem. In my dataset I have approx 500000 events and 30 variables, 10 of these variables are factors, and some of them have weakly populated levels in some cases having as little as 1 event.
I built several Random Forest models, each time including and extra variable to the model. I started adding to the model the numerical variables without a problem to generate a PMML, the same happened for the categorical variables with all levels largely populated, when I tried to include categorical variables with levels weakly populated I got the error:
Error in append.XMLNode(rfNode, splitNode) : object 'splitNode' not found
I suppose that the origin of the problem is that in some situations when building a tree where the levels is weakly populated then there is no split as there is only one case and although the randomForest package knows how to handle these cases, the pmml package does not.
My tests show that this problem appears when the number of levels of a categorical variable goes beyond the maximum number allowed by the randomForest function. The split defined in the forest sublist is no longer a positive integer which is required by the split definition for categorical objects. Reducing the number of levels fixed the problem.

NLP - How would you parse highly noisy sentence (with Earley parser)

I need to parse a sentence. Now I have an implemented Earley parser and a grammar for it. And everything works just fine when a sentence has no misspellings. But the problem is a lot of sentences I have to deal with are highly noisy. I wonder if there's an algorithm which combines parsing with errors correction? Possible errors are:
typos 'cheker' instead of 'checker'
typos like 'spellchecker' instead of 'spell checker'
contractions like 'Ear par' instead 'Earley parser'
If you know an article which can answer my question I would appriciate a link to it.
I assume you are using a tagger (or lexer) stage that is applied before the Earley parser, i.e. an algorithm that splits the input string into tokens and looks each token up in a dictionary to determine its part-of-speech (POS) tag(s):
John --> PN
loves --> V
a --> DT
woman --> NN
named --> JJ,VPP
Mary --> PN
It should be possible to build some kind of approximate string lookup (aka fuzzy string lookup) into that stage, so when it is presented with a misspelled token, such as 'lobes' instead of 'loves', it will not only identify the tags found by exact string matching ('lobes' as a noun plural of 'lobe'), but also tokens that are similar in shape ('loves' as third-person singular of verb 'love').
This will imply that you generally get a larger number of candidate tags for each token, and therefore a larger number of possible parse results during parsing. Whether or not this will produce the desired result depends on how comprehensive the grammar is, and how good the parser is at identifying the correct analysis when presented with many possible parse trees. A probabilistic parser may be better for this, as it assigns every candidate parse tree a probability (or confidence score), which may be used to select the most likely (or best) analysis.
If this is the solution you'd like to try, there are several possible implementation strategies. Firstly, if the tokenization and tagging is performed as a simple dictionary lookup (i.e. in the style of a lexer), you may simply use a data structure for the dictionary that enables approximate string matching. General methods for approximate string comparison are described in Approximate string matching algorithms, while methods for approximate string lookup in larger dictionaries are discussed in Quickly compare a string against a Collection in Java.
If, however, you use an actual tagger, as opposed to a lexer, i.e. something that performs POS disambiguation in addition to mere dictionary lookup, you will have to build the approximate dictionary lookup into that tagger. There must be a dictionary lookup function, which is used to generate candidate tags before disambiguation is applied, somewhere in the tagger. That dictionary lookup will have to be replaced with one that enables approximate string lookup.

Resources