How to set whitespace tokenizer on NER Model? - machine-learning

i am creating a custom NER model using CoreNLP 3.6.0
My props are:
# location of the training file
trainFile = /home/damiano/stanford-ner.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz
# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1
# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
I build with this command:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop /home/damiano/stanford-ner.prop
The problem is when i use this model to retrieve the entities inside a textfile. The command is:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt
Where file.txt is:
Hello!
my
name
is
John.
The output is:
Hello/O !/O
my/O name/O is/O John/PERSON ./O
As you can see it split "Hello!" into two tokens. Same thing for "John."
I must use whitespace tokenizer.
How can i set it?
why does CoreNlp is splitting those words in two tokens?

You set your own tokenizer by specifying the classname to the tokenizerFactory flag/property:
tokenizerFactory = edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory
You can specify any class that implements Tokenizer<T> interface, but the included WhitespaceTokenizer sounds like what you want. If the tokenizer has options you can specify them with tokenizerOptions For instance, here, if you also specify:
tokenizerOptions = tokenizeNLs=true
then the newlines in your input will be preserved in the input (for output options that don't convert things always into a one-token-per-line format).
Note: Options like tokenize.whitespace=true apply at the level of CoreNLP. They aren't interpreted (you get a warning saying that the option is ignored) if provided to individual components like CRFClassifier.
As Nikita Astrakhantsev notes, this isn't necessarily a good thing to do. Doing it at test time would only be correct if your training data is also whitespace separated, but otherwise will adversely affect performance. And having tokens like the ones you get from whitespace separation are bad for doing subsequent NLP processing such as parsing.

Upd. If you want to use whitespace tokenizer here, simply add tokenize.whitespace=true to your properties file. look at Christopher Manning's answer.
However, and answering to your second question, 'why does CoreNlp is splitting those words in two tokens?', I'd suggest to keep the default tokenizer (which is PTBTokenizer), because it simply lets to obtain better results. Usually the reason to switch to whitespace tokenization is high demand to processing speed or (usually - and) low demand to tokenization quality.
Since you are going to use it for further NER, I doubt that it is your case.
Even in your example, if you have token John. after tokenization, it can not be captured by gazette or train examples.
More details and reasons why tokenization isn't that simple can be found here.

Related

Getting sentence embedding from huggingface Feature Extraction Pipeline

How do i get an embedding for the whole sentence from huggingface's feature extraction pipeline?
I understand how to get the features for each token (below) but how do i get the overall features for the sentence as a whole?
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")
To explain more on the comment that I have put under stackoverflowuser2010's answer, I will use "barebone" models, but the behavior is the same with the pipeline component.
BERT and derived models (including DistilRoberta, which is the model you are using in the pipeline) agenerally indicate the start and end of a sentence with special tokens (mostly denoted as [CLS] for the first token) that usually are the easiest way of making predictions/generating embeddings over the entire sequence. There is a discussion within the community about which method is superior (see also a more detailed answer by stackoverflowuser2010 here), however, if you simply want a "quick" solution, then taking the [CLS] token is certainly a valid strategy.
Now, while the documentation of the FeatureExtractionPipeline isn't very clear, in your example we can easily compare the outputs, specifically their lengths, with a direct model call:
from transformers import pipeline, AutoTokenizer
# direct encoding of the sample sentence
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
encoded_seq = tokenizer.encode("i am sentence")
# your approach
feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction("i am sentence")
# Compare lengths of outputs
print(len(encoded_seq)) # 5
# Note that the output has a weird list output that requires to index with 0.
print(len(features[0])) # 5
When inspecting the content of encoded_seq, you will notice that the first token is indexed with 0, denoting the beginning-of-sequence token (in our case, the embedding token). Since the output lengths are the same, you could then simply access a preliminary sentence embedding by doing something like
sentence_embedding = features[0][0]
If you want to get meaningful embedding of whole sentence, please use SentenceTransformers. Pooling is well implemented in it and it also provides various APIs to Fine Tune models to produce features/embeddings at sentence/text-chunk level
pip install sentence-transformers
Once you have installed sentence-transformers, below code can be used to produce sentence embeddings
from sentence_transformers import SentenceTransformer
model_st = SentenceTransformer('distilroberta-base')
embeddings = model_st.encode("I am a sentence')
print(embeddings)
Visit official site for more info on sentence transformers.
If you have the embeddings for each token, you can create an overall sentence embedding by pooling (summarizing) over them. Note that if you have D-dimensional token embeddings, you should get a D-dimensional sentence embeddings through one of these approaches:
Compute the mean over all token embeddings.
Compute the max of each of the D-dimensions over all the token embeddings.

Entities on my gazette are not recognized

I would like to create a custom NER model. That's what i did:
TRAINING DATA (stanford-ner.tsv):
Hello O
! O
My O
name O
is O
Damiano PERSON
. O
PROPERTIES (stanford-ner.prop):
trainFile = stanford-ner.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
maxLeft=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
gazette=gazzetta.txt
cleanGazette=true
GAZZETTE gazzetta.txt):
PERSON John
PERSON Andrea
I build the model via command line with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop stanford-ner.prop
And test with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt
I did two tests with the following texts:
>>> TEST 1 <<<
TEXT:
Hello! My name is Damiano and this is a fake text to test.
OUTPUT
Hello/O !/O
My/O name/O is/O Damiano/PERSON and/O this/O is/O a/O fake/O text/O to/O test/O ./O
>>> TEST 2 <<<
TEXT:
Hello! My name is John and this is a fake text to test.
OUTPUT
Hello/O !/O
My/O name/O is/O John/O and/O this/O is/O a/O fake/O text/O to/O test/O ./O
As you can see only "Damiano" entity is found. This entity is in my training data but "John" (second test) is inside the gazzette. So the question is.
Why does John entity is not recognized ?
Thank you so much in advance.
As Stanford FAQ says,
If a gazette is used, this does not guarantee that words in the
gazette are always used as a member of the intended class, and it does
not guarantee that words outside the gazette will not be chosen. It
simply provides another feature for the CRF to train against. If the
CRF has higher weights for other features, the gazette features may be
overwhelmed.
If you want something that will recognize text as a member of a class
if and only if it is in a list of words, you might prefer either the
regexner or the tokensregex tools included in Stanford CoreNLP. The
CRF NER is not guaranteed to accept all words in the gazette as part
of the expected class, and it may also accept words outside the
gazette as part of the class.
Btw, it is not a good practice to test machine learning pipelines in a 'unit-test'-way, i.e. with only one or two examples, because it is supposed to work on much greater volume of data and, more importantly, it is probabilistic by nature.
If you want to check if your gazette file is actually used, it may be better to take existent examples (see the bottom of the page linked above for austen.gaz.prop and austen.gaz.txt examples) and replace multiple names by your own ones, then check. If it fails, firstly try to change your test, e.g. add more names, reformulate text and so on.
gazzette will only help for extracting extra features from the training data, if you don't have any occurrence of these words inside your training data or any connection to labeled tokens, your model will not benefits from that. One of the experiments that I would suggest is to add Damiano to your gazzette.
Why does John entity is not recognized ?
It looks to me that your minimal example should most probably add "Damiano" to the gazetteer as a PERSON category. Currently, the training data allows the model to learn that "Damiano" is a PERSON label, but I think this is not related to the gazetteer categories (i.e. having PERSON on both sides is not sufficient).

Parse/ tree diagram for phrasal constituents of a specific sentence

I'm trying to do some language analysis on the opening paragraph of The Kite Runner by Khaled Hosseini, specifically looking at phrasal constituents. The first sentence is as follows:
"I became what I am today at the age of twelve, on a frigid overcast day in the winter of 1975."
I've got a pretty good idea of what the phrasal constituents are, but I'm a bit unsure as to how to draw the tree, as it seems like the tree should be split into two distinct branches, splitting at the comma after twelve. I've uploaded an image my tree so far, but I'm not sure if it's correct or not. Any help would be greatly appreciated.
Thanks in advance :)
There is a library called constituent-treelib that can be used to construct, process and visualize constituent trees. First, we must install the library:
pip install constituent-treelib
Then, we can use it as follows to parse the sentence into a constituent tree, visualize it, and finally export the result to a PDF file:
from constituent_treelib import ConstituentTree
# Define a sentence
sentence = "You must construct additional pylons!"
# Define the language that should be considered
language = ConstituentTree.Language.English
spacy_model_size = ConstituentTree.SpacyModelSize.Medium
# Construct the neccesary NLP pipeline by downloading and installing the required models (benepar and spaCy)
nlp = ConstituentTree.create_pipeline(language, spacy_model_size, download_models=True)
# Instantiate a ConstituentTree object and pass it both the sentence and the NLP pipeline
tree = ConstituentTree(sentence, nlp)
# Export the visualization of the tree into a PDF file
tree.export_tree('my_constituent_tree.pdf')
The result...

Create a user assistant using NLP

I am following a course titled Natural Language Processing on Coursera, and while the course is informative, I wonder if the contents given cater to what am I looking for.Basically I want to implement a textual version of Cortana, or Siri for now as a project, i.e. where the user can enter commands for the computer in natural language and they will be processed and translated into appropriate OS commands. My question is
What is generally sequence of steps for the above applications, after processing the speech? Do they tag the text and then parse it, or do they have any other approach?
Under which application of NLP does it fall? Can someone cite me some good resources for same? My only doubt is that what I follow now, shall that serve any important part towards my goal or not?
What you want to create can be thought of as a carefully constrained chat-bot, except you are not attempting to hold a general conversation with the user, but to process specific natural language input and map it to specific commands or actions.
In essence, you need a tool that can pattern match various user input, with the extraction or at least recognition of various important topic or subject elements, and then decide what to do with that data.
Rather than get into an abstract discussion of natural language processing, I'm going to make a recommendation instead. Use ChatScript. It is a free open source tool for creating chat-bots that recently took first place in the Loebner chat-bot competition, as it has done so several times in the past:
http://chatscript.sourceforge.net/
The tool is written in C++, but you don't need to touch the source code to create NLP apps; just use the scripting language provided by the tool. Although initially written for chat-bots, it has expanded into an extremely programmer friendly tool for doing any kind of NLP app.
Most importantly, you are not boxed in by the philosophy of the tool or limited by the framework provided by the tool. It has all the power of most scripting languages so you won't find yourself going most of the distance towards your completing your app, only to find some crushing limitation during the last mile that defeats your app or at least cripples it severely.
It also includes a large number of ontologies that can jump-start significantly your development efforts, and it's built-in pre-processor does parts-of-speech parsing, input conformance, and many other tasks crucial to writing script that can easily be generalized to handle large variations in user input. It also has a full interface to the WordNet synset database. There are many other important features in ChatScript that make NLP development much easier, too many to list here. It can run on Linux or Windows as a server that can be accessed using a TCP-IP socket connection.
Here's a tiny and overly simplistic example of some ChatScript script code:
# Define the list of available devices in the user's household.
concept: ~available_devices( green_kitchen_lamp stove radio )
#! Turn on the green kitchen lamp.
#! Turn off that damn radio!
u: ( turn _[ on off ] *~2 _~available_devices )
# Save off the desired action found in the user's input. ON or OFF.
$action = _0
# Save off the name of the device the user wants to turn on or off.
$target_device = _1
# Launch the utility that turns devices on and off.
^system( devicemanager $action $target_device )
Above is a typical ChatScript rule. Your app will have many such rules. This rule is looking for commands from the user to turn various devices in the house on and off. The # character indicates a line is a comment. Here's a breakdown of the rule's head:
It consists of the prefix u:. This tells ChatScript a rule that the rule accepts user input in statement or question format.
It consists of the match pattern, which is the content between the parentheses. This match pattern looks for the word turn anywhere in the sentence. Next it looks for the desired user action. The square brackets tell ChatScript to match the word on or the word off. The underscore preceding the square brackets tell ChatScript to capture the matched text, the same way parentheses do in a regular expression. The ~2 token is a range restricted wildcard. It tells ChatScript to allow up to 2 intervening words between the word turn and the concept set named ~available_devices.
~available_devices is a concept set. It is defined above the rule and contains the set of known devices the user can turn on and off. The underscore preceding the concept set name tells ChatScript to capture the name of the device the user specified in their input.
If the rule pattern matches the current user input, it "fires" and then the rule's body executes. The contents of this rule's body is fairly obvious, and the comments above each line should help you understand what the rule does if fired. It saves off the desired action and the desired target device captured from the user's input to variables. (ChatScript variable names are preceded by a single or double dollar-sign.) Then it shells to the operating system to execute a program named devicemanager that will actually turn on or off the desired device.
I wanted to point out one of ChatScript's many features that make it a robust and industrial strength NLP tool. If you look above the rule you will see two sentences prefixed by a string consisting of the characters #!. These are not comments but are validation sentences instead. You can run ChatScript in verify mode. In verify mode it will find all the validation sentences in your scripts. It will then apply each validation sentence to the rule immediately following it/them. If the rule pattern does not match the validation sentence, an error message will be written to a log file. This makes each validation sentence a tiny, easy to implement unit test. So later when you make changes to your script, you can run ChatScript in verify mode and see if you broke anything.

Documentation of Moses (statistical machine translation) mose.ini file format?

Is there any documentation of the moses.ini format for Moses? Running moses at the command line without arguments returns available feature names but not their available arguments. Additionally, the structure of the .ini file is not specified in the manual that I can see.
The main idea is that the file contains settings that will be used by the translation model. Thus, the documentation of values and options in moses.ini should be looked up in the Moses feature specifications.
Here are some excerpt I found on the Web about moses.ini.
In the Moses Core, we have some details:
7.6.5 moses.ini All feature functions are specified in the [feature] section. It should be in the format:
* Feature-name key1=value1 key2=value2 .... For example, KENLM factor=0 order=3 num-features=1 lazyken=0 path=file.lm.gz
Also, there is a hint on how to print basic statistics about all components mentioned in the moses.ini.
Run the script
analyse_moses_model.pl moses.ini
This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.
In the Center for Computational Language and EducAtion Research (CLEAR) Wiki, there is a sample file with some documentation:
Parameters
It is recommended to make an .ini file to storage all of your setting.
input-factors
- Using factor model or not
mapping
- To use LM in memory (T) or read the file in hard disk directly (G)
ttable-file
- Indicate the num. of source-factor, num. of target-factor, num of score, and
the path to translation table file
lmodel-file
- Indicate the type using for LM (0:SRILM, 1:IRSTLM), using factor number, the order (n-gram) of LM, and the path to language model file
If it is not enough, there is another description on this page, see "Decoder configuration file" section
The sections
[ttable-file] and [lmodel-file] contain pointers to the phrase table
file and language model file, respectively. You may disregard the
numbers on those lines. For the time being, it's enough to know that
the last one of the numbers in the language model specification is the
order of the n-gram model.
The configuration file also contains some feature weights. Note that
the [weight-t] section has 5 weights, one for each feature contained
in the phrase table.
The moses.ini file created by the training process will not work with
your decoder without modification because it relies on a language
model library that is not compiled into our decoder. In order to make
it work, open the moses.ini file and find the language model
specification in the line immediately after the [lmodel-file] heading.
The first number on this line will be 0, which stands for SRILM.
Change it into 8 and leave the rest of the line untouched. Then your
configuration should work.

Resources