ANN - Artificial Neural Network training - opencv

I know that is possible to save a trained ANN into a file using CvFileStorage, but I really don't like the way that CvFileStorage saves the training, then I was wondering: Is that possible to retrieve the informations of a training and save it in a custom way?
Thanks in advance.

Just look at xml structure, it's very simple.
The names of objects are the same as in ANN class.
Here is a XOR solving network:
<?xml version="1.0"?>
<opencv_storage>
<my_nn type_id="opencv-ml-ann-mlp">
<layer_sizes type_id="opencv-matrix">
<rows>1</rows>
<cols>3</cols>
<dt>i</dt>
<data>
2 3 1</data></layer_sizes>
<activation_function>SIGMOID_SYM</activation_function>
<f_param1>1.</f_param1>
<f_param2>1.</f_param2>
<min_val>-9.4999999999999996e-001</min_val>
<max_val>9.4999999999999996e-001</max_val>
<min_val1>-9.7999999999999998e-001</min_val1>
<max_val1>9.7999999999999998e-001</max_val1>
<training_params>
<train_method>RPROP</train_method>
<dw0>1.0000000000000001e-001</dw0>
<dw_plus>1.2000000000000000e+000</dw_plus>
<dw_minus>5.0000000000000000e-001</dw_minus>
<dw_min>1.1920928955078125e-007</dw_min>
<dw_max>50.</dw_max>
<term_criteria><epsilon>9.9999997764825821e-003</epsilon>
<iterations>1000</iterations></term_criteria></training_params>
<input_scale>
2. -1. 2. -1.</input_scale>
<output_scale>
5.2631578947368418e-001 4.9999999999999994e-001</output_scale>
<inv_output_scale>
1.8999999999999999e+000 -9.4999999999999996e-001</inv_output_scale>
<weights>
<_>
-3.8878915951440729e+000 -3.7728173427563569e+000
-1.9587678786875042e+000 3.7898767378369680e+000
3.0354324494246829e+000 1.9757881693499044e+000
-3.5862527376978406e+000 -3.2701446005792296e+000
1.3000011629911392e+000</_>
<_>
3.1017381376627204e+000 1.1052842857439200e+000
-4.6739037571329822e+000 3.2282702769334666e+000</_></weights></my_nn>
</opencv_storage>
You can save the same parameters to you oun formated file. Some of the fields are protected, but you can make child class from CvANN_MLP and make your oun file saver.

Related

How to get vocabulary size of word2vec?

I have a pretrained word2vec model in pyspark and I would like to know how big is its vocabulary (and perhaps get a list of words in the vocabulary).
Is this possible? I would guess it has to be stored somewhere since it can predict for new data, but I couldn't find a clear answer in the documentation.
I tried w2v_model.getVectors().count() but the result (970) seem too small for my use case. In case it may be relevant, I'm using short-text data and my dataset has tens of millions of messages each having from 10 to 30/40 words. I am using min_count=50.
Not quite sure why you doubt the result of .getVectors().count(), which gives the desired result indeed, as shown in the documentation link you have provided yourself.
Here is the example posted there, with a vocabulary of just three (3) tokens - a, b, and c:
from pyspark.ml.feature import Word2Vec
sent = ("a b " * 100 + "a c " * 10).split(" ") # 3-token vocabulary
doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model")
model = word2Vec.fit(doc)
So, unsurprisingly, it is
model.getVectors().count()
# 3
and asking for the vectors themselves
model.getVectors().show()
gives
+----+--------------------+
|word| vector|
+----+--------------------+
| a|[0.09511678665876...|
| b|[-1.2028766870498...|
| c|[0.30153277516365...|
+----+--------------------+
In your case, with min_count=50, every word that appears less than 50 times in your corpus will not be represented; reducing this number will result in more vectors.

How to Export Stanza to ONNX format?

How to export Stanza to ONNX format?
It seems impossible to just simply train the model.
There is an explanation here: https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html
I created a fork from stanza for this experiment here https://github.com/vivkvv/stanza. See also my commits https://github.com/vivkvv/stanza/commits?author=vivkvv.
I used pipeline_demo.py for testing. The main thing I added is code just inside models/tokanization/trainer.py below the line 77
pred = self.model(units, features)
Due to explanation I added
torch.onnx.export(
self.model,
(units, features),
onnx_export_file_name,
opset_version=9,
export_params=True,
do_constant_folding=True,
input_names=['input'],
output_names=['output'],
dynamic_axes={
'input': {0: 'batch_size'},
'output': {0: 'batch_size'}
}
)
and it works for tokenization. But the same does not work for e.g. pos or lemmatizer (see my commit for PartOfSpeech). And I get different errors for different opset_version.
I created a question on github/stanza and you could see there https://github.com/stanfordnlp/stanza/issues/893

How do I use Conll 2003 corpus in python crfsuite

I have downloaded Conll 2003 corpus ("eng.train"). I want to use it to extract entity using python crfsuite training. But I don't know how to load this file for training.
I found this example, but it is not for English.
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))
Also in future I would like to train new entities other than POS or location. How can I add those.
Also please suggest how to handle multiple words.
You can use ConllCorpusReader.
Here a general impelemantation:
ConllCorpusReader('file path', 'file name', columntypes=['','',''])
Here a list of column types which you can use: 'WORDS', 'POS', 'TREE', 'CHUNK', 'NE', 'SRL', 'IGNORE'
Example:
from nltk.corpus.reader import ConllCorpusReader
train = ConllCorpusReader('CoNLL-2003', 'eng.train', ['words', 'pos', 'ignore', 'chunk'])
test = ConllCorpusReader('CoNLL-2003', 'eng.testa', ['words', 'pos', 'ignore', 'chunk'])

NER model training with IOB encoding fails (Stanford CoreNLP)

I am trying to train a NER model for Stanford CoreNLP. But as soon as the 8th or 9th iteration of the training process is reached, it stops and nothing else is happening.
The corpus is annotated with IOB/BIO encoding like this:
How O
to O
play O
a O
video O
in O
Java B-Fram
Swing I-Fram
? O
My properties file:
trainFile = C:\\Data\\corpora\\train\\train.tsv
serializeTo = C:\\Data\\ner-model.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=2
maxRight=2
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
sloppyGazette=true
gazette=C:\\Data\\gazetteers\\gaz1.txt,C:\\Data\\gazetteers\\gaz2.txt
entitySubclassification=bio
The content of my Gazetteers:
Fram LiteDB
Fram RavenDB
Fram MongoDB
Fram Cassandra
Fram Couchbase
...
The command for the training process:
java -mx8g -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop C:\\Data\\ner.prop -readerAndWriter edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter
Why is the training process suddenly stopping? Has this something to do with incorrect properties? Or does the gazetteers have to have the same labels as the annotated corpus?
At the end I want the entities to be tagged with just "Fram" instead of "B-Fram" or "I-Fram". How is that possible?
Thank you in advance.

How to fetch vectors for a word list with Word2Vec?

I want to create a text file that is essentially a dictionary, with each word being paired with its vector representation through word2vec. I'm assuming the process would be to first train word2vec and then look-up each word from my list and find its representation (and then save it in a new text file)?
I'm new to word2vec and I don't know how to go about doing this. I've read from several of the main sites, and several of the questions on Stack, and haven't found a good tutorial yet.
The direct access model[word] is deprecated and will be removed in Gensim 4.0.0 in order to separate the training and the embedding. The command should be replaced with, simply, model.wv[word].
Using Gensim in Python, after vocabs are built and the model trained, you can find the word count and sampling information already mapped in model.wv.vocab, where model is the variable name of your Word2Vec object.
Thus, to create a dictionary object, you may:
my_dict = dict({})
for idx, key in enumerate(model.wv.vocab):
my_dict[key] = model.wv[key]
# Or my_dict[key] = model.wv.get_vector(key)
# Or my_dict[key] = model.wv.word_vec(key, use_norm=False)
Now that you have your dictionary, you can write it to a file with whatever means you like. For example, you can use the pickle library. Alternatively, if you are using Jupyter Notebook, they have a convenient 'magic command' %store my_dict > filename.txt. Your filename.txt will look like:
{'one': array([-0.06590105, 0.01573388, 0.00682817, 0.53970253, -0.20303348,
-0.24792041, 0.08682659, -0.45504045, 0.89248925, 0.0655603 ,
......
-0.8175681 , 0.27659689, 0.22305458, 0.39095637, 0.43375066,
0.36215973, 0.4040089 , -0.72396156, 0.3385369 , -0.600869 ],
dtype=float32),
'two': array([ 0.04694849, 0.13303463, -0.12208422, 0.02010536, 0.05969441,
-0.04734801, -0.08465996, 0.10344813, 0.03990637, 0.07126121,
......
0.31673026, 0.22282903, -0.18084198, -0.07555179, 0.22873943,
-0.72985399, -0.05103955, -0.10911274, -0.27275378, 0.01439812],
dtype=float32),
'three': array([-0.21048863, 0.4945509 , -0.15050395, -0.29089224, -0.29454648,
0.3420335 , -0.3419629 , 0.87303966, 0.21656844, -0.07530259,
......
-0.80034876, 0.02006451, 0.5299498 , -0.6286509 , -0.6182588 ,
-1.0569025 , 0.4557548 , 0.4697938 , 0.8928275 , -0.7877308 ],
dtype=float32),
'four': ......
}
You may also wish to look into the native save / load methods of Gensim's word2vec.
Gensim tutorial explains it very clearly.
First, you should create word2vec model - either by training it on text, e.g.
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
or by loading pre-trained model (you can find them here, for example).
Then iterate over all your words and check for their vectors in the model:
for word in words:
vector = model[word]
Having that, just write word and vector formatted as you want.
You can Directly get the vectors through
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.wv.vectors
and words through
model.wv.vocab.keys()
Hope it helps !
If you are willing to use python with gensim package, then building upon this answer and Gensim Word2Vec Documentation you could do something like this
from gensim.models import Word2Vec
# Take some sample sentences
tokenized_sentences = [["here","is","one"],["and","here","is","another"]]
# Initialise model, for more information, please check the Gensim Word2vec documentation
model = Word2Vec(tokenized_sentences, size=100, window=2, min_count=0)
# Get the ordered list of words in the vocabulary
words = model.wv.vocab.keys()
# Make a dictionary
we_dict = {word:model.wv[word] for word in words}
Gensim 4.0 updates: vocab method is depreciated and change in how to parse a word's vector
Get the ordered list of words in the vocabulary
words = list(w for w in model.wv.index_to_key)
Get the vector for 'also'
print(model.wv['also'])
Using basic python:
all_vectors = []
for index, vector in enumerate(model.wv.vectors):
vector_object = {}
vector_object[list(model.wv.vocab.keys())[index]] = vector
all_vectors.append(vector_object)
For gensim 4.0:
my_dict = dict({})
for word in word_list:
my_dict[word] = model.wv.get_vector('0', norm = True)
I would suggest this, you may find anything you need including Word2Vec, FastText, Doc2Vec, KeyedVectors and so on...

Resources