Does calling `ner.update()` multiple times makes any difference? - named-entity-recognition

I am trying to understand how to add entity classes to the Named Entity Recognizer. An example code has a structure that looks like:
ner = EntityRecognizer(nlp.vocab, entity_types=[... ENTITIES ...])
for itn in range(NUMBER_OF_ITERATIONS):
for raw_text, entities in training_examples:
... some data handling ...
ner.update(doc, gold)
, but then the next example (for BILUO tags) calls ner.update() only once (i.e., no for-loops that cause update() to see the training data multiple times).
I have read this question, whose answers seem to tell me I should call update() more than once for each training example; but then I also thought they could be just following the examples.
Because of the following sentence (from the end of the documentation page)...
The costs are then used to calculate the gradient of the loss, to train the model.
... I am guessing that the answer to my question is "yes, I should train it by iterating 'several' times through the training data"; but if that is the case, then does anyone have suggestion on how many times is "enough"? (the example code uses 5, but if I think it is too less, can I end up iterating "too many times"? I.e., does it "overfit"?)

Related

Build vocab in doc2vec

I have a list of abstracts and articles approx 500 in csv each paragraph contains approx 800 to 1000 words whenever I build vocab and print with words giving none and how I can improve results?
lst_doc = doc.translate(str.maketrans('', '', string.punctuation))
target_data = word_tokenize(lst_doc)
train_data = list(read_data())
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
train_vocab = model.build_vocab(train_data)
print(train_vocab)
{train = model.train(train_data, total_examples=model.corpus_count,
epochs=model.epochs) }
Output:
None
A call to build_vocab() only builds the vocabulary inside the model, for further usage. That function call doesn't return anything, so your train_vocab variable will be Python None.
So, the behavior you're seeing is as expected, and you should say more about what your ultimate aims are, and what you'd want to see as steps towards those aims, if you're stuck.
If you want to see reporting of the progress of your calls to build_vocab() or train(), you can set the logging level to INFO. This is always a usually a good idea working to learn a new library: even if initially the copious info shown is hard to understand, by reviewing it you'll start to see the various internal steps, and internal counts/timings/etc, that hint whehter things are doing well or poorly.
You can also examine the state of the model and its various internal properties after the code has run.
For example, the model.wv property contains, after build_vocab(), a Gensim KeyedVectors structure holding all the untrained ready-for-training vectors. You can ask for its length (len(model.wv) or examine the discovered active list of words (model.wv.index_to_key).
Other comments:
It's not clear your 1st two lines – assigning into lst_doc and target_data – affect anything further, since it's unclear what read_data() might be doing to fill the train_corpus.
Often low min_count values worsen results, by including more words that have so few usage examples that they're little more than noise during training.
only 500 documents is rather small compared to most published work showing impressive results with this algorithm, which uses tens-of-thousands of documents (if not millions). So, keep in mind that results on such a small dataset may be unrepresentative of what's possible with a larger corpus - in terms of quality, optimal parameters, etc.

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

How does Machine Learning algorithm retain learning from previous execution?

I am reading Hands on Machine Learning book and author talks about random seed during train and test split, and at one point of time, the author says over the period Machine will see your whole dataset.
Author is using following function for dividing Tran and Test split,
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Usage of the function like this:
>>>train_set, test_set = split_train_test(housing, 0.2)
>>> len(train_set)
16512
>>> len(test_set)
4128
Well, this works, but it is not perfect: if you run the program again, it will generate a different test set! Over time, you (or your Machine Learning algorithms) will get to see the whole dataset, which is what you want to avoid.
Sachin Rastogi: Why and how will this impact my model performance? I understand that my model accuracy will vary on each run as Train set will always be different. How my model will see the whole dataset over a time ?
The author is also providing a few solutions,
One solution is to save the test set on the first run and then load it in subsequent runs. Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) before calling np.random.permutation(), so that it always generates the same shuffled indices.
But both these solutions will break next time you fetch an updated dataset. A common solution is to use each instance’s identifier to decide whether or not it should go in the test set (assuming instances have a unique and immutable identifier).
Sachin Rastogi: Will it be a good train/test division? I think No, Train and Test should contain elements from across dataset to avoid any bias from the Train set.
The author is giving an example,
You could compute a hash of each instance’s identifier and put that instance in the test set if the hash is lower or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset.
The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
Sachin Rastogi: I am not able to understand this solution. Could you please help?
For me, these are the answers:
The point here is that you should better put aside part of your data (which will constitute your test set) before training the model. Indeed, what you want to achieve is to be able to generalize well on unseen examples. By running the code that you have shown, you'll get different test sets through time; in other words, you'll always train your model on different subsets of your data (and possibly on data that you've previously marked as test data). This in turn will affect training and - going to the limit - there will be nothing to generalize to.
This will be indeed a solution satisfying the previous requirement (of having a stable test set) provided that new data are not added.
As said in the comments to your question, by hashing each instance's identifier you can be sure that old instances always get assigned to the same subsets.
Instances that were put in the training set before the update of the dataset will remain there (as their hash value won't change - and so their left-most bit - and it will remain higher than 0.2*max_hash_value);
Instances that were put in the test set before the update of the dataset will remain there (as their hash value won't change and it will remain lower than 0.2*max_hash_value).
The updated test set will contain 20% of the new instances and all of the instances associated to the old test set, letting it remain stable.
I would also suggest to see here for an explanation from the author: https://github.com/ageron/handson-ml/issues/71.

Modelling a time series consisting mainly of structural breaks only

I am given a financial time series that is characterized by a bunch of structural breaks, i.e. the series isn't moving (literally at all), but at some points in time the series jumps up or down. Then it stays at this level for a while until the series jumps again. So the time series basically looks like a step function.
My assumption is that these breaks come from some particular exogenous variables that are in the form of dummies. So if a particular exogenous variable takes on the value 1, (I assume) it is very likely that the series jumps.
My question is how I could model this particular time series (in a uni- or multivariate sense). I guess that standard AR(MA)-models are inappropriate. I was thinking about creating two binary variables that take on the value 1 if there's an upward (downward) break and 0 otherwise. Then I would run a dynamic probit model to test the probabilities that the exogenous variables trigger a break. What do you think about this idea? Or would you have other suggestions? Please note that I don't wanna test for structural breaks but rather formulate a time series model.
Did you try ARIMAX, TAR, or STAR models?
You said that you have time series data and you think this series is influanced by some exogeneous shocks. I think you need to include exogeneous variable in your time series analysis thats where ARIMAX comes. This modela allows you to include exogeneous variable in ARIMA model.
You also said that there are(is) structural breaks. Try Treshold AutoRegressive or Smoothed Treshold AutoRegressive. I hope this helps to find more materials about that models. Here is one click here

Does it help to duplicate original data in order to make more data for building model?

I just got an interview question.
"Assume you want to build a statistical or machine learning model, but you have very limited data on hand. Your boss told you can duplicate original data several times, to make more data for building the model" Does it help?
Intuitively, it does not help, because duplicating original data doesn't create more "information" to feed the model.
But is there anyone can explain it more statistically? Thanks
Consider e.g. variance. The data set with the duplicated data will have the exact same variance - you don't have a more precise estimate of the distrbution afterwards.
There are, however, some exceptions. For example bootstrap validation helps when evaluating your model, but you have very little data.
Well, it depends on exactly what one means by "duplicating the data".
If one is exactly duplicating the whole data set a number of times, then methods based on maximum likelihood (as with many models in common use) must find exactly the same result since the log likelihood function of the duplicated data is exactly a multiple of the unduplicated data's log likelihood, and therefore has the same maxima. (This argument doesn't apply to methods which aren't based on the likelihood function; I believe that CART and other tree models, and SVM's, are such models. In that case you'll have to work out a different argument.)
However, if by duplicating, one means duplicating the positive examples in a classification problem (which is common enough, since there are often many more negative examples than positive), then that does make a difference, since the likelihood function is modified.
Also if one means bootstrapping, then that, too, makes a difference.
PS. Probably you'll get more interest in this question on stats.stackexchange.com.

Resources