I'm trying to find entities on websites text. So I've copied 20 sites into a text file (all in one line) and tagged entities manually. (according to this tutorial: https://www.machinelearningplus.com/nlp/training-custom-ner-model-in-spacy/)
Usually, the text files contain 5000+ characters and I've tagged two entities per each file.
I've got Spacy 3.2.1 and I'm using nlp = spacy.load("en_core_web_sm").
The print of the losses:
Losses {'ner': 438.16809472077654}
Losses {'ner': 448.97231569240785}
Losses {'ner': 470.66727516808453}
Losses {'ner': 477.0379003697807}
Losses {'ner': 8.354419636216686}
Losses {'ner': 86.56611033267922}
...
Losses {'ner': 0.026654468748418734}
Losses {'ner': 0.026672022747312552}
Losses {'ner': 0.026672030436685996}
Losses {'ner': 0.027033395785247216}
Losses {'ner': 0.027033751640550253}
Losses {'ner': 0.02703389312135557}
Losses {'ner': 0.029515604829788607}
When I use this model to find an entity on another text, it finds nothing. (even though I can see the entities in the text).
doc = nlp(text)
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
displacy.serve(doc, style="ent")
This code just displays:
Entities []
/home/python3.8/site-packages/spacy/displacy/__init__.py:200: UserWarning: [W006] No entities to visualize found in Doc object. If this is surprising to you, make sure the Doc was processed using a model that supports named entity recognition, and check the `doc.ents` property manually if necessary.
warnings.warn(Warnings.W006)
Using the 'ent' visualizer
Serving on http://0.0.0.0:5000
All the examples that I found online use a different text to entity ratio:
1 short sentence with 1 - 2 entities inside.
I'm using 1 long text with 1 - 2 entities inside.
Could that be the issue?
And if so, what do I need to do? Prepare more training data?
Related
I want to do predictions with a LSTM model, but the dataset isn't a single file, it's composed with multiple files (for example 3 Excels).
I've already checked that if you want to deal with a time series forecasting problem you have to prepare your data like (number of samples, number of time steps, number of features) and it works well if I implement this for a single Excel.
The problem consists in training with the three Excels at the same time, in this case the input tensor for the LSTM layer has the shape: (n_files, n_samples, n_timesteps, n_features), with dim = 4. This doesn't work because LSTM layers only admits input tensors with dim = 3.
My files have the same amount of data. It's collected from a device and the data has 1 value for each minute along the duration of the experiment. All the experiments have the same duration too.
I've tried to concatenate the files in order to have 1, and choosing the batch_size as the number of samples in one Excel (because I can't mix the different experiments) but it doesn't produce a good result (at least as good as the results from predicting with 1 experiment).
def build_model():
model = keras.Sequential([
layers.Masking(mask_value = 0.0, input_shape=[1,1]),
layers.LSTM(50, return_sequences=True),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.Adam(0.001)
model.compile(loss='mse',
optimizer=optimizer,
metrics=['mae','RootMeanSquaredError'])
return model
model_pred = build_model()
model_pred.fit(Xchopped_train, ychopped_train, batch_size = 252,
epochs=500, verbose=1)
Where Xchopped_train and ychopped_train are the concatenated data from the 3 Excel.
Another thing I've tried is to train the model within a loop, and changing the Excel:
for i in range(len(Xtrain)):
model_pred.fit(Xtrain[i], Ytrain[i], epochs=167, verbose=1)
Where Xtrain is (3,252,1,1) and the first index refers to the number of Excel.
And by far this is my best approach but it feels like this isn't a good solution since I don't know what's happening with the NN weights or which loss function is minimizing...
Is there a more efficient way to do this? Thanks!
I am following the following tutorial
https://pjreddie.com/darknet/yolo/
I am trying to firstly use my own dataset. It has annotations in .xml files as the training requires. I also have the labels in txt files. The training goes fine. But, when I use one image that I used in the training just to check if the detector works, it simply detects nothing.
Then, I tried to follow the examples of the website https://pjreddie.com/darknet/yolo/ for the VOC dataset. The training again goes well, but, again, nothing is detected.
My command to train:
./darknet detector train cfg/voc.data cfg/yolov3-voc.cfg darknet53.conv.74
My command to test:
./darknet detect cfg/yolov3-voc.cfg backup/yolov3-voc_final.weights VOCdevkit/VOC2012/JPEGImages/2007_000033.jpg
My cfg/voc.data
classes= 20
train = /home/server/Desktop/dataset_others/darknet/train.txt
valid = /home/server/Desktop/dataset_others/darknet/2007_test.txt
names = /home/server/Desktop/dataset_others/darknet/data/voc.names
backup = /home/server/Desktop/dataset_others/darknet/backup
My data/voc.names
aeroplane
bicycle
bird
boat
bottle
bus
car
cat
chair
cow
diningtable
dog
horse
motorbike
person
pottedplant
sheep
sofa
train
tvmonitor
One thing I noticed in the training is the big amount of nan values as values found during training, as you can see below:
What did I miss when training the network for VOC dataset?
I want to know is there any way in which we can partially save a Scikit-Learn Machine Learning model and reload it again to train it from the point it was saved before?
For models such as Scikitlearn applied to sentiment analysis, I would suspect you need to save two important things: 1) your model, 2) your vectorizer.
Remember that after training your model, your words are represented by a vector of length N, and that is defined according to your total number of words.
Below is a piece from my test-model and test-vectorizer saved in order to be used latter.
SAVING THE MODEL
import pickle
pickle.dump(vectorizer, open("model5vectorizer.pickle", "wb"))
pickle.dump(classifier_fitted, open("model5.pickle", "wb"))
LOADING THE MODEL IN A NEW SCRIPT (.py)
import pickle
model = pickle.load(open("model5.pickle", "rb"))
vectorizer = pickle.load(open("model5vectorizer.pickle", "rb"))
TEST YOUR MODEL
sentence_test = ["Results by Andutta et al (2013), were completely wrong and unrealistic."]
USING THE VECTORIZER (model5vectorizer.pickle) !!
sentence_test_data = vectorizer.transform(sentence_test)
print("### sentence_test ###")
print(sentence_test)
print("### sentence_test_data ###")
print(sentence_test_data)
# OBS-1: VECTOR HERE WILL HAVE SAME LENGTH AS BEFORE :)
# OBS-2: If you load the default vectorizer or a different one, then you may see the following problems
# sklearn.exceptions.NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.
# # ValueError: X has 8 features per sample; expecting 11
result1 = model.predict(sentence_test_data) # using saved vectorizer from calibrated model
print("### RESULT ###")
print(result1)
Hope that helps.
Regards,
Andutta
When a data set is fitted to a Scikit-learn machine learning model, it is trained and supposedly ready to be used for prediction purposes. By training a model with let's say, 100 samples and using it and then going back to it and fitting another 50 samples to it, you will not make it better but you will rebuild it.
If your purpose is to build a model and make it more powerful as it interacts with more samples, you would be thinking of a real-time condition, such as a mobile robot for mapping an environment with a Kalman Filter.
I have a dataset with documents separated into different years, and my objective is to train an embedding model for each year's data, while at the same time, the same word appearing in different years will have similar vector representations. Like this: for word 'compute', its vector in year 1 is
[0.22, 0.33, 0.20]
and in year 2 it's something around:
[0.20, 0.35, 0.18]
Is there a way to accomplish this? For example, train the model of year 2 with both initial values (if the word is trained already in year 1, modify its vector) and randomness (if this is a new word for the corpus).
I think the easiest solution is to save the embeddings after training on the first data set, then load the trained model and continue training for the second data set. This way you shouldn't expect the embeddings to drift away from the saved state much (unless your data sets are very different).
It would also make sense to create a single vocabulary from all documents: vocabulary words that aren't present in a particular document will get some random representation, but still it will be a working word2vec model.
Example from the documentation:
>>> model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
>>> model.save(fname)
>>> model = Word2Vec.load(fname) # continue training with the loaded model
How can we make a working classifier for sentiment analysis since for that we need to train our classifier on huge data sets.
I have the huge data set to train, but the classifier object (here using Python), gives memory error when using 3000 words. And I need to train for more than 100K words.
What I thought was dividing the huge data set into smaller parts and make a classifier object for each and store it in a pickle file and use all of them. But it seems using all the classifier object for testing is not possible as it takes only one of the object during testing.
The solution which is coming in my mind is either to combine all the saved classifier objects stored in the pickle file (which is just not happening) or to keep appending the same object with new training set (but again, it is being overwritten and not appended).
I don't know why, but I could not find any solution for this problem even when it is the basic of machine learning. Every machine learning project needs to be trained in huge data set and the object size for training those data set will always give a memory error.
So, how to solve this problem? I am open to any solution, but would like to hear what is followed by people who do real time machine learning projects.
Code Snippet :
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower())
all_words = nltk.FreqDist(all_words)
word_features = list(all_words.keys())[:3000]
def find_features(document):
words = set(document)
features = {}
for w in word_features:
features[w] = (w in words)
return features
featuresets = [(find_features(rev), category) for (rev, category) in documents]
numtrain = int(len(documents) * 90 / 100)
training_set = featuresets[:numtrain]
testing_set = featuresets[numtrain:]
classifier = nltk.NaiveBayesClassifier.train(training_set)
PS : I am using the NLTK toolkit using NaiveBayes. My training dataset is being opened and stored in the documents.
There are two things you seem to be missing:
Datasets for text are usually extremely sparse, and you should store them as sparse matrices. For such representation, you should be able to store milions of documents inyour memory with vocab. of 100,000.
Many modern learning methods are trained in mini-batch scenario, meaning that you never need whole dataset in memory, instead, you feed it to the model with random subsets of data - but still training a single model. This way your dataset can be arbitrary large, memory consumption is constant (fixed by minibatch size), and only training time scales with the amount of samples.