SparkNLP's NerCrfApproach with custom labels - named-entity-recognition

I am trying to train a SparkNLP NerCrfApproach model with a dataset in CoNLL format that has custom labels for product entities (like I-Prod, B-Prod etc.). However, when using the trained model to make predictions, I get only "O" as the assigned label for all tokens. When using the same model trained on the CoNLL data from the SparkNLP workshop example, the classification works fine.
(cf. https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/training/english/crf-ner)
So, the question is: Does NerCrfApproach rely on the standard tag set for NER labels used by the CoNLL data? Or can I use it for any custom labels and, if yes, do I need to specify these somehow? My assumption was that the labels are inferred from the training data.
Cheers,
Martin
Update: The issue might not be related to the labels after all. I tried to replace my custom labels with CoNLL standard labels and I am still not getting the expected classification results.

As it turns out, this issue was not caused by the labels, but rather by the size of the dataset. I was using a rather small dataset for development purposes. Not only was this dataset quite small, but also heavily imbalanced, with a lot more "O" labels than the other labels. Fixing this by using a dataset of 10x the original size (in terms of sentences), I am able to get meaningful results, even for my custom labels.

i wanted create custom labels with CoNLL standard labels for my project, need help from you in this regard as how to follow, any materials.

Related

Recognizing multiple objects in an image with convolutional neural networks

I've seen quite a few CNN code examples for ID'ing images, but they generally relate to a 1-to-1 input to target relationship (like the MNISt handwritten numerals set), and most seem to use similar image dimensions (pixels) for the input image and training images.
So...what is the usual approach for identifying multiple objects in one image? (like several people, or any other relatively complex scene). I've seen it done often enough, but haven't seen design approaches mentioned. Does this require some type of preprocessing or can this be handled directly by a CNN?
I would say the most known family of techniques to retrieve multiple objects from an images would be the Detection family.
With Detection, the basic idea is to have one or more Proposal windows of different sizes and ratios within an image, generated with either a calculated or random array of algorithms.
For each Proposal window, the Classification algorithm is then executed to reveal what that specific area of the image represents.
The next step would usually be to run a Merge process to combine all neighbouring areas into one single classification output.
Note: A None class is often also used to represent an area with no specific class found.

Recognition of images with additional data

Good morning everyone, first I would like to make it clear that I began to take my first steps in machine learning yesterday.
I've read most basic items and attended some presentations.
I will participate in a project here a few months that this technology will be applied.
As a beginner I would like to ask a question that I think is silly, but I could not find answers for her.
In presentations and articles, I have seen the creation of a classifier that can classify images or data sets, but never both at the same time.
For example, Iris flower data set, which is used as an example. In this data set we have the characteristics of flowers, such as petal width, but we do not have a visual representation of it. It is possible to fit both and for example, to estimate the width of the petal of a certain image?
I imagine this is a very basic question, but I could not find something suitable for a beginner.
I would be very grateful.
Machine learning models always work on some abstract data items like vectors, points in multidimensional spaces etc. For the simplicity, let us assume for a moment that ML algorithms work on vectors. Classification therefore would be a task of assigning a label Y to a vector X(n).
Now with a data set conversion of values in a row into a vector is relatively easy - well, you have to somehow convert texts onto numbers or vice versa, but it is a standard procedure.
With images it is different. You have to now build a ML-suitable representation of an image. In other words you need to create features (e.g. numerical) describing the image, that you can later use as inputs to your ML.
Examples of such features are: colour histograms, average brightness, number of edges, various convolutions etc. There can be more complicated, semantic features like the presence of a human on the picture. Calculating these however is much more difficult.
So summing up - you can build a classifier on both the image and dataset, but it basically means transforming both into a set of features.

Mood classification using libsvm

I want to apply SVM on audio data det. I am extarcting difftrent features from the speech signal. After reducing the dimention of this matrix, I am still getting a features in matix form. Can anyone help me regarding the data formating
should i have to convert the feature matix in a row vector? Can i assign same label to each row of one feature matrix and other label to the rows of other matrix?
Little bit ambiguous question but let me try to resolve your problem. For feature selection, you can use filter method, wrapper method etc. One popularly used method is principle component analysis. Once you select your feature you can directly feed them to the classifier. In your case, i guess you are getting lower dimensional representation of your training data (for example, if you have used SVD). In this case, its fine, now you can use it for SVM classification.
What did you mean by adding label to feature matrix? You can add label to the training instances, not the features. I guess you are talking about separate matrix for each of the class labels. If that is the case, yes you can use as you want but remember it depends on the model design.

Designing a classifier with minimal image data

I want to train a 3-class classifier with tissue images, but only have around 50 labelled images in total. I can't take patches from the images and train on them, so I am looking for another way to deal with this problem.
Can anyone suggest an approach to this? Thank you in advance.
The question is very broad but here are some recommendations:
It could make sense to generate variations of your input images. Things like modifying contrast, brightness or color, rotating the image, adding noise. But which of these operations, if any, make sense really depends on the type of classification problem.
Generally, the less data you have, the fewer parameters (weights etc.) your model should have. Otherwise it will result in overlearning, meaning that your classifier will classify the training data but nothing else.
You should check for overlearning. A simple method would be to split your training data into a training set and a control set. Once you have found that the classification is correct for the control set as well, you could do additional training including the control set.

Class labels in data partitions

Suppose that one partitions the data to training/validation/test sets for further application of some classification algorithm, and it happens that training set doesn't contain all class labels that were present in the complete dataset - say some records with label "x" appear only in validation set and not in the training.
Is this the valid partitioning? The above can have many consequences like confusion matrix would be no longer square, also during the algorithm we may evaluate an error and this would be affected by unseen labels in training set.
The second question is following: is it common for partitioning algorithms to take care about above issue and partition the data in the way that training set has all existing labels?
This is what stratified sampling is supposed to solve.
https://en.wikipedia.org/wiki/Stratified_sampling

Resources