How to include information about labels in a multilabel classification task - machine-learning

Currently, I'm working on a multilabel classification problem for a shared task in NLP. I have quite a few labels, and with those labels, I have a little paragraph defining them. I was wondering if there is some way I can include that label information in a multilabel classification pipeline.
Up until now, I've tried prompt-learning, designing a prompt that includes that paragraph, but I haven't obtained good results. My best shot so far has been using a fine-tuned RoBERTa model, and I thought that if I could include that label definition somehow in the pipeline, I could obtain better results, as the LM beneath could extract more information about it.
Thanks in advance! Cheers.

Related

I'm building a neural network for post processing OCR text. Are convolutional layers a good choice?

From some typified documents like receipts, invoices, relevant information is extracted with OCR and templates. Later, a person has to visually validate that the information was correctly identified or manually adjusting where needed. My task is to build a model that does the validation. I'm thinking convolutional and pooling layers, with the input being images, the coordinates of bounding boxed where the extracted text was found, the extracted text and correct text. The goal is to train the network to automatically make the corrections if needed, based on the correct labeled train material. The project is in the design phase. I'm interested in insights regarding the input data or the layers. Thank you in advance and have a nice day.
Your intention of building a model to do validation after the OCR translates the typified documents seems a bit convoluted, since the OCR should be doing what you are hoping to accomplish with your model.
Also, it doesn't seem possible for the same model to be able to correct the OCR extracted text and validate other factors like correct bounding boxes, etc.
Perhaps you are looking to train individual models for each of these use cases.
I would advise you to simplify your objective for this new model to something like correcting wrongly interpreted text after the OCR converts the image to text.

Is there any model/classifier that works best for NLP based projects like this?

I've written a program to analyze a given piece of text from a website and make conclusory classifications as to its validity. The code basically vectorizes the description (taken from the HTML of a given webpage in real-time) and takes in a few inputs from that as features to make its decisions. There are some more features like the domain of the website and some keywords I've explicitly counted.
The highest accuracy I've been able to achieve is with a RandomForestClassifier, (>90%). I'm not sure what I can do to make this accuracy better except incorporating a more sophisticated model. I tried using an MLP but for no set of hyperparameters does it seem to exceed the previous accuracy. I have around 2000 data points available for training.
Is there any classifier that works best for such projects? Does anyone have any suggestions as to how I can bring about improvements? (If anything needs to be elaborated, I'll do so.)
Any suggestions on how I can improve on this project in general? Should I include the text on a webpage as well? How should I do so? I tried going through a few sites, but the next doesn't seem to be contained in any specific element whereas the description is easy to obtain from the HTML. Any help?
What else can I take as features? If anyone could suggest any creative ideas, I'd really appreciate it.
You can search with keyword NLP. The task you are facing is a hot topic among those study deep learning, and is called natural language processing.
RandomForest is a machine learning algorithm, and probably works quite well. Using other machine learning algorithms might improve your accuracy, or maybe not. If you want to try out other machine learning algorithms that are light, it's fine.
Deep Learning most likely will outperform your current model, and starting with keyword NLP, you'll find out many models, hopefully Word2Vec, Bert, and so on. You can find out all the codes on github.
One tip for you, is to think carefully whether you can train the model or not. Trying to train BERT from scratch is a crazy thing to do for a starter, even for an expert. Try to bring pretrained model and finetune it, or just bring the word vectors.
I hope that this works out.

CoreML Multiple Input/Multiple Classifier output

After searching questions on SO and reddit, I can't figure out how to train a multiple input, multiple output classifier on a ML Text Classifier. I can train a single input, single output text classifier but that doesnt fit my use case.
Any help would be appreciated. I understand that there's no code to post, and that this is sort of a "show me how" question, but this information seems not readily available via searching and elsewhere, and would be beneficial to the community.
The classifier objects provided by Core ML (and Create ML) are for very specific use cases. If you try to do anything more advanced than that, you'll have to create a custom model, such as your own neural network.

Automating the rumour identification process

Currenlty what we do, check the user discussion based on some keywords on social media. As per the keywords detection we identify that this can be rumour.
Approach to automate the process:
Keyword based : verifying the conversation for 1-2 gram based keywords. If keyword present, marking it as suspected conversation
Classifier based approach : Training the classifier with some prelabeled suspected conversations. Which ever being classified with >50% probability, marked as suspected.
For 2nd approach I am thinking of naive bayes classifier, and identifying the result with precision, recall, F measure value using scikit learn.
Is there any better approach to this? Or some model which can be combination of both approach?
There's no reason that the two approaches would be mutually exclusive. If you are going to be identifying keywords anyway, then you could easily extract a feature for machine-learning. And if you are doing machine-learning, you might as well include features that capture what you know about the keywords you have identified.
Is there a reason that you have chosen a Naive Bayes model? You may want to try a number of models to compare their performance. Your statement about 'identifying the result with precision, recall, F-measure' makes it seem like you don't understand how you make predictions with a machine-learning model. Those three metrics are the result of comparing a model's predictions with 'gold-standard' labels on a number of texts. I would recommend reading through an introduction to machine-learning. If you have already decided that you want to use scikit-learn, then perhaps you could work through their tutorial here. Another python library worth looking into is nltk, which has a free companion book here.
If python is not your preferred language, then there are lots of other options, too. For example, weka is a well-known tool written in java. It has a very user-friendly graphical interface for the basic functions, but it is not difficult to use from the command line as well.
Good luck!

Naive Bayes Classifier Biased Output?

I'm using Emgu CV to implement a machine learning technique in c# to classify pixels of my image into 3 different categories.
Everything works perfect so far, but the problem is that it is fully automatic. I want to make it semi-automatic which mean the user can "give weight" to each of those 3 outcomes. This is to give the user ability to well-tune the outcome.
Any idea how?
The first thing I can think of is to actually modify the input in a way that it would have bias to one of the outputs (for example make it more red by modifying red channel). But I though maybe there is a generic way of doing this that I'm not aware of.
Thanks.
Usually you'd do that by adapting the prior probabilities in the classification rule (what you get from the gaussian distributions is the likelyhood), but it seems that the implementation in emgucv does not allow you to do that.

Resources