In ML, using RNN for an NLP project, is it necessary for DATA redundancy? - machine-learning

Is it necessary to repeat similar template data... Like the meaning and context is the same, but the smaller details vary. If I remove these redundancies, the dataset is very small (size in hundreds) but if the data like these are included, it easily crosses thousands. Which is the right approach?
SAMPLE DATA

This is acutally not a question suited for stack overflow but I'll answer anyways:
You have to think about how the emails (or what ever your data this is) will look in real-life usage: Do you want to detect any kind of spam or just similiar to what your sample data shows? If the first is the case, your dataset is just not suited for this problem since there are not enough various data samples. When you think about it, every of the senteces are exactly the same because the company name isn't really valueable information and will probably not be learned as a feature by your RNN. So the information is almost the same. And since every input sample will run through the network multiple times (once each epoch) it doesnt really help having almost the same sample multiple times.
So you shouldnt have one kind of almost identical data samples dominating your dataset.
But as I said: When you primarily want to filter out "Dear customer, we wish you a ..." you can try it with this dataset but you wouldnt really need an RNN to detect that. If you want to detect all kind of spam, you should search for a new dataset since ~100 unique samples are not enough. I hope that was helpful!

Related

Practical advice on dealing with very long inputs using LSTM model?

I built a character-level LSTM model on text data, but ultimately I'm looking to apply this model on very long text documents (such as a novel) where it's important to understand contextual information, such as where in the novel it's in.
For these large-scale NLP tasks, is the data usually cut into smaller pieces and concatenated with metadata - such as position within the document, detected topic, etc. - to be fed into the model? Or are there more elegant techniques?
Personally, I have not gone that in depth with using LSTMs to go into the level of depth that you are trying to attain but I do have some suggestions.
One solution to your problem, which you mentioned above, could be to simply analyze different pieces of the document by splitting your document into smaller pieces and analyzing them that way. You'll probably have to be creative.
Another solution, that I think might be of interest of you is to uses a Tree LSTM model in order to get the level to depth. Here's the link to the paper Using the Tree model you could feed in individual characters or words on the lowest level and then feed it upward to higher levels of abstraction. Again, I am not completely familiar with the model, so don't take my word on it, but it could be a possible solution.
Adding few more ideas in answer pointed by bhaskar, which are used to handle this problem.
You can used Attention mechanism, which is used to deal with long term dependencies. Because for a long sequence, it certainly forget information or its next prediction may not depend on all the sequence information, it has in its cell. So attention mechanism helps to find the reasonable weights for the characters, it depend on. For more info you can check this link
There is potentially lots of research on this problem. This is very recent paper on this problem.
You can also break the sequence and use seq2seq model, which encode the features into low dims space and then decoder will extract it . This is short-article on this.
My personal advice is to break the sequence and then train it, because sliding window on the complete sequence is pretty much able to find the correlation between each sequence.

What is the amount of training data needed for additional Named Entity Recognition with spaCy?

I'm using the spaCy module to find name entities for input text. I am training the model to predict medical terms. I currently have access to 2 million medical notes, which I wrote a program to that annotates the notes.
I cross reference the medical notes against a pre-defined list of ~90 thousand terms, which is used for the annotation task. At the current pace of annotation, it takes about an hour and a half to annotate 10,000 notes. The way that annotation currently works, I end up with about 90% of the notes having no annotations (I'm currently working on getting a better list of cross-reference terms), so I take the ~1000 annotated notes and train the model on these.
I have checked and the model sort of responds to known annotated terms that it has seen (for example, the term tachycardia has been seen before from annotation, and will sometimes pick it up when the term shows up in the text).
This background might not be too relevant to my particular question, but I thought I would give a small bit of background to my current position.
I was wondering if anyone who has successfully trained a new entity in spaCy could give me some insight into their personal experience in the amount of training that was necessary to have at least somewhat reliable entity recognition.
Thanks!
I trained the Named Entity Recognizer of the Greek language from scratch because no data was available, so I would try to give you a summary of the things I noticed for my case.
I trained the NER with Prodigy annotation tool.
The answer to your question from my personal experience depends on the following things:
The number of labels you want your recognizer to be able to predict. It makes sense that when the numbers of labels (possible outputs) increases, it gets more difficult for your neural network to be able to distinguish them so the amount of data you need increases.
How different are the labels. For example, GPE and LOC tags are quite close and often used in the same context, so neural network was confusing them a lot at the beginning. It is advisable to provide more data related to labels that are close to each other.
The way of training. Pretty much there are two possibilities here:
Fully annotated sentences. This means that you tell your neural network that there are no missing tags to your annotations.
Partially annotated sentences. This means that you tell your neural network that your annotations are correct, but probably some tags are missing. This makes it harder for the network to rely on your data and for this reason, more data need to be provided.
Hyper-parameters. It is really important to fine tune your network in order to get the maximum out of your dataset.
The quality of the dataset. That means that if the dataset is representative of the things that you are going to ask your network to predict less data is required. However, if you are building a more general neural network (that would answer correctly in different contexts), more data is needed for that.
For the Greek model, I tried to predict among 6 labels that were distinct enough, I provided around 2000 fully annotated sentences and I spent a great amount of time fine-tuning.
Results: 70% F-measure, which is quite good for the complexity of the task.
Hope it helps!

What do you do when you have an ML model that works, but does not have good results?

Sorry if this has been asked before, I have tried looking online but maybe I don't know the proper terminology because I mostly find results that try to address overfitting by splitting the data set.
So when my my models gets stuck at like 30% accuracy on the validation data and refuses to improve, my strategies tend to be trying to change the number of nodes per layer, batch size, or number of epochs. Sometimes this is helpful, but other times it doesn't seem to do much at all.
What do people usually do in this situation?
I'd like to help with your question. You probably are working on a classification task. Could you please specify the following properties of your dataset: number of samples, number of features, types of features (numerical, categorical, etc).

Twitter data topical classification

So I have a data set which consists of tweets from various news organizations. I've loaded it into RapidMiner, tokenized it, and produced some n-grams of it. Now I want to be able to have RapidMiner automatically classify my data into various categories based on the topic of the tweets.
I'm pretty sure RapidMiner can do this, but according to the research I've done into it, I need a training data set to be able to show RapidMiner how I want things classified. So I need a training data set, though given the categories I wanted to classify things into, I might have to create my own.
So my questions are these:
1) Is there a training data set for twitter data that focuses more on the topic of the tweet as opposed to a sentiment analysis publicly available?
2) If there isn't one publicly available, how can I create my own? My idea to do it was to go through the tweets themselves and associate the tokens and n-grams with the categories I want. Some concerns I have with that are that I won't be able to manually classify enough tweets to create a training data set comprehensive enough so that I can get a good accuracy rate for the automatic classifier.
3) Any general advice for topical classification of text data would be great. This is the first time that I've done a project like this, and I'm sure there are things I could improve on. :)
There may be training corpora that work for you, but you need to say what your topic or categories are to identify it. The fact that this is Twitter may be relevant, but the data source is likely to be much less relevant to the classification accuracy you will achieve than the topic is. So if you take the infamous 20 newsgroups data set this is likely to work on Twitter as well, but only if the categories you are after are the 20 categories from that data set. If you want to classify cats vs dogs or Android vs iPhone you need to find a data set for that.
In most cases you will have to create initial labels manually, which is, as you say, a lot of work. One workaround might be to start with something simpler like a keyword search to create subsets of your tweets for which you know they deal with a particular category. Then you create the model on top of that and hope that it generalizes to identify the same categories even though the original keywords do not occur.
Alternatively, depending on your application (and if you actually want to build an applicaion), you may as well start with only a small data set and accept that you have poor classification. Then you generate classifications, show them to the users of your apps, and collect some form of explicit or implicit feedback on the classification (e.g. users can flag tweets as incorrectly classified). This way you improve your training corpus and periodically update your model.
Finally, if you do not know what your topics are and you want RapidMiner to identify the topics, you may want to try clustering as opposed to classification. Just create a few clusters and look at the top words for each cluster. They may well be quite dissimilar and describe what the respective clusters are about.
I believe your third question may be a bit broad for stackoverflow and is probably better answered by a text book.

Parsing nonuniform data

I am trying to parse a collection of data that has two (or one) useful pieces, but may be organized in many different ways:
V01C01
Vol 1 Chapter 1
Chapter 1 Volume 1 - Alt title
V1.1
etc.
I don't want to use a massive collection of regexs, because there is no way to predict all of the combinations of how things will be organized (also some will have extraneous text). I feel like there is a branch of machine learning that may be perfect for this, but I'm not experienced in it enough to know.
Well that is an interesting problem for sure and there are a couple of things you could try.
Making the assumption that you don't have labels on your data, then the first thing I would try to do, is to check the connections between each instance using a clustering algorithm like k-means (http://en.wikipedia.org/wiki/K-means_clustering), keep in mind that this wouldn't solve your problem but would help you to explore your data and hopefully find a set of features to train a supervised learning classifier.
In the case that you do have labels on your data, or you could manually tag your set. Then you are in front a more manageable problem. At first glance, it would look a lot like a text or document classification problem (like classify emails as Spam/NoSpam), in which case a naive bayes classifier could be a good first attempt to attack the problem since is a easy algorithm to implement and can provide reasonable good results.
About Naives Bayes Classifier (https://www.bionicspirit.com/blog/2012/02/09/howto-build-naive-bayes-classifier.html)
I made some assumptions here and I might be wrong based on that. Maybe if you clarify some points (like if you are able to manually tag the data) we would be able to help you further.

Resources