After going through many tutorials on youtube, could not find an answer...
I have two arff files, one with the actual test results, class is numeric 0-48,
and the other with a '?' as the class.
I've used 10 fold cross validation REPtree and got a reasonably low error.
My problem is that I don't understand how to use weka to apply this training set on the "unpredicted" data that I have.
The training set consists of users that answered an online survey, and the other file is users who did not answer the survey.
Here is a screenshot of the actual set up I have.
Thank you very much!!
Related
Is it necessary to repeat similar template data... Like the meaning and context is the same, but the smaller details vary. If I remove these redundancies, the dataset is very small (size in hundreds) but if the data like these are included, it easily crosses thousands. Which is the right approach?
SAMPLE DATA
This is acutally not a question suited for stack overflow but I'll answer anyways:
You have to think about how the emails (or what ever your data this is) will look in real-life usage: Do you want to detect any kind of spam or just similiar to what your sample data shows? If the first is the case, your dataset is just not suited for this problem since there are not enough various data samples. When you think about it, every of the senteces are exactly the same because the company name isn't really valueable information and will probably not be learned as a feature by your RNN. So the information is almost the same. And since every input sample will run through the network multiple times (once each epoch) it doesnt really help having almost the same sample multiple times.
So you shouldnt have one kind of almost identical data samples dominating your dataset.
But as I said: When you primarily want to filter out "Dear customer, we wish you a ..." you can try it with this dataset but you wouldnt really need an RNN to detect that. If you want to detect all kind of spam, you should search for a new dataset since ~100 unique samples are not enough. I hope that was helpful!
After searching questions on SO and reddit, I can't figure out how to train a multiple input, multiple output classifier on a ML Text Classifier. I can train a single input, single output text classifier but that doesnt fit my use case.
Any help would be appreciated. I understand that there's no code to post, and that this is sort of a "show me how" question, but this information seems not readily available via searching and elsewhere, and would be beneficial to the community.
The classifier objects provided by Core ML (and Create ML) are for very specific use cases. If you try to do anything more advanced than that, you'll have to create a custom model, such as your own neural network.
I have to create a system that generates all possible question answer pairs from unstructured text in a specific domain.Many questions may have the same answer but the system should generate all possible types of questions that an answer can have.The questions formed should be meaningful and grammatically correct.
For this purpose, I used nltk and trained an NER, creating entities according to my domain and then I created some rules to identify the question word using the combination of NER identified entities and POS tagged words. But this approach isn't working fine as I am not able to create meaningful questions from the text. Moreover, some question words are wrongly identified and some question words are missed. I also read research papers on using RNN for this purpose but I don't have a large training data since the domain is pretty small. Can anyone suggest a better approach?
I am working on what is to me a very new domain in data science and would like to know if anyone can suggest any existing academic literature that has relevant approaches that address my problem.
The problem setting is as follows:
I have a set of named topics (about 100 topics). We have a document tagging engine that tags documents (news articles in our case) based on their text with up to 5 of these 100 topics.
All this is done using fairly rudimentary similarity metrics (each topic is a text vector and so is each document and we do a similarity between these vectors and assign the 5 most similar topics to each document).
We are looking to improve the quality of this process but the constraint is we have to maintain the set of 100 named topics which are vital for other purposes so unsupervised topic models like LDA are out because:
1. They don't provide named topics
2. Even if we are able to somehow map distributions of topics output by LDA to existing topics, these distributions will not remain constant and vary with the underlying corpus.
So could anyone point me towards papers that have worked with document tagging using a finite set of named topics?
There are 2 challenges here:
1. Given a finite set of named topics , how to tag new documents with them? (this is the bigger more obvious challenge)
2. How do we keep the topics updated with the changing document universe?
Any work that addresses one or both of these challenges would be a great help.
P.S. I've also asked this question on Quora if anyone else is looking for answers and would like to read both posts. I'm duplicating this question as I feel it is interesting and I'd like to get as many people talking about this problem as possible and as many literature suggestions as possible.
Same Question on Quora
Have you tried classification?
Train a classifier for each topic.
Tag with the 5 most likely classes.
So I have a data set which consists of tweets from various news organizations. I've loaded it into RapidMiner, tokenized it, and produced some n-grams of it. Now I want to be able to have RapidMiner automatically classify my data into various categories based on the topic of the tweets.
I'm pretty sure RapidMiner can do this, but according to the research I've done into it, I need a training data set to be able to show RapidMiner how I want things classified. So I need a training data set, though given the categories I wanted to classify things into, I might have to create my own.
So my questions are these:
1) Is there a training data set for twitter data that focuses more on the topic of the tweet as opposed to a sentiment analysis publicly available?
2) If there isn't one publicly available, how can I create my own? My idea to do it was to go through the tweets themselves and associate the tokens and n-grams with the categories I want. Some concerns I have with that are that I won't be able to manually classify enough tweets to create a training data set comprehensive enough so that I can get a good accuracy rate for the automatic classifier.
3) Any general advice for topical classification of text data would be great. This is the first time that I've done a project like this, and I'm sure there are things I could improve on. :)
There may be training corpora that work for you, but you need to say what your topic or categories are to identify it. The fact that this is Twitter may be relevant, but the data source is likely to be much less relevant to the classification accuracy you will achieve than the topic is. So if you take the infamous 20 newsgroups data set this is likely to work on Twitter as well, but only if the categories you are after are the 20 categories from that data set. If you want to classify cats vs dogs or Android vs iPhone you need to find a data set for that.
In most cases you will have to create initial labels manually, which is, as you say, a lot of work. One workaround might be to start with something simpler like a keyword search to create subsets of your tweets for which you know they deal with a particular category. Then you create the model on top of that and hope that it generalizes to identify the same categories even though the original keywords do not occur.
Alternatively, depending on your application (and if you actually want to build an applicaion), you may as well start with only a small data set and accept that you have poor classification. Then you generate classifications, show them to the users of your apps, and collect some form of explicit or implicit feedback on the classification (e.g. users can flag tweets as incorrectly classified). This way you improve your training corpus and periodically update your model.
Finally, if you do not know what your topics are and you want RapidMiner to identify the topics, you may want to try clustering as opposed to classification. Just create a few clusters and look at the top words for each cluster. They may well be quite dissimilar and describe what the respective clusters are about.
I believe your third question may be a bit broad for stackoverflow and is probably better answered by a text book.