Estimating nlp neural net training data set size - machine-learning

I'm working on a simple question / answer solution where answers are returned given a question. I don't have a training set of question, answer pairs. I plan to take a subset of the data , approx 10 sentences and manually create 100 question answer pairs. How can I estimate how many questions / answer pairs will be required in order to answer questions from this data ?

Related

Apply different transformations on features in machine learning [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have features like Amount(20-100K$), Percent1(i.e. 0-1),Percent2(i.e.0-1). Here Amount values are between 20-100000 US dollars, and percent columns with decimals between 0-1. These features are positively skewed, so I applied log transformation on Amount, Yeo-Johnson using powertransformer on Percent1,Percent2 columns.
Is it right to apply different transformations on columns, will it have an effect on model performance or should I same same transformation on all columns?
There are some things that need to be known before we can answer the question.
The answer would depend on the model that you are using. In some models it’s better if the range of the different inputs are same. Some models are agnostic to that. And of course sometimes one might also be interested in assigning different priorities to the inputs.
To get back to your question, depending on the model, there might be absolutely no harm in applying different transformations, or there could be performance differences.
For example: Linear regression models would be greatly affected by the feature transformation. However supervised neural networks most likely wouldn’t.
You might want to check this stackoverflow
question: https://stats.stackexchange.com/questions/397227/why-feature-transformation-is-needed-in-machine-learning-statistics-doesnt-i
it's about understanding the benefits of transformation
so when we talk about some equation like f(x1,x2) = w1x1 + w2x2
then if x1 is about 100,000 like the amount
and if x2 is about 1.0 like the percent
and at the same time feature number 1 will be updated 100,000 faster than feature 2
mathematically when you update the weight then the equation of weights will be like
w1 = w1 - lr*(x1)
w2 = w2 - lr*(x2)
lr here represents the learning rate
then you are saying the amount feature is lot lot better than the percent feature
and that's why usually one transform the data into the same distribution to not make a feature better than another feature

Handing high Cardinality features with supervised ratio and weight of evidence [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
Say a data set has a categorical feature with high cardinality. Say zipcodes, or cities. Encoding this feature would give hundreds of feature columns. Different approaches such as supervised_ratio, weight of evidence seems to give better performance.
The question is, these supervised_ratio and WOE are to be calculated on the training set, right ? So I get the training set and process it and calcuate the SR and WOE and update the training set with the new values and keep the calculated values to be used in test set as well. But what happens if the test set has zip codes which were not in training set ? when there is no SR or WOE value to be used? (Practically this is possible if the training data set is not covering all the possible zip codes or if there are only one or two records from certain zip codes which might fall in to either training set or test set).
(Same will happen with encoding approach also)
I am more interested in the question, is SR and/or WOE the recommended way to handle a feature with high cardinality? if so what do we do when there are values in test set which were not in training set?
If not, what are the recommended ways to handling high cardinality features and which algorithms are more robust to them ? Thank you
This is a well-known problem when applying value-wise transformations to a categorical feature. The most common workaround is to have a set of rules to translate unseen values into values known by your training set.
This can be just a single 'NA' value (or 'others', as another answer is suggesting), or something more elaborate (e.g. in your example, you can map unseen zip codes to the closest know one in the training set).
Another posible solution in some scenarios is to have the model refusing to made a prediction in those cases, and just return an error.
For your second question, there is not really a recommended way of encoding high cardinality features (there are many methods and some may work better than others depending on the other features, the target variable, etc..); but what we can recommend you is to implement a few and experiment which one is more effective for your problem. You can consider the preprocessing method used as just another parameter in your learning algorithm.
That's a great question, thanks for asking!
When approaching this kind of problem of handle a feature with high cardinality, like zip codes, I keep in my training set just the most frequent ones and put all others in new category "others", then I calculate their WOE or any metric.
If some unseen zip code are found the test set, they falls to 'others' category. In general, this approach works well in practice.
I hope this nayve solution can help you!

Training and Testing Data set for classification text file

Suppose we have 10000 text file and We would like to classify as political ,health,weather,sports,Science ,Education,.........
I need training data set for classification of text documents and I am Naive Bayes classification Algorithm. Anyone can help to get data sets .
OR
Is there any another way to get classification done..I am new at Machine Learning Please explain your answer completely.
Example:
**Sentence** **Output**
1) Obama won election. ----------------------------------------------->political
2) India won by 10 wickets ---------------------------------------------->sports
3) Tobacco is more dangerous --------------------------------------------->Health
4) Newtons laws of motion can be applied to car -------------->science
Any way to classify these sentences into their respective categories
Have you tried to google it? There are tons and tons of datasets for text categorization. The classical one is Reuters-21578 (https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection), another famous one and mentioned almost in each ML book is 20 newsgroup: http://web.ist.utl.pt/acardoso/datasets/
But there are lots of other, one google query away from you. Just load them, slightly adjust if needed and train your classifier on that datasets.

Imbalanced Data for Random ferns [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
For a Multiclass problem, should the data be balanced for machine learning algorithms such as Random forests and Random ferns or is it ok for it to be imbalanced for a certain extent?
The issue with imbalanced classes raises when the disproportion alters the separability of the classes instances. But this does not happen in ever imbalanced dataset: sometimes the more data you have from one class the better you can differentiate the scarse data from it since it lets you find more easily which features are meaningful to create an discriminating plane (even though you are not using discriminative analysis the point is to classify-separate the instances according to classes).
For example I can remember the KDDCup2004 protein classification task in which one class had 99.1% of the instances in the training set but if you tried to use under sampling methods to alleviate the imbalance you would only get worse results. That meaning that the large amount of data from the first class defined the data in the smaller one.
Concerning random forests, and decision trees in general, they work by selecting, at each step, the most promising feature that can partitionate the set into two (or more) class-meaningful subsets. Having inherently more data about one class does not bias this partitioning by default ( = always) but only when the imbalance is not representative of the classes real distributions.
So I suggest that you first run a multivariate analysis to try to get the extent of imbalance among classes in your dataset and the run a series of experiments with different undersampling ratios if you still ar ein doubt.
I have used Random Forrests in my task before. Although the data don't need be balanced, however if the positive samples are too few, the pattern of the data maybe drown in the noise. Most of classify methods even (random forrests and ada boost) should have this flaw more or less.'Over sample' may be a good idea to deal with this problem.
Perhaps the paper Logistic Regression in rareis useful with this sort of problem, although its topic is logistic regression.

NLP: Calculating probability a document belongs to a topic (with a bag of words)? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Given a topic, how can I calculate the probability a document "belongs" to that topic(ie sports)
This is what I have to work with:
1) I know the common words in documents associated with that topics (eliminating all STOP words), and the % of documents that have that word
For instance if the topic is sports, I know:
75% of sports documents have the word "play"
70% have the word "stadium"
40% have the word "contract"
30% have the word "baseball"
2) Given this, and a document with a bunch of words, how can I calculate the probability this document belongs to that topic?
This is fuzzy classification problem with topics as classes and words as features. Normally you don't have bag of words for each topic, but rather set of documents and associated topics, so I will describe this case first.
The most natural way to find probability (in the same sense it is used in probability theory) is to use naive Bayes classifier. This algorithm has been described many times, so I'm not going to cover it here. You can find quite good explanation in this synopsis or in associated Coursera NLP lectures.
There are also many other algorithms you can use. For example, your description naturally fits tf*idf based classifiers. tf*idf (term frequency * inverse document frequency) is a statistic used in modern search engines to calculate importance of a word in a document. For classification, you may calculate "average document" for each topic and then find how close new document is to each topic with cosine similarity.
If you have the case exactly like you've described - only topics and associated words - just consider each bag of words as a single document with, possibly, duplicating frequent words.
Check out topic modeling (https://en.wikipedia.org/wiki/Topic_model) and if you are coding in python, you should check out radim's implementation, gensim (http://radimrehurek.com/gensim/tut1.html). Otherwise there are many other implementations from http://www.cs.princeton.edu/~blei/topicmodeling.html
There are many approaches to solving a clustering problem. I suggest start with simple logistic regression and look at the results. If you already have predefined ontology sets, you can add them as features in next stage to improve accuracy.

Resources