ML:Data set Sizes Small, Medium or Large [closed] - machine-learning

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
What is the range for small,medium or large data set sizes in machine learning problems? This was asked in one of the interview and I could not answer it.How we would know if our data set falls in small, medium or large category?
Thanks.

Generally, the size of the data might control issues relating to generalization, data imbalance, and difficulty in approaching the global optimum.
However, it has to do also with the application itself. On the quality of the data. On the questions you want to answer based on the data.
Generally, the goal is to minimize biased and variance. One efficient way to achieve this is by training with more data. Less data could make the predictive models really sensitive. But for some applications less data can also indicate significant patterns.
Another way to indicate whether you data is small or big imagine the scenario where your data consist of 20 columns and 10 rows . That's 200 cells. A dataset with 10 columns and 20 rows would be considered larger even though the total number of cells is still 200. In the latter, the number of samples is bigger.
Another point or view is the classification problems. Imagine you have a big imbalanced dataset where the dependent variable is 99% of the times yes and 1% of the times no. On the other hand, you have a smaller dataset with approximately 50-50 distribution in the samples of the dependent variable. The latter could again be considered a more effective dataset for training.
Keep in mind that there is a variety of a techniques which you can use to deal with small datasets.

Related

How can I normalize data for Reinforcement Learning when outliers are present? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I have to train a reinforcement learning agent (represented by a neural network) whose environment has a dataset where outliers are present.
How can I actually deal with the normalization data considering that I want to normalize them in a range [-1,1]?
I need to maintain outliers in the dataset because they're critical, and they can be actually significant in some circumstances despite being out of the normal range.
So the option to completely delete rows is excluded.
Currently, I'm trying to normalize the dataset by using the IQR method.
I fear that with outliers still present, the agent will take some actions only when intercepts them.
I already experimented that a trained agent always took the same actions, excluding others.
What does your experience suggest?
After some tests, I take this road:
Applied a Z-score normalization with the option "Robust"; in this way, I have mean=0 and sd=1.
I calculated the min_range(feature)+max_range(feature)/2
I divided all the feature data with the mean calculated in point 2.
The agent learned pretty well.

When should I train my own models and when should I use pretrained models? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Is it recommended to train my own models for things like sentiment analysis, despite only having a very small dataset (5000 reviews), or is it best to use pretrained models which were trained on way larger datasets, however aren't "specialized" on my data.
Also, how could I train my model on my data and then later use it on it too? I was thinking of an iterative approach where the training data would be randomly selected subset of my total data for each learning epoch.
I would go like this:
Try the pre-trained model and see how it goes
If results are non satisfactory, you can fine tune it (see this tutorial). Basically, you are using your own examples to change the weights of the pre-trained model. This should improve the results, but it depends on how your data is and how many examples you can provide. The more you have, the better it should be (I would try to use 10-20k at least)
Also, how could I train my model on my data and then later use it on it too?
Be careful to distinguish between pre-train and fine-tuning.
For pre-training you need a huge amount of text (like billions of characters), it is very resource demanding, and tipically you don't want to do that, unless for a very good reason (for example, a model for your target language does not exist).
Fine-tuning requires much much less examples (some tents of thousands), it take tipycally less than a day on a single GPU and allow you to exploit pre-trained model created by someone else.
From what you write, I would go with fine-tune.
Of course you can save the model for later, as you can see in the tutorial I linked above:
model.save_pretrained("my_imdb_model")

How come a small dataset has a high variance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Why does a small dataset have a high variance? Our professor once said it. I just did not understand it. Any help would be greatly appreciated.
Thanks in advance.
if your data set is small and you train your model to fit the data set ,it is easy to have overfitting problems.If your data set is big enough,a little overfitting may not a big problem ,but not in a small data set.
Every single one of us, by the time we are entering our professional careers, have been exposed to a larger visual dataset then the largest dataset available for AI researchers. On top of this, we have sound, smell, touch, and taste data all coming in from our external senses. In summary, humans have a lot of context on the human world. We have a general common-sense understanding of human situations. When analyzing a dataset, we combine the data itself with our past knowledge in order to come up with an analysis.
The typical machine learning algorithm has none of that — it has only the data you show to it, and that data must be in a standardized format. If a pattern isn’t present in the data, there is no way for the algorithm to learn it. That's why when given a small dataset it is more prone to error.

Deep Convolutional Networks [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I would like to do some object detection where have two restrictions.
First one is that at the moment I don't have large number of images for training (at the moment are around 550 images).
Second, most likely I will not be able to see the whole object, there will be available only some part of the object that I try to detect.
My question is it good to try Deep Convolutional Networks
via Bayesian Optimization and Structured Prediction for this kind of situation?
I have this paper as a reference:
Deep Convolutional Networks via Bayesian Optimization and Structured Prediction.
You need to offer us more details. The answer to what CNN should I use? and do I have enough images for that? depends on several factors:
1- How many objects for 550 images? Each object is a class, if you have 550 images from 2 different objects that might be enough, but if you have 550 objects thats only 1 image per object, which is definitely not enough.
2- What is the size of your images? Does it change among them? The 550 images contain parts of the object or the whole object?
After knowing the answer to these questions you can select your CNNs architecture and your data augmentation strategy.
Structured receptive fields have shown better results for small datasets than the normal CNN. Here's a papers to it: https://arxiv.org/abs/1605.02971

Survey to determine satisfaction: how to find the questions that mattered? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
If a survey is given to determine overall customer satisfaction, and there are 20 general questions and a final summary question: "What's your overall satisfaction 1-10", how could it be determined which questions are most significantly related to the summary question's answer?
In short, which questions actually mattered and which ones were just wasting space on the survey...
Information about the relevance of certain features is given by linear classification and regression weights associated with these features.
For your specific application, you could try training an L1 or L0 regularized regressor (http://en.wikipedia.org/wiki/Least-angle_regression, http://en.wikipedia.org/wiki/Matching_pursuit). These regularizers force many of the regression weights to zero, which means that the features associated with these weights can be effectively ignored.
There are many different approaches for answering this question and at varying levels of sophistication. I would start by calculating the correlation matrix for all pair-wise combinations of answers, thereby indicating which individual questions are most (or most negatively) correlated with the overall satisfaction score. This is pretty straightforward in Excel with the Analysis ToolPak.
Next, I would look into clustering techniques starting simple and moving up in sophistication only if necessary. Not knowing anything about the domain to which this survey data applies it is hard to say which algorithm would be the most effective, but for starters I would look at k-means and variants if your clusters are likely to all be similarly-sized. However, if a vast majority of the responses are very similar, I would look into expectation-maximization-based algorithms. A good open-source toolkit for exploring data and testing the efficacy of various algorithms is called Weka.

Resources