How come a small dataset has a high variance? [closed] - machine-learning

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Why does a small dataset have a high variance? Our professor once said it. I just did not understand it. Any help would be greatly appreciated.
Thanks in advance.

if your data set is small and you train your model to fit the data set ,it is easy to have overfitting problems.If your data set is big enough,a little overfitting may not a big problem ,but not in a small data set.

Every single one of us, by the time we are entering our professional careers, have been exposed to a larger visual dataset then the largest dataset available for AI researchers. On top of this, we have sound, smell, touch, and taste data all coming in from our external senses. In summary, humans have a lot of context on the human world. We have a general common-sense understanding of human situations. When analyzing a dataset, we combine the data itself with our past knowledge in order to come up with an analysis.
The typical machine learning algorithm has none of that — it has only the data you show to it, and that data must be in a standardized format. If a pattern isn’t present in the data, there is no way for the algorithm to learn it. That's why when given a small dataset it is more prone to error.

Related

What are the good practices to building your own custom facial recognition? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am working on building a custom facial recognition for our office.
I am planning to use Google FaceNet,
Now my question is that you can find or create your own version of facenet model in keras or pytorch there's no issue in that, but regarding creating dataset ,I want to know what are the best practices to capture photo of person when I don't have any prior photo of that person,all I have is a camera and a person ,should I create variance in by changing lightning condition or orientation or face size ?
A properly trained FaceNet model should already be somewhat invariant to lighting conditions, pose and other features that should not be a part of identifying a face. At least that is what is claimed in a draft of the FaceNet paper. If you only intend to compare feature vectors generated from the network, and intend to recognize a small group of people, your own dataset likely does not have to be particulary large.
Personally I have done something quite similar to what you are trying to achieve for a group of around ~100 people. The dataset consisted of 1 image per person and I used a 1-N-N classifier to classify the generated feature vectors. While I do not remember the exact results, it did work quite well. The pretrained network's architecture was different from FaceNet's but the overall idea was the same though.
The only way to truly answer your question though would be to experiment and see how well things work out in practice.

Is it a good idea to train a Neural Network on continiously randomly generated training data? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Hello everyone I'm building a license plate detection model in Tensorflow. I built a function that chooses a license plate at random from a collection of ~5000 plates and puts it in a random place in on a random background and saves the coordinates. At first I thought to generate about 40K images this way and train the network on with the generated data. But wouldn't it be a good idea to just continiously keep generating new data to feed to the network and basically eliminate any chance of it getting overfitted?
This is an excellent way to train it on how to spot the discontinuities around a superimposed yellow / white / blue rectangle, but maybe not such a great way of teaching it to spot a real license plate. If you've got a good way of procedurally generating images then great! but be warned.
It might spot the wrong pattern.

What is the difference between feature engineering and feature extraction? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am struggling to find the difference between the two concepts. From what I understand both refer to turning raw data into more comprehensive features to describe the problem at hand. Are they the same thing? If not could anyone please provide examples for both?
Feature extraction is usually used when the original data was very different. In particular when you could not have used the raw data.
E.g. original data were images. You extract the redness value, or a description of the shape of an object in the image. It's lossy, but at least you get some result now.
Feature engineering is the careful preprocessing into more meaningful features, even if you could have used the old data.
E.g. instead of using variables x, y, z you decide to use log(x)-sqrt(y)*z instead, because your engineering knowledge tells you that this derived quantity is more meaningful to solve your problem. You get better results than without.
Feature engineering - is transforming raw data into features/attributes that better represent the underlying structure of your data, usually done by domain experts.
Feature Extraction - is transforming raw data into the desired form.

Learning approach in machine learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
(homework problem)
Which of the following problems are best suited for the learning approach?
Classifying numbers into primes and non-primes.
Detecting potential fraud in credit card charges.
Determining the time it would take a falling object to hit the ground.
Determining the optimal cycle for trafic lights in a busy intersection
I'm trying to answer your question without doing your homework.
Basically you can think of machine learning as a way to extract patterns from data where all other approaches fail.
So first clue here: If there is an analytic way to solve the problem then don't use machine learning! The analytic algorithm will likely be faster, more efficient, and 100% correct.
Second clue is: There has to be a pattern in the data. If you as a human see a pattern, machine learning can find it too. If lots of smart humans who are experts of the respective domain don't see a pattern then machine learning will most likely fail. Chaos can not be learned, i.e. classified/predicted.
That should answer your question. Make sure to also read the summary on wikipedia to get an idea whether a problem can be solved using supervised, unsupervised, or reinforcement learning.

A basic query about data mining [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Using data mining, we are able to find useful patterns in a large set of data using techniques like correlation etc etc and there must exist some open source tools for this (what are some examples?).
Is this pull-based or push-based? I mean, do we provide data set as well as specific queries as input to the data mining engine and it provides us answers (as in SQL) or we only supply large data set as input to the engine and it on its own find patterns (which we never knew existed and/or we couldn't formulate queries for this) and thus we don't really pull any specific queries from it, it pushes the patterns to us.
Some quick reading of Wikipedia article doesn't clarify my doubts in clear way.
As open source have a look at Weka.
In regards to the push-pull thing, well, it's a bit of both. But it's not quite that simple. You must be looking for something. E.g. if you are looking for clusters, there are unsupervised algorithms which will give you an answer with minimal guidance.
In practice things are more meaningful if you know about the data you analyse and you are looking at regularities and patterns that make sense.
Playing with Weka will give you a better idea of the range of possibilities.
Python and R are other great open source tools that have great popularity in the data mining area.
A great tool that i used recently is scikit-learn

Resources