What is the difference between feature engineering and feature extraction? [closed] - machine-learning

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am struggling to find the difference between the two concepts. From what I understand both refer to turning raw data into more comprehensive features to describe the problem at hand. Are they the same thing? If not could anyone please provide examples for both?

Feature extraction is usually used when the original data was very different. In particular when you could not have used the raw data.
E.g. original data were images. You extract the redness value, or a description of the shape of an object in the image. It's lossy, but at least you get some result now.
Feature engineering is the careful preprocessing into more meaningful features, even if you could have used the old data.
E.g. instead of using variables x, y, z you decide to use log(x)-sqrt(y)*z instead, because your engineering knowledge tells you that this derived quantity is more meaningful to solve your problem. You get better results than without.

Feature engineering - is transforming raw data into features/attributes that better represent the underlying structure of your data, usually done by domain experts.
Feature Extraction - is transforming raw data into the desired form.

Related

ML or rule based [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I already have 85 accuracy on my sklearn text classifier. What are the advantages and disadvantages of making a rule based system? Can save doing double the work? Maybe you can provide me with sources and evidence for each side, so that I can make the decision baed on my cirucumstances. Again, I want to know when ruls-based approach is favorable versus when a ML based approach is favorable? Thanks!
Here is an idea:
Instead of going one way or another, you can set up a hybrid model. Look at typical errors your machine learning classifier makes, and see if you can come up with a set of rules that capture those errors. Then run these rules on your input, and if they applied, finish there; if not, pass the input on to the classifier.
In the past I did this with a probabilistic part-of-speech tagger. It's difficult to tune a probabilistic model, but it's easy to add a few pre- or post-processing rules to capture some consistent errors.
https://www.linkedin.com/feed/update/urn:li:activity:6674229787218776064?commentUrn=urn%3Ali%3Acomment%3A%28activity%3A6674229787218776064%2C6674239716663156736%29
Yoel Krupnik (CTO & co-founder | smrt - AI For Accounting) writes:
I think it really depends on the specific problem. Some problems can be completely solved with rule based logic, some require machine learning (often in combination with rule based logic before or after).
Advantages of the rule based are that it doesn't require labeled training data, might quickly provide decent results used as a benchmark and helps you better understand the problem for future labeling / text manipulations required by the ML algorithm.

How come a small dataset has a high variance? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Why does a small dataset have a high variance? Our professor once said it. I just did not understand it. Any help would be greatly appreciated.
Thanks in advance.
if your data set is small and you train your model to fit the data set ,it is easy to have overfitting problems.If your data set is big enough,a little overfitting may not a big problem ,but not in a small data set.
Every single one of us, by the time we are entering our professional careers, have been exposed to a larger visual dataset then the largest dataset available for AI researchers. On top of this, we have sound, smell, touch, and taste data all coming in from our external senses. In summary, humans have a lot of context on the human world. We have a general common-sense understanding of human situations. When analyzing a dataset, we combine the data itself with our past knowledge in order to come up with an analysis.
The typical machine learning algorithm has none of that — it has only the data you show to it, and that data must be in a standardized format. If a pattern isn’t present in the data, there is no way for the algorithm to learn it. That's why when given a small dataset it is more prone to error.

What type of neural net to use to distinguish between real and fake images? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I want to implement a netwrok that know to distinguish between real and fake given images.
I don't want to use GAN bc it will be an overkill (training generator and discriminator and I already have the images).
What is the prefered framework to do this?
Does binary classifier is what I need?
Yes, binary classification sounds like a reasonable way to frame your problem.
GANs would be more suitable if you wanted to generate new images. In that case you could train a generator and a discriminator, and then use the former and discard the latter.
As I understand it, discriminator networks typically don't get used on their own (which appears to have been your line of thinking). The reason is that they become tightly coupled to the generator they've been trained with, and don't necessarily generalise beyond that.

Tensorflow Count Objects in Image [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
New to machine learning so looking for some direction how to get started. The end goal is to be able to train a model to count the number of objects in an image using Tensorflow. My initial focus will be to train the model to count one specific type of object. So lets say I take coins. I will only train the model to count coins. Not worried about creating a generic counter for all different types of objects. I've only done Google's example of image classification of flowers and I understand the basics of that. So looking for clues how to get started. Is this an image classification problem and I can use the same logic as the flowers...etc etc?
Probably the best performing solution for the coin problem would be to use a regression to solve this. Annotate 5k images with the amount of objects in the scene and run your model on it. Then your model just outputs the correct number. (Hopefully)
Another way is to classify if an image shows a coin and use a sliding window approach like this one: https://arxiv.org/pdf/1312.6229.pdf to classify for each window if it shows a coin. Then you count the found regions. This one is easier to annotate and learn and better extensible. But you have the problem of choosing good windows and using the result of those windows in a concise way.

A basic query about data mining [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Using data mining, we are able to find useful patterns in a large set of data using techniques like correlation etc etc and there must exist some open source tools for this (what are some examples?).
Is this pull-based or push-based? I mean, do we provide data set as well as specific queries as input to the data mining engine and it provides us answers (as in SQL) or we only supply large data set as input to the engine and it on its own find patterns (which we never knew existed and/or we couldn't formulate queries for this) and thus we don't really pull any specific queries from it, it pushes the patterns to us.
Some quick reading of Wikipedia article doesn't clarify my doubts in clear way.
As open source have a look at Weka.
In regards to the push-pull thing, well, it's a bit of both. But it's not quite that simple. You must be looking for something. E.g. if you are looking for clusters, there are unsupervised algorithms which will give you an answer with minimal guidance.
In practice things are more meaningful if you know about the data you analyse and you are looking at regularities and patterns that make sense.
Playing with Weka will give you a better idea of the range of possibilities.
Python and R are other great open source tools that have great popularity in the data mining area.
A great tool that i used recently is scikit-learn

Resources