I'm going through Andrew Ng's Machine Learning course on Coursera, and in the intro to his section on SVM's he says: "SVMs are considered by many to be the most powerful 'black box' learning algorithm"
He doesn't elaborate on that. Why would SVM's be considered a black box? Is it because their behavior is less straightforward than that of, say, logistic regression?
In machine learning, some algorithms are referred to as black box processes because the mechanism that transforms the input into the output is obfuscated by an imaginary box, without interference from the audience.
In general, the fundamental problem that SVMs try to solve is binary classification. One of the classical examples is to distinguish spam emails vs non-spam emails.
The main idea behind a SVM is that it's trying to find a hyperplane separating those points in space, with the positive examples on one side, and the negative examples on the other.
This is where the idea of "machine" comes into the name. Imagine attaching a spring to each of the points that's the closest to the separating hyperplane, and the other end of the spring to the hyperplane. Those points are called "support vectors" and they act like a "machine", moving the hyperplane to the orientation that maximizes the separation between the positive and negative examples in space.
Related
I want to build an RL agent which can justify if a handwritten word is written by the legitimate user or not. The plan is as follow:
Let's say I have written any word 10 times and extracted some geometrical properties for all of them to use as features. Then I have trained an RL agent to learn to take the decision on the basis of the differences between geometrical properties of new and the old 10 handwritten texts. Reward is assigned for correct identification and nothing or negative for incorrect one.
Am I going in the right direction or I am missing anything which is vital? Is it possible to train the agent with only 10 samples? Actally as a new student of RL, I am confused about use case of RL; if it is best fit for game solving and robotic problems or it is also suitable for predicting on the basis of training.
Reinforcement learning would be used over time. If you were following the stroke of the pen, over time, to find out which way it was going that would be more reinforcement learning's wheelhouse. The time dimension (or over a series of states) is why it's used in games like Starcraft II.
You are talking about taking a picture of the text that was written and eventually classifying it into a boolean (Good or Not). You are looking for more Convolutional neural networks to solve your problem (those types of algos are good for pictures).
Eventually you won't be able to tell. There are techniques with GAN's (Generative Adversarial Networks) that can train with your discriminator and finally figure out the pattern it's looking for and fool it. But this sounds good as a homework problem.
I have some questions about SVM :
1- Why using SVM? or in other words, what causes it to appear?
2- The state Of art (2017)
3- What improvements have they made?
SVM works very well. In many applications, they are still among the best performing algorithms.
We've seen some progress in particular on linear SVMs, that can be trained much faster than kernel SVMs.
Read more literature. Don't expect an exhaustive answer in this QA format. Show more effort on your behalf.
SVM's are most commonly used for classification problems where labeled data is available (supervised learning) and are useful for modeling with limited data. For problems with unlabeled data (unsupervised learning), then support vector clustering is an algorithm commonly employed. SVM tends to perform better on binary classification problems since the decision boundaries will not overlap. Your 2nd and 3rd questions are very ambiguous (and need lots of work!), but I'll suffice it to say that SVM's have found wide range applicability to medical data science. Here's a link to explore more about this: Applications of Support Vector Machine (SVM) Learning in Cancer Genomics
I have designed a 3 layer neural network whose inputs are the concatenated features from a CNN and RNN. The weights learned by network take very small values. What is the reasonable explanation for this? and how to interpret the weight histograms and distributions in Tensorflow? Any good resource for it?
This is the weight distribution of the first hidden layer of a 3 layer neural network visualized using tensorboard. How to interpret this? all the weights are taking up zero value?
This is the weight distribution of the second hidden layer of a 3 layer neural:
how to interpret the weight histograms and distributions in Tensorflow?
Well, you probably didn't realize it, but you have just asked the 1 million dollar question in ML & AI...
Model interpretability is a hyper-active and hyper-hot area of current research (think of holy grail, or something), which has been brought forward lately not least due to the (often tremendous) success of deep learning models in various tasks; these models are currently only black boxes, and we naturally feel uncomfortable about it...
Any good resource for it?
Probably not exactly the kind of resources you were thinking of, and we are well off a SO-appropriate topic here, but since you asked...:
A recent (July 2017) article in Science provides a nice overview of the current status & research: How AI detectives are cracking open the black box of deep learning (no in-text links, but googling names & terms will pay off)
DARPA itself is currently running a program on Explainable Artificial Intelligence (XAI)
There was a workshop in NIPS 2016 on Interpretable Machine Learning for Complex Systems
On a more practical level:
The Layer-wise Relevance Propagation (LRP) toolbox for neural networks (paper, project page, code, TF Slim wrapper)
FairML: Auditing Black-Box Predictive Models, by Fast Forward Labs (blog post, paper, code)
A very recent (November 2017) paper by Geoff Hinton, Distilling a Neural Network Into a Soft Decision Tree, with an independent PyTorch implementation
SHAP: A Unified Approach to Interpreting Model Predictions (paper, authors' code)
These should be enough for starters, and to give you a general idea of the subject about which you asked...
UPDATE (Oct 2018): I have put up a much more detailed list of practical resources in my answer to the question Predictive Analytics - “Why” factor?
The weights learned by network take very small values. What is the reasonable explanation for this? How to interpret this? all the weights are taking up zero value?
Not all weights are zero, but many are. One reason is regularization (in combination with a large, i.e. wide layers, network) Regularization makes weights small (both L1 and L2). If your network is large, most weights are not needed, i.e., they can be set to zero and the model still performs well.
How to interpret the weight histograms and distributions in Tensorflow? Any good resource for it?
I am not so sure about weight distributions. There is some work that analysis them, but I am not aware of a general interpretation, e.g., for CNNs it is known that center weights of a filter/feature usually have larger magnitude than those in corners, see [Locality-Promoting Representation Learning, 2021, ICPR, https://arxiv.org/abs/1905.10661]
For CNNs you can also visualize weights directly, if you have large filters. For example, for (simpl)e networks you can see that weights first converge towards some kind of class average before overfitting starts. This is shown in Figure 2 of [The learning phases in NN: From Fitting the Majority to Fitting a Few, 2022, http://arxiv.org/abs/2202.08299]
Rather than going for weights, you can also look at what samples trigger the strongest activations for specific features. If you don't want to look at single features, there is also the possibility to visualize what the network actually remembers on the input, e.g., see [Explaining Neural Networks by Decoding Layer Activations, https://arxiv.org/abs/2005.13630].
These are just a few examples (Disclaimer I authored these works) - there is thousands of other works on explainability out there.
According to this Wikipedia article Feature Extraction examples for Low-Level algorithms are Edge Detection, Corner Detection etc.
But what are High-Level algorithms?
I only found this quote from the Wikipedia article Feature Detection (computer vision):
Occasionally, when feature detection is computationally expensive and there are time constraints, a higher level algorithm may be used to guide the feature detection stage, so that only certain parts of the image are searched for features.
Could you give an example of one of these higher level algorithms?
There isn't a clear cut definition out there, but my understanding of "high-level" algorithms are more in tune with how we classify objects in real life. For low-level feature detection algorithms, these are mostly concerned with finding corresponding points between images, or finding those things that classify as something even remotely interesting at the lowest possible level you can think of - things like finding edges or lines in an image (in addition to finding interesting points of course). In addition, anything dealing with pixel intensities or colours directly is what I would consider low-level too.
High-level algorithms are mostly in the machine learning domain. These algorithms are concerned with the interpretation or classification of a scene as a whole. Things like body pose classification, face detection, classification of human actions, object detection and recognition and so on. These algorithms are concerned with training a system to recognize or classify something, then you provide it some unknown input that it has never seen before and its job is to either determine what is happening in the scene, or locate a region of interest where it detects an action that the system is trained to look for. This latter fact is probably what the Wikipedia article is referring to. You would have some sort of pre-processing stage where you have some high-level system that determines salient areas in the scene where something important is happening. You would then apply low-level feature detection algorithms in this localized area.
There is a great high-level computer vision workshop that talks about all of this, and you can find the slides and code examples here: https://www.mpi-inf.mpg.de/departments/computer-vision-and-machine-learning/teaching/courses/ss-2019-high-level-computer-vision/
Good luck!
High-level features are something that we can directly see and recognize, like object classification, recognition, segmentation and so on. These are usually the goal of CV research, which is always based on 'low-level' features and algorithms.
Two of them are used in machine specially x-ray machine
Concerned Scene as a whole and edges of lines to help soft ware of machine to take good decision.
I think we should not confuse with high-level features and high-level inference. To me, high-level features mean shape, size, or a combination of low-level features etc. are the high-level features. While classification is the decision made based on the high-level features.
Many machine learning competitions are held in Kaggle where a training set and a set of features and a test set is given whose output label is to be decided based by utilizing a training set.
It is pretty clear that here supervised learning algorithms like decision tree, SVM etc. are applicable. My question is, how should I start to approach such problems, I mean whether to start with decision tree or SVM or some other algorithm or is there is any other approach i.e. how will I decide?
So, I had never heard of Kaggle until reading your post--thank you so much, it looks awesome. Upon exploring their site, I found a portion that will guide you well. On the competitions page (click all competitions), you see Digit Recognizer and Facial Keypoints Detection, both of which are competitions, but are there for educational purposes, tutorials are provided (tutorial isn't available for the facial keypoints detection yet, as the competition is in its infancy. In addition to the general forums, competitions have forums also, which I imagine is very helpful.
If you're interesting in the mathematical foundations of machine learning, and are relatively new to it, may I suggest Bayesian Reasoning and Machine Learning. It's no cakewalk, but it's much friendlier than its counterparts, without a loss of rigor.
EDIT:
I found the tutorials page on Kaggle, which seems to be a summary of all of their tutorials. Additionally, scikit-learn, a python library, offers a ton of descriptions/explanations of machine learning algorithms.
This cheatsheet http://peekaboo-vision.blogspot.pt/2013/01/machine-learning-cheat-sheet-for-scikit.html is a good starting point. In my experience using several algorithms at the same time can often give better results, eg logistic regression and svm where the results of each one have a predefined weight. And test, test, test ;)
There is No Free Lunch in data mining. You won't know which methods work best until you try lots of them.
That being said, there is also a trade-off between understandability and accuracy in data mining. Decision Trees and KNN tend to be understandable, but less accurate than SVM or Random Forests. Kaggle looks for high accuracy over understandability.
It also depends on the number of attributes. Some learners can handle many attributes, like SVM, whereas others are slow with many attributes, like neural nets.
You can shrink the number of attributes by using PCA, which has helped in several Kaggle competitions.