I used the three algorithm with the same training and test set. However I'm getting the same exact mean accuracy for all of them. Is there any good explanation on why this is the case for me. I read it may have something to do with the classes being used might be considered easy.

It’s a bit weird that this happens but I have to make assumptions regarding to your question:
The MLP architecture consists of a n inputs and one hidden neuron and m outputs.
You use a linear kernel for the SVM
Your test and training data is linearly separateable.
If 1. is true, your network is equal to logistic regression and your data is separate by a line which leads to the 3. and also to 2. because you won’t need a kernel function to separate the data.
So, the weird part is, that the SVM turns the computation of a decision boundary of your classifier into a convex optimization problem to solve for the optimal boundary. Neither Logistic Regression nor an MLP is able to do this. Hence, your test data must be really easy to separate and must lay with a larger margin to the decision boundary than your training data. This way, it’s not necessary to have a optimal margin between the classes and any boundary which separates the classes without error is sufficient.

They can all give identical performance if your problem is simple enough. There's nothing stopping them from giving you identical results.


I am working to create an MLP model on a CEA Classification Dataset (Binary Classification). Each sample contains different 4 features, such as resistance and other values, each in its own range (resistance in hundreds, another in micros, etc.). I am still new to machine learning and this is the first real model to build. How can I deal with such data? I have tried feeding each sample to the neural network with a sigmoid activation function, but I am not getting accurate results. My assumption to deal with this kind of data is to scale it? If so, what are some resources which are useful to look at, since I do not quite understand when is scaling required.
Scaling your data can be an important step in building a machine-learning model, especially when working with neural networks. Scaling can help to ensure that all of the features in your dataset are on a similar scale, which can make it easier for the model to learn.
There are a few different ways to scale your data, such as normalization and standardization. Normalization is the process of scaling the data so that it has a minimum value of 0 and a maximum value of 1. Standardization is the process of scaling the data so that it has a mean of 0 and a standard deviation of 1.
When working with your CEA Classification dataset, it might be helpful to try both normalization and standardization to see which one works better for your specific dataset. You can use scikit-learn library's preprocessing functions like MinMaxScaler() and StandardScaler() for normalization and standardization respectively.
Additionally, it might be helpful to try different activation functions, such as ReLU or LeakyReLU, to see if they lead to more accurate results. Also, you can try adding more layers and neurons in your neural network to see if it improves the performance.
It's also important to remember that feature engineering, which includes the process of selecting the most important features, can be more important than scaling.

I am currently looking through Michael Nielsen's ebook Neural Networks and Deep Learning and have run the code found at the end of chapter 1 which trains a neural network to recognize hand-written digits (with a slight modification to make the backpropagation algorithm over a mini-batch matrix-based).
However, having run this code and achieving a classification accuracy of just under 94%, I decided to remove the use of biases from the network. After re-training the modified network, I found no difference in classification accuracy!
NB: The output layer of this network contains ten neurons; if the ith of these neurons has the highest activation then the input is classified as being the digit i.
This got me wondering why it is necessary to use biases in a neural network, rather than just weights, and what differentiates between a task where biases will improve the performance of a network and a task where they will not?
My code can be found here:
Biases are used to account for the fact that your underlying data might not be centered. It is clearer to see in the case of a linear regression.
If you do a regression without an intercept (or bias), you are forcing the underlying model to pass through the origin, which will result in a poor model if the underlying data is not centered (for example if the true generating process is Y=3000). If, on the other hand, your data is centered or close to centered, then eliminating bias is good, since you won't introduce a term that is, in fact, independent to your predictive variable (it's like selecting a simpler model, which will tend to generalize better PROVIDED that it actually reflects the underlying data).

I have a binary classification problem where I have a few great features that have the power to predict almost 100% of the test data because the problem is relatively simple.
However, as the nature of the problem requires, I have no luxury to make mistake(let's say) so instead of giving a prediction I am not sure of, I would rather have the output as probability, set a threshold and would be able to say, "if I am less than %95 sure, I will call this "NOT SURE" and act accordingly". Saying "I don't know" rather than making a mistake is better.
So far so good.
For this purpose, I tried Gaussian Bayes Classifier(I have a cont. feature) and Logistic Regression algorithms, which provide me the probability as well as the prediction for the classification.
Coming to my Problem:
GBC has around 99% success rate while Logistic Regression has lower, around 96% success rate. So I naturally would prefer to use GBC.
However, as successful as GBC is, it is also very sure of itself. The odds I get are either 1 or very very close to 1, such as 0.9999997, which makes things tough for me, because in practice GBC does not provide me probabilities now.
Logistic Regression works poor, but at least gives better and more 'sensible' odds.
As nature of my problem, the cost of misclassifying is by the power of 2 so if I misclassify 4 of the products, I lose 2^4 more (it's unit-less but gives an idea anyway).
In the end; I would like to be able to classify with a higher success than Logistic Regression, but also be able to have more probabilities so I can set a threshold and point out the ones I am not sure of.
Any suggestions?
Thanks in advance.
If you have enough data, you can simply retune the probabilities. For example, given the "predicted probability" output of your gaussian classifier, you can go back through (on a held out dataset) and at different prediction values, estimate the probability of the positive class.
Further, you can simply set up an optimization on your holdout set to determine the best threshold(without actually estimating the probability). Since it's one dimensional, you shouldn't even need to do anything fancy for optimization-- test like 500 different thresholds and pick the one which minimizes the costs associated with misclassifications.

The motivating idea behind neural nets seems to be that they learn the "right" features to apply logistic regression to. Is there a similar approach for linear regression? (or just regression problems in general?)
Would doing the obvious thing of removing the application of a sigmoid function for all neurons (ie, including the hidden layers) make sense/work? (ie, each neuron is performing linear regression instead of logistic regression).
Alternatively, would doing the (maybe even more obvious) thing of just scaling output values to [0,1] work? (intuitively I would think not, as the sigmoid function seems like it would cause the net to arbitrarily favor extreme values) (edit: though I was just searching around some more, and saw that one technique is to scale based on mean and variance, which seems like it might deal with this issue -- so maybe this is more viable than I thought).
Or is there some other technique for doing "feature learning" for regression problems?
Check out this applet. Try to learn different functions. When you dictate linear activation functions at both hidden and output layers, it even fails to learn the quadratic function. At least one layer needs to be set to sigmoid function, see figures below.
There are different kinds of scaling. Standard scaling, as you mentioned, eliminates the impact of mean and standard deviation of the training sample, is most often used in machine learning. Just make sure you are using the same mean and std value from training sample in the test sample.
The reason why scaling is required is because the output of sigmoid function ranges at (0,1). I didn't try, but I think it is better to scale the output even if you select linear function at output layer. Otherwise large input at hidden layer (with sigmoid) won't lead to drastic output (the sigmoid function is approximately linear when the input is at a small range, out of such range will make the output changes much slowly). You can try this by yourself in your own data.
Besides, if you have various features, the feature normalization that makes different features in the same scale is also recommended. The scaling speeds up gradient descent by avoiding many extra iterations that are required when one or more features take on much larger values than the rest.
As #Ray mentioned, deep learning that many levels of features are involved can help you with the feature learning, it's not all linear combinations though.

I was learning about different techniques for classification, like probablistic classifiers etc , and stubled upon the question Why cant we implement a binary classifier as a Regression function of all the attributes and classify on the basis of the output of the function , say if the output is less than a certain value it belongs to class A , else in class B . Is there any limitation to this method compared to probablistic approach ?
You can do this and it is often done in practice, for example in Logistic Regression. It is not even limited to binary classes. There is no inherent limitation compared to a probabilistic approach, although you should keep in mind that both are fundamentally different approaches and hard to compare.
I think you have some misunderstanding in classification. No matter what kind of classifier you are using (svm, or logistic regression), you can always view the output model as
f(x)>b ===> positive
f(x) negative
This applies to both probabilistic model and non-probabilistic model. In fact, this is something related to risk minimization which results the cut-off branch naturally.
Yes, this is possible. For example, a perceptron does exactly that.
However, it is limited in its use to linearly separable problems. But multiple of them can be combined to solve arbitrarily complex problems in general neural networks.
Another machine learning technique, SVM, works in a similar way. It first transforms the input data into some high dimensional space and then separates it via a linear function.
