How To Understand Tensor Flow Weights and Gradient Histogram - histogram

I am experimenting with conv nets for sentiment analysis.
I used a basic convent architecture.
I am trying to analyze the learning, I have a bad feeling about weights, Although I am early in the learning process, but the weights for fully connected layer essentially suggest no updates/ update on only very few component , because histogram essentially remains the same.
Any hints on what I could be missing?
Regards,
Vaibhav Singh

Related

Why regression is considered as part of machine learning?

If by Machine Learning (ML) we mean any program that learns from data, then, yes, regression can be said to be part of ML. But there are several other aspects to Machine Learning such as : solution is improved iteratively based on some performance measure. Whereas for linear regression there is a closed form solution in the form of a direct formula using which all the parameters can be determined and it does not involve iterations. But there is other version of parameter estimation for regression that makes use of gradient descent and it involves several iterations. Does it mean that this iterative version of parameter estimation for regression is done forcefully to bring regression under machine learning umbrella? Or the iterative version has some advantages that the direct formula does not offer?
I won't comment on whether regression is part of ML or not (I don't really see where your definitions came from). But regarding the advantage of an iterative approach, please note that the closed-form solution for linear regression is as follows:
Where X is your design matrix.
Please note that inverting a matrix is an O(n^3) operation, which is infeasible for large n. This is the obvious advantage of the iterative approach using GD.

How to interpret weight distributions of neural net layers

I have designed a 3 layer neural network whose inputs are the concatenated features from a CNN and RNN. The weights learned by network take very small values. What is the reasonable explanation for this? and how to interpret the weight histograms and distributions in Tensorflow? Any good resource for it?
This is the weight distribution of the first hidden layer of a 3 layer neural network visualized using tensorboard. How to interpret this? all the weights are taking up zero value?
This is the weight distribution of the second hidden layer of a 3 layer neural:
how to interpret the weight histograms and distributions in Tensorflow?
Well, you probably didn't realize it, but you have just asked the 1 million dollar question in ML & AI...
Model interpretability is a hyper-active and hyper-hot area of current research (think of holy grail, or something), which has been brought forward lately not least due to the (often tremendous) success of deep learning models in various tasks; these models are currently only black boxes, and we naturally feel uncomfortable about it...
Any good resource for it?
Probably not exactly the kind of resources you were thinking of, and we are well off a SO-appropriate topic here, but since you asked...:
A recent (July 2017) article in Science provides a nice overview of the current status & research: How AI detectives are cracking open the black box of deep learning (no in-text links, but googling names & terms will pay off)
DARPA itself is currently running a program on Explainable Artificial Intelligence (XAI)
There was a workshop in NIPS 2016 on Interpretable Machine Learning for Complex Systems
On a more practical level:
The Layer-wise Relevance Propagation (LRP) toolbox for neural networks (paper, project page, code, TF Slim wrapper)
FairML: Auditing Black-Box Predictive Models, by Fast Forward Labs (blog post, paper, code)
A very recent (November 2017) paper by Geoff Hinton, Distilling a Neural Network Into a Soft Decision Tree, with an independent PyTorch implementation
SHAP: A Unified Approach to Interpreting Model Predictions (paper, authors' code)
These should be enough for starters, and to give you a general idea of the subject about which you asked...
UPDATE (Oct 2018): I have put up a much more detailed list of practical resources in my answer to the question Predictive Analytics - “Why” factor?
The weights learned by network take very small values. What is the reasonable explanation for this? How to interpret this? all the weights are taking up zero value?
Not all weights are zero, but many are. One reason is regularization (in combination with a large, i.e. wide layers, network) Regularization makes weights small (both L1 and L2). If your network is large, most weights are not needed, i.e., they can be set to zero and the model still performs well.
How to interpret the weight histograms and distributions in Tensorflow? Any good resource for it?
I am not so sure about weight distributions. There is some work that analysis them, but I am not aware of a general interpretation, e.g., for CNNs it is known that center weights of a filter/feature usually have larger magnitude than those in corners, see [Locality-Promoting Representation Learning, 2021, ICPR, https://arxiv.org/abs/1905.10661]
For CNNs you can also visualize weights directly, if you have large filters. For example, for (simpl)e networks you can see that weights first converge towards some kind of class average before overfitting starts. This is shown in Figure 2 of [The learning phases in NN: From Fitting the Majority to Fitting a Few, 2022, http://arxiv.org/abs/2202.08299]
Rather than going for weights, you can also look at what samples trigger the strongest activations for specific features. If you don't want to look at single features, there is also the possibility to visualize what the network actually remembers on the input, e.g., see [Explaining Neural Networks by Decoding Layer Activations, https://arxiv.org/abs/2005.13630].
These are just a few examples (Disclaimer I authored these works) - there is thousands of other works on explainability out there.

Machine learning algorithm for few samples and features

I am intended to do a yes/no classifier. The problem is that the data does not come from me, so I have to work with what I have been given. I have around 150 samples, each sample contains 3 features, these features are continuous numeric variables. I know the dataset is quite small. I would like to make you two questions:
A) What would be the best machine learning algorithm for this? SVM? a neural network? All that I have read seems to require a big dataset.
B)I could make the dataset a little bit bigger by adding some samples that do not contain all the features, only one or two. I have read that you can use sparse vectors in this case, is this possible with every machine learning algorithm? (I have seen them in SVM)
Thanks a lot for your help!!!
My recommendation is to use a simple and straightforward algorithm, like decision tree or logistic regression, although, the ones you refer to should work equally well.
The dataset size shouldn't be a problem, given that you have far more samples than variables. But having more data always helps.
Naive Bayes is a good choice for a situation when there are few training examples. When compared to logistic regression, it was shown by Ng and Jordan that Naive Bayes converges towards its optimum performance faster with fewer training examples. (See section 4 of this book chapter.) Informally speaking, Naive Bayes models a joint probability distribution that performs better in this situation.
Do not use a decision tree in this situation. Decision trees have a tendency to overfit, a problem that is exacerbated when you have little training data.

Binary Classification with Neural Networks?

I have a dataset of the order of MxN. I want to perform a binary classifcation on this dataset using neural networks. I was looking into Recurrent Neural Networks. Although, LSTM's can be used for AutoEncoders, I am not sure if they can be used for classification (I am trying to do a binary classification). I am very new to neural networks and deep learning models and i am not really sure if there is a way of achieving binary classification with neural networks. I tried Bernouli RBM on my dataset. I am not sure how to use this model to perform classification. I also found out Pipeline(). Again, I am not sure how to achieve my goal.
Any help would be greatly appreciated.
Ok, something doesn't stack up. If you have unlabelled data and you want to classify it you must take a look at K-Means (http://scikit-learn.org/stable/modules/clustering.html#k-means).
Regarding LSTMs classification: You run your input through the RNN layers and take the last output and feed it into some Conv / Fully-connected layers to take care of classification as you know it.

Image classification using Convolutional neural network

I'm trying to classify hotel image data using Convolutional neural network..
Below are some highlights:
Image preprocessing:
converting to gray-scale
resizing all images to same resolution
normalizing image data
finding pca components
Convolutional neural network:
Input- 32*32
convolution- 16 filters, 3*3 filter size
pooling- 2*2 filter size
dropout- dropping with 0.5 probability
fully connected- 256 units
dropout- dropping with 0.5 probability
output- 8 classes
Libraries used:
Lasagne
nolearn
But, I'm getting less accuracy on test data which is around 28% only.
Any possible reason for such less accuracy? Any suggested improvement?
Thanks in advance.
There are several possible reasons for low accuracy on test data, so without more information and a healthy amount of experimentation, it will be impossible to provide a concrete answer. Having said that, there are a few points worth mentioning:
As #lejlot mentioned in the comments, the PCA pre-processing step is suspicious. The fundamental CNN architecture is designed to require minimal pre-processing, and it's crucial that the basic structure of the image remains intact. This is because CNNs need to be able to find useful, spatially-local features.
For detecting complex objects from image data, it's likely that you'll benefit from more convolutional layers. Chances are, given the simple architecture you've described, that it simply doesn't possess the necessary expressiveness to handle the classification task.
Also, you mention you apply dropout after the convolutional layer. In general, the research I've seen indicates that dropout is not particularly effective on convolutional layers. I personally would recommend removing it to see if it has any impact. If you do wind up needing regularization on your convolutional layers, (which in my experience is often unnecessary since the shared kernels often already act as a powerful regularizer), you might consider stochastic pooling.
Among the most important tips I can give is to build a solid mechanism for measuring the quality of the model and then experiment. Try modifying the architecture and then tuning hyper-parameters to see what yields the best results. In particular, make sure to monitor training loss vs. validation loss so that you can identify when the model begins overfitting.
After 2012 Imagenet, all convolutional neural networks which performs good(state of the art) are adding more convolutional neural network, they even use zero padding to increase the convolutional neural network.
Increase the number of convolutional neural network.
Some says that dropout is not that effective on CNN, however it is not bad to use, but
You should lower the dropout value, you should try it(May be 0.2).
Data should be analysed. If it is low,
You should use data augmentation techniques.
If you have more data in one of the labels,
You are stuck with the imbalanced data problem. But you should not consider it for now.
You can
Fine-Tune from VGG-Net or some other CNN's should be considered.
Also, don't convert to grayscale, after image-to-array transformation, you should just divide 225.
I think that you learned CNN from some tutorial(MNIST) and you think that you should turn it to grayscale.

Resources