I'm currently using a java library to do some naive experimentation with offline handwriting recognition. I give my program an image of a pre-written English sentence and segment it into individual characters, which I then feed to a very naively constructed neural network.
I'm new to the idea of neural nets, so my question is where to start with regard to optimising this network's hyperparameters. Currently it's a simple feed forward network which I train using resilient propagation, so the only parameters I can optimise are the number of hidden layers, and the number of neurons in each hidden layer. I could of course do an exhaustive search through a large but finite number of combination, but this would be very time-consuming, and I'm sure someone out there who is more informed in this art must be able to point me in the right direction.
I found a post somewhere on here that stated a good place to start for any network in general was to use only one hidden layer with number of neurons equal to the mean number of neurons in your input and output layer, so that's what I'm doing at the moment.
I'm getting performance of about 40-60% (depending on character) accuracy with this model.
Related
I have designed a 3 layer neural network whose inputs are the concatenated features from a CNN and RNN. The weights learned by network take very small values. What is the reasonable explanation for this? and how to interpret the weight histograms and distributions in Tensorflow? Any good resource for it?
This is the weight distribution of the first hidden layer of a 3 layer neural network visualized using tensorboard. How to interpret this? all the weights are taking up zero value?
This is the weight distribution of the second hidden layer of a 3 layer neural:
how to interpret the weight histograms and distributions in Tensorflow?
Well, you probably didn't realize it, but you have just asked the 1 million dollar question in ML & AI...
Model interpretability is a hyper-active and hyper-hot area of current research (think of holy grail, or something), which has been brought forward lately not least due to the (often tremendous) success of deep learning models in various tasks; these models are currently only black boxes, and we naturally feel uncomfortable about it...
Any good resource for it?
Probably not exactly the kind of resources you were thinking of, and we are well off a SO-appropriate topic here, but since you asked...:
A recent (July 2017) article in Science provides a nice overview of the current status & research: How AI detectives are cracking open the black box of deep learning (no in-text links, but googling names & terms will pay off)
DARPA itself is currently running a program on Explainable Artificial Intelligence (XAI)
There was a workshop in NIPS 2016 on Interpretable Machine Learning for Complex Systems
On a more practical level:
The Layer-wise Relevance Propagation (LRP) toolbox for neural networks (paper, project page, code, TF Slim wrapper)
FairML: Auditing Black-Box Predictive Models, by Fast Forward Labs (blog post, paper, code)
A very recent (November 2017) paper by Geoff Hinton, Distilling a Neural Network Into a Soft Decision Tree, with an independent PyTorch implementation
SHAP: A Unified Approach to Interpreting Model Predictions (paper, authors' code)
These should be enough for starters, and to give you a general idea of the subject about which you asked...
UPDATE (Oct 2018): I have put up a much more detailed list of practical resources in my answer to the question Predictive Analytics - “Why” factor?
The weights learned by network take very small values. What is the reasonable explanation for this? How to interpret this? all the weights are taking up zero value?
Not all weights are zero, but many are. One reason is regularization (in combination with a large, i.e. wide layers, network) Regularization makes weights small (both L1 and L2). If your network is large, most weights are not needed, i.e., they can be set to zero and the model still performs well.
How to interpret the weight histograms and distributions in Tensorflow? Any good resource for it?
I am not so sure about weight distributions. There is some work that analysis them, but I am not aware of a general interpretation, e.g., for CNNs it is known that center weights of a filter/feature usually have larger magnitude than those in corners, see [Locality-Promoting Representation Learning, 2021, ICPR, https://arxiv.org/abs/1905.10661]
For CNNs you can also visualize weights directly, if you have large filters. For example, for (simpl)e networks you can see that weights first converge towards some kind of class average before overfitting starts. This is shown in Figure 2 of [The learning phases in NN: From Fitting the Majority to Fitting a Few, 2022, http://arxiv.org/abs/2202.08299]
Rather than going for weights, you can also look at what samples trigger the strongest activations for specific features. If you don't want to look at single features, there is also the possibility to visualize what the network actually remembers on the input, e.g., see [Explaining Neural Networks by Decoding Layer Activations, https://arxiv.org/abs/2005.13630].
These are just a few examples (Disclaimer I authored these works) - there is thousands of other works on explainability out there.
Given two layers of a neural network that have a 2D representation, i.e. fields of activation. I'd like to connect each neuron of the lower layer to the near neurons of the upper layer, say within a certain radius. Is this possible with TensorFlow?
This is similar to a convolution, but the weight kernels should not be tied. I'm trying to avoid connecting both layers fully first and masking out most of the parameters, in order to keep the number of parameters low.
I don't see a simple way to do this with existing TensorFlow ops efficiently, but there might be some tricks with sparse things. However, ops for efficient locally connected, non-convolutional neural net layers would be very useful, so you might want to file a feature request as a GitHub issue.
Most neural networks bring high accuracy with only one hidden layer, so what is the purpose of multiple hidden layers?
To answer you question you first need to find the reason behind why the term 'deep learning' was coined almost a decade ago. Deep learning is nothing but a neural network with several hidden layers. The term deep roughly refers to the way our brain passes the sensory inputs (specially eyes and vision cortex) through different layers of neurons to do inference. However, until about a decade ago researchers were not able to train neural networks with more than 1 or two hidden layers due to different issues arising such as vanishing, exploding gradients, getting stuck in local minima, and less effective optimization techniques (compared to what is being used nowadays) and some other issues. In 2006 and 2007 several researchers 1 and 2 showed some new techniques enabling a better training of neural networks with more hidden layers and then since then the era of deep learning has started.
In deep neural networks the goal is to mimic what the brain does (hopefully). Before describing more, I may point out that from an abstract point of view the problem in any learning algorithm is to approximate a function given some inputs X and outputs Y. This is also the case in neural network and it has been theoretically proven that a neural network with only one hidden layer using a bounded, continuous activation function as its units can approximate any function. The theorem is coined as universal approximation theorem. However, this raises the question of why current neural networks with one hidden layer cannot approximate any function with a very very high accuracy (say >99%)? This could potentially be due to many reasons:
The current learning algorithms are not as effective as they should be
For a specific problem, how one should choose the exact number of hidden units so that the desired function is learned and the underlying manifold is approximated well?
The number of training examples could be exponential in the number of hidden units. So, how many training examples one should train a model with? This could turn into a chicken-egg problem!
What is the right bounded, continuous activation function and does the universal approximation theorem is generalizable to any other activation function rather than sigmoid?
There are also other questions that need to be answered as well but I think the most important ones are the ones I mentioned.
Before one can come up with provable answers to the above questions (either theoretically or empirically), researchers started using more than one hidden layers with limited number of hidden units. Empirically this has shown a great advantage. Although adding more hidden layers increases the computational costs, but it has been empirically proven that more hidden layers learn hierarchical representations of the input data and can better generalize to unseen data as well. By looking at the pictures below you can see how a deep neural network can learn hierarchies of features and combine them successively as we go from the first hidden layer to the one in the end:
Image taken from here
As you can see, the first hidden layer (shown in the bottom) learns some edges, then combining those seemingly, useless representations turn into some parts of the objects and then combining those parts will yield things like faces, cars, elephants, chairs and ... . Note that these results were not achievable if new optimization techniques and new activation functions were not used.
I was looking for an automatic way to decide how many layers should I apply to my network depends on data and computer configuration. I searched in web, but I could not find anything. Maybe my keywords or looking ways are wrong.
Do you have any idea?
The number of layers, or depth, of a neural network is one of its hyperparameters.
This means that it is a quantity that can not be learned from the data, but you should choose it before trying to fit your dataset. According to Bengio,
We define a hyper-
parameter for a learning algorithm A as a variable to
be set prior to the actual application of A to the data,
one that is not directly selected by the learning algo-
rithm itself.
There are three main approaches to find out the optimal value for an hyperparameter. The first two are well explained in the paper I linked.
Manual search. Using well-known black magic, the researcher choose the optimal value through try-and-error.
Automatic search. The researcher relies on an automated routine in order to speed up the search.
Bayesian optimization.
More specifically, adding more layers to a deep neural network is likely to improve the performance (reduce generalization error), up to a certain number when it overfits the training data.
So, in practice, you should train your ConvNet with, say, 4 layers, try adding one hidden layer and train again, until you see some overfitting. Of course, some strong regularization techniques (such as dropout) is required.
I am a newbie in machine learning and also in neural networks. Currently I'm taking a course at coursera.org about neural networks, but I don't understand everything. I have a little problem with my thesis. I should use a neural network, but I don't know how to choose the right neural network architecture for my problem.
I have a lot of data from web portals (typically online editions of newspapers, magazines). There is information about articles for example, name, text of article and release of article. There are also large amounts of sequence data that capture behavior of users.
My goal is to predict the popularity of an article (number of readers or clicks on article by unique user). I want to make vectors from this data and feed my neural network with these vectors.
I have two questions:
1. How do I create the right vector?
2. Which neural network architecture is best suited for this problem?
Those are very broad questions. You'll need to identify smaller issues if you want more exact answers.
How to create a right vector?
For text data, you usually use the vector space model. Best results are often obtained using tf-idf weighting.
Which neural network architecture is suitable for this problem?
This is very hard to say. I would start with a network with k input neurons (where k is the size of your vectors after applying tf-idf: you might also want to do some sort of feature selection to reduce the number of features. A good feature selection method is by using the chi squared test.)
Then, a standard network layout is given by using a single hidden layer with number of neurons equal to the average between the number of input neurons and output neurons. Then it looks like you only need a single output neuron that will output how popular the article is going to be (this can be a linear neuron or a sigmoid neuron).
For the neurons in your hidden layer, you can also experiment with linear and sigmoid neurons.
There are many other things you can try as well: weight decay, the momentum technique, networks with multiple layers, recurrent networks and so on. It's impossible to say what would work best for your given problem without a lot of experimentation.