Interpreting internal features of Multiplayer Perceptrons - machine-learning

I've trained a multiplayer perceptron on a medical imaging classification task (classifying whether an ultrasound scanning image belongs to the healthy or disease condition). The network consists of 2 fully connected hidden layers and 1 output unit. I then want to examine the weights to see which features in the images (e.g., clusters of pixels) are the most important for the network to distinguish between different classes. Since my network has two layers of hidden weights, how do I use these weights to quantify the importance of each image pixel? Could someone experienced with this point me to the right literature? Thanks.

"Several methods for finding saliency have been described by other authors. Among them are sensitivity
based approaches [4, 5, 6], deconvolution based ones [7, 8], or more complex ones like
layer-wise relevance propagation (LRP) [9]."
source : https://arxiv.org/pdf/1704.07911.pdf
They are doing what you want, but with a CNN, maybe you should go from a MLP to a CNN, that would seem appropriate for medical imaging classification.
Or maybe this paper would fit better:
Randomization approach for understanding variable contributions in artificial neural networks

Related

Using embeddings for non-language concepts?

Does it make sense to use an embedding instead large one-hot encoded vectors representing, say, car makes and models? Also, what would the embedding represent conceptually? How similar a Ford F-150 is to a Toyota Tacoma, for example?
Yes, it makes sense.
You can think of embeddings as a representation of your input in a different space. Sometimes you want to perform dimensionality reduction, hence your embedding has lower dimensionality than your input. Other times, you simply want your embedding to be very descriptive of your input, so that your model, say a Neural Network, can easily distinguish it from all other inputs (this is especially useful in classification task).
As you can see, an embedding is just a vector that describes your input better than the input itself. In this context, we generally refer to embeddings with the word features.
But, maybe, what you're asking is a bit different. You want to know if an embedding can express similarity between cars. Theoretically, yes. Suppose you have the following embeddings:
Car A: [0 1]
Car B: [1 0]
The first element of the embedding is the maker. 0 stands for Toyota and 1 stands for Ferrari. The second element is the model. 0 stands for F-150 and 1 stands for 458 Italia. How can you compute similarity between these two cars?
Cosine similarity
Basically, you compute cosine of the angle between these two vectors in the embedding space. Here the embeddings are 2-dimensional, hence we are in a plane. Moreover, the two embeddings are orthogonal, thus the angle between them is 90° and the cosine 0. So their similarity is 0: they are not similar at all!
Suppose you have:
Car A: [1 0]
Car B: [1 1]
In this case the maker is the same. Although the model is different, you might expect these two cars to be more similar than the previous two. If you compute the cosine of the angle between their embeddings, you get around 0.707 which is greater than 0. These two cars are indeed more similar.
Obvoiusly, it's not so easy. It all depends on how you design your model and how the embeddings are learned, i.e. which data you provide as input to your system.
TLDR: Yes it makes sense. No it's not the same as the famous Word2Vec embedding.
When people talk about embedding data in vector representation, they really mean factorization of the design matrix they explicitly/implicitly construct.
Take Word2Vec as an example. The design matrix represents an artificially constructed prediction problem, where words in surrounding context is used to predict the central word (SkipGram). It is equivalent to factorizing a cross-tabbed matrix of context and central words of filled with positive point-wise mutual information. [1]
Now, let's say we would like the answer the question: how similar a Ford F-150 is to a Toyota Tacoma?
First, we have to decide if our data allows us to use supervised methods. If yes, then there are a few algorithms like the traditional Feed-forward neural network and factorization machine that we can use. You can use these algorithms to define similarity of features in one-hot space by using prediction labels, like click on detail page at a car-rental website. Then models with similar vectors means that people click on their detail pages in the same session. That is, the behavior of the response models the similarity of the features.
If your dataset is not labeled, you can still try to predict co-occurrence of features. This is the novelty of Word2Vec, namely cleverly defining prediction problems using unlabeled sentence of co-occurring tokens in context windows. In this case, the vectors merely represents co-occurrence of the features. They can be useful as a dimensional reduction technique to extract dense features for another prediction problem down the pipeline.
If you wanna save some brain power, and your features happen to be all factors, you can apply existing algorithms in packages, things like LDA, NMF, SVD, with a loss function for binary classification, such as hinge loss. Most programming languages provide their libraries with APIs that consist of a few lines of codes.
All the methods above are matrix factorization. There are also deeper, more complex tensor factorization methods. But I will let you research on your own on them.
Reference
http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization.pdf

How to classify sequence of images with keras deep learning

I want to make a classification model for a sequence of CT images with Keras. my dataset obtains from 50 patients and each patient has 1000 images. For a patient, each image has a meaningful relationship with the previous image. I want to use these meaningful relationships, so I don't know how to build a model for such this problem. can you please give me an idea or examples?
Your problem is in the context of Sequence Classification. You need to classify sequences of images. In this case, a model is needed to learn two aspects :
Features of the images
Features of the sequence ( temporal or time-related features )
This might sound similar to video classification in which a video is a sequence of several frames. See here.
For extracting features from images:
Most real-world cases use Convolutional Neural Networks. They use layers like Max Pooling and Convolution. They are excellent at extracting features from a 3D input like an image. You can learn more from here.
For handling temporal data:
Here's where you will require an RNN ( Recurrent Neural Network ). LSTM ( Long-Short Term Memory ) cells are popular RNN as they can hold a stronger memory than traditional RNNs.
RNNs preserve the hidden layer activations and use them in processing each and every term in a sequence. Hence, while processing the 2nd image in a sequence, the RNN has knowledge or activations of the 1st image in that same sequence.
You can know more from here.
Finally, we require a fusion of both the above networks:
A CNN-LSTM network uses both convolutional as well as LSTM cells to classify the image sequences.
You can refer here and here
Hope that helps you. :-)

Is it better to make neural network to have hierarchical output?

I'm quite new to neural network and I recently built neural network for number classification in vehicle license plate. It has 3 layers: 1 input layer for 16*24(382 neurons) number image with 150 dpi , 1 hidden layer(199 neurons) with sigmoid activation function, 1 softmax output layer(10 neurons) for each number 0 to 9.
I'm trying to expand my neural network to also classify letters in license plate. But I'm worried if I just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class. And also, I think it might cause problem when input is one of number and neural network wrongly classifies as one of letter with biggest probability, even though sum of probabilities of all number output exceeds that.
So I wonder if it is possible to build hierarchical neural network in following manner:
There are 3 neural networks: 'Item', 'Number', 'Letter'
'Item' neural network classifies whether input is numbers or letters.
If 'Item' neural network classifies input as numbers(letters), then input goes through 'Number'('Letter') neural network.
Return final output from Number(Letter) neural network.
And learning mechanism for each network is below:
'Item' neural network learns all images of numbers and letters. So there are 2 output.
'Number'('Letter') neural network learns images of only numbers(letter).
Which method should I pick to have better classification? Just simply add 10 more classes or build hierarchical neural networks with method above?
I'd strongly recommend training only a single neural network with outputs for all the kinds of images you want to be able to detect (so one output node per letter you want to be able to recognize, and one output node for every digit you want to be able to recognize).
The main reason for this is because recognizing digits and recognizing letters is really kind of exactly the same task. Intuitively, you can understand a trained neural network with multiple layers as performing the recognition in multiple steps. In the hidden layer it may learn to detect various kinds of simple, primitive shapes (e.g. the hidden layer may learn to detect vertical lines, horizontal lines, diagonal lines, certain kinds of simple curved shapes, etc.). Then, in the weights between hidden and output layers, it may learn how to recognize combinations of multiple of these primitive shapes as a specific output class (e.g. a vertical and a horizontal line in roughly the correct locations may be recoginzed as a capital letter L).
Those "things" it learns in the hidden layer will be perfectly relevant for digits as well as letters (that vertical line which may indicate an L may also indicate a 1 when combined with other shapes). So, there are useful things to learn that are relevant for both ''tasks'', and it will probably be able to learn these things more easily if it can learn them all in the same network.
See also a this answer I gave to a related question in the past.
I'm trying to expand my neural network to also classify letters in license plate. But i'm worried if i just simply add more classes into output, for example add 10 letters into classification so total 20 classes, it would be hard for neural network to separate feature from each class.
You're far from where it becomes problematic. ImageNet has 1000 classes and is commonly done in a single network. See the AlexNet paper. If you want to learn more about CNNs, have a look at chapter 2 of "Analysis and Optimization of
Convolutional Neural Network Architectures". And when you're on it, see chapter 4 for hirarchical classification. You can read the summary for ... well, a summary of it.

why do we have multiple layers and multiple nodes per layer in a neural network?

I just started to learn about neural networks and so far my knowledge of machine learning is simply linear and logistic regression. from my understanding of the latter algorithms, is that given multiples of inputs the job of the learning algorithm is to come up with appropriate weights for each input so that eventually I have a polynomial that either describes the data which is the case of linear regression or separates it as in the case of logistic regression.
if I was to represent the same mechanism in neural network, according to my understanding, it would look something like this,
multiple nodes at the input layer and a single node in the output layer. where I can back propagate the error proportionally to each input. so that also eventually I arrive to a polynomial X1W1 + X2W2+....XnWn that describes the data. to me having multiple nodes per layer, aside from the input layer, seems to make the learning process parallel, so that I can arrive to the result faster. it's almost like running multiple learning algorithms each with different starting points to see which one converges faster. and as for the multiple layers I'm at a lose of what mechanism and advantage does it have on the learning outcome.
why do we have multiple layers and multiple nodes per layer in a neural network?
We need at least one hidden layer with a non-linear activation to be able to learn non-linear functions. Usually, one thinks of each layer as an abstraction level. For computer vision, the input layer contains the image and the output layer contains one node for each class. The first hidden layer detects edges, the second hidden layer might detect circles / rectangles, then there come more complex patterns.
There is a theoretical result which says that an MLP with only one hidden layer can fit every function of interest up to an arbitrary low error margin if this hidden layer has enough neurons. However, the number of parameters might be MUCH larger than if you add more layers.
Basically, by adding more hidden layers / more neurons per layer you add more parameters to the model. Hence you allow the model to fit more complex functions. However, up to my knowledge there is no quantitative understanding what adding a single further layer / node exactly makes.
It seems to me that you might want a general introduction into neural networks. I recommend chapter 4.3 and 4.4 of [Tho14a] (my bachelors thesis) as well as [LBH15].
[Tho14a]
M. Thoma, “On-line recognition of handwritten mathematical symbols,”
Karlsruhe, Germany, Nov. 2014. [Online]. Available: https://arxiv.org/abs/1511.09030
[LBH15]
Y. LeCun,
vol. 521,
Y. Bengio,
no. 7553,
and G. Hinton,
pp. 436–444,
“Deep learning,”
Nature,
May 2015. [Online]. Available:
http://www.nature.com/nature/journal/v521/n7553/abs/nature14539.html

Why do we have normally more than one fully connected layers in the late steps of the CNNs?

As I noticed, in many popular architectures of the convolutional neural networks (e.g. AlexNet), people use more than one fully connected layers with almost the same dimension to gather the responses to previously detected features in the early layers.
Why do not we use just one FC for that? Why this hierarchical arrangement of the fully connected layers is possibly more useful?
Because there are some functions, such as XOR, that can't be modeled by a single layer. In this type of architecture the convolutional layers are computing local features and the fully-connected output layer(s) are then combining these local features to derive the final outputs.. So, you can consider the fully-connected layers as a semi-independent mapping of features to outputs, and if this is a complex mapping then you may need the expressive power of multiple layers.
Actually its no longer popular/normal. 2015+ networks(such as Resnet, Inception 4) uses Global average pooling(GAP) as a last layer + softmax, which gives same performance and mach smaller model. Last 2 layers in VGG16 is about 80% of all parameters in network. But to answer you question its common to use 2 layer MLP for classification and consider the rest of network to be feature generation. 1 layer would be normal logistic regression with global minimum and simple properties, 2 layers give some usefulness to have non linearity and usage of SGD.

Resources