I want to use neural network to learn the mapping of a input vector and a output vector. The physics of the problem has constraints such that certain input nodes only have influence on certain output nodes. I want to use that constraint in the training.
If I formulate the NN as a directed graph, I imagine the paths from certain input nodes to output nodes are 'blocked', and the error message should not back propagate through such paths. For example, in the figure below, I show a NN with 2 input and 2 output nodes. Input node 1 should not have any influence to output 4, so any path from node 1 to 4 (as shown in dashed lines) should not have back-prop.
I could not simply set some edge/weight to zero to satisfy the constraints, because the constraints are on paths, not on a single edge/weight.
I appreciate anyone share thoughts and experience on this issue. Maybe this is a well-thought problem but I haven't found anything after hard searching.
Interesting case. I'm afraid neural networks don't work like this. Layers are considered independent: forward and back passes flow through all available connections and each layer doesn't know how the current tensor was accumulated.
Your choice in terms of architecture is to block individual connections, something like DropConnect, but without randomness, if that's possible.
You can also consider separate networks for each output, e.g. one network, in which (1, 2) predicts 3, and another network, in which 2 predicts 4. This way, you force your constraints, but lose sharing weights between different networks, which is not ideal.
Another option: I can imagine that you can augment the dataset so that the networks actually learns that certain inputs are not affecting certain outputs. Depending on your actual problem, this may be time consuming, but at least in theory it may work: for a given input/output pair (1, 2) -> (3, 4) you can add several additional pairs (1*, 2) -> (3*, 4) to show that changing 1 affects the first output 3*, but not the second 4.
Related
I am new to deep learning and trying to understand the concept behind hidden layers, but i am not clear with following things:
If there are suppose 3 hidden layers. When we take output from all the nodes of 2nd layer as input to all the nodes of 3rd layer then what difference it makes in output of nodes of 3rd layer as they are getting same input + same parameters initialization (as per what I read, I assume that all the nodes of one layer gets same random weight for parameters).
Please correct me if I am thinking in wrong direction.
The simple answer is because of random initialization.
If you started with same weights through out the neural network (NN), then all nodes will produce the same output.
This is because when using backprop algorithm the error is spread out based on the activation strength of each node. If they start out the same then the error will spread equally and hence the nodes in the NN will not be able to learn different features.
So basic random initialization makes sure that each node specializes. Hence after learning, the nodes in the hidden layers will produce different outputs even when the input is the same.
Hope this helps.
I'm trying to use the approach described in this paper https://arxiv.org/abs/1712.01815 to make the algorithm learn a new game.
There is only one problem that does not directly fit into this approach. The game I am trying to learn has no fixed board size. So currently the input tensor has dimensions m*n*11, where m and n are the dimensions of the game board and can vary each time the game is played. So first of all I need a neural network able to make use of such varying input sizes.
The size of the output is also a function of the board size, as it has a vector with entries for every possible move on the board, and so the output vector will be bigger if the board size increases.
I have read about recurrent and recursive neural networks but they all seem to relate to NLP, and I'm not sure on how to translate that to my problem.
Any ideas on NN architectures able to handle my case would be welcome.
What you need is Pointer Networks (https://arxiv.org/abs/1506.03134)
Here is a introduction quote from a post about it:
Pointer networks are a new neural architecture that learns pointers to positions in an input sequence. This is new because existing techniques need to have a fixed number of target classes, which isn't generally applicable— consider the Travelling Salesman Problem, in which the number of classes is equal to the number of inputs. An additional example would be sorting a variably sized sequence.
- https://finbarr.ca/pointer-networks/
Its an attention based model.
Essentially a pointer network is used to predict pointers back to the input, meaning your output layer isn't actually fixed, but variable.
A use case where I have used them is for translating raw text into SQL queries.
Input: "HOW MANY CARS WERE SOLD IN US IN 1983"
Output: SELECT COUNT(Car_id) FROM Car_table WHERE (Country='US' AND
Year=='1983')
The issue with raw text such as this is that it will only make sense w.r.t to a specific table (in this case car table with a set of variables around car sales, similar to your different boards for board games). Meaning, that if the question cant be the only input. So the input that actually goes into the pointer network is a combination of -
Input -
Query
Metadata of the table (column names)
Token vocabulary for all categorical columns
Keywords from SQL syntax (SELECT, WHERE etc..)
All of these are appended together.
The output layer then simply points back to specific indexes of the input. It points to Country and Year (from column names in metadata), it points to US and 1983 (from tokens in vocabulary of categorical columns), it points to SELECT, WHERE etc from the SQL syntax component of the input.
The sequence of these indexes in the appended index is then used as the output of your computation graph, and optimized using a training dataset that exists as WIKISQL dataset.
Your case is quite similar, you need to pass the inputs, metadata of the game, and the stuff you need as part of your output as an appended index. Then the pointer network simply makes selections from the input (points to them).
You need to go back to a fixed input / output problem.
A common way to fix this issue when applying to images / time series... is to use sliding windows to downsize. Perhaps this can be applied to your game.
Fully convolutional neural network is able to do that. Parameters of conv layers are convolutional kernels. Convolutional kernel not so much care about input size(yes there are certain limitations related to stride, padding input and kernel size).
Typical use case is some convlayers followed by the maxpooling and repeated again and again to some point where are filters flattened and connected to dense layer. Dense layer is problem because he expect input at fixed size. If there is another conv2 layer, your output will be another feature map of appropriate size.
Example of such network could be YOLOv3. If you feed it for example with image 416x416x3 output can be for example 13x13xnumber of filters(I know YOLOv3 has more output layers but I will discuss only one because of simplicity). If you feed YOLOv3 with image 256x256x3, output will be feature map 6x6xnumber of filters.
So network don't crash and produce results. Will be results good? I don't know, maybe yes, maybe no. I never use it at such manner I always resize image to recommended size or retrain network.
I am using a neural networks for reinforcement learning. I have a neural network with 4 output nodes and I mapped every output node to a different action. Hidden and output nodes use the sigmoid activation function.
A problem which I face is that on some inputs, few output nodes have same value (i.e. they have output value of 1). I am not sure what to do in this situation. Is there some way I could fix this, so no two output nodes have same value? Or just to randomly choose between actions which are assigned to output nodes with the highest value?
Given your setup (i.e. one output per action), some ties will be unavoidable. That said, there are a few things you can do to alleviate the problem.
You can reduce the chance of ties by choosing an activation function for your output nodes that does not saturate like a sigmoid. For example, you may try a Rectified Linear activation function (https://en.wikipedia.org/wiki/Rectifier_(neural_networks)).
You can make your sigmoid less steep by multiplying the input to the sigmoid by a constant smaller than 1.
You can try to apply a softmax function to your output layer (https://en.wikipedia.org/wiki/Softmax_function). Note that if you choose to do so, you should not apply your sigmoid activation to the output layer, because the softmax function only exaggerates existing differences.
When you do have a tie, you can either pick an action at random (as you suggested yourself), or you can simply pick the first action in your list. This last option does bias whatever learning algorithm you have towards trying the first action, but if your learning algorithm is effective enough, it should eventually learn in which cases this is the wrong action to take, and thus learn to put less emphasis on these actions.
Alternatively, you can try a completely different way of processing your outputs (this depends on your problem domain though). For example, if your four actions are "move north", "move east", "move south", and "move west", you can instead give the network two outputs, one determining whether to move horizontal or vertical (thus splitting the actions into "move north or south" and "move east or west"), and the second output providing the tiebreaker between the remaining alternatives.
I am clustering undirected graphs using mcl. To do so, I have choose a threshold under which nodes are connected, a similarity measure for each edge and the inflation parameter to tune the granularity of my graph. I have been playing around with these parameters, but so far, the clusters I have seem to be too large (I did visualizations that suggest that the largest clusters should be cut into 2 or more clusters). Therefore, I was wondering what are the other parameters I can play with to improve my clustering (I am currently working with the scheme parameter of mcl to see whether increasing the accuracy would help, but if there are other 'more specific' parameters that could help to get smaller clusters for instance, please let me know)?
There are really mainly two things to consider. The first and most important is outside mcl (http://micans.org/mcl/) itself, namely how the network is constructed. I've written about it elsewhere, but I'll repeat it here because it is important.
If you have a weighted similarity, choose an edge-weight (similarity) cutoff
such that the topology of the network becomes informative; i.e. too many edges
or too few edges yield little discriminative information in the
absence/presence structure of edges. Choose it such that no edges connect
things you consider very dissimilar, and that edges connect things you consider
somewhat similar to quite similar. In the case of mcl, the dynamic range in
edge weight between 'a bit similar' and 'very similar' should be, as a rule of
a thumb, one order of magnitude, i.e. two-fold or five-fold or ten-fold, as
opposed to varying from 0.9 to 1.0. Of course, it is possible to give simple
networks to mcl and it will just utilise the absence/presence of edges. Make sure
the network does not become very dense - a very rough rule of thumb could be to aim
for a total number of edges that is in the order of V * sqrt(V) if the number of nodes (vertcies) is V, that is, each node has, on average, in the order of sqrt(V) neighbours.
The above, network construction, is really crucial, and it is advisable
to try different approaches. Now, given a network,
there is really only one mcl parameter to vary: the inflation parameter (the -I option).
A good set of values to test with is 1.4, 2, 3, 4, 6.
In summary, if you are exploring, try different ways of network construction,
using your knowledge of the data to make the network a meaningful representation,
and combine this with trying different mcl inflation values.
My goal is to solve the XOR problem using a Neural Network. I’ve read countless articles on the theory, proof, and mathematics behind a multi-layered neural network. The theory make sense (math… not so much) but I have a few simple questions regarding the evaluation and topology of a Neural Network.
I feel I am very close to solving this problem, but I am beginning to question my topology and evaluation techniques. The complexities of back propagation aside, I just want to know if my approach to evaluation is correct. With that in mind, here are my questions:
Assuming we have multiple inputs, does each respective input get its’ own node? Do we ever input both values into a single node? Does the order in which we enter this information matter?
While evaluating the graph output, does each node fire as soon as it gets a value? Or do we instead collect all the values from the above layer and then fire off once we’ve consumed all the input?
Does the order of evaluation matter? For example, if a given node in layer “b” is ready to fire – but other nodes in that same layer are still awaiting input – should the ready node fire anyway? Or should all nodes in the layer be loaded up before firing?
Should each layer be connected to all nodes in the following layer?
I’ve attached a picture which should help explain (some of) my questions.
Thank you for your time!
1) Yes, each input gets its own node, and that node is always the node for that input type. The order doesn't matter - you just need to keep it consistent. After all, an untrained neural net can learn to map any set of linearly separable inputs to outputs, so there can't be an order that you need to put the nodes in in order for it to work.
2 and 3) You need to collect all the values from a single layer before any node in the next layer fires. This is important if you're using any activation function other than a stepwise one, because the sum of the inputs will affect the value that is propagated forward. Thus, you need to know what that sum is before you propagate anything.
4) Which nodes to connect to which other nodes is up to you. Since your net won't be excessively large and XOR is a fairly straightforward problem, it will probably be simplest for you to connect all nodes in one layer to all nodes in the next layer (i.e. a fully-connected neural net). There might be specialized cases in other problems where it would be better to not use this topology, but there isn't an easy way to figure it out (most people either use trial and error or a genetic algorithm, as in NEAT), and you don't need to worry about it for the purposes of this problem.