Neural Network / Machine Learning memory storage - memory

I am currently trying to set up an Neural Network for information extraction and I am pretty fluent with the (basic) concepts of Neural Networks, except for one which seem to puzzle me. It is probably pretty obvious but I can't seem to found information about it.
Where/How do Neural Networks store their memory? ( / Machine Learning)
There is quite a bit of information available online about Neural Networks and Machine Learning but they all seem to skip over memory storage. For example after restarting the program, where does it find its memory to continue learning/predicting? Many examples online don't seem to 'retain' memory but I can't imagine this being 'safe' for real/big-scale deployment.
I have a difficult time wording my question, so please let me know if I need to elaborate a bit more.
Thanks,
EDIT: - To follow up on the answers below
Every Neural Network will have edge weights associated with them.
These edge weights are adjusted during the training session of a
Neural Network.
This is exactly where I am struggling, how do/should I vision this secondary memory?
Is this like RAM? that doesn't seem logical.. The reason I ask because I haven't encountered an example online that defines or specifies this secondary memory (for example in something more concrete such as an XML file, or maybe even a huge array).

Memory storage is implementation-specific and not part of the algorithm per se. It is probably more useful to think about what you need to store rather than how to store it.
Consider a 3-layer multi-layer perceptron (fully connected) that has 3, 8, and 5 nodes in the input, hidden, and output layers, respectively (for this discussion, we can ignore bias inputs). Then a reasonable (and efficient) way to represent the needed weights is by two matrices: a 3x8 matrix for weights between the input and hidden layers and an 8x5 matrix for the weights between the hidden and output layers.
For this example, you need to store the weights and the network shape (number of nodes per layer). There are many ways you could store this information. It could be in an XML file or a user-defined binary file. If you were using python, you could save both matrices to a binary .npy file and encode the network shape in the file name. If you implemented the algorithm, it is up to you how to store the persistent data. If, on the other hand, you are using an existing machine learning software package, it probably has its own I/O functions for storing and loading a trained network.

Every Neural Network will have edge weights associated with them. These edge weights are adjusted during the training session of a Neural Network. I suppose your doubt is about storing these edge weights. Well, these values are stored separately in a secondary memory so that they can be retained for future use in the Neural Network.

I would expect discussion of the design of the model (neural network) would be kept separate from the discussion of the implementation, where data requirements like durability are addressed.
A particular library or framework might have a specific answer about durable storage, but if you're rolling your own from scratch, then it's up to you.
For example, why not just write the trained weights and topology in a file? Something like YAML or XML could serve as a format.
Also, while we're talking about state/storage and neural networks, you might be interested in investigating associative memory.

This may be answered in two steps:
What is "memory" in a Neural Network (referred to as NN)?
As a neural network (NN) is trained, it builds a mathematical model
that tells the NN what to give as output for a particular input. Think
of what happens when you train someone to speak a new language. The
human brain creates a model of the language. Similarly, a NN creates
mathematical model of what you are trying to teach it. It represents the mapping from input to output as a series of functions. This math model
is the memory. This math model is the weights of different edges in the network. Often, a NN is trained and these weights/connections are written to the hard disk (XML, Yaml, CSV etc). Whenever a NN needs to be used, these values are read back and the network is recreated.
How can you make a network forget its memory?
Think of someone who has been taught two languages. Let us say the individual never speaks one of these languages for 15-20 years, but uses the other one every day. It is very likely that several new words will be learnt each day and many words of the less frequent language forgotten. The critical part here is that a human being is "learning" every day. In a NN, a similar phenomena can be observed by training the network using new data. If the old data were not included in the new training samples, then the underlying math model will change so much that the old training data will no longer be represented in the model. It is possible to prevent a NN from "forgetting" the old model by changing the training process. However, this has the side effect that such a NN cannot learn completely new data samples.

I would say your approach is wrong. Neural Networks are not dumps of memory as we see on the computer. There are no addresses where a particular chunk of memory resides. All the neurons together make sure that a given input leads to a particular output.
Lets compare it with your brain. When you taste sugar, your tongue's taste buds are the input nodes which read chemical signals and transmit electric signals to brain. The brain then determines the taste using the various combinations of electric signals.
There are no lookup tables. There is no primary and secondary memories, only short and long term memory.

Related

Do Neuronal networks getting slow in adaption after a lot of training?

I am a beginner in the neuronal network field and I want to understand a certain statement. A friend said that a neuronal network gets slower after you fit a lot of data in.
Right now, I just did the coursera ML course from androw ng. There, I implemented backpropagation. I thought it just adaptes the model related to the expected output by using different types of calculations. Nevertheless, it was not like the history was used to adapt the model. Just the current state of the neurons were checked and their weight were adapted backwards in combination with regularisation.
Is my assumption correct or am I wrong? Are there some libraries that use history data that could result in a slowly adapting model after a certain amount of training?
I want to use a simple neuronal network for reinforcement learning and I want to get an idea if I need to reset my model if the target environment changes for some reason. Otherwise my model would be slower and slower in adaption after time.
Thanks for any links and explanations in advanced!
As you have said, neural networks adapt by modifying their weights during the backpropagation step. Modifying these weights will not be slower as the training goes on since the number of steps to modify these weights will always remain the same. The amount of steps needed to run an example through your model will also remain the same, therefore not slowing down your network according to the amount of examples you fed it during training.
However, you can decide to change your learning rate during your training (generally decreasing it as epochs go on). According to the way the learning rate of your model evolves, the weights will be modified in a different manner, generally resulting in a smaller difference each epoch.

Simple machine learning for website classification

I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.

Why is lift for neural network that stable in SAS Viya demo?

I'm looking at the SAS Viya machine learing demo. It races some machine Learning algorithms against each other on a given dataset. All models produce almost equally good "lift" as shown in lift diagrams in the output.
If you tweak the Learning to perform on a smaller subset of the data; only 0.002% of the total data set (proc partition data=&casdata partition samppct=0.002;), most algorithms get into problems producing lift.
But the neural network is still performing very well. Feature or bug? I could imagine that the script does not re-initilize the network, but it is hard to guess from the calls alone.
I got good answers over at the SAS Community posted by BrettWujek and Xinmin there:
Mats - the short answer without running some studies of my own is that neural networks are highly adaptive and can train very accurate models with far fewer observations than many other techniques. The tree-based models are going to be quite unstable with very few observations. In this case you sampled all the way down to around 20 observations...even that might be sufficient for a neural network if the space it not overly nonlinear.
As for your last comment - it seems you are referring to what is known as warm start, where a previously trained model can be used as a starting point and refined by providing new observations. That is NOT what is happening here, as that capability is only coming available in our upcoming release which is just over a month away.
Brett
And I've got some detail on this from Xinmin:
Mats, PROC NNET initializes weight random, if you specify a seed in the train statement, the initial weights are repeatable. NNET training is powered by a sophiscated nonlinear optimization solver, if the log shows "converged" status, it means the model is fit very well.

Should the neurons in a neural network be asynchronous?

I am designing a neural network and am trying to determine if I should write it in such a way that each neuron is its own 'process' in Erlang, or if I should just go with C++ and run a network in one thread (I would still use all my cores by running an instance of each network in its own thread).
Is there a good reason to give up the speed of C++ for the asynchronous neurons that Erlang offers?
I'm not sure I understand what you're trying to do. An artificial neural network is essentially represented by the weight of the connections between nodes. The nodes themselves don't exist in isolation; their values are only calculated (at least in feed-forward networks) through the forward-propagation algorithm, when it is given input.
The backpropagation algorithm for updating weights is definitely parallelizable, but that doesn't seem to be what you're describing.
The usefulness of having neurons in a Neural Network (NN), is to have a multi-dimension matrix which coefficients you want to handle ( to train them, to change them, to adapt them little by little, so as they fit well to the problem you want to solve). On this matrix you can apply numerical methods (proven and efficient) so as to find an acceptable solution, in an acceptable time.
IMHO, with NN (namely with back-propagation training method), the goal is to have a matrix which is efficient both at run-time/predict-time, and at training time.
I don't grasp the point of having asynchronous neurons. What would it offers ? what issue would it solve ?
Maybe you could explain clearly what problem you would solve putting them asynchronous ?
I am indeed inverting your question: what do you want to gain with asynchronicity regarding traditional NN techniques ?
It would depend upon your use case: the neural network computational model and your execution environment. Here is a recent paper (2014) by Plotnikova et al, that uses "Erlang and platform Erlang/OTP with predefined base implementation of actor model functions" and a new model developed by the authors that they describe as “one neuron—one process” using "Gravitation Search Algorithm" for training:
http://link.springer.com/chapter/10.1007%2F978-3-319-06764-3_52
To briefly cite their abstract, "The paper develops asynchronous distributed modification of this algorithm and presents the results of experiments. The proposed architecture shows the performance increase for distributed systems with different environment parameters (high-performance cluster and local network with a slow interconnection bus)."
Also, most other answers here reference a computational model that uses matrix operations for the base of training and simulation, for which the authors of this paper compare by saying, "this case neural network model [ie matrix operations based] becomes fully mathematical and its original nature (from neural networks biological prototypes) gets lost"
The tests were run on three types of systems;
IBM cluster is represented as 15 virtual machines.
Distributed system deployed to the local network is represented as 15 physical machines.
Hybrid system is based on the system 2 but each physical machine has four processor cores.
They provide the following concrete results, "The presented results evidence a good distribution ability of gravitation search, especially for large networks (801 and more neurons). Acceleration depends on the node count almost linearly. If we use 15 nodes we can get about eight times acceleration of the training process."
Finally, they conclude regarding their model, "The model includes three abstraction levels: NNET, MLP and NEURON. Such architecture allows encapsulating some general features on general levels and some specific for the considered neural networks features on special levels. Asynchronous message passing between levels allow to differentiate synchronous and asynchronous parts of training and simulation algorithms and, as a result, to improve the use of resources."
It depends what you are after.
2nd Generation of Neural Networks are synchronous. They perform computations on an input-output basis without a delay, and can be trained either through reinforcement or back-propagation. This is the prevailing type of ANN at the moment and the easiest to get started with if you are trying to solve a problem via machine learning, lots of literature and examples available.
3rd Generation of Neural Networks (so-called "Spiking Neural Networks") are asynchronous. Signals propagate internally through the network as a chain-reaction of spiking events, and can create interesting patterns and oscillations depending on the shape of the network. While they model biological brains more closely they are also harder to make use of in a practical setting.
I think that async computation for NNs might prove beneficial for the (recognition) performance. In fact, the result might be similar (maybe less pronounced) to using dropout.
But a straight-forward implementation of async NNs would be much slower, because for synchronous NNs you can use linear algebra libraries, which make good use of vectorization or GPUs.

How to find dynamically the depth of a network in Convolutional Neural Network

I was looking for an automatic way to decide how many layers should I apply to my network depends on data and computer configuration. I searched in web, but I could not find anything. Maybe my keywords or looking ways are wrong.
Do you have any idea?
The number of layers, or depth, of a neural network is one of its hyperparameters.
This means that it is a quantity that can not be learned from the data, but you should choose it before trying to fit your dataset. According to Bengio,
We define a hyper-
parameter for a learning algorithm A as a variable to
be set prior to the actual application of A to the data,
one that is not directly selected by the learning algo-
rithm itself.
There are three main approaches to find out the optimal value for an hyperparameter. The first two are well explained in the paper I linked.
Manual search. Using well-known black magic, the researcher choose the optimal value through try-and-error.
Automatic search. The researcher relies on an automated routine in order to speed up the search.
Bayesian optimization.
More specifically, adding more layers to a deep neural network is likely to improve the performance (reduce generalization error), up to a certain number when it overfits the training data.
So, in practice, you should train your ConvNet with, say, 4 layers, try adding one hidden layer and train again, until you see some overfitting. Of course, some strong regularization techniques (such as dropout) is required.

Resources