Should I normalize my features before throwing them into RNN? - machine-learning

I am playing some demos about recurrent neural network.
I noticed that the scale of my data in each column differs a lot. So I am considering to do some preprocess work before I throw data batches into my RNN. The close column is the target I want to predict in the future.
open high low volume price_change p_change ma5 ma10 \
0 20.64 20.64 20.37 163623.62 -0.08 -0.39 20.772 20.721
1 20.92 20.92 20.60 218505.95 -0.30 -1.43 20.780 20.718
2 21.00 21.15 20.72 269101.41 -0.08 -0.38 20.812 20.755
3 20.70 21.57 20.70 645855.38 0.32 1.55 20.782 20.788
4 20.60 20.70 20.20 458860.16 0.10 0.48 20.694 20.806
ma20 v_ma5 v_ma10 v_ma20 close
0 20.954 351189.30 388345.91 394078.37 20.56
1 20.990 373384.46 403747.59 411728.38 20.64
2 21.022 392464.55 405000.55 426124.42 20.94
3 21.054 445386.85 403945.59 473166.37 21.02
4 21.038 486615.13 378825.52 461835.35 20.70
My question is, is preprocessing the data with, say StandardScaler in sklearn necessary in my case? And why?
(You are welcome to edit my question)

It will be beneficial to normalize your training data. Having different features with widely different scales fed to your model will cause the network to weight the features not equally. This can cause a falsely prioritisation of some features over the others in the representation.
Despite that the whole discussion on data preprocessing is controversial either on when exactly it is necessary and how to correctly normalize the data for each given model and application domain there is a general consensus in Machine Learning that running a Mean subtraction as well as a general Normalization preprocessing step is helpful.
In the case of Mean subtraction, the mean of every individual feature is being subtracted from the data which can be interpreted as centering the data around the origin from a geometric point of view. This is true for every dimensionality.
Normalizing the data after the Mean subtraction step results in a normalization of the data dimensionality to approximately the same scale. Note that the different features will loose any prioritization over each other after this step as mentioned above. If you have good reasons to think that the different scales in your features bear important information that the network may need to truly understand the underlying patterns in your dataset, then a normalization will be harmful. A standard approach would be to scale the inputs to have mean of 0 and a variance of 1.
Further preprocessing operations may be helpful in specific cases such as performing PCA or Whitening on your data. Look into the awesome notes of CS231n (Setting up the data and the model) for further reference on these topics as well as for a more detailed explenation of the topics above.

Definetly yes. Most of neural networks work best with data beetwen 0-1 or -1 to 1(depends on output function). Also when some inputs are higher then others network will "think" they are more important. This can make learning very long. Network must first lower weights in this inputs.

I found this https://arxiv.org/abs/1510.01378
If you normalize it may improve convergence so you will get lower training times.

Related

How do I decide or count number of hidden/tunable parameters in my design?

For my deep learning assignment I need to design a image classification network. There this constraint in the assignment I can have 500,000 number of hidden/tunable parameters at most in this design.
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Thanks in advance
How can I count or observe the number of these hidden parameters especially if I am using this tensor flow tutorial as initial code/design.
Instead of me doing the work for you I'll show you how to count free parameters
Glancing quickly it looks like the code at cifar10 uses layers of max pooling, convolution, bias, fully connected weights. Let's review how many free parameters each of these layers adds to your architecture.
max pooling : FREE! That's right, there are no "free parameters" from max pooling.
conv : Convolutions are defined using parameters like [1,3,3,1] where the numbers correspond to your tensor like so [batch_size, CONV_SIZE, CONV_SIZE, FEATURE_DEPTH]. Multiply all the dimension sizes together to find the total size of your free parameters. In the case of [1,3,3,1], the total is 1x3x3x1 = 9.
bias : A Bias is similar to convolutions in that it is defined by a shape like [10] or [1,342,342,3]. Same thing, just multiply all dimension sizes together to get the total free parameters. Sometimes a bias is just a single number, which means a size of 1.
fully connected : A fully connected layer usually has a 2d shape like [1024,32]. This means that it is a 2d matrix, and you calculate the total free parameters just like the convolution. In this example [1024,32] has 1024x32 = 32,768 free parameters.
Finally you add up all the free parameters from all the layers and that is your total number of free parameters.
500 000 parmeters? You use an R, G and B value of each pixel? If yes there is some problems
1. too much data (long calculating time)
2. in image clasification companys always use some other image analysis technique(preprocesing) befor throwing data into NN. if you have to identical images. Second is moved by one piksel. For the network they can be very diffrend.
Imagine other neural network. Use two parameters maybe weight and height. If you swap this parametrs what will happend.
Yes during learning of your image network can decrease this effect but when I made experiments with 5x5 binary images that was very hard to network. I start using 4 layers but this help only a little.
The image used to lerning can be good clasified, after destoring also but mooving for one pixel and you have a problem.
If no make eksperiments or use genetic algoritm to find it.
After laerning you should use some algoritm to find dates with network recognize as "no important"(big differnce beetwen weight of this input and the rest, If this input weight are too close to 0 network "think" it is no important)

TensorFlow seq2seq model with low number of target_vocab_size

I am experimenting with the tensorflow seq2seq_model.py model.
The target vocab size I have is around 200.
The documentation the says:
For vocabularies smaller than 512, it might be a better idea to just use a standard softmax loss.
The source-code also has the check:
if num_samples > 0 and num_samples < self.target_vocab_size:
Running the model with only 200 target output vocabulary does not invoke the if statement.
Do I need to write a "standard" softmax loss function to ensure a good training, or can I just let the model run as it comes?
Thanks for the help!
I am doing the same thing. In order to just get my fingers wet with different kinds of structures in the training data I am working in an artificial test-world with just 117 words in the (source and) target vocabulary.
I asked myself the same question and decided to not go through that hassle. My models train well even though I didn't touch the loss, thus still using the sampled_softmax_loss.
Further experiences with those small vocab sizes:
- batchsize 32 is best in my case (smaller ones make it really unstable and I run into nan-issues quickly)
- I am using AdaGrad as the optimizer and it works like magic
- I am working with the model_with_buckets (addressed through translate.py) and having size 512 with num_layers 2 produces the desired outcomes in many cases.

Does the Izhikevich neuron model use weights?

I've been working a bit with neural networks and I'm interested on implementing a spiking neuron model.
I've read a fair amount of tutorials but most of them seem to be about generating pulses and I haven't found any application of it on a given input train.
Say for example I got input train:
Input[0] = [0,0,0,1,0,0,1,1]
It enters the Izhikevich neuron, does the input multiply a weight or only makes use of the parameters a, b, c and d?
Izhikevich equations are:
v[n+1] = 0.04*v[n]^2 + 5*v[n] + 140 - u[n] + I
u[n+1] = a*(b*v[n] - u[n])
where v[n] is input voltage and u[n] is a general recovery variable.
Are there any texts on implementations of Izhikevich or similar spiking neuron models on a practical problem? I'm trying to understand how information is encoded on this models but it looks different from what's done with standard second generation neurons. The only tutorial I've found where it deals with a spiking train and a set of weights is [1] but I haven't seen the same with Izhikevich.
[1] https://msdn.microsoft.com/en-us/magazine/mt422587.aspx
The plain Izhikevich model by itself, does not include weights.
The two equations you mentioned, model the membrane potential (v[]) over time of a point neuron. To use weights, you could connect two or more of such cells with synapses.
Each synapse could include some sort spike detection mechanism on the source cell (pre-synaptic), and a synaptic current mechanism in the target (post-synaptic) cell side. That synaptic current could then be multiplied by a weight term, and then become part of the I term (in the 1st equation above) for the target cell.
As a very simple example of a two cell network, at every time step, you could check if pre- cell v is above (say) 0 mV. If so, inject (say) 0.01 pA * weightPrePost into the post- cell. weightPrePost would range from 0 to 1, and could be modified in response to things like firing rate, or Hebbian-like spike synchrony like in STDP.
With multiple synaptic currents going into a cell, you could devise various schemes how to sum them. The simplest one would be just a simple sum, more complicated ones could include things like distance and dendrite diameters (e.g. simulated neural morphology).
This chapter is a nice introduction to other ways to model synapses: Modelling
Synaptic Transmission

why normalizing feature values doesn't change the training output much?

I have 3113 training examples, over a dense feature vector of size 78. The magnitude of features is different: some around 20, some 200K. For example, here is one of the training examples, in vowpal-wabbit input format.
0.050000 1 '2006-07-10_00:00:00_0.050000| F0:9.670000 F1:0.130000 F2:0.320000 F3:0.570000 F4:9.837000 F5:9.593000 F6:9.238150 F7:9.646667 F8:9.631333 F9:8.338904 F10:9.748000 F11:10.227667 F12:10.253667 F13:9.800000 F14:0.010000 F15:0.030000 F16:-0.270000 F17:10.015000 F18:9.726000 F19:9.367100 F20:9.800000 F21:9.792667 F22:8.457452 F23:9.972000 F24:10.394833 F25:10.412667 F26:9.600000 F27:0.090000 F28:0.230000 F29:0.370000 F30:9.733000 F31:9.413000 F32:9.095150 F33:9.586667 F34:9.466000 F35:8.216658 F36:9.682000 F37:10.048333 F38:10.072000 F39:9.780000 F40:0.020000 F41:-0.060000 F42:-0.560000 F43:9.898000 F44:9.537500 F45:9.213700 F46:9.740000 F47:9.628000 F48:8.327233 F49:9.924000 F50:10.216333 F51:10.226667 F52:127925000.000000 F53:-15198000.000000 F54:-72286000.000000 F55:-196161000.000000 F56:143342800.000000 F57:148948500.000000 F58:118894335.000000 F59:119027666.666667 F60:181170133.333333 F61:89209167.123288 F62:141400600.000000 F63:241658716.666667 F64:199031688.888889 F65:132549.000000 F66:-16597.000000 F67:-77416.000000 F68:-205999.000000 F69:144690.000000 F70:155022.850000 F71:122618.450000 F72:123340.666667 F73:187013.300000 F74:99751.769863 F75:144013.200000 F76:237918.433333 F77:195173.377778
The training result was not good, so I thought I would normalize the features to make them in the same magnitude. I calculated mean and standard deviation for each of the features across all examples, then do newValue = (oldValue - mean) / stddev, so that their new mean and stddev are all 1. For the same example, here is the feature values after normalization:
0.050000 1 '2006-07-10_00:00:00_0.050000| F0:-0.660690 F1:0.226462 F2:0.383638 F3:0.398393 F4:-0.644898 F5:-0.670712 F6:-0.758233 F7:-0.663447 F8:-0.667865 F9:-0.960165 F10:-0.653406 F11:-0.610559 F12:-0.612965 F13:-0.659234 F14:0.027834 F15:0.038049 F16:-0.201668 F17:-0.638971 F18:-0.668556 F19:-0.754856 F20:-0.659535 F21:-0.663001 F22:-0.953793 F23:-0.642736 F24:-0.606725 F25:-0.609946 F26:-0.657141 F27:0.173106 F28:0.310076 F29:0.295814 F30:-0.644357 F31:-0.678860 F32:-0.764422 F33:-0.658869 F34:-0.674367 F35:-0.968679 F36:-0.649145 F37:-0.616868 F38:-0.619564 F39:-0.649498 F40:0.041261 F41:-0.066987 F42:-0.355693 F43:-0.638604 F44:-0.676379 F45:-0.761250 F46:-0.653962 F47:-0.668194 F48:-0.962591 F49:-0.635441 F50:-0.611600 F51:-0.615670 F52:-0.593324 F53:-0.030322 F54:-0.095290 F55:-0.139602 F56:-0.652741 F57:-0.675629 F58:-0.851058 F59:-0.642028 F60:-0.648002 F61:-0.952896 F62:-0.629172 F63:-0.592340 F64:-0.682273 F65:-0.470121 F66:-0.045396 F67:-0.128265 F68:-0.185295 F69:-0.510251 F70:-0.515335 F71:-0.687727 F72:-0.512749 F73:-0.471032 F74:-0.789335 F75:-0.491188 F76:-0.400105 F77:-0.505242
However, this yields basically the same testing result (if not exactly the same, since I shuffle the examples before each training).
Wondering why there is no change in the result?
Here is my training and testing commands:
rm -f cache
cat input.feat | vw -f model --passes 20 --cache_file cache
cat input.feat | vw -i model -t -p predictions --invert_hash readable_model
(Yes, I'm testing on the training data right now since I have only very few data examples to train on.)
More context:
Some of the features are "tier 2" - they were derived by manipulating or doing cross products on "tier 1" features (e.g. moving average, 1-3 order of derivatives, etc). If I normalize the tier 1 features before calculating the tier 2 features, it would actually improve the model significantly.
So I'm puzzled as why normalizing tier 1 features (before generating tier 2 features) helps a lot, while normalizing all features (after generating tier 2 features) doesn't help at all?
BTW, since I'm training a regressor, I'm using SSE as the metrics to judge the quality of the model.
vw normalizes feature values for scale as it goes, by default.
This is part of the online algorithm. It is done gradually during runtime.
In fact it does more than that, vw enhanced SGD algorithm also keeps separate learning rates (per feature) so rarer feature learning rates don't decay as fast as common ones (--adaptive). Finally there's an importance aware update, controlled by a 3rd option (--invariant).
The 3 separate SGD enhancement options (which are all turned on by default) are:
--adaptive
--invariant
--normalized
The last option is the one that adjust values for scale (discounts large values vs small). You may disable all these SGD enhancements by using the option --sgd. You may also partially enable any subset by explicitly specifying it.
All in all you have 2^3 = 8 SGD option combinations you can use.
The Possible reason is that whatever Training algorithm that you used to get the result already did the normalization process for you!.In fact many algorithms do the normalization process before working on it.Hope it helps you :)

Kohonen Self Organizing Maps: Determining the number of neurons and grid size

I have a large dataset I am trying to do cluster analysis on using SOM. The dataset is HUGE (~ billions of records) and I am not sure what should be the number of neurons and the SOM grid size to start with. Any pointers to some material that talks about estimating the number of neurons and grid size would be greatly appreciated.
Thanks!
Quoting from the som_make function documentation of the som toolbox
It uses a heuristic formula of 'munits = 5*dlen^0.54321'. The
'mapsize' argument influences the final number of map units: a 'big'
map has x4 the default number of map units and a 'small' map has
x0.25 the default number of map units.
dlen is the number of records in your dataset
You can also read about the classic WEBSOM which addresses the issue of large datasets
http://www.cs.indiana.edu/~bmarkine/oral/self-organization-of-a.pdf
http://websom.hut.fi/websom/doc/ps/Lagus04Infosci.pdf
Keep in mind that the map size is also a parameter which is also application specific. Namely it depends on what you want to do with the generated clusters. Large maps produce a large number of small but "compact" clusters (records assigned to each cluster are quite similar). Small maps produce less but more generilized clusters. A "right number of clusters" doesn't exists, especially in real world datasets. It all depends on the detail which you want to examine your dataset.
I have written a function that, with the data set as input, returns the grid size. I rewrote it from the som_topol_struct() function of Matlab's Self Organizing Maps Toolbox into a R function.
topology=function(data)
{
#Determina, para lattice hexagonal, el número de neuronas (munits) y su disposición (msize)
D=data
# munits: número de hexágonos
# dlen: número de sujetos
dlen=dim(data)[1]
dim=dim(data)[2]
munits=ceiling(5*dlen^0.5) # Formula Heurística matlab
#munits=100
#size=c(round(sqrt(munits)),round(munits/(round(sqrt(munits)))))
A=matrix(Inf,nrow=dim,ncol=dim)
for (i in 1:dim)
{
D[,i]=D[,i]-mean(D[is.finite(D[,i]),i])
}
for (i in 1:dim){
for (j in i:dim){
c=D[,i]*D[,j]
c=c[is.finite(c)];
A[i,j]=sum(c)/length(c)
A[j,i]=A[i,j]
}
}
VS=eigen(A)
eigval=sort(VS$values)
if (eigval[length(eigval)]==0 | eigval[length(eigval)-1]*munits<eigval[length(eigval)]){
ratio=1
}else{
ratio=sqrt(eigval[length(eigval)]/eigval[length(eigval)-1])}
size1=min(munits,round(sqrt(munits/ratio*sqrt(0.75))))
size2=round(munits/size1)
return(list(munits=munits,msize=sort(c(size1,size2),decreasing=TRUE)))
}
hope it helps...
Iván Vallés-Pérez
I don't have a reference for it, but I would suggest starting off by using approximately 10 SOM neurons per expected class in your dataset. For example, if you think your dataset consists of 8 separate components, go for a map with 9x9 neurons. This is completely just a ballpark heuristic though.
If you'd like the data to drive the topology of your SOM a bit more directly, try one of the SOM variants that change topology during training:
Growing SOM
Growing Neural Gas
Unfortunately these algorithms involve even more parameter tuning than plain SOM, but they might work for your application.
Kohenon has written on the issue of selecting parameters and map size for SOM in his book "MATLAB Implementations and Applications of the Self-Organizing Map". In some cases, he suggest the initial values can be arrived at after testing several sizes of the SOM to check that the cluster structures were shown with sufficient resolution and statistical accuracy.
my suggestion would be the following
SOM is distantly related to correspondence analysis. In statistics, they use 5*r^2 as a rule of thumb, where r is the number of rows/columns in a square setup
usually, one should use some criterion that is based on the data itself, meaning that you need some criterion for estimating the homogeneity. If a certain threshold would be violated, you would need more nodes. For checking the homogeneity you would need some records per node. Agai, from statistics you could learn that for simple tests (small number of variables) you would need around 20 records, for more advanced tests on some variables at least 8 records.
remember that the SOM represents a predictive model. So validation is the key, absolutely mandatory. Yet, validation of predictive models (see typeI / II error entry in Wiki) is a subject on its own. And the acceptable risk as well as the risk structure also depend fully on your purpose.
You may test the dynamics of the error rate of the model by reducing its size more and more. Then take the smallest one with acceptable error.
It is a strength of the SOM to allow for empty nodes. Yet, there should not be too much of them. Let me say, less than 5%.
Taken all together, from experience, I would recommend the following criterion a minimum of the absolute number of 8..10 records, but those should not be more than 5% of all clusters.
Those 5% rule is of of course a heuristics, which however can be justified by the general usage of the confidence level in statistical tests. You may choose any percentage from 1% to 5%.

Resources