Accumulated normalized fitness - normalization

I'm building a genetic algorithm and I stumbled on this:
Accumulated normalized fitness values are computed (the accumulated
fitness value of an individual is the sum of its own fitness value
plus the fitness values of all the previous individuals). The
accumulated fitness of the last individual should be 1 (otherwise
something went wrong in the normalization step).
from Wikipedia
Could anyone explain why should I do it? What do I gain with this kind of normalization?
I already have a normalized fitness score ( I'm using gene_score/total_scores ), which gives me a sum of all the scores equal to 1. With that I can sort the genes from best to worst and do any kind of recombination / crossover.

The accumulated normalized fitness is the cumulative sum of the normalized fitnesses of each individual.
Example: you have individuals with fitnesses 2, 1, and 7. Their normalized fitnesses are 0.2, 0.1 and 0.7. Their accumulated fitnesses are then 0.2, 0.3 and 1. The order may differ but it has no effect on the selection procedure.
The accumulated fitness allows you to do the selection (of one individual) with a single randomly generated number.
If your fitness is already normalized, further normalization will, of course, do nothing and you can safely skip this step and proceed to the accumulation.
There are also other selection techniques like ranking selection or tournament selection where no normalization and accumulation is needed.

Related

Time Series Analysis, ARIMA model

I'm currently performing a time series analysis using the ARIMA model. I cannot read the PACF and ACF graphs in order to determine P & Q values. Any help would be appreciated.
Thanks
Autocorrelation shows the correlation of past observations (lags) with the time series, which is the correlation of the time series with itself. If you have a time series y(t), then you calculate the correlation of y(t) and y(t-1), y(t) and y(t-2), and so on.
The problem with the autocorrelation is that so called intermediary effects/indirect correlations are also included. If y(t) and y(t-1) correlate, and y(t-1) and y(t-2) also correlate. This influences the correlation of y(t) and y(t-2) indirectly. You can find a more detailed explanation here:
https://otexts.com/fpp2/non-seasonal-arima.html
Partial autocorrelation also shows the correlation of a time series and it’s lags, but intermediary effects are removed. That means in the PACF you can only see how y(t) is influenced directly by y(t-1), y(t-2), and so on. Maybe also have a look here:
https://towardsdatascience.com/time-series-from-scratch-autocorrelation-and-partial-autocorrelation-explained-1dd641e3076f
Now, how do you read ACF and PACF plots? The plot shows you the correlation per each lag. A correlation coefficient usually ranges from -1 meaning a perfect negative relationship to +1 meaning a perfect positive relationship. 0 means no relationship at all. In your case, y(t) and y(t-1) -> lag 1 correlate with a coefficient of around -0.55, meaning a medium strong negative relationship. y(t) and y(t-8) -> lag 8 correlate with a coefficient of +0.3, meaning a weak positive relationship. The confidence limit shows you whether the correlation is statistically significant. Basically, this means that every bar that crosses the line is a “true” correlation that is not random or so, you can use these correlations. On the other hand, t(y) and t(y-2) -> lag 2 have a very weak correlation that seems to be more or less random. You cannot use this relationship.
In general, strong correlations in the PACF indicate the usage of an MA model, so you should use and ARIMA(0, d, q) model. I would recommend to use the first, third and fourth lag, maybe also fifth lag, since these have at least medium strong, significant correlations. This means an ARIMA(0, d, [1, 3, 4, 5]) model.
But you also need the ACF plot to find the best ARIMA order, especially the p value.
First perform one differentiaiton, and then read number of lags for PACF, for current plot P is 8, which is quite high

Continuous or categorical data in data science

I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.
The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.
But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!

Reward function with a neural network approximated Q-function

In Q-learning, how should I represent my Reward function if my Q-function is approximated by a normal Feed-Forward Neural Network?
Should I represent it as discrete values "near", "very near" to the goal etc.. All I'm what concerned about, is that as long as I already moved to the neural network approximation of the Q-function Q(s, a, θ) and not using a lookup table anymore, would I still be obliged to build a Reward table as well?
There is no such thing as a "reward table" you are supposed to define "reward signal", which is produced in a given agent-world state in given timestamp. This reward should be a scalar (number). In general you could consider more complex rewards, but in typical setting of Q-learning reward is just a number, since the goal of the algorithm is to find a policy such that it maximizes the expected summed discounted rewards. Obviously you need an object which can be added, multiplied and finally - compared, and efficiently such objects are only numbers (or can be directly transformed to numbers). Ok, having said that for your particular case, if you know the distance to the goal you can give a reward which is invertibly proportional to the distance, it can be even -distance, or 1/distance (as this will guarantee better scaling).

finding maximum depth of random forest given the number of features

How do we find maximum depth of Random Forest if we know the number of features ?
This is needed for regularizing random forest classifier.
I have not thought about this before. In general the trees are non-deterministic. Instead of asking what is the maximum depth? You may want to know what would be the average depth, or what is the chance of a tree has depth 20... Anyways it is possible to calculate some bounds of the maximum depth. So either a node runs out of (a)inbag samples or (b)possible splits.
(a) If inbag samples(N) is the limiting part, one could imagine a classification tree, where all samples except one are forwarded left for each split. Then the maximum depth is N-1. This outcome is highly unlikely, but possible. The minimal depth tree, where all child nodes are equally big, then the minimal depth would be ~log2(N), e.g. 16,8,4,2,1. In practice the tree depth will be somewhere in between maximal in minimal. Settings controlling minimal node size, would reduce the depth.
(b) To check if features are limiting tree depth and you on before hand know the training set, then count how many training samples are unique. Unique samples (U) cannot be split. Do to boostrapping only ~0.63 of samples will be selected for every tree. N ~ U * 0.63. Use the rules from section (a). All unique samples could be selected during bootstrapping, but that is unlikely too.
If you do not know your training set, try to estimate how many levels (L[i]) possible could be found in each feature (i) out of d features. For categorical features the answer may given. For numeric features drawn from a real distribution, there would be as many levels as there are samples. Possible unique samples would be U = L[1] * L[2] * L[3] ... * L[d].

FFT coefficients question

I'm a software engineer working on DSP for the first time.
I'm successfully using an FFT library that produces frequency spectrums. I also understand how the FFT works in terms of its inputs and outputs, in particular the contents of the two output arrays:
Now, my problem is that I'm reading some new research reports that suggest that I extract: "the energy, variance, and sum of FFT coefficients".
What are the 'FFT coefficients'? Are those the values of the Real and Imaginary arrays shown above, which (from my understanding) correspond to the amplitudes of the constituent cosine and sine waves?
What is the 'energy' of the FFT coefficients? Is that terminology from statistics or from DSP?
You are correct. FFT coefficients are the signal values in the frequency domain.
"Energy" is the square modulus of the coefficients. The total energy (sum of square modulus of all values) in either time or frequency domain is the same (see Parseval's Theorem).
The real and imaginary arrays, when put together, can represent a complex array. Every complex element of the complex array in the frequency domain can be considered a frequency coefficient, and has a magnitude ( sqrt(R*R + I*I) ). Parseval's theorem says that the sum of all the Frequency domain complex vector magnitudes (squared) is equal to the energy of the time domain signal (which may require a scaling factor involving the FFT length, depending on your particular DFT/FFT library implementation).
One example of a time domain signal is voltage on a wire, which when measured in Volts times Amps into Ohms represents power, or over time, energy. Probably the word "energy" in the strictly numerical case is derived from historical usage from physics or engineering, where the numbers meant something that could burn your fingers.

Resources