Big Query Linear Regression param - machine-learning

What is the actual meaning of these two points in Big Query. I get that in 2nd, maybe, by total cardinality it actually means no. of features. What about point 1?
If total cardinalities of training features are more than 10,000, batch_gradient_descent strategy is used.
If there is over-fitting issue, that is, the number of training
examples is less than 10x, where x is total cardinality,
batch_gradient_descent strategy is used.

The cardinality is the number of possible values for a feature. The total is the sum of the possible values of all the features.
For the #2, that means you must provide at least 10 time more input than the sum of the possible values of all the feature. That is for ensuring that you have enough examples per cardinality and therefore prevent over-fitting.

Related

Optimal number of folds for k-fold cross-validation

Can anyone recommend some more formal method of establishing the optimal number of folds, less than the maximum possible one and not requiring time-consuming simulations (that would predictably find the top of the range of tested k values to be the best)?
More info
From theory and simulations we know that model metrics tend to generally increase (with some variance) as a function of the number of folds (k). It is therefore suboptimal to use anything less than the maximum number of folds still feasible given data size and time constraints.
So using standard default values of 5 or 10 folds is in fact an example of hyperparameter optimization too, but one collectively performed, so they need not be pre-optimized, but switched according to time constraints for model training. As a special case, in time-consuming training setups such as deep learning there is no time for multiple folds, so only single validation set is normally used.
One imperfect solution can be borrowed from PCA scree plots - it's the so called elbow point, but it needs formalization, and it requires those simulations over folds numbers that we wanted to avoid.
For example, according to my simulation over hundreds of models (sklearn breast cancer data classification) the optimum elbow point would be around 3-5 folds:

Understanding Precision#K, AP#K, MAP#K

I'm currently evaluating a recommender system based on implicit feedback. I've been a bit confused with regard to the evaluation metrics for ranking tasks. Specifically, I am looking to evaluate by both precision and recall.
Precision#k has the advantage of not requiring any estimate of the
size of the set of relevant documents but the disadvantages that it is
the least stable of the commonly used evaluation measures and that it
does not average well, since the total number of relevant documents
for a query has a strong influence on precision at k
I have noticed myself that it tends to be quite volatile and as such, I would like to average the results from multiple evaluation logs.
I was wondering; say if I run an evaluation function which returns the following array:
Numpy array containing precision#k scores for each user.
And now I have an array for all of the precision#3 scores across my dataset.
If I take the mean of this array and average across say, 20 different scores: Is this equivalent to Mean Average Precision#K or MAP#K or am I understanding this a little too literally?
I am writing a dissertation with an evaluation section so the accuracy of the definitions is quite important to me.
There are two averages involved which make the concepts somehow obscure, but they are pretty straightforward -at least in the recsys context-, let me clarify them:
P#K
How many relevant items are present in the top-k recommendations of your system
For example, to calculate P#3: take the top 3 recommendations for a given user and check how many of them are good ones. That number divided by 3 gives you the P#3
AP#K
The mean of P#i for i=1, ..., K.
For example, to calculate AP#3: sum P#1, P#2 and P#3 and divide that value by 3
AP#K is typically calculated for one user.
MAP#K
The mean of the AP#K for all the users.
For example, to calculate MAP#3: sum AP#3 for all the users and divide that value by the amount of users
If you are a programmer, you can check this code, which is the implementation of the functions apk and mapk of ml_metrics, a library mantained by the CTO of Kaggle.
Hope it helped!

Continuous or categorical data in data science

I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.
The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.
But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Classification - What can we use when the number of dimensions is greater than the number of samples?

I've read on the scikit-learn website that SVM is a good choice when the number of dimensions is greater than the number of samples.
I would like to know what you think (as experienced users) is the more efficient in these cases where the class to predict is binary.
And especially what to do when the number of labeled samples is about 50.
Algorithms that should work ? Things to care about ?
If you have dense data, and n < d, you have a high risk of overfitting: each dimension or differences of dimensions could uniquely identify your training records.
That doesn't mean it won't work, only that you have to be extra careful when evaluating your method, because of the high risk of overfitting.
Methods that only use one dimension at a time - such as decision trees - could be less affected. In particular, if you apply pruning techniques to your tree.

Resources