Catboost: what are reasonable values for l2_leaf_reg? - machine-learning

Running catboost on a large-ish dataset (~1M rows, 500 columns), I get:
Training has stopped (degenerate solution on iteration 0, probably too small l2-regularization, try to increase it).
How do I guess what the l2 regularization value should be? Is it related to the mean values of y, number of variables, tree depth?
Thanks!

I don't think you will find an exact answer to your question because each data-set is different one from another.
However, based on my experience values form a range between 2 and 30, is a good starting point.

Related

Why i get number of clusters less than the number of the given k?

I compute the maximal number of clusters and I get k=5. then i generate the clusters using kmeans++ and setting k=5, but i get as a result only 4 clusters. Is that correct?
Does it go from 0 to 4? That's 5 clusters. It's zero-based.
How are you currently computing the value of k?
What's likely happening is that one of your clusters is empty, causing it to disappear. This usually means that the value of k is too high for your dataset. If you lower the value of k you should find you will stop losing clusters.
I have found that some implementations of k-means++ will always give nearly identical initial centriod placement. If you need to have k=5 for some reason, you could try just using kmeans (random centroid initialisation) and see if that helps any.

How to Address Noise Resulting from Inverse-Scaling for a Machine Learning Task

I'm still a little unsure of whether questions like these belong on stackoverflow. Is this website only for questions with explicit code? The "How to Format" just tells me it should be a programming question, which it is. I will remove my question if the community thinks otherwise.
I have created a neural network and am predicting reasonable values for most of my data (the task is multi-variate time series forecasting).
I scale my data before inputting it using scikit-learn's MinMaxScaler(0,1) or MinMaxScaler(-1,1) (the two primary scalings I am using).
The model learns, predicts, and I inverse the scaling using MinMaxScaler()'s inverse_transform method to visually see how close my predictions were to the actual values. However, I notice that the inverse_scaled values for a particular part of the vector I predicted have now become very noisy. Here is what I mean (inverse_scaled prediction):
Left end: noisy; right end: not-so noisy.
I initially thought that perhaps my network didn't learn that part of the vector well, so is just outputting ~random values. BUT, I notice that the predicted values before the inverse scaling seem to match the actual values very well, but that these values are typically near 0 or -1 (lower limit of the feature scale) because of the fact that these values have a very large spread (unscaled mean= 1E-1, max= 1E+1 [not an outlier]). Example (scaled prediction):
So, when inverse transforming these values (again, often near -1 or 0), the transformed values exhibit loud noise, as shown in the images.
Questions:
1.) Should I be using a different scaler/scaling differently, perhaps one that exponentially/nonlinearly scales? MinMaxScaler() scales each column. Simply dropping the high-magnitude data isn't an option since they are real, meaningful data. 2.) What other solutions can help this?
Please let me know if you'd like anything else clarified.

machine learning, why do we need to weight data

This my sound as very naive question. I checked on google and many YouTube videos for beginners and pretty much, all explain data weighting as something the most obvious. I still do not understand why data is being weighted.
Let's assume I have four features:
a b c d
1 2 1 4
If I pass each value to Sigmond function, I'll receive -1 >< 1 value already.
I really don't understand why data needs or it is recommended to be weighted first. If you could explain to me this in very simple manner, I would appreciate it a lot.
I think you are not talking about weighing data but features.
A feature is a column in your table and as data I would understand rows.
The confusion comes now from the fact that weighing rows is also sometimes sensible, e.g., if you want to punish misclassification of positive class more.
Why do we need to weigh features?
I assume you are talking about a modle like
prediction = sigmoid(sum_i weight_i * feature_i) > base
Let's assume you want to predict whether a person is overweight based on Bodyweight, height, and age.
In R we can generate a sample dataset as
height = rnorm(100,1.80,0.1) #normal distributed mean 1.8,variance 0.1
weight = rnorm(100,70,10)
age = runif(100,0,100)
ow = weight / (height**2)>25 #overweight if BMI > 25
data = data.frame(height,weight,age,bc,ow)
if we now plot the data you can see that at least my sample of the data can be separated with a straight line in weight/height. However, age does not provide any value. If we weight it prior to the sum/sigmoid you can put all factors into relation.
Furthermore, as you can see from the following plot the weight/height have a very different domain. Hence, they need to be put into relation, such that the line in the following plot has the right slope, as the value of weight have are one order of magnitude larger

How to decide numClasses parameter to be passed to Random Forest algorithm in SPark MLlib with pySpark

I am working on Classification using Random Forest algorithm in Spark have a sample dataset that looks like this:
Level1,Male,New York,New York,352.888890
Level1,Male,San Fransisco,California,495.8001345
Level2,Male,New York,New York,-495.8001345
Level1,Male,Columbus,Ohio,165.22352099
Level3,Male,New York,New York,495.8
Level4,Male,Columbus,Ohio,652.8
Level5,Female,Stamford,Connecticut,495.8
Level1,Female,San Fransisco,California,495.8001345
Level3,Male,Stamford,Connecticut,-552.8234
Level6,Female,Columbus,Ohio,7000
Here the last value in each row will serve as a label and rest serve as features. But I want to treat label as a category and not a number. So 165.22352099 will denote a category and so will -552.8234. For this I have encoded my features as well as label into categorical data. Now what I am having difficulty in is deciding what should I pass for numClasses parameter in Random Forest algorithm in Spark MlLib? I mean should it be equal to number of unique values in my label? My label has like 10000 unique values so if I put 10000 as value of numClasses then wouldn't it decrease the performance dramatically?
Here is the typical signature of building a model for Random Forest in MlLib:
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
The confusion comes from the fact that you are doing something that you should not do. You problem is clearly a regression/ranking, not a classification. Why would you think about it as a classification? Try to answer these two questions:
Do you have at least 100 samples per each value (100,000 * 100 = 1,000,000)?
Is there completely no structure in the classes, so for example - are objects with value "200" not more similar to those with value "100" or "300" than to those with value "-1000" or "+2300"?
If at least one answer is no, then you should not treat this as a classification problem.
If for some weird reason you answered twice yes, then the answer is: "yes, you should encode each distinct value as a different class" thus leading to 10000 unique classes, which leads to:
extremely imbalanced classification (RF, without balancing meta-learner will nearly always fail in such scenario)
extreme number of classes (there are no models able to solve it, for sure RF will not solve it)
extremely small dimension of the problem- looking at as small is your number of features I would be surprised if you could predict from that binary classifiaction. As you can see how irregular are these values, you have 3 points which only diverge in first value and you get completely different results:
Level1,Male,New York,New York,352.888890
Level2,Male,New York,New York,-495.8001345
Level3,Male,New York,New York,495.8
So to sum up, with nearly 100% certainty this is not a classification problem, you should either:
regress on last value (keyword: reggresion)
build a ranking (keyword: learn to rank)
bucket your values to at most 10 different values and then - classify (keywords: imbalanced classification, sparse binary representation)

Predict future values using highcharts/Highstock

I need to predict the future values based on given set of data. I found in the following link a method of obtaining trend line moving average.
http://www.highcharts.com/plugin-registry/single/16/technical-indicators
jsfiddle is here http://jsfiddle.net/laff/WaEBc/
But my requirement is based on this Moving average to predict the future values.
Searched a lot, but couldn't find. please help.
Thanks!
How it should work, if you need to predict, you need to calculate any points to achieve that. Its not build-in.
To find the equation to produce a trend line, search for Linear Regression.
You will need to calculate the slope and intercept using the linear regression calculations, and you build your trend line using those two values, combined with an x value for the start and end points that are defined by the min and max x values of the data set.
(ie your first point is {x: min x value, y: intercept}. your second point is {x: max x value, y: intercept + (slope * max x value)} )
Much more importantly:
Trend lines do NOT predict future values that fall outside of the existing range of the independent variable in the data.
Using regression to plot a line in this way will help you build a predictive model of what your dependent variable may be when given a known independent variable.
It will absolutely not give you a reliable prediction of what will happen to Y as X increases beyond the scope of the known data, especially when X is a time value.
Building an actual predictive model of values over time is much more involved, and there isn't one single way to do it. It depends on what factors affect those values, and what data you have to demonstrate those effects.
some reference:
Predictive modelling

Resources