I have used SFS from mlxtend with r2 scorer. The thing about R2 is that in regression models, with addition of new features R2 will always increase. Hence, is it ever useful to use R2 as a scorer or should a custom function with adjusted r2 be always used?
While I have seen R2 used in practice, I have seen the plot where with more features I see R2 dropping. For example in this notebook https://www.kaggle.com/code/jorijnsmit/linear-regression-by-sequential-feature-selection the validation section has the SFS plot where R2 is dropping on adding new features.
SFS Plot
Can someone pls help me understand how this is possible?
Related
Generally, in statistics, the R² Score is between 0 and 1. But, it can be negative in BigQuery ML training results using XGBoost(model type = BOOSTED_TREE_REGRESSOR).
So, what is the coefficient of determination R² in the evaluation of models in BigQuery ML?
R2 score can be negative. R2 is not always the square of anything, so it can have a negative value without violating any rules of math. R2 is negative only when the chosen model does not follow the trend of the data.
I'm using a Faster RCNN network to perform object/symbol detection and I'm facing 2 major issues.
The bounding box of the detected symbols is not tight enough. For example, in many cases, only 50%-70% of the entire symbol is being identified (Example: Resistor R1 in the image below). What can I do to make my bounding box more accurate?
In the below example we have 3 resistors, R1, R2, R3. The trained network is able to identify R1 with partial IoU, R2 properly but it has missed out R3 completely even though R3 is present on the same page and is the same symbol as R1 and R2. Why does this happen and how can I overcome this? (I tried a correlation-based approach but there are too many variations to consider in my use case)
How can I fix the above issues? Thanks in advance.
The problem is in the picture
Question's image:
Question 2
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemists obtains the dataset below. In the column on the right, kj/mole is the unit measuring the amount of energy released. examples.
You would like to use linear regression (h a(x)=a0+a1 x) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for a0 and a1? You should be able to select the right answer without actually implementing linear regression.
A) a0=−1780.0, a1=−530.9 B) a0=−569.6, a1=−530.9
C) a0=−1780.0, a1=530.9 D) a0=−569.6, a1=530.9
Since all a0s are negative but two a1s are positive lets figure out the latter first.
As you can see by increasing the number of carbon atoms the energy is become more and more negative, so the relation cannot be positively correlated which rules out options c and d.
Then for the intercept the value that produces the least error is the correct one. For the 1 and 10 (easier to calculate) the outputs are about -2300 and -7000 for a, -1100 and -5900 for b, so one would prefer b over a.
PS: You might be thinking there should be obvious values for a0 and a1 from the data, it's not. The intention of the question is to give you a general understanding of the best fit. Also this way of solving is kinda machine learning as well
Suppose we have a set of inputs (named x1, x2, ..., xn) that give us the output y. The goal is to predict y from some values of x1... xn that were not seem yet. It's clear to me that this problem can be modelled as a Regression problem on the realm of Machine Learning.
However, let's say data keep coming. I'm able to predict y from x1... xn. Furthermore, I'm able to check afterwards whether or not that prediction was a good one. If it was a good one, everything is fine. On the other hand, I would like to update my model in case that prediction deviates a lot from the real y. The one way I can see this is to insert this new data on my training set and train the regression algorithm again. Two problems arise from that. First, it may cost more than I can afford to recompute my module from scratch from time to time. Second, I may already have too much data on my training set so that new coming data is negligible. However, the new coming data might be more import than the older ones due to the nature of my problem.
It seems that a good solution would be to compute a kind of continuous regression that is more related to the new data than to the older one. I have searched for such approach but I have not found anything relevant. Perhaps I'm looking at the wrong direction. Does anyone have a clue on how to do it?
If you want to consider the newer data more important you have to use weights. Usually it is called
sample_weight
in fit() function in scikit-learn (if you use this library).
Weights can be defined as 1 / (time pass from this current observation).
Now about the second problem. If the recalculation takes much time you can cut your observations and use the latest ones. Fit your model on the whole data and on the fresh one + some part of the old data and check how much your weights are changed. I suppose if you really have a dependence between {x_i} and {y} you don't need the whole dataset.
Otherwise you can use weights again. But for now you will weight weights in the model:
model for old data: w1*x1 + w2*x2 + ...
model for new data: ~w1*x1 + ~w2*x2 + ...
common model: (w1*a1_1 + ~w1*a1_2)*x1 + (w2*a2_1 + ~w2*a2_2)*x2 + ...
Here a1_1, a2_1 are the weights for 'old model', a2_1, a2_2 - for new one, w1, w2 - coefficients of old model, ~w1, ~w2 - of the new one.
Parameters {a} can be estimated as in the first bullet (be hands), but you also can create another linear model to estimate them. But my advice: don't use non-linear regression for {a} - you will overfit.
Suppose that X is our dataset (still not centered) and X_cent is our centered dataset (X_cent = X - mean(X)).
If we are doing PCA projection in this way Z_cent = F*X_cent, where F is matrix of principal components, that is pretty obvious that we need to add mean(X) after reconstruction Z_cent.
But what if we are doing PCA projection in this way Z = F*X? In this case we don't need to add mean(X) after reconstruction, but it gives us another result.
I think that something wrong with this procedure (construction-reconstruction), when it applied to the non-centered data (X in our case). Can anyone explain how it works? Why can't we do construction/reconstruction phase without this subracting/adding mean?
Thank you in advance.
If you retain all Principal Components, then reconstruction of the centered and non-centered vectors as described in your question would be identical. The problem (as indicated in your comments) is that you are only retaining K principal components. When you drop PCs, you lose information so the reconstruction will contain errors. Since you don't have to reconstruct the mean in one of the reconstructions, you don't introduce errors w.r.t. the mean there so the reconstruction errors of the two versions will be different.
Reconstruction with fewer than all PCs isn't quite as simple as multiplying by the transpose of the eigenvectors (F') because you need to pad your transformed data with zeros but to keep things simple, I'll ignore that here. Your two reconstructions look like this:
R1 = F'*F*X
R2 = F'*F*X_cent + X_mean
= F'*F*(X - X_mean) + X_mean
= F'*F*X - F'*F*X_mean + X_mean
Since the reconstruction is lossy, in general, F'*F*Y != Y for matrix Y. If you retrained all PCs, you would have R1 - R2 = 0. But since you are only retaining a subset of the PCs, your two reconstructions will differ by
R2 - R1 = X_mean - F'*F*X_mean
Your follow-up question in the comments regarding why it's better to reconstruct X_cent instead of X is a bit more nuanced and really depends on why you are doing PCA in the first place. The most fundamental reason is that the PCs are with respect to the mean in the first place so by not centering the data prior to transforming/rotating, you aren't really decorrelating the features. Another reason is that the numeric values of the transformed data are often orders of magnitude smaller when centering the data first.