Can we calculate range of a data if we only know mean, SD and N? - mean

Can we calculate range of a data if we only know mean=42.2, SD=9.61 and N=43?
Thank you,
Mitra

Generally speaking, in order to estimate the range of your data, you need to know the distribution of the data, the mean and the standard deviation.
There are some particular/special cases whereby knowing only the mean, SD and N you can find the range but I do not think that this applies to your case unless I miss something. Think that you have the following equations:
(X1+X2+...+X43)/43 = 42.2
((X1-mean)^2+...+(X43-mean)^2)/(43-1) = 9.61^2
and you need to find the min and max Xi.

Related

Search for the optimal value of x for a given y

Please help me find an approach to solving the following problem: Let X is a matrix X_mxn = (x1,…,xn), xi is a time series and a vector Y_mx1. To predict values ​​from Y_mx1, let's train some model, let linear regression. We get Y = f (X). Now we need to find X for some given value of Y. The most naive thing is brute force, but what are the competent ways to solve such problems? Perhaps there is a use of the scipy.optimize package here, please enlighten me.
get an explanation or matherial to read for understanding
Most scipy-optimize algorithm use gradient method, for those optimization problem, we could apply these into re-engineering of data (find the best date to invest in the stock market...)
If you want to optimize the result, you should choose a good step size and suitable optimize method.
However, we should not classify tge problem as "predict" of xi because what we are doing is to find local/global maximum/minimum.
For example Newton-CG, your data/equation should contain all the information needed/a simulation, but no prediction is made from the method.
If you want to do a pretiction on "time", you could categorize the time data in "year,month..." then using unsupervise learning to "group" the data. If trend is obtained, then we can re-enginning the result to know the time

machine learning, why do we need to weight data

This my sound as very naive question. I checked on google and many YouTube videos for beginners and pretty much, all explain data weighting as something the most obvious. I still do not understand why data is being weighted.
Let's assume I have four features:
a b c d
1 2 1 4
If I pass each value to Sigmond function, I'll receive -1 >< 1 value already.
I really don't understand why data needs or it is recommended to be weighted first. If you could explain to me this in very simple manner, I would appreciate it a lot.
I think you are not talking about weighing data but features.
A feature is a column in your table and as data I would understand rows.
The confusion comes now from the fact that weighing rows is also sometimes sensible, e.g., if you want to punish misclassification of positive class more.
Why do we need to weigh features?
I assume you are talking about a modle like
prediction = sigmoid(sum_i weight_i * feature_i) > base
Let's assume you want to predict whether a person is overweight based on Bodyweight, height, and age.
In R we can generate a sample dataset as
height = rnorm(100,1.80,0.1) #normal distributed mean 1.8,variance 0.1
weight = rnorm(100,70,10)
age = runif(100,0,100)
ow = weight / (height**2)>25 #overweight if BMI > 25
data = data.frame(height,weight,age,bc,ow)
if we now plot the data you can see that at least my sample of the data can be separated with a straight line in weight/height. However, age does not provide any value. If we weight it prior to the sum/sigmoid you can put all factors into relation.
Furthermore, as you can see from the following plot the weight/height have a very different domain. Hence, they need to be put into relation, such that the line in the following plot has the right slope, as the value of weight have are one order of magnitude larger

Missing data & single imputation

I have a complete ozone data set which consist a few missing values. I would like to use SPSS to do single imputation to impute my data.
Before I start impute my data, I would like to do randomly simulate missing data patterns with 5%, 10%, 15%, 25% and 40% of the data missing in order to evaluating the accuracy of imputation methods.
Can someone please teach me how to do the randomly missing data pattern by using SPSS?
Besides that can someone please tell me how to obtain the performance indicator such as: mean absolute error, coefficient of determination and root mean square error in order to check the best method for estimating missing values.
Unfortunately, my current SPSS supports no missing data analysis, so I can only give some general advice.
First: For your missing data pattern: Simply go to Data -> Select cases -> Random Sample and delete the desired amount of cases and then run the Imputation.
The values you mentioned should be provided by spss if you use their imputation module. There is a manual:
ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/de/client/Manuals/IBM_SPSS_Missing_Values.pdf
The answer to your first question. Assume your study variable is y and you want to simulate missingness of the variable y. This is en example code to compute extra variable y_miss according to your missing data pattern.
do if uniform(1) < .05.
comp y_miss = $SYSMIS.
else.
comp y_miss = y.
end if.

knnMatch requires k>1 to get good results?

I am using SURF and i am trying both
FlannBasedMatcher
and
BruteForceMatcher
I saw to get good matches I need to set
matcher.knnMatch(,,2); // with k=2 (At least)
If I set k = 1 I don't get the first less distant match for that keypoint.
Are there any reasons?
knnMatch partitions your data in k groups. If k=1, you will put it in one big group.
With a single partition, it will be hard for the algorithm to figure out the distance to a second one. So there will be no distance to calculate at all.

Kohonen SOM Maps: Normalizing the input with unknown range

According to "Introduction to Neural Networks with Java By Jeff Heaton", the input to the Kohonen neural network must be the values between -1 and 1.
It is possible to normalize inputs where the range is known beforehand:
For instance RGB (125, 125, 125) where the range is know as values between 0 and 255:
1. Divide by 255: (125/255) = 0.5 >> (0.5,0.5,0.5)
2. Multiply by two and subtract one: ((0.5*2)-1)=0 >> (0,0,0)
The question is how can we normalize the input where the range is unknown like our height or weight.
Also, some other papers mention that the input must be normalized to the values between 0 and 1. Which is the proper way, "-1 and 1" or "0 and 1"?
You can always use a squashing function to map an infinite interval to a finite interval. E.g. you can use tanh.
You might want to use tanh(x * l) with a manually chosen l though, in order not to put too many objects in the same region. So if you have a good guess that the maximal values of your data are +/- 500, you might want to use tanh(x / 1000) as a mapping where x is the value of your object It might even make sense to subtract your guess of the mean from x, yielding tanh((x - mean) / max).
From what I know about Kohonen SOM, they specific normalization does not really matter.
Well, it might through specific choices for the value of parameters of the learning algorithm, but the most important thing is that the different dimensions of your input points have to be of the same magnitude.
Imagine that each data point is not a pixel with the three RGB components but a vector with statistical data for a country, e.g. area, population, ....
It is important for the convergence of the learning part that all these numbers are of the same magnitude.
Therefore, it does not really matter if you don't know the exact range, you just have to know approximately the characteristic amplitude of your data.
For weight and size, I'm sure that if you divide them respectively by 200kg and 3 meters all your data points will fall in the ]0 1] interval. You could even use 50kg and 1 meter the important thing is that all coordinates would be of order 1.
Finally, you could a consider running some linear analysis tools like POD on the data that would give you automatically a way to normalize your data and a subspace for the initialization of your map.
Hope this helps.

Resources