Dividing data sets into testing and training data - machine-learning

I have a dataset with k examples and I want to partition into m sets.
How can I do it programmatically.
For example, if k = 5 and m = 2, therefore, 5 / 2 = 2.5
How do I partition it into the 2 and 3, and not 2, 2 and 1?
Similarly, if k = 10 and m = 3, I want it to be partitioned into 3, 3 and 4, but not 3, 3, 3 and 1.

Usually, this sort of functionality is built into tools. But, assuming that your observations are independent, just set up a random number generator and do something like:
for i = 1 to k do;
set r = rand();
if r < 0.5 then data[i].which = 'set1'
else data[i].which = 'set2'
You can extend this for any number of sets and probabilities.
For an example where k = 5, then you could actually get all the rows in a single set (I'm thinking about 3% of the time). However, the point of splitting data is for dealing with larger amounts of data. If you only have 5 or 10 rows, then splitting your observations into different partitions is probably not the way to go.

Related

Vectorization of FOR loop

Is there a way to vectorize this FOR loop I know about gallery ("circul",y) thanks to user carandraug
but this will only shift the cell over to the next adjacent cell I also tried toeplitz but that didn't work).
I'm trying to make the shift adjustable which is done in the example code with circshift and the variable shift_over.
The variable y_new is the output I'm trying to get but without having to use a FOR loop in the example (can this FOR loop be vectorized).
Please note: The numbers that are used in this example are just an example the real array will be voice/audio 30-60 second signals (so the y_new array could be large) and won't be sequential numbers like 1,2,3,4,5.
tic
y=[1:5];
[rw col]= size(y); %get size to create zero'd array
y_new= zeros(max(rw,col),max(rw,col)); %zero fill new array for speed
shift_over=-2; %cell amount to shift over
for aa=1:length(y)
if aa==1
y_new(aa,:)=y; %starts with original array
else
y_new(aa,:)=circshift(y,[1,(aa-1)*shift_over]); %
endif
end
y_new
fprintf('\nfinally Done-elapsed time -%4.4fsec- or -%4.4fmins- or -%4.4fhours-\n',toc,toc/60,toc/3600);
y_new =
1 2 3 4 5
3 4 5 1 2
5 1 2 3 4
2 3 4 5 1
4 5 1 2 3
Ps: I'm using Octave 4.2.2 Ubuntu 18.04 64bit.
I'm pretty sure this is a classic XY problem where you want to calculate something and you think it's a good idea to build a redundant n x n matrix where n is the length of your audio file in samples. Perhaps you want to play with autocorrelation but the key point here is that I doubt that building the requested matrix is a good idea but here you go:
Your code:
y = rand (1, 3e3);
shift_over = -2;
clear -x y shift_over
tic
[rw col]= size(y); %get size to create zero'd array
y_new= zeros(max(rw,col),max(rw,col)); %zero fill new array for speed
for aa=1:length(y)
if aa==1
y_new(aa,:)=y; %starts with original array
else
y_new(aa,:)=circshift(y,[1,(aa-1)*shift_over]); %
endif
end
toc
my code:
clear -x y shift_over
tic
n = numel (y);
y2 = y (mod ((0:n-1) - shift_over * (0:n-1).', n) + 1);
toc
gives on my system:
Elapsed time is 1.00379 seconds.
Elapsed time is 0.155854 seconds.

How to get solution report from lp_select gem (lpsolve)

Thank you for your time.
I couldn't find how to get variables values after the solution.
Make a three row five column equation
#lp = LPSolve::make_lp(3, 5)
Set some column names
LPSolve::set_col_name(#lp, 1, "fred")
LPSolve::set_col_name(#lp, 2, "bob")
Add a constraint and a row name, the API expects a 1 indexed array
constraint_vars = [0, 0, 1]
FFI::MemoryPointer.new(:double, constraint_vars.size) do |p|
p.write_array_of_double(constraint_vars)
LPSolve::add_constraint(#lp, p, LPSelect::EQ, 1.0.to_f)
end
LPSolve::set_row_name(#lp, 1, "onlyBob")
Set the objective function and minimize it
constraint_vars = [0, 1.0, 3.0]
FFI::MemoryPointer.new(:double, constraint_vars.size) do |p|
p.write_array_of_double(constraint_vars)
LPSolve::set_obj_fn(#lp, p)
end
LPSolve::set_minim(#lp)
Solve it and retreive the result
LPSolve::solve(#lp)
#objective = LPSolve::get_objective(#lp)
Output
Model name: '' - run #1
Objective: Minimize(R0)
SUBMITTED
Model size: 4 constraints, 5 variables, 1 non-zeros.
Sets: 0 GUB, 0 SOS.
Using DUAL simplex for phase 1 and PRIMAL simplex for phase 2.
The primal and dual simplex pricing strategy set to 'Devex'.
Optimal solution 3 after 1 iter.
Excellent numeric accuracy ||*|| = 0
MEMO: lp_solve version 5.5.0.15 for 64 bit OS, with 64 bit REAL
variables.
In the total iteration count 1, 0 (0.0%) were bound flips.
There were 0 refactorizations, 0 triggered by time and 0 by density.
... on average 1.0 major pivots per refactorization.
The largest [LUSOL v2.2.1.0] fact(B) had 5 NZ entries, 1.0x largest basis.
The constraint matrix inf-norm is 1, with a dynamic range of 1.
Time to load data was 0.031 seconds, presolve used 0.000 seconds,
... 0.000 seconds in simplex solver, in total 0.031 seconds. => 3.0
retvals = []
FFI::MemoryPointer.new(:double, 2) do |p|
err = LPSolve::get_variables(#lp, p)
retvals = p.get_array_of_double(0,2)
end
retvals[0]
retvals[1]
gives the solution.

Poisson Distribution

I have a Poisson distribution that looks similar to the one below:
https://i0.wp.com/www.real-statistics.com/wp-content/uploads/2012/11/poisson-distribution-chart.png
I've been asked to find the mean, and then the three logical groups above and below the mean for a total of seven groups.
Were this a normal distribution where the min was 0, max was 12 and mean was 6, the logical groups might be:
-3: 1
-2: 2.666
-1: 4.333
0: 6
1: 7.666
2: 9.333
3: 11
But with a Poisson distribution (such as the image above), I would expect it to be more like:
-3: 0.625
-2: 1.25
-1: 1.875
0: 2.5
1: 4.25
2: 6.5
3: 10
Is there a faster way of looking for where these points would be than eyeballing it? I need to do this with more than a hundred histograms...
I apologize if I have the language wrong; this is my first time doing something like this.
Imagine that you need 7 bins that store the values you need.
For Poisson Distribution, the mean is the lambda itself, which in your case is 3. So bin[3] = 3
Consider the formula:
bins = []
for n = min to groups + min: # typically it is 0 to groups - 1
bins[n] = min + range * n / groups
Now you need 2 different ranges:
n = 0 to 2, min = 0, max = 3, range = (3 - 0) = 3, groups = 3
n = 4 to 6, min = 3, max = 12, range = (12 - 3) = 9, groups = 3
You may apply the values in above formula to get your bins.
HTH. My memory is little out of practice, but I think general idea is correct.
Edit: This might not work for Poisson distribution. Poisson is a Discrete type distribution while my solution works only for Continuous distributions. I will leave my answer here anyways.

Compute similarity between n entities

I am trying to compute the similarity between n entities that are being described by entity_id, type_of_order, total_value.
An example of the data might look like:
NR entity_id type_of_order total_value
1 1 A 10
2 1 B 90
3 1 C 70
4 2 B 20
5 2 C 40
6 3 A 10
7 3 B 50
8 3 C 20
9 4 B 50
10 4 C 80
My question would be what is a god way of measuring the similarity between entity_id 1 and 2 for example with regards to the type_of_order and the total_value for that type of order.
Would a simple KNN give satisfactory results or should I consider other algorithms?
Any suggestion would be much appreciated.
The similarity metric is a heuristic to capture a relationship between two data rows, with respect to the data semantics and the purpose of the training. We don't know your data; we don't know your usage. It would be irresponsible to suggest metrics to solve a problem when we have no idea what problem we're solving.
You have to address this question to the person you find in the mirror. You've given us three features with no idea of what they mean or how they relate. You need to quantify ...
relative distances within features: under type_of_order, what is the relationship (distance) between any two measurements? If we arbitrarily assign d(A, B) = 1, then what is d(B, C)? We have no information to help you construct this. Further, if we give that some value c, then what is d(A, C)? In various popular metrics, it could be 1+c, |1-c|, all distances could be 1, or perhaps it's something else -- even more than 1+c in some applications.
Even in the last column, we cannot assume that d(10, 20) = d(40, 50); the actual difference could be a ratio, difference of squares, etc. Again, this depends on the semantics behind these labels.
relative weights between features: How do the differences in the various columns combine to provide a similarity? For instance, how does d([A, 10], [B, 20]) compare to d([A, 10], [C, 30])? That's two letters in the left column, two steps of 10 in the right column. How about d([A, 10], [A, 20]) vs d([A, 10], [B, 10])? Are the distances linear, or do the relationships change as we slide up the alphabet or to higher numbers?

How to predict users' preferences using item similarity?

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20&plus;%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?
Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.
What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

Resources