I am using Spark ML's LinearSVC in a binary classification model. The transform method creates two columns, prediction and rawPrediction. Spark's docs don't provide any way of interpreting the rawPrediction column for this particular classifier. This question has been asked and answered for other classifiers, but not specifically for LinearSVC.
The relevant column from my predictions dataframe:
+------------------------------------------+
|rawPrediction |
+------------------------------------------+
|[0.8553257800650063,-0.8553257800650063] |
|[0.4230977574196645,-0.4230977574196645] |
|[0.49814263303537865,-0.49814263303537865]|
|[0.9506355050332026,-0.9506355050332026] |
|[0.5826887000450813,-0.5826887000450813] |
|[1.057222808292026,-1.057222808292026] |
|[0.5744214192446275,-0.5744214192446275] |
|[0.8738081933835614,-0.8738081933835614] |
|[1.418173816502859,-1.418173816502859] |
|[1.0854125533426737,-1.0854125533426737] |
+------------------------------------------+
Clearly this isn't simply the probability of belonging to each class. What is it?
Edit: Since the input code has been requested, here's a model built on a subset of features in the original dataset. Fitting any data with Spark's LinearSVC will produce this column.
var df = sqlContext
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/full_frame_20180716.csv")
var assembler = new VectorAssembler()
.setInputCols(Array("oy_length", "ah_length", "ey_length", "vay_length", "oh_length",
"longest_word_length", "total_words", "repeated_exact_words",
"repeated_bigrams", "repeated_lemmatized_words",
"repeated_lemma_bigrams"))
.setOutputCol("features")
df = assembler.transform(df)
var Array(train, test) = df.randomSplit(Array(.8,.2), 42)
var supvec = new LinearSVC()
.setLabelCol("written_before_2004")
.setMaxIter(10)
.setRegParam(0.001)
var supvecModel = supvec.fit(train)
var predictions = supvecModel.transform(test)
predictions.select("rawPrediction").show(20, false)
Output:
+----------------------------------------+
|rawPrediction |
+----------------------------------------+
|[1.1502868455791242,-1.1502868455791242]|
|[0.853488887006264,-0.853488887006264] |
|[0.8064994501574174,-0.8064994501574174]|
|[0.7919862003563363,-0.7919862003563363]|
|[0.847418035176922,-0.847418035176922] |
|[0.9157433788236442,-0.9157433788236442]|
|[1.6290888181913814,-1.6290888181913814]|
|[0.9402461917731906,-0.9402461917731906]|
|[0.9744052798627367,-0.9744052798627367]|
|[0.787542624053347,-0.787542624053347] |
|[0.8750602657901001,-0.8750602657901001]|
|[0.7949414037722276,-0.7949414037722276]|
|[0.9163545832998052,-0.9163545832998052]|
|[0.9875454213431247,-0.9875454213431247]|
|[0.9193015302646135,-0.9193015302646135]|
|[0.9828623328048487,-0.9828623328048487]|
|[0.9175976004208621,-0.9175976004208621]|
|[0.9608750388820302,-0.9608750388820302]|
|[1.029326217566756,-1.029326217566756] |
|[1.0190290910146256,-1.0190290910146256]| +----------------------------------------+
only showing top 20 rows
It is (-margin, margin).
override protected def predictRaw(features: Vector): Vector = {
val m = margin(features)
Vectors.dense(-m, m)
}
As it is mention by arpad, it is the margin.
And the margin is:
margin = coefficients * feature + intercept
or
y = w * x + b
If you divide the margin by the norm of the coefficients, you will get the distance to the hyperplane for each data point.
Related
I'm using gtsummary package.
I need to merge different univariate logistic regression and in order to have a good presentation, I want to hide the p_value and bold or put a star to the significant OR (p< 0.05).
Anyone can help me?
Maybe it's easier to use another presentation type like kable, huxtable, I don't know?
Thank you for your help.
Have a nice day
There is a function called add_significance_stars() that hides the p-value and adds stars to the estimate indicating various levels of statistical significance. I've also added code to bold the estimate if significant with modify_table_styling().
library(gtsummary)
#> #BlackLivesMatter
packageVersion("gtsummary")
#> [1] '1.4.0'
tbl <-
trial %>%
select(death, age, grade) %>%
tbl_uvregression(
y = death,
method = glm,
method.args = list(family = binomial),
exponentiate = TRUE
) %>%
# add significance stars to sig estimates
add_significance_stars() %>%
# additioanlly bolding significant estimates
modify_table_styling(
columns = estimate,
rows = p.value < 0.05,
text_format = "bold"
)
Created on 2021-04-14 by the reprex package (v2.0.0)
Here's a quick huxtable version:
l1 <- glm(I(cyl==8) ~ gear, data = mtcars, family = binomial)
l2 <- glm(I(cyl==8) ~ carb, data = mtcars, family = binomial)
huxtable::huxreg(l1, l2, statistics = "nobs", bold_signif = 0.05)
────────────────────────────────────────────────────
(1) (2)
───────────────────────────────────
(Intercept) 5.999 * -1.880 *
(2.465) (0.902)
gear -1.736 *
(0.693)
carb 0.579 *
(0.293)
───────────────────────────────────
nobs 32 32
────────────────────────────────────────────────────
*** p < 0.001; ** p < 0.01; * p < 0.05.
Column names: names, model1, model2
It doesn't show it here, but the significant coefficients are bold on screen (and in any other kind of output).
I am currently using a generative RNN to classify indices in a sequence (sort of saying whether something is noise or not noise).
My input in continuous (i.e. a real value between 0 and 1) and my output is either a (0 or 1).
For example, if the model marks a 1 for numbers greater than 0.5 and 0 otherwise,
[.21, .35, .78, .56, ..., .21] => [0, 0, 1, 1, ..., 0]:
0 0 1 1 0
^ ^ ^ ^ ^
| | | | |
o->L1 ->L2 ->L3 ->L4 ->... ->L10
^ ^ ^ ^ ^
| | | | |
.21 .35 .78 .56 ... .21
Using
n_steps = 10
n_inputs = 1
n_neurons = 7
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [None, n_steps, n_outputs])
cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons, activation=tf.nn.relu)
rnn_outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
rnn_outputs becomes a (?, 10, 7) shape tensor, presumable 7 outputs per each of the 10 time steps.
Previously, I have run the following snippet on output projection wrapped rnn_outputs to get a classification label per sequence.
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,logits=logits)
loss = tf.reduce_mean(xentropy)
How would I run something similar on rnn_outputs to get a sequence?
Specifically,
1. Can I get the rnn_output from each step and feed it into a softmax?
curr_state = rnn_outputs[:,i,:]
logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
2. What loss function should I use and should it be applied across every value of every sequence? (for sequence i and step j, loss = y_{ij} (true) - y_{ij}(predicted) )?
Should my loss be loss = tf.reduce_mean(np.sum(xentropy))?
EDIT
It seems I am trying to implement something similar to what is similar in https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/ in TensorFlow.
In Keras, there's a TimeDistributed function:
You can then use TimeDistributed to apply a Dense layer to each of the
10 timesteps, independently
How would I go about implementing something similar in Tensorflow?
First up, it looks like you're doing seq-to-seq modelling. In this kind of problems it's usually a good idea to go with encoder-decoder architecture rather than predict the sequence from the same RNN. Tensorflow has a big tutorial about it under the name "Neural Machine Translation (seq2seq) Tutorial", which I'd recommend you to check out.
However, the architecture that you're asking about is also possible provided that n_steps is known statically (despite using dynamic_rnn). In this case, it's possible compute the cross-entropy of each cells' output and then sum up all the losses. It's possible if the RNN length is dynamic as well, but would be more hairy. Here's the code:
n_steps = 2
n_inputs = 3
n_neurons = 5
X = tf.placeholder(dtype=tf.float32, shape=[None, n_steps, n_inputs], name='x')
y = tf.placeholder(dtype=tf.int32, shape=[None, n_steps], name='y')
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
# Reshape to make `time` a 0-axis
time_based_outputs = tf.transpose(outputs, [1, 0, 2])
time_based_labels = tf.transpose(y, [1, 0])
losses = []
for i in range(n_steps):
cell_output = time_based_outputs[i] # get the output, can do apply further dense layers if needed
labels = time_based_labels[i] # get the label (sparse)
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels, logits=cell_output)
losses.append(loss) # collect all losses
total_loss = tf.reduce_sum(losses) # compute the total loss
I'm building a web app to help students with learning Maths.
The app needs to display Maths content that comes from LaTex files.
These Latex files render (beautifully) to pdf that I can convert cleanly to svg thanks to pdf2svg.
The (svg or png or whatever image format) image looks something like this:
_______________________________________
| |
| 1. Word1 word2 word3 word4 |
| a. Word5 word6 word7 |
| |
| ///////////Graph1/////////// |
| |
| b. Word8 word9 word10 |
| |
| 2. Word11 word12 word13 word14 |
| |
|_______________________________________|
Real example:
The web app intent is to manipulate and add content to this, leading to something like this:
_______________________________________
| |
| 1. Word1 word2 | <-- New line break
|_______________________________________|
| |
| -> NewContent1 |
|_______________________________________|
| |
| word3 word4 |
|_______________________________________|
| |
| -> NewContent2 |
|_______________________________________|
| |
| a. Word5 word6 word7 |
|_______________________________________|
| |
| ///////////Graph1/////////// |
|_______________________________________|
| |
| -> NewContent3 |
|_______________________________________|
| |
| b. Word8 word9 word10 |
|_______________________________________|
| |
| 2. Word11 word12 word13 word14 |
|_______________________________________|
Example:
A large single image cannot give me the flexibility to do this kind of manipulations.
But if the image file was broken down into smaller files which hold single words and single Graphs I could do these manipulations.
What I think I need to do is detect whitespace in the image, and slice the image into multiple sub-images, looking something like this:
_______________________________________
| | | | |
| 1. Word1 | word2 | word3 | word4 |
|__________|_______|_______|____________|
| | | |
| a. Word5 | word6 | word7 |
|_____________|_______|_________________|
| |
| ///////////Graph1/////////// |
|_______________________________________|
| | | |
| b. Word8 | word9 | word10 |
|_____________|_______|_________________|
| | | | |
| 2. Word11 | word12 | word13 | word14 |
|___________|________|________|_________|
I'm looking for a way to do this.
What do you think is the way to go?
Thank you for your help!
I would use horizontal and vertical projection to first segment the image into lines, and then each line into smaller slices (e.g. words).
Start by converting the image to grayscale, and then invert it, so that gaps contain zeros and any text/graphics are non-zero.
img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray
Calculate horizontal projection -- mean intensity per row, using cv2.reduce, and flatten it to a linear array.
row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
Now find the row ranges for all the contiguous gaps. You can use the function provided in this answer.
row_gaps = zero_runs(row_means)
Finally calculate the midpoints of the gaps, that we will use to cut the image up.
row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2
You end up with something like this situation (gaps are pink, cutpoints red):
Next step would be to process each identified line.
bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
line = img[start:end]
line_gray_inverted = img_gray_inverted[start:end]
Calculate the vertical projection (average intensity per column), find the gaps and cutpoints. Additionally, calculate gap sizes, to allow filtering out the small gaps between individual letters.
column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
column_gaps = zero_runs(column_means)
column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2
Filter the cutpoints.
filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]
And create a list of bounding boxes for each segment.
for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
bounding_boxes.append(((xstart, start), (xend, end)))
Now you end up with something like this (again gaps are pink, cutpoints red):
Now you can cut up the image. I'll just visualize the bounding boxes found:
The full script:
import cv2
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
def plot_horizontal_projection(file_name, img, projection):
fig = plt.figure(1, figsize=(12,16))
gs = gridspec.GridSpec(1, 2, width_ratios=[3,1])
ax = plt.subplot(gs[0])
im = ax.imshow(img, interpolation='nearest', aspect='auto')
ax.grid(which='major', alpha=0.5)
ax = plt.subplot(gs[1])
ax.plot(projection, np.arange(img.shape[0]), 'm')
ax.grid(which='major', alpha=0.5)
plt.xlim([0.0, 255.0])
plt.ylim([-0.5, img.shape[0] - 0.5])
ax.invert_yaxis()
fig.suptitle("FOO", fontsize=16)
gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])
fig.set_dpi(200)
fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
plt.clf()
def plot_vertical_projection(file_name, img, projection):
fig = plt.figure(2, figsize=(12, 4))
gs = gridspec.GridSpec(2, 1, height_ratios=[1,5])
ax = plt.subplot(gs[0])
im = ax.imshow(img, interpolation='nearest', aspect='auto')
ax.grid(which='major', alpha=0.5)
ax = plt.subplot(gs[1])
ax.plot(np.arange(img.shape[1]), projection, 'm')
ax.grid(which='major', alpha=0.5)
plt.xlim([-0.5, img.shape[1] - 0.5])
plt.ylim([0.0, 255.0])
fig.suptitle("FOO", fontsize=16)
gs.tight_layout(fig, rect=[0, 0.03, 1, 0.97])
fig.set_dpi(200)
fig.savefig(file_name, bbox_inches='tight', dpi=fig.dpi)
plt.clf()
def visualize_hp(file_name, img, row_means, row_cutpoints):
row_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
row_highlight[row_means == 0, :, :] = [255,191,191]
row_highlight[row_cutpoints, :, :] = [255,0,0]
plot_horizontal_projection(file_name, row_highlight, row_means)
def visualize_vp(file_name, img, column_means, column_cutpoints):
col_highlight = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
col_highlight[:, column_means == 0, :] = [255,191,191]
col_highlight[:, column_cutpoints, :] = [255,0,0]
plot_vertical_projection(file_name, col_highlight, column_means)
# From https://stackoverflow.com/a/24892274/3962537
def zero_runs(a):
# Create an array that is 1 where a is 0, and pad each end with an extra 0.
iszero = np.concatenate(([0], np.equal(a, 0).view(np.int8), [0]))
absdiff = np.abs(np.diff(iszero))
# Runs start and end where absdiff is 1.
ranges = np.where(absdiff == 1)[0].reshape(-1, 2)
return ranges
img = cv2.imread('article.png', cv2.IMREAD_COLOR)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
img_gray_inverted = 255 - img_gray
row_means = cv2.reduce(img_gray_inverted, 1, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
row_gaps = zero_runs(row_means)
row_cutpoints = (row_gaps[:,0] + row_gaps[:,1] - 1) / 2
visualize_hp("article_hp.png", img, row_means, row_cutpoints)
bounding_boxes = []
for n,(start,end) in enumerate(zip(row_cutpoints, row_cutpoints[1:])):
line = img[start:end]
line_gray_inverted = img_gray_inverted[start:end]
column_means = cv2.reduce(line_gray_inverted, 0, cv2.REDUCE_AVG, dtype=cv2.CV_32F).flatten()
column_gaps = zero_runs(column_means)
column_gap_sizes = column_gaps[:,1] - column_gaps[:,0]
column_cutpoints = (column_gaps[:,0] + column_gaps[:,1] - 1) / 2
filtered_cutpoints = column_cutpoints[column_gap_sizes > 5]
for xstart,xend in zip(filtered_cutpoints, filtered_cutpoints[1:]):
bounding_boxes.append(((xstart, start), (xend, end)))
visualize_vp("article_vp_%02d.png" % n, line, column_means, filtered_cutpoints)
result = img.copy()
for bounding_box in bounding_boxes:
cv2.rectangle(result, bounding_box[0], bounding_box[1], (255,0,0), 2)
cv2.imwrite("article_boxes.png", result)
The image is top quality, perfectly clean, not skewed, well separated characters. A dream !
First perform binarization and blob detection (standard in OpenCV).
Then cluster the characters by grouping those with an overlap in the ordinates (i.e. facing each other in a row). This will naturally isolate the individual lines.
Now in every row, sort the blobs left-to-right and cluster by proximity to isolate the words. This will be a delicate step, because the spacing of characters within a word is close to the spacing between distinct words. Don't expect perfect results. This should work better than a projection.
The situation is worse with italics as the horizontal spacing is even narrower. You may have to also look at the "slanted distance", i.e. find the lines that tangent the characters in the direction of the italics. This can be achieved by applying a reverse shear transform.
Thanks to the grid, the graphs will appear as big blobs.
I have a data frame representing a time series like for example:
timestamp: 1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28...
value: 0|0|3|6|3|3|6|3|3|6 |3 |0 |0 |0 |1 |3 |7 |0 |0 |1 |3 |7 |1 |3 |7 |3 |6 |3 ...
The goal is to classify different patterns (which can be at random positions) and label the values.
This means to find the patterns:
3-6-3
1-3-7
0
and to extend the data frame to
timestamp: 1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28...
value: 0|0|3|6|3|3|6|3|3|6 |3 |0 |0 |0 |1 |3 |7 |0 |0 |1 |3 |7 |1 |3 |7 |3 |6 |3 ...
label: c|c|a|a|a|a|a|a|a|a |a |c |c |c |b |b |b |c |c |b |b |b |b |b |b |a |a |a ...
Note that there is no identical length of such a pattern.
The question is what kind of algorithms can be used for this unsupervised learning problem and maybe additionally what libraries/frameworks could be useful to implement such a task.
Thanks in advance!
My answer deals with question of pattern matching. After successful matching the following algorithm gives the starting and position of a matched sequence as output. You can then use this information for labelling as you described in your question.
I recommend the SPRING algorithm as introduced in the paper:
Stream Monitoring under the Time Warping Distance (Sakurai, Faloutsos, Yamamuro)
http://www.cs.cmu.edu/~christos/PUBLICATIONS/ICDE07-spring.pdf
The algorithms core is the DTW Distance (dynamic time warping) and the resulting distance is the DTW distance. The only difference to DTW is optimization as every position of the "stream"(sequence in which your are looking for a match) is a possible starting point for a match - opposed to DTW which computes the total distance matrix for each starting point.
You provide a template, a threshold and a stream (the concept is developed for matching from a datastream, yet you can apply the algorithm onto your dataframe by simply looping through it)
Note:
Choice of threshold is not trivial and could pose a siginificant challenge - but this would be another question.
( threshold = 0, if you only want to match the exact same sequences, threshold >0, if you want to match similiar sequences as well)
If you are looking for several templates/target pattern, then you would have to create several instances of the SPRING class, each for one target pattern.
The following code is my (messy) implementation, which lacks a lot things (such a class definition and so on). Yet it works and should help guiding you to your answer.
I implemented as follows:
#HERE DEFINE
#1)template consisting of numerical data points
#2)stream consisting of numerical data points
template = [1, 2, 0, 1, 2]
stream = [1, 1, 0, 1, 2, 3, 1, 0, 1, 2, 1, 1, 1, 2 ,7 ,4 ,5]
#the threshold for the matching process has to be chosen by the user - yet in reality the choice of threshold is a non-trivial problem regarding the quality of the matching process
#Getting Epsilon from the user
epsilon = input("Please define epsilon: ")
epsilon = float(epsilon)
#SPRING
#1.Requirements
n = len(template)
D_recent = [float("inf")]*(n)
D_now=[0]*(n)
S_recent=[0]*(n)
S_now=[0]*(n)
d_rep=float("inf")
J_s=float("inf")
J_e=float("inf")
check=0
#check/output
matches=[]
#calculation of accumulated distance for each incoming value
def accdist_calc (incoming_value, template,Distance_new, Distance_recent):
for i in range (len(template)):
if i == 0:
Distance_new[i] = abs(incoming_value-template[i])
else:
Distance_new[i] = abs(incoming_value-template[i])+min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1])
return Distance_new
#deduce starting point for each incoming value
def startingpoint_calc (template_length, starting_point_recent, starting_point_new, Distance_new, Distance_recent):
for i in range (template_length):
if i == 0:
#here j+1 instead of j, because of the programm counting from 0 instead of from 1
starting_point_new[i] = j+1
else:
if Distance_new[i-1] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
starting_point_new[i] = starting_point_new[i-1]
elif Distance_recent[i] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
starting_point_new[i] = starting_point_recent[i]
elif Distance_recent[i-1] == min(Distance_new[i-1], Distance_recent[i], Distance_recent[i-1]):
starting_point_new[i] = starting_point_recent[i-1]
return starting_point_new
#2.Calculation for each incoming point x.t - simulated here by simply calculating along the given static list
for j in range (len(stream)):
x = stream[j]
accdist_calc (x,template,D_now,D_recent)
startingpoint_calc (n, S_recent, S_now, D_now, D_recent)
#Report any matching subsequence
if D_now[n-1] <= epsilon:
if D_now[n-1] <= d_rep:
d_rep = D_now[n-1]
J_s = S_now[n-1]
J_e = j+1
print "REPORT: Distance "+str(d_rep)+" with a starting point of "+str(J_s)+" and ending at "+str(J_e)
#Identify optimal subsequence
for i in range (n):
if D_now[i] >= d_rep or S_now[i] > J_e:
check = check+1
if check == n:
print "MATCH: Distance "+str(d_rep)+" with a starting point of "+str(J_s)+" and ending at "+str(J_e)
matches.append(str(d_rep)+","+str(J_s)+","+str(J_e))
d_rep = float("inf")
J_s = float("inf")
J_e = float("inf")
check = 0
else:
check = 0
#define the recently calculated distance vector as "old" distance
for i in range (n):
D_recent[i] = D_now[i]
S_recent[i] = S_now[i]
Here's a question about entity framework that has been bugging me for a bit.
I have a table called prizes that has different prizes. Some with higher and some with lower monetary values. A simple representation of it would be as such:
+----+---------+--------+
| id | name | weight |
+----+---------+--------+
| 1 | Prize 1 | 80 |
| 2 | Prize 2 | 15 |
| 3 | Prize 3 | 5 |
+----+---------+--------+
Weight is this case is the likely hood I would like this item to be randomly selected.
I select one random prize at a time like so:
var prize = db.Prizes.OrderBy(r => Guid.NewGuid()).Take(1).First();
What I would like to do is use the weight to determine the likelihood of a random item being returned, so Prize 1 would return 80% of the time, Prize 2 15% and so on.
I thought that one way of doing that would be by having the prize on the database as many times as the weight. That way having 80 times Prize 1 would have a higher likelihood of being returned when compared to Prize 3, but this is not necessarily exact.
There has to be a better way of doing this, so i was wondering if you could help me out with this.
Thanks in advance
Normally I would not do this in database, but rather use code to solve the problem.
In your case, I would generate a random number within 1 to 100. If the number generated is between 1 to 80 then 1st one wins, if it's between 81 to 95 then 2nd one wins, and if between 96 to 100 the last one win.
Because the random number could be any number from 1 to 100, each number has 1% of chance to be hit, then you can manage the winning chance by giving the range of what the random number falls into.
Hope this helps.
Henry
This can be done by creating bins for the three (generally, n) items and then choose a selected random to be dropped in one of those bins.
There might be a statistical library that could do this for you i.e. proportionately select a bin from n bins.
A solution that does not limit you to three prizes/weights could be implemented like below:
//All Prizes in the Database
var allRows = db.Prizes.ToList();
//All Weight Values
var weights = db.Prizes.Select(p => new { p.Weight });
//Sum of Weights
var weightsSum = weights.AsEnumerable().Sum(w => w.Weight);
//Construct Bins e.g. [0, 80, 95, 100]
//Three Bins are: (0-80],(80-95],(95-100]
int[] bins = new int[weights.Count() + 1];
int start = 0;
bins[start] = 0;
foreach (var weight in weights) {
start++;
bins[start] = bins[start - 1] + weight.Weight;
}
//Generate a random number between 1 and weightsSum (inclusive)
Random rnd = new Random();
int selection = rnd.Next(1, weightsSum + 1);
//Assign random number to the bin
int chosenBin = 0;
for (chosenBin = 0; chosenBin < bins.Length; chosenBin++)
{
if (bins[chosenBin] < selection && selection <= bins[chosenBin + 1])
{
break;
}
}
//Announce the Prize
Console.WriteLine("You have won: " + allRows.ElementAt(chosenBin));