I'm doing a report and needed to have a test for the scalability of a mind map database software design idea. I wanted to use the USL equation to get a quantifiable metric for scalability, but I have no idea what range is considered good for USL. Any help would be appreciated :)
USL Eq'n:
C(N) = N/ (1 + α (N − 1) + β N (N − 1))
The three terms in the denominator of eqn. are associated repectively with the three Cs: the level of concurrency, a contention penalty (with stength α) and a coherency penalty (with stength β). The parameter values are defined in the range: 0 ≤ α, β < 1. The independent variable N can represent either
Do you mean number of measurements by "What range"? If yes then you cannot be assured about the required number of measured data points beforehand. Until you don't see any change in the predicted maximum number of concurrency after including more data points you have to keep adding more data points.
The estimated parameters and predictions there of are not reliable if you use the MS Excel spread sheet method explained in the book "Guerrilla Capacity Planning" book. Check out the paper "Mythbuster for the Guerrillas" to understand why and how to get reliable results. It might be worth to read the paper "Better Prediction Using the Super-serial Scalability Law Explained by the Least Square Error Principle and the Machine Repairman Model”.
Related
This may show my naiveté but it is my understanding that quantum computing's obstacle is stabilizing the qbits. I also understand that standard computers use binary (on/off); but it seems like it may be easier with today's tech to read electric states between 0 and 9. Binary was the answer because it was very hard to read the varying amounts of electricity, components degrade over time, and maybe maintaining a clean electrical "signal" was challenging.
But wouldn't it be easier to try to solve the problem of reading varying levels of electricity so we can go from 2 inputs to 10 and thereby increasing the smallest unit of storage and exponentially increasing the number of paths through the logic gates?
I know I am missing quite a bit (sorry the puns were painful) so I would love to hear why or why not.
Thank you
"Exponentially increasing the number of paths through the logic gates" is exactly the problem. More possible states for each n-ary digit means more transistors, larger gates and more complex CPUs. That's not to say no one is working on ternary and similar systems, but the reason binary is ubiquitous is its simplicity. For storage, more possible states also means we need more sensitive electronics for reading and writing, and a much higher error frequency during these operations. There's a lot of hype around using DNA (base-4) for storage, but this is more on account of the density and durability of the substrate.
You're correct, though that your question is missing quite a bit - qubits are entirely different from classical information, whether we use bits or digits. Classical bits and trits respectively correspond to vectors like
Binary: |0> = [1,0]; |1> = [0,1];
Ternary: |0> = [1,0,0]; |1> = [0,1,0]; |2> = [0,0,1];
A qubit, on the other hand, can be a linear combination of classical states
Qubit: |Ψ> = α |0> + β |1>
where α and β are arbitrary complex numbers such that such that |α|2 + |β|2 = 1.
This is called a superposition, meaning even a single qubit can be in one of an infinite number of states. Moreover, unless you prepared the qubit yourself or received some classical information about α and β, there is no way to determine the values of α and β. If you want to extract information from the qubit you must perform a measurement, which collapses the superposition and returns |0> with probability |α|2 and |1> with probability |β|2.
We can extend the idea to qutrits (though, just like trits, these are even more difficult to effectively realize than qubits):
Qutrit: |Ψ> = α |0> + β |1> + γ |2>
These requirements mean that qubits are much more difficult to realize than classical bits of any base.
I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.
I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.
I want to classify documents (composed of words) into 3 classes (Positive, Negative, Unknown/Neutral). A subset of the document words become the features.
Until now, I have programmed a Naive Bayes Classifier using as a feature selector Information gain and chi-square statistics. Now, I would like to see what happens if I use Odds ratio as a feature selector.
My problem is that I don't know hot to implement Odds-ratio. Should I:
1) Calculate Odds Ratio for every word w, every class:
E.g. for w:
Prob of word as positive Pw,p = #positive docs with w/#docs
Prob of word as negative Pw,n = #negative docs with w/#docs
Prob of word as unknown Pw,u = #unknown docs with w/#docs
OR(Wi,P) = log( Pw,p*(1-Pw,p) / (Pw,n + Pw,u)*(1-(Pw,n + Pw,u)) )
OR(Wi,N) ...
OR(Wi,U) ...
2) How should I decide if I choose or not the word as a feature ?
Thanks in advance...
Since it took me a while to independently wrap my head around all this, let me explain my findings here for the benefit of humanity.
Using the (log) odds ratio is a standard technique for filtering features prior to text classification. It is a 'one-sided metric' [Zheng et al., 2004] in the sense that it only discovers features which are positively correlated with a particular class. As a log-odds-ratio for the probability of seeing a feature 't' given the class 'c', it is defined as:
LOR(t,c) = log [Pr(t|c) / (1 - Pr(t|c))] : [Pr(t|!c) / (1 - Pr(t|!c))]
= log [Pr(t|c) (1 - Pr(t|!c))] / [Pr(t|!c) (1 - Pr(t|c))]
Here I use '!c' to mean a document where the class is not c.
But how do you actually calculate Pr(t|c) and Pr(t|!c)?
One subtlety to note is that feature selection probabilities, in general, are usually defined over a document event model [McCallum & Nigam 1998, Manning et al. 2008], i.e., Pr(t|c) is the probability of seeing term t one or more times in the document given the class of the document is c (in other words, the presence of t given the class c). The maximum likelihood estimate (MLE) of this probability would be the proportion of documents of class c that contain t at least once. [Technically, this is known as a Multivariate Bernoulli event model, and is distinct from a Multinomial event model over words, which would calculate Pr(t|c) using integer word counts - see the McCallum paper or the Manning IR textbook for more details, specifically on how this applies to a Naive Bayes text classifier.]
One key to using LOR effectively is to smooth these conditional probability estimates, since, as #yura noted, rare events are problematic here (e.g., the MLE of Pr(t|!c) could be zero, leading to an infinite LOR). But how do we smooth?
In the literature, Forman reports smoothing the LOR by "adding one to any zero count in the denominator" (Forman, 2003), while Zheng et al (2004) use "ELE [Expected Likelihood Estimation] smoothing" which usually amounts to adding 0.5 to each count.
To smooth in a way that is consistent with probability theory, I follow standard practices in text classification with a Multivariate Bernoulli event model. Essentially, we assume that we have seen each presence count AND each absence count B extra times. So our estimate for Pr(t|c) can be written in terms of #(t,c): the number of times we've seen t and c, and #(t,!c): the number of times we've seen t without c, as follows:
Pr(t|c) = [#(t,c) + B] / [#(t,c) + #(t,!c) + 2B]
= [#(t,c) + B] / [#(c) + 2B]
If B = 0, we have the MLE. If B = 0.5, we have ELE. If B = 1, we have the Laplacian prior. Note this looks different than smoothing for the Multinomial event model, where the Laplacian prior leads you to add |V| in the denominator [McCallum & Nigam, 1998]
You can choose 0.5 or 1 as your smoothing value, depending on which prior work most inspires you, and plug this into the equation for LOR(t,c) above, and score all the features.
Typically, you then decide on how many features you want to use, say N, and then choose the N highest-ranked features based on the score.
In a multi-class setting, people have often used 1 vs All classifiers and thus did feature selection independently for each classifier and thus each positive class with the 1-sided metrics (Forman, 2003). However, if you want to find a unique reduced set of features that works in a multiclass setting, there are some advanced approaches in the literature (e.g. Chapelle & Keerthi, 2008).
References:
Zheng, Wu, Srihari, 2004
McCallum & Nigam 1998
Manning, Raghavan & Schütze, 2008
Forman, 2003
Chapelle & Keerthi, 2008
Odd ratio is not good measure for feature selection, because it is only shows what happen when feature present, and nothing when it is not. So it will not work for rare features and almost all features are rare so it not work for almost all features. Example feature with 100% confidence that class is positive which present in 0.0001 is useless for classification. Therefore if you still want to use odd ratio add threshold on frequency of feature, like feature present in 5% of cases. But I would recommend better approach - use Chi or info gain metrics which automatically solve those problems.
I have a scenario where I have several thousand instances of data. The data itself is represented as a single integer value. I want to be able to detect when an instance is an extreme outlier.
For example, with the following example data:
a = 10
b = 14
c = 25
d = 467
e = 12
d is clearly an anomaly, and I would want to perform a specific action based on this.
I was tempted to just try an use my knowledge of the particular domain to detect anomalies. For instance, figure out a distance from the mean value that is useful, and check for that, based on heuristics. However, I think it's probably better if I investigate more general, robust anomaly detection techniques, which have some theory behind them.
Since my working knowledge of mathematics is limited, I'm hoping to find a technique which is simple, such as using standard deviation. Hopefully the single-dimensioned nature of the data will make this quite a common problem, but if more information for the scenario is required please leave a comment and I will give more info.
Edit: thought I'd add more information about the data and what I've tried in case it makes one answer more correct than another.
The values are all positive and non-zero. I expect that the values will form a normal distribution. This expectation is based on an intuition of the domain rather than through analysis, if this is not a bad thing to assume, please let me know. In terms of clustering, unless there's also standard algorithms to choose a k-value, I would find it hard to provide this value to a k-Means algorithm.
The action I want to take for an outlier/anomaly is to present it to the user, and recommend that the data point is basically removed from the data set (I won't get in to how they would do that, but it makes sense for my domain), thus it will not be used as input to another function.
So far I have tried three-sigma, and the IQR outlier test on my limited data set. IQR flags values which are not extreme enough, three-sigma points out instances which better fit with my intuition of the domain.
Information on algorithms, techniques or links to resources to learn about this specific scenario are valid and welcome answers.
What is a recommended anomaly detection technique for simple, one-dimensional data?
Check out the three-sigma rule:
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
An alternative method is the IQR outlier test:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
IF (x < Q25 - 1.5*IQR) OR (Q75 + 1.5*IQR < x) THEN x is a mild outlier
IF (x < Q25 - 3.0*IQR) OR (Q75 + 3.0*IQR < x) THEN x is an extreme outlier
this test is usually employed by Box plots (indicated by the whiskers):
EDIT:
For your case (simple 1D univariate data), I think my first answer is well suited.
That however isn't applicable to multivariate data.
#smaclell suggested using K-means to find the outliers. Beside the fact that it is mainly a clustering algorithm (not really an outlier detection technique), the problem with k-means is that it requires knowing in advance a good value for the number of clusters K.
A better suited technique is the DBSCAN: a density-based clustering algorithm. Basically it grows regions with sufficiently high density into clusters which will be maximal set of density-connected points.
DBSCAN requires two parameters: epsilon and minPoints. It starts with an arbitrary point that has not been visited. It then finds all the neighbor points within distance epsilon of the starting point.
If the number of neighbors is greater than or equal to minPoints, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbors recursively.
If the number of neighbors is less than minPoints, the point is marked as noise.
If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points until they are depleted.
Finally the set of all points marked as noise are considered outliers.
There are a variety of clustering techniques you could use to try to identify central tendencies within your data. One such algorithm we used heavily in my pattern recognition course was K-Means. This would allow you to identify whether there are more than one related sets of data, such as a bimodal distribution. This does require you having some knowledge of how many clusters to expect but is fairly efficient and easy to implement.
After you have the means you could then try to find out if any point is far from any of the means. You can define 'far' however you want but I would recommend the suggestions by #Amro as a good starting point.
For a more in-depth discussion of clustering algorithms refer to the wikipedia entry on clustering.
This is an old topic but still it lacks some information.
Evidently, this can be seen as a case of univariate outlier detection. The approaches presented above have several pros and cons. Here are some weak spots:
Detection of outliers with the mean and sigma has the obvious disadvantage of dependence of mean and sigma on the outliers themselves.
The case of the small sample limit (see question for example) is not adequately covered by, 3 sigma, K-Means, IQR etc.
And I could go on... However the statistical literature offers a simple metric: the median absolute deviation. (Medians are insensitive to outliers)
Details can be found here: https://www.sciencedirect.com/book/9780128047330/introduction-to-robust-estimation-and-hypothesis-testing
I think this problem can be solved in a few lines of python code like this:
import numpy as np
import scipy.stats as sts
x = np.array([10, 14, 25, 467, 12]) # your values
np.abs(x - np.median(x))/(sts.median_abs_deviation(x)/0.6745) #MAD criterion
Subsequently you reject values above a certain threshold (97.5 percentile of the distribution of data), in case of an assumed normal distribution the threshold is 2.24. Here it translates to:
array([ 0.6745 , 0. , 1.854875, 76.387125, 0.33725 ])
or the 467 entry being rejected.
Of course, one could argue, that the MAD (as presented) also assumes a normal dist. Therefore, why is it that argument 2 above (small sample) does not apply here? The answer is that MAD has a very high breakdown point. It is easy to choose different threshold points from different distributions and come to the same conclusion: 467 is the outlier.
Both three-sigma rule and IQR test are often used, and there are a couple of simple algorithms to detect anomalies.
The three-sigma rule is correct
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
The IQR test should be:
Q25 = 25th_percentile
Q75 = 75th_percentile
IQR = Q75 - Q25 // inter-quartile range
If x > Q75 + 1.5 * IQR or x < Q25 - 1.5 * IQR THEN x is a mild outlier
If x > Q75 + 3.0 * IQR or x < Q25 – 3.0 * IQR THEN x is a extreme outlier