Nearest neighbor search vs Near duplicate detection

Nearest neighbor search vs Near duplicate detection - machine-learning

I was searching for a few AI/ML and non-AI/ML solutions for the "Near duplicate detection" problem (text, image, audio), I found that there is a similar/exact problem i,e, "Nearest neighbor search" which is also seems handled exactly the same way as "Near duplicate detection". I wondering whether there are any differences at all between these two problems or their solutions in any way.

The two problem names seems semantically the same from an english perspective.
In a nearest neighbor search you have a set of elements and, given a reference element, you want to search for an element in the set that is the closest to the reference with respect to a given metric.
In a near duplicate detection you have a set of elements and, given a reference element, you want to search for an element in the set that is the closest to be a duplicate of the reference with respect to a given metric.
Having said that, in the litterature I see people usually using the later name when the elements in the set are textual documents. In this case, one example algorithm consists of getting the set of windows of size k of the textual documents (k-shingles) and comparing two documents using the Jaccard metric (number of shingles in common in the two documents divided by number of different shingles) between the set of k-shingles of each of the documents. To avoid calculating the Jaccard metric explicitly, there is a theorem. If you hash all the k-shingles to 64-bit integers (for example) and consider a random permutation from 64-bit integers to 64-bit integers, then, if you apply the permutation to the set of hashed k-shingles of each document, the probability that the smallest elements of each of the two sets of permuted values are equal is equal to the Jaccard metric between the two documents.
On the other side, I see people usually using the first name if the set of elements is a subset of R^n (for example). In this case, many techniques exist. For example, some useful data structures are octrees, kd-trees.
Having said that, people also use vectorization techniques to convert some sets of elements into a subset of R^n. For example, signal2vec, word2vec etc.

Related

Using awkward1.Array for BDT

I want to implement a boosted decision tree for my analysis. But the entries in my array contain are of varying length, so the array is not convertible directly into numpy or pandas.
Is there any way to use existing ML libraries with awkward array?

Your ML library might assume that the arrays are NumPy arrays and not recognize an ak.Array. That problem, in itself, is easily solved: call np.to_numpy (or equivalently, cast it with np.asarray) to put it in a form the ML library expects. Incidentally, there's also ak.to_pandas to make a DataFrame in which variable-length nested lists are represented by a MultiIndex (with limitations: there has to be only one nested list, since a DataFrame has only one index).
The above is what I'd call a "branding" issue: the ML library just doesn't recognize the ak.Array "brand" of array, so we relabel it. But there's a more fundamental issue: does the ML algorithm in question intrinsically require rectilinear data? For instance, a feedforward neural network maps N-dimensional inputs to M-dimensional outputs; N and M can't be different for each input. This is a problem even if you're not using Awkward Array. In HEP, the old solution was to run variable-length data through a recurrent neural network (thus ignoring the boundaries between lists and imposing an irrelevant order on them) and the new solution seems to be graph neural networks (which is a more theoretically correct thing to do).
I've noticed that some ML libraries are introducing their own "jagged arrays," which are the minimum structure that Awkward Array provides: TensorFlow has RaggedTensors and PyTorch is getting NestedTensors. I don't know to what degree these data types have been integrated into the ML algorithms, though. If they have been, then Awkward Array ought to get an ak.to_tensorflow and ak.to_pytorch to complement ak.to_numpy and ak.to_pandas, as a way to preserve jaggedness when sending data to these libraries. Hopefully, they'll be able to use that jaggedness in their ML algorithms! (Otherwise, what's the point? But I haven't been following these developments closely.)
You're interested in boosted decision trees (BDTs). I can't think of how a decision tree model, boosted or not, could be adapted to different length inputs... Or maybe I can: the nodes of a decision tree choose which subtree to pass the data down to based on the value of one index in the N-dimensional input. That doesn't imply there's a maximum index value N, though a particular tree would have a set of indexes that it splits on, and there would be some maximum of that set (because the tree is finite!). Apply a tree that wants to split on index k on an input with n < k elements would have to have a contingency for how to split anyway, but there are already methods for applying decision trees to datasets with missing values. An input datum with n elements could be treated as an input for which indexes greater than n are considered missing values. To train such a BDT, you'd have to give it inputs with missing values beyond each list's maximum element.
In Awkward Array, the function for that is ak.pad_none. If you know the maximum length list in your sample (ak.num and ak.max), you can pad the whole array such that all lists have the same length with missing values at the end. If you set clip=True, then the resulting array type is "regular," it no longer considers the possibility that a list can have a length different from the chosen length. If you pass such an array to np.to_numpy (and not np.asarray), then it becomes a NumPy masked array, which a BDT algorithm that expects missing values should be able to recognize.
The only problem with this plan is that padding every list to have the same length as the maximum length list uses more memory. If the BDT algorithm were aware of jaggedness (the way that TensorFlow and soon PyTorch is/will be aware of jaggedness), then it should be able to make these trees and apply them to data without the memory-padding step. I don't know if there are any such BDT implementations out there, but if someone wants to write a "BDT with missing values that accepts jagged arrays," I'd be happy to help them get it set up with Awkward Arrays!

SPSS GLM Significance of Predictors are different when building interaction terms vs creating the interaction variables

I was wondering if anyone knows how SPSS builds the interaction terms/calculates the significance for predictors behind the scenes in a GLM? From my understanding it dummy codes variables and treats the one that comes alphabetically last as the reference group.
The reason I'm asking is I have a GLM model which has 3 continuous predictors and two categorical predictors (dummy coded). When I build all the 2-way and 3-way interactions with syntax ie:
Age_Centred Age_CentredDx Age_Centredgender Age_CentredDxgender BMI_Centred BMI_CentredDx BMI_Centredgender BMI_CentredDxgender BPS_Centred BPS_CentredDx BPS_Centredgender BPS_CentredDxgender Dx Dxgender DxICV_Centred DxICV_Centredgender gender ICV_Centred ICV_Centred*gender.
vs manually creating all the variables by hand ie:
Age_Centred Age_Centred_Dx Age_Centred_gender Age_Centred_gender_Dx BMI_Centred BMI_Centred_Dx BMI_Centred_gender BMI_Centred_gender_Dx BPS_Centred BPS_Centred_Dx BPS_Centred_gender BPS_Centred_gender_Dx Dx gender_Dx ICV_Dx ICV_Centred_Dx_gender gender ICV_Centred ICV_gender.
I end up with a model which has the same intercept, overall significance, and R squared however the individual significance of the predictors changes. Refer to output below. To troubleshoot I've tried to flip the references groups when manually creating the variables but it still does not replicate the results. I've had another statistician try the same thing and ended up reaching the same point as what I did. Does it have to do with some of the parameters being redundant?
Building the terms via syntax:
Physically creating the variables by multiplying them together

All the details one might reasonably want about how GLM (and UNIANOVA, which is the same underlying code) parameterizes models, estimates parameters, and conducts hypothesis tests are available in the IBM SPSS Statistics Algorithms manual, available for download as a pdf at ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/26.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf. (Note that this is a large file, about 78 MB; clicking on the link starts a download.) In addition to the information in the GLM chapter, appendices F (Indicator Method) and H (Sums of Squares) are relevant, respectively, for building the design matrix and specifying linear combinations of model parameters for computing sums of squares for testing hypotheses.
In building the design matrix, categorical predictors (factors) are indeed represented by sets of indicator (0-1) variables. For a factor with k levels, k indicator variables are created, one for each observed level of the factor. The procedure does not explicitly treat the last category (sorted in ascending order, alphabetical for strings) as a reference category, though in simpler models the effect of what's done is essentially the same. If there is an intercept in the model, then the kth indicator will be redundant (linearly dependent) on the intercept and the preceding k-1 indicators. The estimation algorithm used in GLM/UNIANOVA will set the row and column in the cross-product matrix representing the redundant column in the design matrix to 0s, alias the corresponding parameter estimate to 0, and the results are similar to a reparameterization approach treating the last category as a reference category, except that you have to remember that it's there if you want to specify a linear combination of the parameters to estimate.
If you suppress the intercept, then for the first factor entered into the model the kth indicator would not be redundant (unless the factor is preceded by an unusual covariate or set of covariates). Any subsequent factors included in the model would involve redundant parameters, as would any interactions among factors, whether or not an intercept is included. Interactions among factors are created by multiplying the 0s and 1s for each level of the factors by those for each level of the other factor. So for an interaction of two two-level factors, there are four columns generated, of which typically the last three are redundant.
Covariates are entered simply by copying the values of the variables into the design matrix. Interactions involving covariates and other covariates multiply values for the columns involved within each row, and interactions involving covariates and factors multiply covariates (or products of them) by the indicator variables for the factor(s). Usually covariate-by-covariate terms do not involve redundancies, but factor-by-covariate terms do.
To get to the specifics of what's going on with your data, I can't replicate your exact results without your data, but I am able to replicate the patterns shown if I assume you've used the binary Dx variable as a covariate and the binary gender variable as a factor in each analysis. (There seem to actually be four continuous predictors in your model rather than three, but that doesn't affect anything of importance for understanding what's going on.)
There are two aspects of the situation to be considered. One is the parameterization and how the two ways of entering the variables into the model treat the variables and whether or not they produce the same estimates of parameters. The second is how the model specification results in the Type III tests shown in the ANOVA tables.
If I'm understanding things correctly based on what you've posted here, you should find if you compare parameter estimates for the two analyses that the parameter estimates for the intercepts and the non-redundant estimates for gender ([gender=0]) are the same, and have the same standard errors. For the terms involving just covariates or products of covariates, I expect that you will find the parameter estimates to differ between the two analyses and produce different t statistics. For interactions involving gender and covariates (which is all the other variables or products created outside the procedure), I expect the estimates will be the same in magnitude and opposite in sign, with the same standard errors.
None of the estimates or tests here are wrong. The models fitted involve interaction effects. An interaction means that effect of one variable varies by the levels of the other variable(s) in the interaction, and in order to estimate the same simple effects you have to parameterize the model in the same way, at least as far as the non-redundant parameters are concerned. However, to get the Type III tests for all terms to be identical, it's not always enough to have the same parameter estimates and standard errors. Type III tests involve a concept called containment that must also be considered.
For two effects in a model, effect A is contained in effect B if:
A and B contain the same covariate terms, if any.
B contains all factor effects in A, and at least one more (with the intercept being contained in all factor-only effects).
In your original model, the intercept is included in the gender effect, gender is not included in any effects, and all the covariate main effects and two-way interactions among covariates are contained within the interactions between those terms and gender, while the three-way interactions (which include gender) are not contained within any other effects.
Type III sums of squares (not invented by SPSS, but by our friends at SAS) are based on linear combinations of parameters where a given effect is adjusted for any effects that do not contain it, and made orthogonal to any effects that contain it. The practical application of these rules is complicated (see Appendix H of the algorithms).
If you recode the gender variable to swap the 0 and 1 values, specify it as a covariate along with all the other variables, and fit the same model, you should be able to match all the non-redundant parameter estimates from the original model, along with their standard errors and t statistics. However, because the containment relationships in the original model are no longer there, the Type III tests for the terms not involving gender (which were previously contained in terms involving gender) will not match up.
The bottom line is that all results are translatable and all correct for what's being done, and that in order to make much sense out of individual terms you have to carefully focus on what's being estimated in a given parameterization, as well as the containment relationships. The difficult part gets simpler when you take seriously the fact that when variable X is involved in interaction terms, there is no single estimate of the effect of X. Any estimates are conditional one where you fix the value(s) of the terms with which X interacts.

How do I interpret this output from sklearn.tree.export_graphviz?

I am working on analyzing grade data. As a new way to look at the data I am using a decision tree, for the first time. I believe I have the code right and now I am trying to interpret it. The features are grades gotten for a series of quizzes, and the classification is the final grade the student received. I have a few questions:
If my understanding is correct, each node has a test and a left branch representing the test being true, and the other for false. And when the tree seems to have asked enough questions, it says what the "class" is. If that is the case, how come there's a class= on boxes well before the leaves? I would have thought that just leaves have a class=
How do I "tune" the overall tree? It seems to have too many boxes. Is this an example of "overfitting"? How can I tune that better?
For example, the use of FINAL_GRADE_PA01 seems arbitrary to be based on the ordering of the data. Is that true or did the analysis actually conclude that that feature was the best discriminator?

If I'm not mistaken, those class values indicate what the model would have predicted, had it stopped branching on that node. It still stores those values, but it doesn't use them if there's a branching from that node.
About the number of nodes, as you see in the docs:
The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets.
To reduce memory consumption, the complexity and size of the trees
should be controlled by setting those parameter values.
There are several parameters which you can use to reduce the complexity of your model. The following two parameters are just an example:
max_leaf_nodes : int or None, optional (default=None)
Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited
number of leaf nodes.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

What do the parameters of DBpedia Spotlight mean?

I am interested in using DBpedia Spotlight. However, we need to insert a value to the two parameters confidence and support. What do these two parameters really mean?
I want to identify the significant, prominent n-grams in the text. In that case, what is the usual recommendation for confidence and support parameters (rule of thumb)?

When you ask DBpedia Spotlight to annotate text (finding entities/topics), it searches for n-grams that have URIs on DBpedia (n-grams that are Wikipedia titles). Those n-grams are called DBpedia resources.
Support: this is the Resource Prominence parameter, it helps you to ignore unimportant or uninformative resources. When you set a value X to it, this means resources that have a number of Wikipedia in-links smaller than X will be ignored and not returned to you.
Confidence: this is the Disambiguation Confidence parameter, it is a threshold which takes a value between 0 and 1. When you set a high value to it, you get better and more trustworthy annotations but you risk losing some correct ones.
Choosing values of those (or any other) parameters depends on your use case.
Examples:
If you have some test set or gold standard for the type of n-grams you are interested in, you can tune your choice until you get good enough results satisfied by your gold standard.
If you care about retrieving the top-N n-grams only to infer the topic of text, you can tune your parameters choosing high values to get few (mostly) correct n-grams and sort them by Confidence.
If you want to get as many n-grams as possible and your task won't get affected or biased by mistakes, you can set low values.

Determining groups in a hierarchical cluster

I have an algorithm that can group data into a hierarchical cluster tree. The algorithm is the one described in Toby Seagram's Programming Collective Intelligence. The tree output is a binary tree with a "distance" value at each node, that tells you how far apart the two child nodes are.
I can then display this as a Dendrogram and it makes it fairly easy for a human spot which values are grouped together. However I'm having difficult coming up with an algorithm that automatically decides what the groups should be. I'd like to be able to determine automatically:
The number of group
Which points should be placed in each group
Is there a standard algorithm for this?

I think there is no default way to do this. Simple 'manual' methods would be to either:
specify the number of clusters you want/expect
set a threshold for the maximum distance between two nodes; any nodes with a larger distance belong to another cluster
There are some automatic methods to determine the number of clusters. R has the Dynamic Tree Cut package which automatically deals with this problem, also pvclust could be used. Here are two more methods described to deal with this problem, Salvador (2002) and Daniels (2006).

I have found out that the Calinski-Harabasz index (also known as Variance Ratio Criterion) works well with dendrograms produced by hierarchical clustering. You can find more information (and a comparative study) in this paper.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart