how to efficiently do large matrix multiplications on Google cloud data flow? - google-cloud-dataflow

We need to multiply a large matrix with a one-dimensional vector. The large matrix is sparse. In a second scenario, we need to multiply two large matrices, both of which are sparse. And in the third scenario, we need to multiply two large matrices both of which are dense.
Apache Spark seems to provide a built-in data type for matrices (including a specialized one for sparse matrices) as well as what seems to be a very rich set of libraries for matrix linear algebra (multiplication, addition, transposition, etc.)
How can one efficiently do the matrix multiplications (or other linear algebra operations for matrixes) on Google Cloud DataFlow for the three scenarios described above?

Dataflow currently doesn't support matrix operations natively. That said, it should be possible to implement these operations similarly to spark.
For sparse matrices, it should be possible to key by the (x,y) coordinate, and then do a GroupByKey.
For dense matrices, you can divide the matrix into blocks, use a GroupByKey to group the blocks, and then use a native library (such as BLAS) to implement the multiplication on the blocks.
See BlockMatrix for more information on how the block operations are implemented in Spark.

The following method should work on dataflow.

Related

Dict support in PyTorch

Does PyTorch support dict-like objects, through which we can backpropagate gradients, like Tensors in PyTorch?
My goal is to compute gradients with respect to a few (1%) elements of a large matrix. But if I use PyTorch's standard Tensors to store the matrix, I need to keep the whole matrix in my GPU, which causes problems due to limited GPU memory available during training. So I was thinking whether I could store the matrix as a dict instead, indexing only the relevant elements of the matrix, and computing gradients and backpropagating w.r.t those select elements only.
So far, I have tried using Tensors only, but it's causing memory issues for the above reasons. So I searched extensively for alternate options like dicts in PyTorch but couldn't find any such information on Google.
It sounds like you want your parameter to be a torch.sparse tensor.
This interface allows you to have tensors that are mostly zeros, with only a few non-zero elements in known locations. Sparse tensors should allow you to significantly reduce the memory footprint of your model.
Note that this interface is still "under construction": not all operations are supported for sparse tensors. However, it is being constantly improving.

Custom kernels for SVM, when to apply them?

I am new to machine learning field and right now trying to get a grasp of how the most common learning algorithms work and understand when to apply each one of them. At the moment I am learning on how Support Vector Machines work and have a question on custom kernel functions.
There is plenty of information on the web on more standard (linear, RBF, polynomial) kernels for SVMs. I, however, would like to understand when it is reasonable to go for a custom kernel function. My questions are:
1) What are other possible kernels for SVMs?
2) In which situation one would apply custom kernels?
3) Can custom kernel substantially improve prediction quality of SVM?
1) What are other possible kernels for SVMs?
There are infinitely many of these, see for example list of ones implemented in pykernels (which is far from being exhaustive)
https://github.com/gmum/pykernels
Linear
Polynomial
RBF
Cosine similarity
Exponential
Laplacian
Rational quadratic
Inverse multiquadratic
Cauchy
T-Student
ANOVA
Additive Chi^2
Chi^2
MinMax
Min/Histogram intersection
Generalized histogram intersection
Spline
Sorensen
Tanimoto
Wavelet
Fourier
Log (CPD)
Power (CPD)
2) In which situation one would apply custom kernels?
Basically in two cases:
"simple" ones give very bad results
data is specific in some sense and so - in order to apply traditional kernels one has to degenerate it. For example if your data is in a graph format, you cannot apply RBF kernel, as graph is not a constant-size vector, thus you need a graph kernel to work with this object without some kind of information-loosing projection. also sometimes you have an insight into data, you know about some underlying structure, which might help classifier. One such example is a periodicity, you know that there is a kind of recuring effect in your data - then it might be worth looking for a specific kernel etc.
3) Can custom kernel substantially improve prediction quality of SVM?
Yes, in particular there always exists a (hypothethical) Bayesian optimal kernel, defined as:
K(x, y) = 1 iff arg max_l P(l|x) == arg max_l P(l|y)
in other words, if one has a true probability P(l|x) of label l being assigned to a point x, then we can create a kernel, which pretty much maps your data points onto one-hot encodings of their most probable labels, thus leading to Bayes optimal classification (as it will obtain Bayes risk).
In practise it is of course impossible to get such kernel, as it means that you already solved your problem. However, it shows that there is a notion of "optimal kernel", and obviously none of the classical ones is not of this type (unless your data comes from veeeery simple distributions). Furthermore, each kernel is a kind of prior over decision functions - closer you get to the actual one with your induced family of functions - the more probable is to get a reasonable classifier with SVM.

Clustering Method Selection in High-Dimension?

If the data to cluster are literally points (either 2D (x, y) or 3D (x, y,z)), it would be quite intuitive to choose a clustering method. Because we can draw them and visualize them, we somewhat know better which clustering method is more suitable.
e.g.1 If my 2D data set is of the formation shown in the right top corner, I would know that K-means may not be a wise choice here, whereas DBSCAN seems like a better idea.
However, just as the scikit-learn website states:
While these examples give some intuition about the algorithms, this
intuition might not apply to very high dimensional data.
AFAIK, in most of the piratical problems we don't have such simple data. Most probably, we have high-dimensional tuples, which cannot be visualized like such, as data.
e.g.2 I wish to cluster a data set where each data is represented as a 4-D tuple <characteristic1, characteristic2, characteristic3, characteristic4>. I CANNOT visualize it in a coordinate system and observes its distribution like before. So I will NOT be able to say DBSCAN is superior to K-means in this case.
So my question:
How does one choose the suitable clustering method for such an "invisualizable" high-dimensional case?
"High-dimensional" in clustering probably starts at some 10-20 dimensions in dense data, and 1000+ dimensions in sparse data (e.g. text).
4 dimensions are not much of a problem, and can still be visualized; for example by using multiple 2d projections (or even 3d, using rotation); or using parallel coordinates. Here's a visualization of the 4-dimensional "iris" data set using a scatter plot matrix.
However, the first thing you still should do is spend a lot of time on preprocessing, and finding an appropriate distance function.
If you really need methods for high-dimensional data, have a look at subspace clustering and correlation clustering, e.g.
Kriegel, Hans-Peter, Peer Kröger, and Arthur Zimek. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3.1 (2009): 1.
The authors of that survey also publish a software framework which has a lot of these advanced clustering methods (not just k-means, but e.h. CASH, FourC, ERiC): ELKI
There are at least two common, generic approaches:
One can use some dimensionality reduction technique in order to actually visualize the high dimensional data, there are dozens of popular solutions including (but not limited to):
PCA - principal component analysis
SOM - self-organizing maps
Sammon's mapping
Autoencoder Neural Networks
KPCA - kernel principal component analysis
Isomap
After this one goes back to the original space and use some techniques that seems resonable based on observations in the reduced space, or performs clustering in the reduced space itself.First approach uses all avaliable information, but can be invalid due to differences induced by the reduction process. While the second one ensures that your observations and choice is valid (as you reduce your problem to the nice, 2d/3d one) but it loses lots of information due to transformation used.
One tries many different algorithms and choose the one with the best metrics (there have been many clustering evaluation metrics proposed). This is computationally expensive approach, but has a lower bias (as reducting the dimensionality introduces the information change following from the used transformation)
It is true that high dimensional data cannot be easily visualized in an euclidean high dimensional data but it is not true that there are no visualization techniques for them.
In addition to this claim I will add that with just 4 features (your dimensions) you can easily try the parallel coordinates visualization method. Or simply try a multivariate data analysis taking two features at a time (so 6 times in total) to try to figure out which relations intercour between the two (correlation and dependency generally). Or you can even use a 3d space for three at a time.
Then, how to get some info from these visualizations? Well, it is not as easy as in an euclidean space but the point is to spot visually if the data clusters in some groups (eg near some values on an axis for a parallel coordinate diagram) and think if the data is somehow separable (eg if it forms regions like circles or line separable in the scatter plots).
A little digression: the diagram you posted is not indicative of the power or capabilities of each algorithm given some particular data distributions, it simply highlights the nature of some algorithms: for instance k-means is able to separate only convex and ellipsoidail areas (and keep in mind that convexity and ellipsoids exist even in N-th dimensions). What I mean is that there is not a rule that says: given the distributiuons depicted in this diagram, you have to choose the correct clustering algorithm consequently.
I suggest to use a data mining toolbox that lets you explore and visualize the data (and easily transform them since you can change their topology with transformations, projections and reductions, check the other answer by lejlot for that) like Weka (plus you do not have to implement all the algorithms by yourself.
In the end I will point you to this resource for different cluster goodness and fitness measures so you can compare the results rfom different algorithms.
I would also suggest soft subspace clustering, a pretty common approach nowadays, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance, for example.

In scikit-learn, can DBSCAN use sparse matrix?

I got Memory Error when I was running dbscan algorithm of scikit.
My data is about 20000*10000, it's a binary matrix.
(Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number)
Anyway I found sparse matrix and feature extraction of scikit.
http://scikit-learn.org/dev/modules/feature_extraction.html
http://docs.scipy.org/doc/scipy/reference/sparse.html
But I still have no idea how to use it. In DBSCAN's specification, there is no indication about using sparse matrix. Is it not allowed?
If anyone knows how to use sparse matrix in DBSCAN, please tell me.
Or you can tell me a more suitable cluster method.
The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.
As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.
May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)
Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).
Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.
You can pass a distance matrix to DBSCAN, so assuming X is your sample matrix, the following should work:
from sklearn.metrics.pairwise import euclidean_distances
D = euclidean_distances(X, X)
db = DBSCAN(metric="precomputed").fit(D)
However, the matrix D will be even larger than X: n_samples² entries. With sparse matrices, k-means is probably the best option.
(DBSCAN may seem attractive because it doesn't need a pre-determined number of clusters, but it trades that for two parameters that you have to tune. It's mostly applicable in settings where the samples are points in space and you know how close you want those points to be to be in the same cluster, or when you have a black box distance metric that scikit-learn doesn't support.)
Yes, since version 0.16.1.
Here's a commit for a test:
https://github.com/scikit-learn/scikit-learn/commit/494b8e574337e510bcb6fd0c941e390371ef1879
Sklearn's DBSCAN algorithm doesn't take sparse arrays. However, KMeans and Spectral clustering do, you can try these. More on sklearns clustering methods: http://scikit-learn.org/stable/modules/clustering.html

Doing SparseMat (sparse matrix) operations in openCV

I need to do matrix operations (mainly multiply and inverse) of a sparse matrix SparseMat in OpenCV.
I noticed that you can only iterate and insert values to SparseMat.
Is there an external code I can use? (or am I missing something?)
It's just that sparse matrices are not really suited for inversion or matrix-matrix-multiplication, so it's quite reasonable there is no builtin function for that. They're actually more used for matrix-vector multiplication (usually when solving iterative linear systems).
What you can do is solve N linear systems (with the columns of the identity matrix as right hand sides) to get the inverse matrix. But then you need N*N storage for the inverse matrix anyway, so using a dense matrix with a usual decompositions algorithm would be a better way to do it, as the performance gain won't be that high when doing N iterative solutions. Or maybe some sparse direct solvers like SuperLU or TAUCS may help, but I doubt that OpenCV has such functionalities.
You should also think if you really need the inverse matrix. Often such problem are also solvable by just solving a linear system, which can be done with a sparse matrix quite easily and fast via e.g. CG or BiCGStab.
You can convert a SparseMat to a Mat, do what operations you need and then convert back.
you can use Eigen library directly. Eigen works together with OpenCV very well.

Resources