What's the complexity of going from one graph representation to another? - graph-algorithm

There are different ways one can represent a simple undirected graph on a computer:
Adjacency lists: Vertices are stored as records or objects, and every vertex stores a list of adjacent vertices. This data structure allows the storage of additional data on the vertices.
Incidence list: Vertices and edges are stored as records or objects. Each vertex stores its incident edges, and each edge stores its incident vertices. This data structure allows the storage of additional data on vertices and edges.
Adjacency matrix: A two-dimensional matrix, in which the rows represent source vertices and columns represent destination vertices. Data on edges and vertices must be stored externally. Only the cost for one edge can be stored between each pair of vertices.
Incidence matrix: A two-dimensional Boolean matrix, in which the rows represent the vertices and columns represent the edges. The entries indicate whether the vertex at a row is incident to the edge at a column.
Two questions:
What are the efficient algorithms to go from one representation to another?
What are the complexities in going from one representation to another?

Related

Is there a way to find the geo coordinates of all the buildings in a city?

I am working with the Uber H3 library. Using the poly fill function, I have populated an area with H3 indexes for a specific resolution. But I don’t need all the indexes. I want to identify and remove those indexes which are getting plotted on isolated areas like jungles, lakes, ponds, etc.
Any thoughts on how that can be achieved?
I thought that if I can map all the buildings in a city in their respective indexes, I can easily identify those indexes in which no buildings are mapped.
I’d maintain a Hashmap of h3 index as the key and a list of coordinates which lie in that index as the value.
In order to address this, you'll need some other dataset(s). Where to find this data depends largely on the city you're looking at, but a simple Google search for footprint data should provide some options.
Once you have footprint data, there are several options depending on the resolution of the grid that you're using and your performance requirements.
You could polyfill each footprint and keep the resulting hexagons
For coarser data, just using geoToH3 to get the hexagon for each vertex in each building polygon would be faster
If the footprints are significantly smaller than your hex size, you could probably just take a single coordinate from each building.
Once you have the hexagons for each building, you can simply do a set intersection with your polygon hexes and your building hexes to get the "good" set. But it may be easier in many cases to remove bad hexagons rather than including good ones - in this case you'd need a dataset of non-building features, e.g. water and park features, and do the reverse: polyfill the undesired features, and subtract these hexagons from your set.

What do the values of latent feature models for user and item matrix in collabarative filter represent?

When decomposing a rating matrix for recommender system, the rating matrix can be written as P* t(Q), which P represents user factor matrix and Q represents item factor matrix. The dimension of Q can be written as rank*number of items. I am wondering if the values in the Q matrix actually represent anything, such as the weight of the item? And also, is there any way to find out some hidden patterns in the Q matrix?
Think of features as the important direction of variance in multidimensional data. Imagine a 3-d chart plotting which of 3 items the user bought. It would be an amorphous blob but the actual axis or orientation of the blob is probably not along the x,y,z axises. The vectors that it does orient along are the features in vector form. Take this to huge dimensional data (many users, many items) and this high-dimensional data very often can be spanned by a small number of vectors, most variance not along these new axises is very small and may even be noise. So an algorithm like ALS finds these few vectors that represent most of the span of data. Therefore "features" can be thought of as the primary modes of variance in the data or put another way, the archetypes for describing how one item differs from another.
Note that PQ factorization in recommenders relies on dropping insignificant features to achieve potentially huge compression of the data. These insignificant features (ones that account for very little variance in the user/items input) can be dropped because they often are interpreted as noise and in practice yield better results for being discarded.
Can you find hidden patterns; sure. The new smaller but dense item and user vectors can be treated with techniques like clustering, KNN, etc. They are just vectors in a new "space" defined by the new basis vectors--the new axises. When you want to interpret the result of such operations you will need to transform them back into item & user space.
The essence of ALS (PQ matrix factorization) is to transform the user's feature vector into item space and rank by the item weights. The highest ranked items are recommended.

How is the SOM working?

I know the basic working of self organizing maps but I am having a hard time visualizing them.
Let's say I have a 2*2 grid an I have mapped a data of 200*1000 on it.
Can I access the 200 data points in my training set again using my grid?If so then how is it possible?If the answer is no then what is the use of this maps as my original data cannot be retrieved from the compressed data?
I'm not sure what you mean, are you saying that you have mapped 200 high-dimensional data points onto a 2D grid? If so it should only be a matter of finding the closest 2D-coordinate for each data point and then map it to this position. In other words, each coordinate on the grid has a weight of the same dimension as the input data and if trained correctly you can loop through the grid and find the weight which has the lowest euclidian distance from each sample in your input data, called the bmu (best matching unit). The corresponding 2D-coordinate is then mapped to the given input, from there you can plot it or whatever you like.
SOM is mostly used for visualisation and exploration of high-d data, your original data is not 'retrievable' from it, but it can give you some intuition of how the data is distributed.

Using PCA trained on a large data set for a smaller data set

Can I use a pca subspace trained on, say, eight features and one thousand time points to evaluate a single reading? That is, if I keep, say, the top six components, my transformation matrix will be 8x6 and using this to transform test data that is the same size as the training data would give me an 6x1000 vector.
But what if I want to look for anomalies at each time point independently? That is, can rather than use an 8x1000 test set, can I use 1000 separate transformation on 8x1 dimensional test vectors and get the same result? This vector will get transformed into the exact same spot as if it were the first row in a much larger data matrix, but the distance of that one vector from the principal axis doesn't appear to be meaningful. When I perform this same procedure on the truncated reference data, this distance isn't zero either, only the sum of all distances over the entire reference data set is zero. So if I can't show that the reference data is not "anomalous", how can I use this on test data?
Is it the case that the size of the data "object" used to train pca is the size of object that can be evaluated with it?
Thanks for any help you can give.

Once more: triangle strips vs triangle lists

I decided to build my engine on triangle lists after reading (a while ago) that indexed triangle lists perform better due to less draw calls needed. Today i stumbled on 0xffffffff, which in DX is considered a strip-cut index so you can draw multiple strips in one call. Does this mean that triangle lists no longer hold superior performance?
It is possible to draw multiple triangle strips in a single draw call using degenerate triangles which have an area of zero. A strip cut is made by simply repeating the last vertex of the previous and the first vertex of the next strip, adding two elements per strip break (two zero-area triangles).
New in Direct3D 10 are the strip-cut index (for indexed geometry) and the RestartStrip HLSL function. Both can be used to replace the degenerate triangles method effectively cutting down the bandwidth cost. (Instead of two indices for a cut only one is needed.)
Expressiveness
Can any primitive list be converted to an equal strip and vise versa? Strip to list conversion is of course trivial. For list to strip conversion we have to assume that we can cut the strip. Then we can map each primitive in the list to a one-primitive-sub-strip, though this would not be useful.
So, at least for triangle primitives, strips and lists always had the same expressiveness. Before Direct3D 10 strip cuts in line strips where not possible, so they actually were not equally expressive.
Memory and Bandwidth
How much data needs to be sent to the GPU? In order to compare the methods we need to be able to calculate the number of elements needed for a certain topology.
Primitive List Formula
N ... total number of elements (vertices or indices)
P ... total number of primitives
n ... elements per primitive (point => 1, line => 2, triangle => 3)
N = Pn
Primitive Strip Formula
N, P, n ... same as above
S ... total number of sub-strips
o ... primitive overlap
c ... strip cut penalty
N = P(n-o) + So + c(S-1)
primitive overlap describes the number of elements shared by adjacent primitives. In a classical triangle strip a triangle uses two vertices from the previous primitive, so the overlap is 2. In a line strip only one vertex is shared between lines, so the overlap is 1. A triangle strip using an overlap of 1 is of course theoretically possible but has no representation in Direct3D.
strip cut penalty is the number of elements needed to start a new sub-strip. It depends on the method used. Using strip-cut indices the penalty would be 1, since one index is used to separate two strips. Using degenerate triangles the penalty would be two, since we need two zero-area triangles for a strip cut.
From these formulas we can deduce that it depends on the geometry which method needs the least space.
Caching
One important property of strips is the high temporal locality of the data. When a new primitive is assembled each vertex needs to be fetched from GPU memory. For a triangle this has to be done three times. Now accessing memory is usually slow, that's why processors use multiple levels of caches. In the best case the data needed is already stored in the cache, reducing memory access time. Now for triangle strips the last two vertices of the previous primitive are used, almost guaranteeing that two of three vertices are already present in the cache.
Ease of Use
As stated above, converting a list to a strip is very simple. The problem is converting a list to an efficient primitive strip by reducing the number of sub-strips. For simple procedurally generated geometry (e.g. heightfield terrains) this is usually achievable. Writing a converter for existing meshes might be more difficult.
Conclusion
The introduction of Direct3D 10 has not much impact on the strip vs. list question. There is now equal expressiveness for line strips and a slight data reduction. In any case, when using strips you always gain the most if you reduce the number of sub-strips.
On modern hardware with pre- and post-transform vertex caches, tri-stripping is not a win over indexed triangle lists. The only time you really use tri-stripping would be non-indexed primitives generated by something where the strips are trivial to compute such as a terrain system.
Instead, you should do vertex cache optimization of indexed triangle lists for best performance. The Hoppe algorithm is implemented DirectXMesh, or you can look at Tom Forsyth's alternative algorithm.

Resources