Iterative programming using PCollectionViews - google-cloud-dataflow

I wish to create a PCollection of say one hundred thousand objects (maybe even a million) such that I apply an operation on it a million times in a for-loop on the same data, but with DIFFERENT values for the PCollectionView calculated on each iteration of the loop. Is this a use-case that df can handle reasonably well? Is there a better way to achieve this? My concerns is that PCollectionView has too much overhead, but it could be that that used to be a problem a year ago but now this a use-case that DF can support well. In my case, I can hardcode the number of iterations of the for-loop (as I believe that DF can't handle the situation in which the number of iterations is dynamically determined at run-time.) Here's some pseudocode:
PCollection<KV<Integer,RowVector>> rowVectors = ...
PCollectionView<Map<Integer, Float>> vectorX;
for (int i=0; i < 1000000; i++) {
PCollection<KV<Integer,Float>> dotProducts =
rowVectors.apply(ParDo.of(new DoDotProduct().withSideInputs(vectorX));
vectorX = dotProducts.apply(View.asMap());
}

Unfortunately we only support up to 1000 transformations / stages. This would require 1000000 (or whatever your forloop iterates over) stages.
Also you are correct in that we don't allow changes to the graph after the pipeline begins running.
If you want to do less than 1000 iterations, then using a map side input can work but you have to limit the number of map lookups you do per RowVector. You can do this by ensuring that each lookup has the whole column instead of walking the map for each RowVector. In this case you'd represent your matrix as a PCollectionView of a Map<ColumnIndex, Iterable<RowIndex, RowValue>>

Related

How do I speedup adding two big vectors of tuples?

Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)

Converting a apriori object to a list taking more time even for small number of data

I am working on a data set of more than 22,000 records, and when I tried it with the apriori model, it's taking way too much time even for small number of records like 20. Is there a problem in my code or Is there a faster way to convert the asscocians into a list quickly? The code I used is below.
for i in range(0, 20):
transactions.append([str(dataset.values[i,j]) for j in range(0, 543)])
from apyori import apriori
associations = apriori(transactions, min_support=0.004, min_confidence=0.3, min_lift=3, min_length=2)
result = list(associations)
It's difficult to assess without your data, but the complexity of Apriori is based on a number of factors, including your support threshold, number of transactions, number of items, average/max transaction length, etc.
In cases where even a small number of transactions is taking a long time to run it's often a matter of too low of a minimum support. When support is very low (near 0) the algorithm is effectively still brute forcing, since it has to look at all possible combinations of items, of every length. This is the equivalent of a mathematical power set, which is exponential. For just 41 items you're actually trying 2^41 -1 possible combinations, which is just over 1.1 TRILLION possibilities.
I recommend starting with a "high" min_support at first (e.g. 0.20) and then working your way down slowly. It's easier to test things that take seconds at first than ones that'll take a long time.
Other important note: There is no min_length parameter in Apyori. I'm not sure where everyone's getting that from (you're not alone in thinking there is one), unless it's this one random blog post I found. The parameters are as follows (straight from the code):
Keyword arguments:
min_support -- The minimum support of relations (float).
min_confidence -- The minimum confidence of relations (float).
min_lift -- The minimum lift of relations (float).
max_length -- The maximum length of the relation (integer).
For what it's worth, I wrote unofficial docs for Apyori that can be found here.

Influx index and high cardinality

I have a high throughput system. I found out that since many events has the same timestamp, influx had overwritten many events.
Therefore I tried moving from milliseconds to nanoseconds, but since I am using JAVA, I couldn't get the real clock based nanoseconds.
I came up with this solution:
I created a new tag called "descriptor" which for each event I insert a random number between 1-1000. These values are fixed and the probability for the same timestamp with the same random descriptor value is very low. This fixes my problem and I can see all the events.
My question is wether it is OK to use these 1000 values - since this is a tag and I understand it can mess up my index and my performance?
Regards, Ido
As the random "descriptors" are completely uncorrelated to other event tags, in the worst case this could increase your series cardinality by 3 orders of magnitude. This is because each existing series (s) will potentially split into up to 1000 unique series (s,1),(s,2),...,(s,1000).
How much of a problem this is will depend on your existing series cardinality. Increasing from 10 to 10,000 is probably no big deal. Increasing from 100,000 to 100,000,000 is more likely to be an issue. You would need to experiment and profile to see.
An alternative approach might be to encode the "descriptor" in the microsecond and/or nanosecond component(s) of the timestamp (as you're not using them anyway) to make them unique.

How to CUDA-ize code when all cores require global memory access

Without CUDA, my code is just two for loops that calculate the distance between all pairs of coordinates in a system and sort those distances into bins.
The problem with my CUDA version is that apparently threads can't write to the same global memory locations at the same time (race conditions?). The values I end up getting for each bin are incorrect because only one of the threads ended up writing to each bin.
__global__ void computePcf(
double const * const atoms,
double * bins,
int numParticles,
double dr) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numParticles - 1) {
for (int j = i + 1; j < numParticles; j++) {
double r = distance(&atoms[3*i + 0], &atoms[3*j + 0]);
int binNumber = floor(r/dr);
// Problem line right here.
// This memory address is modified by multiple threads
bins[binNumber] += 2.0;
}
}
}
So... I have no clue what to do. I've been Googling and reading about shared memory, but the problem is that I don't know what memory area I'm going to be accessing until I do my distance computation!
I know this is possible, because a program called VMD uses the GPU to speed up this computation. Any help (or even ideas) would be greatly appreciated. I don't need this optimized, just functional.
How many bins[] are there?
Is there some reason that bins[] need to be of type double? It's not obvious from your code. What you have is essentially a histogram operation, and you may want to look at fast parallel histogram techniques. Thrust may be of interest.
There are several possible avenues to consider with your code:
See if there is a way to restructure your algorithm to arrange computations in such a way that a given group of threads (or bin computations) are not stepping on each other. This might be accomplished based on sorting distances, perhaps.
Use atomics This should solve your problem, but will likely be costly in terms of execution time (but since it's so simple you might want to give it a try.) In place of this:
bins[binNumber] += 2.0;
Something like this:
int * bins,
...
atomicAdd(bins+binNumber, 2);
You can still do this if bins are of type double, it's just a bit more complicated. Refer to the documentation for the example of how to do atomicAdd on a double.
If the number of bins is small (maybe a few thousand, or less) then you could create a few sets of bins that are updated by multiple threadblocks, and then use a reduction operation (adding the sets of bins together, element by element) at the end of the processing sequence. In this case, you might want to consider using a smaller number of threads or threadblocks, each of which processes multiple elements, by putting an additional loop in your kernel code, so that after each particle processing is complete, the loop jumps to the next particle by adding gridDim.x*blockDim.x to the i variable, and repeating the process. Since each thread or threadblock has it's own local copy of the bins, it can do this without stepping on other threads accesses.
For example, suppose I only needed 1000 bins of type int. I could create 1000 sets of bins, which would only take up about 4 megabytes. I could then give each of 1000 threads it's own bin set, and then each of the 1000 threads would have it's own bin set to update, and would not require atomics, since it could not interfere with any other thread. By having each thread loop through multiple particles, I can still effectively keep the machine busy this way. When all the particle-binning is done, I then have to add my 1000 bin-sets together, perhaps with a separate kernel call.

how to remove duplicates from an array without sorting

i have an array which might contain duplicate objects.
I wonder if it's possible to find and remove the duplicates in the array:
- without sorting (strict requirement)
- without using a temporary secondary array
- possibily in O(N), with N the nb of the elements in the array
In my case the array is a Lua array, which contains tables:
t={
{a,1},
{a,2},
{b,1},
{b,3},
{a,2}
}
In my case, t[5] is a duplicate of t[2], while t[1] is not.
To summarize, you have the following options:
time: O(n^2), no extra memory - for each element in the array look for an equal one linearly
time: O(n*log n), no extra memory - sort first, walk over the array linearly after
time: O(n), memory: O(n) - use a lookup table (edit: that's probably not an option as tables cannot be keys in other tables as far as I remember)
Pick one. There's no way to do what you want in O(n) time with no extra memory.
Can't be done in O(n) but ...
what you can do is
Iterate thru the array
For each member search forward for repetitions, remove those.
Worst case scenario complexity is O(n^2)
Iterate the array, stick every value in a hash, checking if the it exists first. If it does remove from original array (or don't write to the new one). Not very memory efficient, but only 0(n) since you are only iterating the array once.
Yes, depending on how you look at it.
You can override the object insertion to prevent insertion of duplicate items. This is O(n) per object insertion and may feel faster for smaller arrays.
If you provide sorted object insertion and deletion then it is O(log n). Essentially you always keep the list sorted as you insert and delete so that finding elements is a binary search. The cost here is that element retrieval is now O(log n) instead of O(1).
This can be also be implemented efficiently using things like red-black tree's and multitree's but at the cost of additional memory. Such implementations offer several benefits for certain problems. For example, we can have O(log n) type of behavior even very very large tables with small a small memory footprint by using nested tree's. The top level tree provides a sort of paired down overview of the dataset while subtree's provide more refined access when needed.
For example, to see this suppose we have N elements. We could partition that into n1 groups. Each of those groups could then further be partitions into n2 more groups and those groups into n2 groups. Hence we have a depth of N/n1n2...
As you can see, the product of n's can become quite huge very quickly even for small n's. If N = 1 Trillion elements and n1 = 1000, n2 = 1000, n3 = 1000 it takes only 1000 + 1000 + 1000 + 1000 s = 4000 per access time. Also, we only have 10^9 times per node memory footprint.
Compare this to the average 500 billion access time's required for a direct linear search. It is over 100 million times faster and 1000 times less memory than a binary tree but about 100 times slower! (of course there is some overhead for keeping the tree's consistent but even that can be reduced).
If we were to use a binary tree then it would have a depth of about 40. The problem is there are about 1 trillion nodes so that is a huge amount of additional memory. By storing multiple values per node(and in the above case each node actually partial values and other tree's) we can significantly reduce the memory footprint but still have impressive performance.
Essentially linear access prevails at lower numbers and tree's prevail at high numbers. Tree's. Tree's consume more memory. By using multitree's we can combine the best of both worlds by using linear access over smaller numbers and having a larger number of elements per node(compared to binary tree's).
Such tree's are not trivial to create but essentially follow the same algorithmic nature of standard binary tree's, red-black tree's, AVL tree's, etc...
So if you are dealing with large datasets it is not a huge issue for performance and memory. Essentially, as you probably know, as one goes up the other goes down. Multitree's, sort of find the optimal medium. (assuming you chose your node sizes correctly)
The depth of the multitree is N/product(n_k,k=1..m). The memory footprint is the number of nodes which is product(n_k,k=1..m) (which can generally be reduced by an order of magnitude or possibly n_m)

Resources