DataLoader - shuffle implicit pairs - image-processing

Is there a way to handle the DataLoader as a list ? The idea is that I want to shuffle implicit pairs of images, without setting the shuffling into True
Basically, I have for example 10 scenes, each containing let's say 100 sequences, so they are represented inside the directory as
'1_1.png', '1_2.png', '1_3.png', '....., '2_1.png', '2_2.png', '2_3.png', ...., '3_1.png', '3_2.png', '3_3.png', ..., ...., '10_1.png', '10_2.png', '10_3.png', ...
I don't want complete shuffling of data, what I want simply is to shuffle but keeping pairs, so they are represented in the data loader as
[ '1_3.png', '1_4.png', '2_2.png', '2_3.png', '10_1.png', '10_2.png', '1_2.png', '1_3.png', ...]
and so on
Please have a look at this question which I have already asked on Stack Overflow concerning shuffling array of implicit pairs, where you can understand what I mean
As an example:
if this is a list
L = [['1_1'],['1_2'],['1_3'],['1_4'],['1_5'],['1_6'],['2_1'],['2_2'],['2_3'],['2_4'],['2_5'],['2_6'],['3_1'],['3_2'],['3_3'],['3_4'],['3_5'],['3_6']]
then this is the output
[['1_2'], ['1_3'], ['2_1'], ['2_2'], ['2_4'], ['2_5'],
['2_2'], ['2_3'], ['1_3'], ['1_4'], ['3_4'], ['3_5'],
['3_3'], ['3_4'], ['3_2'], ['3_3'], ['1_6'], ['2_1'],
['2_5'], ['2_6'], ['2_6'], ['3_1'], ['1_4'], ['1_5'],
['1_1'], ['1_2'], ['2_3'], ['2_4'], ['1_5'], ['1_6'],
['3_1'], ['3_2'], ['3_5'], ['3_6']]
I want to achieve the same for a DataLoader
The main idea, is that I want to train my network on sequential frames, but it doesn't have to be the complete sequence, but at least I need each step, two sequences are there

I think you are looking for data.Sampler: instead of the completely radom default shuffle of data.DataLoader, you can provide your own "sampler" that sample examples from your Dataset.
Looking at the input parameters of data.DataLoader:
sampler (Sampler, optional) – defines the strategy to draw samples
from the dataset. If specified, shuffle must be False.
I think a good starting point for is too look at the code of data.SubsetRandomSampler.

Related

Gensim doc2vec produce more vectors than given documents, when I pass unique integer id as tags

I'm trying to make documents vectors of gensim example using doc2vec.
I passed TaggedDocument which contains 9 docs and 9 tags.
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
idx = [0,1,2,3,4,5,6,7,100]
documents = [TaggedDocument(doc, [i]) for doc, i in zip(common_texts, idx)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
and it produces 101 vectors like this image.
gensim doc2vec produced 101 vectors
and what I want to know is
How can I be sure that the tag I passed is attached to the right vector?
How did the vectors with the tags which I didn't pass (8~99 in my case) come out? Were they computed as a blank?
If you use plain ints as your document-tags, then the Doc2Vec model will allocate enough doc-vectors for every int up to the highest int you provide - even if you don't use some of those ints.
This assumption, that all ints up to the highest declared are used, allows the code to avoid creating a redundant {tag -> slot} dictionary, saving a little memory. That specific potential savings is the main reason for supporting plain ints (rather than unique strings) as tag names.
Any such doc-vectors allocated but never subject to any traiing will be randomly-initialized the same as others - but never adjusted by training.
If you want to use plain int tag names, you should either be comfortable with this over-allocation, or make sure you only use all contiguous int IDs from 0 to your max ID, with none ununused. But unless your training data is very large, using unique string tags, and allowing the {tag -> slot} dictionary to be created, is straightforward and not too expensive in memory.
(Separately: min_count=1 is almost always a bad idea in these algorithms, as discarding rare tokens tends to give better results than letting their thin example usages interfere with other training.)

How do I speedup adding two big vectors of tuples?

Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)

Computing in-place with dask

Short version
I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the original backup numpy arrays, making the whole thing an in-place operation?
If you're thinking "you're using dask wrong" then see the long version below for why I feel the need to do this.
Long version
I'm using dask for an application where the original data is sourced from in-memory numpy arrays that contain data collected from a scientific instrument. The goal is to fill most of the RAM (say 75%+) with the original data, which means that there isn't enough to make an in-memory copy. That makes it semantically a bit like an out-of-core problem, in that any derived value can only be realised in memory in chunks rather than all at once.
Dask is well-suited to this, except for one wrinkle. I'm simplifying a lot, but on most of the data (call it X), we need to apply an element-wise operation f, compute some summary statistics s(f(X)), and use that to compute another result over the data, say t(s(f(X)), f(X)). While all the functions are dask-friendly (can be done on a per-chunk basis), trying to simply run this dask graph would cause f(X) to all be held in memory at once because the chunks are all needed for the second pass. An alternative is to explicitly compute s before asking for t (as suggested by https://github.com/dask/dask/issues/874), and thus pay to compute f(X) twice, but it's a somewhat expensive operation so I'd like to avoid that.
However, once f has been applied, the original data are no longer needed. So I'd like to run da.store(f(X)) and have it store the results in the original backing numpy arrays. Technically I think I know how to set that up, and as long as I can be sure that each piece of data is fully consumed before it is overwritten then there are no race conditions, but I'm worried that I may be breaking an API contract by changing back data underneath dask and that it might go wrong in some way. Is there any way to guarantee that it is safe?
One way I can immediately see this going wrong is if several of the input arrays have the same contents and hence get given the same name in dask, causing them to the unified in the graph. I'm using name=False in da.from_array though, so that shouldn't be an issue.

When are placeholders necessary?

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?
According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.
A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.
I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

looping on flattened fortran matrices

I have a bit of code that looks like this:
DO I=0,500
arg1((I*54+1):(I*54+54)) = premultz*sinphi(I+1)
ENDDO
In short, I have an array premultz of dimension 54. I have an array sinphi of dimension 501. I want to take the first value of of sinphi times all the entries of premultz and store it in the first 54 entries of arg1, then the second value of of sinphi times all the entries of premultz and store it in the second54 entries of arg1, and so on.
These are flattened matrices. I have flattened them in the interest of speed, as one of the primary goals of this project is very fast code.
My question is this: is there a more efficient way of coding this sort of calculation in Fortran90? I know that Fortran has a lot of nifty array operations that can be done that I'm not fully aware of.
Thanks in advance.
This expression, if I've got things right, ought to create arg1 in one statement
arg1 = reshape(spread(premultz,dim=2,ncopies=501)*&
&spread(sinphi,dim=1,ncopies=54),[1,54*501])
I've hardwired the dimensions here, that may or may not suit your purposes. The inner expression generates the outer product of premultz and sinphi, which is then reshaped into a vector. You may find you need to reshape the transpose of the outer product, I haven't checked things very carefully.
However, based on my experience with this sort of clever use of Fortran's array intrinsics I doubt that this, or most other clever uses of Fortran's array intrinsics, will outperform the straightforward loop implementation you already have. For many of these operations the compiler is going to generate copies of arrays, and copying data is relatively expensive. Of course, this is an assertion you may want to test.
I'll leave it to you to decide if the one-liner is more comprehensible than the loops. Sometimes the expressivity of the array syntax comes at an acceptable cost in performance, sometimes it doesn't.

Resources