In Caffe, as we can see in blob.hpp, there are 6 member variables in each blob object:
data_
diff_
shape_data_
shape_
count_
capacity_
data_ contains the normal data that we pass along
diff_ is gradient computed by the network
Since there is no comment in the source code and due to lack of the official documentation, I wanted to know, What is the exact meaning of the others?
thanks,
shape_data_ & shape_ represent the same thing. The only difference is that their types are different. shape_ is a vector of integers with the dimensions of the data, whereas shape_data_ is a shared pointer.
count_ is the total number of elements in data_. So it the product of all the dimensions in shape_.
capacity_ is the maximum size of data_ that can be accommodated in the Blob.
References:
http://blog.luoyetx.com/2015/10/reading-caffe-2/
http://imbinwang.github.io/blog/inside-caffe-code-blob
Related
I'm trying to make documents vectors of gensim example using doc2vec.
I passed TaggedDocument which contains 9 docs and 9 tags.
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
idx = [0,1,2,3,4,5,6,7,100]
documents = [TaggedDocument(doc, [i]) for doc, i in zip(common_texts, idx)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
and it produces 101 vectors like this image.
gensim doc2vec produced 101 vectors
and what I want to know is
How can I be sure that the tag I passed is attached to the right vector?
How did the vectors with the tags which I didn't pass (8~99 in my case) come out? Were they computed as a blank?
If you use plain ints as your document-tags, then the Doc2Vec model will allocate enough doc-vectors for every int up to the highest int you provide - even if you don't use some of those ints.
This assumption, that all ints up to the highest declared are used, allows the code to avoid creating a redundant {tag -> slot} dictionary, saving a little memory. That specific potential savings is the main reason for supporting plain ints (rather than unique strings) as tag names.
Any such doc-vectors allocated but never subject to any traiing will be randomly-initialized the same as others - but never adjusted by training.
If you want to use plain int tag names, you should either be comfortable with this over-allocation, or make sure you only use all contiguous int IDs from 0 to your max ID, with none ununused. But unless your training data is very large, using unique string tags, and allowing the {tag -> slot} dictionary to be created, is straightforward and not too expensive in memory.
(Separately: min_count=1 is almost always a bad idea in these algorithms, as discarding rare tokens tends to give better results than letting their thin example usages interfere with other training.)
Is there a method to define a cube of vectors, like for e.g if I define cube(1,2,4), can I have every entry of the cube to be a vector of float entries (fcolvec) ?
Reading the Armadillo documentation always helps before posting questions.
To have vectors in a cube-like layout, use the field class:
field<vec> X(2,3,4);
Each of the elements in the field is then an instance of the vec class. You will still need to set the size of each vector and manipulate its contents. For example:
X(1,2,3).set_size(10);
X(1,2,3).fill(456);
If on the other hand you want to access the columns of a slice in a cube, use:
cube C(4,3,2, fill::zeros);
C.slice(1).col(2).fill(456);
I tried to follow this tutorial on using ELKI with pre-computed distances for clustering.
http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances
I used the following set of command line options:
-dbc.filter FixedDBIDsFilter -dbc.startid 0 -algorithm clustering.OPTICS
-algorithm.distancefunction external.FileBasedDoubleDistanceFunction
-distance.matrix /path/to/matrix -optics.minpts 5 -resulthandler ResultWriter
ELkI fails with a configuration error saying db.in file is needed to make the computation.
The following configuration errors prevented execution:
No value given for parameter "dbc.in":
Expected: The name of the input file to be parsed.
No value given for parameter "parser.distancefunction":
Expected: Distance function used for parsing values.
My question is what is db.in file? Why should I provide it in addition to the distance matrix file since the pair-wise distance matrix file completely specifies all the information about the point cloud. (also I don't have access to any other information other than the pair-wise distance information).
What should I do about db.in? Should I override it, or specify some dummy information etc. Kindly help me understand.
thank you.
This is documented in the ELKI HowTos:
http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances
Using without primary data
-dbc DBIDRangeDatabaseConnection -idgen.count 100
However, there is a bug (patch is on the howto page, and will be in the next release) so you right now can't fully use this; as a workaround you can use a text file that enumerates the objects.
The reason for this is that ELKI is designed to work on multi-relational data. It's not just processing matrixes. But some algorithms may e.g. need a geographic representation of an object, some measurements for this object, and a label for evaluation. That is three relations.
What the DBIDRange data source essentially does is create a single "fake" relation that is just the DBIDs 0 to 99. On algorithms that don't need actual data, but only distances (e.g. LOF or DBSCAN or OPTICS), it is sufficient to have object IDs and a distance matrix.
i see the function CGPathEqualToPath which i successfully used to compare data from 2 UIbezierPaths (technically, i compared a path to itself).
Is there any way to modify this function to find out how similar 2 paths are? and perhaps make a threshold to say, ok, these paths are close enough to be considered the same?
(i'm using iOS)
also, unrelated. i have a mutable array of bezierpaths. what is the notation for accessing a particular element of the array? i'm new to this. thanks
You might be able to accomplish the comparison by drawing each path into a separate bitmap and then seeing how many bits they have in common. You could make a ratio of the total bits in both bitmaps to the bits in both bitmaps to get a degree of similarity. 2:1 would be identical (two bitmaps completely overlapping), 2:0 would mean nothing in common.
I don't think you can create a likeness function as you don't have access to the underlying structure or functions that provide access to those values. If you can elaborate the use case, maybe there is an alternate solution.
As for accessing an object at a particular index in an array, you can do it using –
id myObject = [array objectAtIndex:particularIndex];
I am currently bringing large (tens of GB) data files into Matlab using memmapfile. The file I'm reading in is structured with several fields describing the data that follows it. Here's an example of how my format might look:
m.format = { 'uint8' [1 1024] 'metadata'; ...
'uint8' [1 500000] 'mydata' };
m.repeat = 10000;
So, I end up with a structure m where one sample of the data is addressed like this:
single_element = m.data(745).mydata(26);
I want to think of this data as a matrix of, from the example, 10,000 x 500,000. Indexing individual items in this way is not difficult though somewhat cumbersome. My real problem arises when I want to access e.g. the 4th column of every row. MATLAB will not allow the following:
single_column = m.data(:).mydata(4);
I could write a loop to slowly piece this whole thing into an actual matrix (I don't care about the metadata by the way), but for data this large it's hard to overemphasize how prohibitively slow that will be... not to mention the fact that it will double the memory required. Any ideas?
Simply map it to a matrix:
m.format = { 'uint8' [1024 500000] 'x' };
m.Data(1).x will be you data matrix.