Binary features and Locality Sensitive Hashing (LSH)

Binary features and Locality Sensitive Hashing (LSH) - machine-learning

I am studying FLANN, a library for approximate nearest neighbors search.
For the LSH method they represent an object (point in search space), as
an array of unsigned int. I am not sure why they do this, and not
represent a point simply as a double array (which would represent a point
in multi-dimensional vector space). Maybe because LSH is used for binary
features? Can someone share more about the possible use of unsigned int in
this case? Why unsigned int if you only need a 0 and 1 for each feature?
Thanks

Please note that I will refer to the latest FLANN release, i.e. flann-1.8.3 at the time of writing.
For the LSH method they represent an object (point in search space),
as an array of unsigned int
No: this is wrong. The LshIndex class includes a buildIndexImpl method that implements the LSH indexing. Since the LSH is basically a collection of hash tables, the effective indexing occurs on the LshTable class.
The elementary indexing method, i.e. the method that indexes one feature vector (aka descriptor, or point) at a time is:
/** Add a feature to the table
* #param value the value to store for that feature
* #param feature the feature itself
*/
void add(unsigned int value, const ElementType* feature) {...}
Note: the buildIndexImpl method uses the alternative version that simply iterates over the features, and call the above method on each.
As you can see this method has 2 arguments which is a pair (ID, descriptor):
value which is unsigned int represents the feature vector unique numerical identifier (aka feature index)
feature represents the feature vector itself
If you look at the implementation you can see that the first step consists in hashing the descriptor value to obtain the related bucket key (= the identifier of the slot pointing to the bucket in which this descriptor ID will be stored):
BucketKey key = getKey(feature);
In practice the getKey hashing function is only implemented for binary descriptors, i.e. descriptors that can be represented as an array of unsigned char:
// Specialization for unsigned char
template<>
inline size_t LshTable<unsigned char>::getKey(const unsigned char* feature) const {...}
Maybe because LSH is used for binary features?
Yes: as stated above, the FLANN LSH implementation works in the Hamming space for binary descriptors.
If you were to use descriptors with real values (in R**d) you should refer to the original paper that includes details about how to convert the feature vectors into binary strings so as to use the Hamming space and hash functions.
Can someone share more about the possible use of unsigned int in this
case? Why unsigned int if you only need a 0 and 1 for each feature?
See above: the unsigned int value is only used to store the related ID of each feature vector.

Related

Gensim doc2vec produce more vectors than given documents, when I pass unique integer id as tags

I'm trying to make documents vectors of gensim example using doc2vec.
I passed TaggedDocument which contains 9 docs and 9 tags.
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
idx = [0,1,2,3,4,5,6,7,100]
documents = [TaggedDocument(doc, [i]) for doc, i in zip(common_texts, idx)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
and it produces 101 vectors like this image.
gensim doc2vec produced 101 vectors
and what I want to know is
How can I be sure that the tag I passed is attached to the right vector?
How did the vectors with the tags which I didn't pass (8~99 in my case) come out? Were they computed as a blank?

If you use plain ints as your document-tags, then the Doc2Vec model will allocate enough doc-vectors for every int up to the highest int you provide - even if you don't use some of those ints.
This assumption, that all ints up to the highest declared are used, allows the code to avoid creating a redundant {tag -> slot} dictionary, saving a little memory. That specific potential savings is the main reason for supporting plain ints (rather than unique strings) as tag names.
Any such doc-vectors allocated but never subject to any traiing will be randomly-initialized the same as others - but never adjusted by training.
If you want to use plain int tag names, you should either be comfortable with this over-allocation, or make sure you only use all contiguous int IDs from 0 to your max ID, with none ununused. But unless your training data is very large, using unique string tags, and allowing the {tag -> slot} dictionary to be created, is straightforward and not too expensive in memory.
(Separately: min_count=1 is almost always a bad idea in these algorithms, as discarding rare tokens tends to give better results than letting their thin example usages interfere with other training.)

Accessing field on original C struct in Go

I'm trying to use OpenCV from Go. OpenCV defines a struct CvMat that has a data field:
typedef struct CvMat
{
...
union
{
uchar* ptr;
short* s;
} data;
}
I'm using the go bindings for opencv found here. This has a type alias for CvMat:
type Mat C.CvMat
Now I have a Mat object and I want to access the data field on it. How can I do this? If I try to access _data, it doesn't work. I printed out the fields on the Mat object with the reflect package and got this:
...
{data github.com/lazywei/go-opencv/opencv [8]uint8 24 [5] false}
...
So there is a data field on it, but it's not even the same type. It's an array of 8 uint8s! I'm looking for a uchar* that is much longer than 8 characters. How do I get to this uchar?

The short answer is that you can't do this without modifying go-opencv. There are a few impediments here:
When you import a package, you can only use identifiers that have been exported. In this case, data does not start with an upper case letter, so is not exported.
Even if it was an exported identifier, you would have trouble because Go does not support unions. So instead the field has been represented by a byte array that matches the size of the underlying C union (8 bytes in this case, which matches the size of a 64-bit pointer).
Lastly, it is strongly recommended not to expose cgo types from packages. So even in cases like this where it may be possible to directly access the underlying C structure, I would recommend against it.
Ideally go-opencv would provide an accessor for the information you are after (presumably one that could check which branch of the union is in use, rather than silently returning bad data. I would suggest you either file a bug report on the package (possibly with a patch), or create a private copy with the required modifications if you need the feature right away.

How are hashtables (maps) stored in memory?

This question is specifically for hashtables, but might also cover other data structures such as linked lists or trees.
For instance, if you have a struct as follows:
struct Data
{
int value1;
int value2;
int value3;
}
And each integer is 4-byte aligned and stored in memory sequentially, are the key and value of a hash table stored sequentially as well? If you consider the following:
std::map<int, string> list;
list[0] = "first";
Is that first element represented like this?
struct ListNode
{
int key;
string value;
}
And if the key and value are 4-byte aligned and stored sequentially, does it matter where the next pair is stored?
What about a node in a linked list?
Just trying to visualize this conceptually, and also see if the same guidelines for memory storage also apply for open-addressing hashing (the load is under 1) vs. chained hashing (load doesn't matter).

It's highly implementation-specific. And by that I am not only referring to the compiler, CPU architecture and ABI, but also the implementation of the hash table.
Some hash tables use a struct that contains a key and a value next to each other, much like you have guessed. Others have one array of keys and one array of values, so that values[i] is the associated value for the key at keys[i]. This is independent of the "open addressing vs. separate chaining" question.

A hash is a data structure itself. Here's your visualizing:
http://en.wikipedia.org/wiki/Hash_table
http://en.wikipedia.org/wiki/Hash_function
Using a hash function (langauge-specific), the keys are turned into places, and the values are placed there (in an array.)
Linked-lists i'm not as sure about, but i would be they are stored sequentially if they are created sequentially. Obviously, if what the nodes hold increases in size, they'd need to be moved and the pointer redefined to that point.

Usually when the value is not that big (int) it's best to group it together with the key (which by default shouldn't be too big), otherwise only a pointer to it is kept.

The simplest representation of a hash table is an array (the table).
A hash function generates a number between 0 and the size of the array. That number is the index for the item.
There is more to it than this, bit that's the general concept and explains why lookups are so fast.

cv::bitwise_not on cv::Mat matrix

I tried to cv::bitwise_not to a cv::Mat matrix of double values. I applied like
cv::bitwise_not(img, imgtemp);
img is CV_64F data of 0 and 1. But imgtemp has all nonsense data inside.
I am expecting 0 in img to be 1 at imgtemp and 1 in img to be 0 at imgtemp. How to apply bitwise_not to a double Mat matrix?
Thanks

I cannot get the sense of doing a bitwise not of a double (floating point) value: you will be doing bitwise operations also on the exponent (see here). All bits will be inverted, from 0 to 1 and viceversa.
There is also a note on this aspect in the function documentation.
In case of a floating-point input array, its machine-specific bit
representation (usually IEEE754-compliant) is used for the operation.
If you want zeros to become ones and viceversa, as you suggested, you could do:
cv::threshold(warpmask, warpmaskTemp,0.5,1.0,THRESH_BINARY_INV)
(see documentation) (and yes, you can use same matrix for input and destination).

I think you are either getting the method signature wrong or wrongly named the parameters for the bitwise_not method.
According to [OpenCV 2.4.6 Documentation on bitwise_not() method] (http://docs.opencv.org/modules/core/doc/operations_on_arrays.html#void bitwise_not(InputArray src, OutputArray dst, InputArray mask))
void bitwise_not(InputArray src, OutputArray dst, InputArray mask=noArray())
If you are going to use any mask, it needs to be the last argument as mask is an optional for 'bitwise_not' method.
Additionally, all the data types need to be the same in order to avoid confusion. What I am trying to imply is that your source and destination data formats and any interim ones such as the method parameters must be in the same format. You cannot have on ein CV_64F and others in different. If I am not loosing my marbles here, bitwise operation would possibly require you to have all the data in unsigned or signed integer format for the sake of simplicity. Nevertheless, you should have all the types same.
About the garbage that you got, I think it is a general and good programming practice that you initialise your variables with some reasonable values. This helps when you are debugging step by step and ascertain the details where it failed.
Give it a try.

To follow on from Antonio's answer, you should use the right tool for the job. double is not an appropriate storage medium for boolean data.
In open cv you can type a boolean as an unsigned char (8bits). Although in typing your own true value you can pick any non-zero value, in open cv it is more natural to have 0/255; that way fitting in with open cv's bitwise operations and comparison operators. E.g. a bitwise not could be achieved by result = (input == 0) which can take any type. threshold in Antonio's answer maintains the same type (useful in some circumstances). For bitwise_not you should have it in the boolean format first.
Unfortunately opencv makes it very difficult to work with black and white bitwise data.

CV_8U opencv's matrixes on no 8 bit systems

I've read that the signed char and unsigned char types are not guaranteed to be 8 bits on every platform, but sometimes they have more than 8 bits.
If so, using OpenCv how can we be sure that CV_8U is always 8bit?
I've written a short function which takes a 8 bit Mat and happens to convert, if needed, CV_8SC1 Mat elements into uchars and CV_8UC1 into schar.
Now I'm afraid it is not platform independent an I should fix the code in some way (but don't know how).
P.S.: Similarly, how can CV_32S always be int, also on machine with no 32bit ints?

Can you give a reference of this (I've never heard of that)? Probably you mean the padding that may be added at the end of a row in a cv::Mat. That is of no problem, since the padding is usually not used, and especially no problem if you use the interface functions, e.g. the iterators (c.f.). If you would post some code, we could see, if your implementation actually had such problems.
// template methods for iteration over matrix elements.
// the iterators take care of skipping gaps in the end of rows (if any)
template<typename _Tp> MatIterator_<_Tp> begin();
template<typename _Tp> MatIterator_<_Tp> end();
the CV_32S will be always 32-bit integer because they use types like those defined in inttypes.h (e.g. int32_t, uint32_t) and not the platform specific int, long, whatever.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Binary features and Locality Sensitive Hashing (LSH) - machine-learning

Related

Gensim doc2vec produce more vectors than given documents, when I pass unique integer id as tags

Accessing field on original C struct in Go

How are hashtables (maps) stored in memory?

cv::bitwise_not on cv::Mat matrix

CV_8U opencv's matrixes on no 8 bit systems

Categories

Resources