How can I get output of sentencepiece function as array format? - vectorization

I am converting word to vector where I need to get vector as type array int format but I am getting array object type.
Can anyone help me with solution?
def word2idx(statement):
#here I am using sentencepieceprocessor as sp
id1 = np.asarray(sp.encode_as_ids(statement)).astype(np.int32)
return id1
sentence = 'the world', 'hello cherry', 'make me proud'
id2 = [word2idx(s)for s in sentence]
print(id2)
actual output:
[[array([ 34, 1867]), array([ 83, 184, 63, 50, 47, 71, 41]), array([328, 69, 7, 303, 649])]]
Expect output:
[[ 34, 1867], [ 83, 184, 63, 50, 47, 71, 41], [328, 69, 7, 303, 649]]

The problem is that arrays are of different lengths, so numpy cannot make tensor out of it.
If you are happy with a list of lists and don't need a numpy array, you can do:
id2 = np.array([[ 34, 1867], [ 83, 184, 63, 50, 47, 71, 41]])
id2.tolist()
and get: [[34, 1867], [83, 184, 63, 50, 47, 71, 41]].
You need a dense numpy array, you need to pad all sequence to the same length. You can do something like:
id2 = np.array([[ 34, 1867], [ 83, 184, 63, 50, 47, 71, 41]])
idx = np.zeros((len(id2), max(len(s) for s in id2)))
for i, sent_ids in enumerate(id2):
idx[i,:len(sent_ids)] = sent_ids
In this case you will get:
array([[ 34., 1867., 0., 0., 0., 0., 0.],
[ 83., 184., 63., 50., 47., 71., 41.]])

Related

Dart Bech32 and Hex encoding and decoding

I'm trying to decode this Bech32 address into a hex.
When given cosmos1qpjrq625nglf3xx9chdkq953nhrd3nygte44rt. It breaks it down into it's head which is 'cosmos' and the remainder is represented as a List of 8-bit unsigned integers (Uint8List).
When this is encoded to hexadecimal (HEX.Encode), i get a value of 00011203001a0a1413081f091106060518170d16000514111317030d11130408.
However, it is meant to be getting me 00643069549a3e9898c5c5db6016919dc6d8cc88 instead.
You can check this if you go to https://slowli.github.io/bech32-buffer/ -> and decode cosmos1qpjrq625nglf3xx9chdkq953nhrd3nygte44rt which gives 00643069549a3e9898c5c5db6016919dc6d8cc88.
I can't figure out the issue, is it perhaps
-The formatting is wrong, different bases? or am i doing this completely wrong.
Thanks and i appreciate any replies
Here is a snippet of code;
import 'package:bech32/bech32.dart';
import 'package:hex/hex.dart';
Bech32Codec bech32codec = Bech32Codec();
// target address : 00643069549a3e9898c5c5db6016919dc6d8cc88 -> to get to this address
String address = 'cosmos1qpjrq625nglf3xx9chdkq953nhrd3nygte44rt';
Bech32 bech32 = bech32codec.decode(address);
print(bech32.data);
// this returns [0, 1, 18, 3, 0, 26, 10, 20, 19, 8, 31, 9, 17, 6, 6, 5, 24, 23, 13, 22, 0, 5, 20, 17, 19, 23, 3, 13, 17, 19, 4, 8]
print(bech32.hrp);
print(bech32codec.encode(Bech32("cosmos", bech32.data)));
var answer2 = HEX.encode(bech32.data);
print(answer2);
var decode = HEX.decode('00643069549a3e9898c5c5db6016919dc6d8cc88');
print(decode);
// this returns [0, 100, 48, 105, 84, 154, 62, 152, 152, 197, 197, 219, 96, 22, 145, 157, 198, 216, 204, 136]

Fitting a Support Vector Classifier in scikit-learn with image data produces error

I'm trying to train an SVC classifier for image data. Yet, when I run this code:
classifier = svm.SVC(gamma=0.001)
classifier.fit(train_set, train_set_labels)
I get this error:
ValueError: setting an array element with a sequence.
I produced the images into an array with Matplotlib: plt.imread(image).
The error seems like it's not in an array, yet when I check the types of the data and the labels they're both lists (I manually add to a list for the labels data):
print(type(train_set))
print(type(train_set_labels))
<class 'list'>
<class 'list'>
If I do a plt.imshow(items[0]) then the image shows correctly in the output.
I also called train_test_split from scikit-learn:
train_set, test_set = train_test_split(items, test_size=0.2, random_state=42)
Example input:
train_set[0]
array([[[212, 134, 34],
[221, 140, 48],
[240, 154, 71],
...,
[245, 182, 51],
[235, 175, 43],
[242, 182, 50]],
[[230, 152, 51],
[222, 139, 47],
[236, 147, 65],
...,
[246, 184, 49],
[238, 179, 43],
[245, 186, 50]],
[[229, 150, 47],
[205, 122, 28],
[220, 129, 46],
...,
[232, 171, 28],
[237, 179, 35],
[244, 188, 43]],
...,
[[115, 112, 103],
[112, 109, 102],
[ 80, 77, 72],
...,
[ 34, 25, 28],
[ 55, 46, 49],
[ 80, 71, 74]],
[[ 59, 56, 47],
[ 66, 63, 56],
[ 48, 45, 40],
...,
[ 32, 23, 26],
[ 56, 47, 50],
[ 82, 73, 76]],
[[ 29, 26, 17],
[ 41, 38, 31],
[ 32, 29, 24],
...,
[ 56, 47, 50],
[ 59, 50, 53],
[ 84, 75, 78]]], dtype=uint8)
Example label:
train_set_labels[0]
'Picasso'
I'm not sure what step I'm missing to get the data in the form that the classifier needs in order to train it. Can anyone see what may be needed?
The error message you are receiving:
ValueError: setting an array element with a sequence,
normally results when you are trying to put a list somewhere that a single value is required. This would suggest to me that your train_set is made up of a list of multidimensional elements, although you do state that your inputs are lists. Would you be able to post an example of your inputs and labels?
UPDATE
Yes, it's as I thought. The first element of your training data, train_set[0], corresponds to a long list (I can't tell how long), each element of which consists of a list of 3 elements. You are therefore calling the classifier on a list of lists of lists, when the classifier requires a list of lists (m rows corresponding to the number of training examples with each row made up of a list of n features). What else is in your train_set array? Is the full data set in train_set[0]? If so, you would need to create a new array with each element corresponding to each of the subelements of train_set[0], and then I believe your code should run, although I am not too familiar with that classifier. Alternatively you could try running the classifier with train_set[0].
UPDATE 2
I don't have experience with scikit-learn.svc so I wouldn't be able to tell you what the best way of preprocessing the data in order for it to be acceptable to the algorithm, but one method would be to do as I said previously and for each element of train_set, which is composed of lists of lists, would be to recurse through and place all the elements of sublist into the list above. For example
new_train_set = []
for i in range(len(train_set)):
for j in range(len(train_set[i]):
new_train_set.append([train_set[i,j])
I would then train with new_train_set and the training labels.

How can I getting value of 8 neighbor of a image as the third dimension in Numpy

Given an 2d image data, for every pixel P1, how can I get the following 3d array out of it?
P9 P2 P3
P8 P1 P4
P7 P6 P5
img[x,y,:] = [P2, P3, P4, P5, P6, P7, P8, P9, P2]
without using forloop, just numpy operation (because of performance issue)
Here's one approach with zeros padding for boundary elements and using NumPy strides with the built-in scikit-image's view_as_windows for efficient sliding window extraction -
from skimage.util import view_as_windows as viewW
def patches(a, patch_shape):
side_size = patch_shape
ext_size = (side_size[0]-1)//2, (side_size[1]-1)//2
img = np.pad(a, ([ext_size[0]],[ext_size[1]]), 'constant', constant_values=(0))
return viewW(img, patch_shape)
Sample run -
In [98]: a = np.random.randint(0,255,(5,6))
In [99]: a
Out[99]:
array([[139, 176, 141, 172, 192, 81],
[163, 115, 7, 234, 72, 156],
[ 75, 60, 9, 81, 132, 12],
[106, 202, 158, 199, 128, 238],
[161, 33, 211, 233, 151, 52]])
In [100]: out = patches(a, [3,3]) # window size = [3,3]
In [101]: out.shape
Out[101]: (5, 6, 3, 3)
In [102]: out[0,0]
Out[102]:
array([[ 0, 0, 0],
[ 0, 139, 176],
[ 0, 163, 115]])
In [103]: out[0,1]
Out[103]:
array([[ 0, 0, 0],
[139, 176, 141],
[163, 115, 7]])
In [104]: out[-1,-1]
Out[104]:
array([[128, 238, 0],
[151, 52, 0],
[ 0, 0, 0]])
If you want a 3D array, you could add a reshape at the end, like so -
out.reshape(a.shape + (9,))
But, be mindful that this would create a copy instead of the efficient strided based views we would get from the function itself.

Finding hamming distance between ORB feature descriptors

I am trying to write a function to match ORB features. I am not using default matchers (bfmatcher, flann matcher) because i just want match speific features in image with features in other image.
I saw ORS descriptor its a binary array.
My query is how to match 2 features i.e how to find hamming distance between 2 descriptors ?
ORB descriptors:
descriptor1 =[34, 200, 96, 158, 75, 208, 158, 230, 151, 85, 192, 131, 40, 142, 54, 64, 75, 251, 147, 195, 78, 11, 62, 245, 49, 32, 154, 59, 21, 28, 52, 222]
descriptor2 =[128, 129, 2, 129, 196, 2, 168, 101, 60, 35, 83, 18, 12, 10, 104, 73, 122, 13, 2, 176, 114, 188, 1, 198, 12, 0, 154, 68, 5, 8, 177, 128]
Thanks.
ORB descriptors are just 32 byte uchar Mat's.
the bruteforce and flann matchers do some more work, than just comparing descriptors, but if that's all you want for now, it would be a straight norm:
Mat descriptor1, descriptor2;
double dist = norm( descriptor1, descriptor2, NORM_HAMMING);
// NORM_HAMMING2 or even NORM_L1 would make sense, too.
// dist is a double, but ofc. you'd only get integer values in this case.

Haskell: How to use attoparsec in order to read a nested list from a ByteString

I have a text file (~ 300 MB large) with a nested list, similar to this one:
[[4, 9, 11, 28, 30, 45, 55, 58, 61, 62, 63, 69, 74, 76, 77, 82, 87, 92, 93, 94, 95], [4, 9, 11, 28, 30, 45, 55, 58, 61, 62, 63, 69, 74, 76, 77, 82, 87, 92, 93, 94],[4, 9, 11, 28, 30, 45, 55, 58, 61, 62, 63, 69, 74, 76, 77, 82, 85, 87, 92, 93, 94, 95]]
Here is my program to read the file into a haskell Integer list:
import qualified Data.ByteString as ByteStr
main :: IO ()
-- HOW to do the same thing but using ByteStr.readFile for file access?
main = do fContents <- readFile filePath
let numList = readNums fContents
putStrLn (show nums)
This works for small text files, but I want to use ByteString to read the file quickly. I found out that there is no read function for ByteString, instead you should write your own parser in attoparsec, since it supports parsing ByteStrings.
How can I use attoparsec to parse the nested list?
The data seems to be in JSON format, so you can use Data.Aeson decode function which works on ByteString
import qualified Data.ByteString.Lazy as BL
import Data.Aeson
import Data.Maybe
main = do fContents <- BL.readFile filePath
let numList = decode fContents :: Maybe [[Int]]
putStrLn (show $ fromJust numList)

Resources