Representing an array as a feature in ML training - machine-learning

I have a set of features x1,x2,x3,x4 where x1,x2,x3 are floats and x4 is an array of floats.
To give an example, say that I am trying to predict the price of a house. I could use the size of the house as an array (e.g. length, width, and height) along with other features like number of bedrooms, age of house, no of bathrooms etc.
This is simple, but I am sort of struggling how to represent this.
Here is a similar sample based on heart attack prediction https://colab.research.google.com/drive/1CQX2d0vkjlZKjX6wbG4ga6tRcI-zOMNA
I tried to add a column to add an array feature, with np.c_ to the end
##################################-Check-########################
print("Before",X_s[:1])
X_s =np.c_[ X_s,np.random.rand(303,2)] # add a numpy array here as a feature
print("After",X_s[:1])
print("shape of X_s",X_s.shape)
print(X_s[:1])
dataset = tf.data.Dataset.from_tensor_slices((X_s, y_s))
But the problems is that the array is added as two extra columns in the end
shape of X_s (303, 13)
shape of X_s (303, 15)
So if I have a feature array of say 330*300 with the above approach it will add 300 columns to the end. Not something I want
I am aware of CNN network, and one option is to model the whole problem as a CNN; that is pad the other features also as arrays and created an n dimension tensor and use a CNN
Is there something simpler and better than these approaches

Related

Creating a ML algorithm where the train data does not have same number of columns in all records

So I have the following train data (no header, explanation bellow):
[1.3264,1.3264,1.3263,1.32632]
[2.32598,2.3256,2.3257,2.326,2.3256,2.3257,2.32566]
[10.3215,10.3215,10.3214,10.3214,10.3214,10.32124]
It does not have an header because all elements with exception of the last 1 on each array are inputs and the last one is the result/output.
So taking first example: 1.3264,1.3264,1.3263 are inputs/feed data that I want to give to the algorith and 1.32632 is the outcome/result.
All of these are historical values that would lead to a pattern recognition.
I would like to give some test data to the algorith and he would give me outcome/result based on that pattern he identified.
From all the examples I looked into with ML and sklearn, I have never seen one where you have(for the same type of data) multiple entries. They all seem to have the same number of columns and diferent types of inputs whereas mine is always the same type of input.
You can try two different approaches:
Extract features from your variable length data to make the features have fixed size. After that you can use any algorithm from sklearn or other packages. Feature extraction is highly domain-specific process that requires context of what the data actually is. For example you can try similar features:
import numpy as np
def extract_features_one_row(arr):
arr = np.array(arr[:-1])
y = arr[-1]
features = [
np.mean(arr),
np.sum(arr),
np.median(arr),
np.std(arr),
np.percentile(arr, 5),
np.percentile(arr, 95),
np.percentile(arr, 25),
np.percentile(arr, 75),
(arr[1:] > arr[:-1]).sum(), # number of increasing pairs
(arr > arr.mean()).sum(), # number of elements > mean value
# extract trends, number of modes, etc
]
return features, y
data = [
[1.3264, 1.3264, 1.3263, 1.32632],
[2.32598, 2.3256, 2.3257, 2.326, 2.3256, 2.3257, 2.32566],
[10.3215, 10.3215, 10.3214, 10.3214, 10.3214, 10.32124],
]
X, y = zip(*[extract_features_one_row(row) for row in data])
X = np.array(X) # (3, 10)
print(X.shape, y)
So now X_data have the same number of columns.
Use ML algorithm that supports variable length data: Recurrent neural networks, transformers, convolutional networks with padding.

Transforming Features to increase similarity

I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)

ML/DL Train Model on Single Column Feature DataSet

I am trying to build a model that can predict insurance names based on insurance Id.
Before putting the question to this forum, I have tried KNN and Decian Tree, but the accuracy does not exceed more than 60%.
In my data frame, I have one column as a feature and the other as a label.
I can also extract other features from this data as well like Is Numeric, length, etc.
I have 2.8M rows of data in this shape.
insurance_id
insurance_name
XOH830990804
Medicare
XOH01179276
Medicare
H55575577
Medicare
H71096147
WELLMED
IBPW01981926
BCBS
MT25110S
Aetna
WXQQ07123
Aetna
6WU7NSSGY63
Oxford
MX7ZZ35T
Oxford
DU00079Z
Welcare
PB95800M
UHC
Please guide me on which approach or model can help me to achieve an accuracy of more than 80%.
You can try to diversify your inputs
As an example, you can pass aditional features to the network, such as:
Length of the insurance_id
Quantity of numbers in the insurance_id
Quantity of letters in the insurance_id
Sum of all numbers in the insurance_id
And any other transform you might think of.
As the output layer of your network, you might wanna use Dense(n_of_different_insurance_names, activation='softmax')
and a categorical_crossentropy loss function when compiling the model

Image classification with Sift features and Knn?

Can you help me waith Image classification using SIFT feature?
I want to classify images based on SIFT features:
Given a training set of images, extract SIFT from them
Compute K-Means over the entire set of SIFTs extracted form the
training set. the "K" parameter (the number of clusters) depends on
the number of SIFTs that you have for training, but usually is around
500->8000 (the higher, the better).
Now you have obtained K cluster centers.
You can compute the descriptor of an image by assigning each SIFT of
the image to one of the K clusters. In this way you obtain a
histogram of length K.
I have 130 images in training set so my training set 130*K
dimensional
I want to classify my test images ı have 1 images so my sample is 1*k
dimensional. I wrote this code knnclassify(sample,training
set,group).
I want to classify to 7 group. So, knnclassify(sample(1*10),trainingset(130*10),group(7*1))
The error is: The length of GROUP must equal the number of rows in TRAINING. What can I do?
Straight from the docs:
CLASS = knnclassify(SAMPLE,TRAINING,GROUP) classifies each row of the
data in SAMPLE into one of the groups in TRAINING using the nearest-
neighbor method. SAMPLE and TRAINING must be matrices with the same
number of columns. GROUP is a grouping variable for TRAINING. Its
unique values define groups, and each element defines the group to
which the corresponding row of TRAINING belongs. GROUP can be a
numeric vector, a string array, or a cell array of strings. TRAINING
and GROUP must have the same number of rows.
What this means, is that group should be 130x1, and should indicate which group each of the training samples belong to. unique(group) should return 7 values in your case - the seven categories represented in your training set.
If you don't already have a group vector which specifies which categories which image falls into, you could use kmeans to split your training set into 7 groups:
group = kmeans(trainingset,7);
knnclassify(sample, trainingset, group);

How are binary classifiers generalised to classify data into arbitrarily large sets?

How can algorithms which partition a space in to halves, such as Suport Vector Machines, be generalised to label data with labels from sets such as the integers?
For example, a support vector machine operates by constructing a hyperplane and then things 'above' the hyperplane take one label, and things below it take the other label.
How does this get generalised so that the labels are, for example, integers, or some other arbitrarily large set?
One option is the 'one-vs-all' approach, in which you create one classifier for each set you want to partition into, and select the set with the highest probability.
For example, say you want to classify objects with a label from {1,2,3}. Then you can create three binary classifiers:
C1 = 1 or (not 1)
C2 = 2 or (not 2)
C3 = 3 or (not 3)
If you run these classifiers on a new piece of data X, then they might return:
C1(X) = 31.6% chance of being in 1
C2(X) = 63.3% chance of being in 2
C3(X) = 89.3% chance of being in 3
Based on these outputs, you could classify X as most likely being from class 3. (The probabilities don't add up to 1 - that's because the classifiers don't know about each other).
If your output labels are ordered (with some kind of meaningful, rather than arbitrary ordering). For example, in finance you want to classify stocks into {BUY, SELL, HOLD}. Although you can't legitimately perform a regression on these (the data is ordinal rather than ratio data) you can assign the values of -1, 0 and 1 to SELL, HOLD and BUY and then pretend that you have ratio data. Sometimes this can give good results even though it's not theoretically justified.
Another approach is the Cramer-Singer method ("On the algorithmic implementation of multiclass kernel-based vector machines").
Svmlight implements it here: http://svmlight.joachims.org/svm_multiclass.html.
Classification into an infinite set (such as the set of integers) is called ordinal regression. Usually this is done by mapping a range of continuous values onto an element of the set. (see http://mlg.eng.cam.ac.uk/zoubin/papers/chu05a.pdf, Figure 1a)

Resources