Clustering algorithms with multiple unrelated features - machine-learning

I have been watching tutorials on clustering. I understand the concept on small data sets like iris etc. What I am having trouble with is trying to cluster a data set with i.e. 20 unrelated features. For example how can I handle a situation when 5 or 6 of those features are represented in binary and the rest are numerical features? Let's say that feature 1 is 1200, feature 2 is 10, feature 3 is 1, feature 4 is 1 etc. How does an algorithm such as k-means work in that scenario? Does it cluster all the feature 1 together, feature 2 together etc. Or will it cluster all the binary features together?

Related

How do engineered features help when they are not present in the test data

I am trying to classify between drones and birds using machine learning. I have got a big number of samples of feature vectors from a radar which generally consists of position(x,y,z), velocity(vx,vy,vz), acceleration(ax,ay,az), Noise, SNR etc plus some more features. Actual classes are known for the samples. However, These basic features are not able to distinguish between drones and birds for new(out of bag) samples. so I am going to try feature engineering to generate new features like standard deviation of speed calculated using mean-speed and then uses the difference between mean-speed and speeds obtained from individual samples(of the same track) to calculate standard deviation by averaging out the differences . Similarly, I generate new features using some other formula by using sum or difference or deviation from average(of different samples from same track) etc.
After obtaining these features we will use the same to create a trained model which will be used for classification.
However, I can apply all these feature engineering on the training dataset whereas the same engineered features will not be present in the test dataset obtained in the operational scenario where we get one sample after another. Also in operational scenario we do not know how many samples we will be getting for a track.
So, how can these engineered features be obtained so as to create a test feature vector with the same in actual operational scenario.
If these cannot be obtained while testing ,then how will the same engineered features (used for model training) be able to solve the classification problem when we do not have these in the test data?

How to analyze the relationship between multiple inputs and multiple outputs through big data or machine learning

I have a lot of data, each data has 3 inputs, 6 outputs, how can I analyze the relationship between these data through big data analysis or machine learning? When the new 3 inputs are provided, the new 6 output data is automatically given.
input:1,2,3 output:4,5,6,7,8,9
input:4,5,6 output:???
Your question is about the capacity of a learning machine. In supervised learning, the machine learns from a lot of examples. In your case, if you have a lot of data that are labeled, you may want to build a multiple layer perceptron to study the samples. In this architecture, you would want 3 input neurons and 6 output neurons and multiple layers between them. On the other hand, if you believe there is a generating pattern in your training data, you may want to use statistical model. In either case you need a lot of data to train a machine. In your example, it would confuse a machine just like confusing a human since it has too few samples and has too much possibilities.

Why is a feature good for distinguishing a cluster?

Let us supposed that we are trying to rank the importance of each feature of the dataset for each given cluster, in a clustering task. What are the characteristics that we should measure in the feature for considering it good for characterizing a given cluster?
I am looking for a more analytical characterization of these features. For example, if a feature f have a high standard deviation in the whole dataset, but a small standard deviation within a cluster c, does this means that this feature is important for distinguishing the cluster c?
There are two approaches you could use here:
A feature selection approach would be to remove the said feature and redo the clustering and see if it had strong effect, if no you could say this feature is unnecessary for the clustering task. The down side of this approach is the time it would take to run the clustering process for each subset of features in the dataset.
A statistical approach would be to split the data into two groups: the samples from the cluster and the rest of the samples. Then you ask how different are the feature values when comparing the two populations. Depends on the distribution of this feature, you could pick for this task a test like KS test, t test, chi-squared test or any other test for comparing distributions of two samples.

Feature weightage from Azure Machine Learning Deployed Web Service

I am trying to predict from my past data which has around 20 attribute columns and a label. Out of those 20, only 4 are significant for prediction. But i also want to know that if a row falls into one of the classified categories, what other important correlated columns apart from those 4 and what are their weight. I want to get that result from my deployed web service on Azure.
You can use permutation feature importance module but that will give importance of the features across the sample set. Retrieving the weights on per call basis is not available in Azure ML.

How to do machine learning when the inputs are of different sizes?

In standard cookbook machine learning, we operate on a rectangular matrix; that is, all of our data points have the same number of features. How do we cope with situations in which all of our data points have different numbers of features? For example, if we want to do visual classification but all of our pictures are of different dimensions, or if we want to do sentiment analysis but all of our sentences have different amounts of words, or if we want to do stellar classification but all of the stars have been observed a different number of times, etc.
I think the normal way would be to extract features of regular size from these irregularly sized data. But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data, deep learners are able to learn the appropriate features themselves. But how do we use e.g. a neural network if the input layer is not of a fixed size?
Since you are asking about deep learning, I assume you are more interested in end-to-end systems, rather then feature design. Neural networks that can handle variable data inputs are:
1) Convolutional neural networks with pooling layers. They are usually used in image recognition context, but recently were applied to modeling sentences as well. ( I think they should also be good at classifiying stars ).
2) Recurrent neural networks. (Good for sequential data, like time series,sequence labeling tasks, also good for machine translation).
3) Tree-based autoencoders (also called recursive autoencoders) for data arranged in tree-like structures (can be applied to sentence parse trees)
Lot of papers describing example applications can readily be found by googling.
For uncommon tasks you can select one of these based on the structure of your data, or you can design some variants and combinations of these systems.
You can usually make the number of features the same for all instances quite easily:
if we want to do visual classification but all of our pictures are of different dimensions
Resize them all to a certain dimension / number of pixels.
if we want to do sentiment analysis but all of our sentences have different amounts of words
Keep a dictionary of the k words in all of your text data. Each instance will consist of a boolean vector of size k where the i-th entry is true if word i from the dictionary appears in that instance (this is not the best representation, but many are based on it). See the bag of words model.
if we want to do stellar classification but all of the stars have been observed a different number of times
Take the features that have been observed for all the stars.
But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data deep learners are able to learn the appropriate features themselves.
I think the speaker probably referred to higher level features. For example, you shouldn't manually extract the feature "contains a nose" if you want to detect faces in an image. You should feed it the raw pixels, and the deep learner will learn the "contains a nose" feature somewhere in the deeper layers.

Resources