Let's say I have Dataset D which can be described as feature vector V
After doing some analysis, I realize that the samples in this dataset are converged to 3 different set of features. I am able to create 3 feature vectors V1, V2, V3 which are (almost) subsets of V, but they describe more precisely D.
In other words, D can be divided into D1, D2, D3 - each of these sub-datasets can be expressed really well by V1, V2, V3.
My question: is that a usual/acceptable approach that I would use 3 features vectors for the training of D1, D2, D3 and then create 3 classifiers instead of using only 1 feature vector V to make 1 classifier only?
The kind of "convergence" is also common with samples out of my dataset as well, so I want to use multiple classifiers to generalize them.
Thank you!
Related
Given n samples with d features of stock A, we can build a (d+1) dimensional linear model to predict the profit. However, in some books, I found that if we have m different stocks with n samples and d features for each, then they merge these data to get m*n samples with d features to build a single (d+1) dimensional linear model to predict the profit.
My confusion is that, different stocks usually have little connection with each other, and their profit are influenced by different factors and environment, so why they can be merged to build a single model?
If you are using R as tool of choice, you might like the time series embedding howto and its appendix -- the mathematics behind that is Taken's theorem:
[Takens's theorem gives] conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system.
It looks to me as the statement's you quote seem to relate to exactly this theorem: For d features (we are lucky, if we know that number - we usually don't), we need d+1 dimensions.
If more time series should be predicted, we can use the same embedding space if the features are the same. The dimensions d are usually simple variables (like e.g. temperature for different energy commodity stocks) - this example helped me to intuitively grasp the idea.
Further reading
Forecasting with Embeddings
I have ~12 features and not much data. I would like to train a machine learning model but instruct it that I have some information in which some features are more important than others. Is there a way to do that, one way I came up with was to generate a lot of data based on pre-existing data with small changes and include the same labels thus covering more of the search space. I would like that the relative feature importance matrix has some weight on the final feature importance (as generated by a classification tree for ex.)
Ideally it would be like
Relative feature importance matrix:
N F1 F2 F3
F1 1 2 N
F2 .5 1 1
F3 N 1 1
If I understand the question, you want to have some features be more important than others. To do this, you can assign weights to the individual features themselves based on which you want to be taken into account more heavily.
This question is rather broad so I hope this can be of help.
I'm having the below Azure Machine Learning question:
You need to identify which columns are more predictive by using a
statistical method. Which module should you use?
A. Filter Based Feature Selection
B. Principal Component Analysis
I choose is A but the answer is B. Can someone explain why it is B
PCA is the optimal approximation of a random vector (in N-d space) by linear combination of M (M < N) vectors. Notice that we obtain these vectors by calculating M eigenvectors with largest eigen values. Thus these vectors (features) can (and usually are) a combination of original features.
Filter Based Feature Selection is choosing the best features as they are (not combining them in any way) based on various scores and criteria.
so as you can see, PCA results in better features since it creates better set of features while FBFS merely finds the best subset.
hope that helps ;)
I am studying principle component analysis, and I have just learnt that before applying PCA to the data samples, we have to apply two preprocessing steps which are mean normalization and feature scaling. However, I have no idea about what mean normalization is and how it can be implemented.
At first I searched it; however, I could not find a instructive explanation. Is there anyone who can explain what is mean normalization and how it can be implemented ?
Assume there is a dataset with 'd' features(Columns) and 'n' Observations(Rows). For simplicity sake lets consider d=2 and n=100. Which means now you dataset has 2 features and 100 observations.
In other words, now your dataset is a 2-dimensional array with 100 rows and 2 columns - (100x2).
Initially, when you visualize it, you can see that the points are scattered in a 2 dimension.
When you standardize the dataset, and when you visualize it you can actually see that all the points have shifted towards the origin. In other words, all the observation points have a mean of value 0 and standard deviation of value 1. This process is called Standardization.
How do you Standardize..?
Its pretty simple. The Formula is straight forward.
z = (X - u) / s
Where,
X - an observation in the feature column
u - mean of the feature column
s - standard deviation of the feature column
Note: You have to apply standardization with respect to all feature in the dataset
Reference:
https://machinelearningmastery.com/normalize-standardize-machine-learning-data-weka/
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
The data set I am trying to cluster is made of multiple heterogeneous dimensions.
For example
<A, B, C, D>
where A, B is lat, long.
C is a number.
D is a binary value.
What is the best way to approach a clustering problem in this case?
Should I normalise the data to make it homogeneous, or I should run a separate clustering problem for each homogeneous set of dimensions?
k-means is not a good choice, as it will not handle the 180° wrap-around, and distances anywhere but the equator will be distorted. IIRC in northern USA and most parts of Europe, the distortion is over 20% already.
Similar, it does not make sense to use k-means on binary data - the mean does not make sense, to be precise.
Use an algorithm that can work with arbitrary distances, and construct a combined distance function that is designed for solving your problem, on your particular data set.
Then use e.g. PAM or DBSCAN or hierarchical linkage clustering any other algorithm that works with arbitrary distance functions.
The mean of a binary feature can be seen as the frequency of that feature. There are cases in which one can standardise a binary feature v by v-\bar{v}.
However, in your case it seems to me that you have three features in three different feature spaces. I'd approach this problem by creating three distances d_v, one appropriate for each feature v \in V. The distance between two entities, say x and y would be given by d(x,y) \sum_{v \in V} w_v d_v(x_{v}, y_{v}). You could play with w_v, but I'd probably constraint it to \sum_{v \in V} w_v =1 and {v}_{v \in V} \geq 0.
The above are just some quick thoughts on it, good luck!
PS: Sorry for the text, I'm new here and I don't know how to put latex text here