Machine learning model with relative feature importance - machine-learning

I have ~12 features and not much data. I would like to train a machine learning model but instruct it that I have some information in which some features are more important than others. Is there a way to do that, one way I came up with was to generate a lot of data based on pre-existing data with small changes and include the same labels thus covering more of the search space. I would like that the relative feature importance matrix has some weight on the final feature importance (as generated by a classification tree for ex.)
Ideally it would be like
Relative feature importance matrix:
N F1 F2 F3
F1 1 2 N
F2 .5 1 1
F3 N 1 1

If I understand the question, you want to have some features be more important than others. To do this, you can assign weights to the individual features themselves based on which you want to be taken into account more heavily.
This question is rather broad so I hope this can be of help.

Related

Why different stocks can be mergerd together to build a single prediction models?

Given n samples with d features of stock A, we can build a (d+1) dimensional linear model to predict the profit. However, in some books, I found that if we have m different stocks with n samples and d features for each, then they merge these data to get m*n samples with d features to build a single (d+1) dimensional linear model to predict the profit.
My confusion is that, different stocks usually have little connection with each other, and their profit are influenced by different factors and environment, so why they can be merged to build a single model?
If you are using R as tool of choice, you might like the time series embedding howto and its appendix -- the mathematics behind that is Taken's theorem:
[Takens's theorem gives] conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system.
It looks to me as the statement's you quote seem to relate to exactly this theorem: For d features (we are lucky, if we know that number - we usually don't), we need d+1 dimensions.
If more time series should be predicted, we can use the same embedding space if the features are the same. The dimensions d are usually simple variables (like e.g. temperature for different energy commodity stocks) - this example helped me to intuitively grasp the idea.
Further reading
Forecasting with Embeddings

Feature Selection or PCA?

I'm having the below Azure Machine Learning question:
You need to identify which columns are more predictive by using a
statistical method. Which module should you use?
A. Filter Based Feature Selection
B. Principal Component Analysis
I choose is A but the answer is B. Can someone explain why it is B
PCA is the optimal approximation of a random vector (in N-d space) by linear combination of M (M < N) vectors. Notice that we obtain these vectors by calculating M eigenvectors with largest eigen values. Thus these vectors (features) can (and usually are) a combination of original features.
Filter Based Feature Selection is choosing the best features as they are (not combining them in any way) based on various scores and criteria.
so as you can see, PCA results in better features since it creates better set of features while FBFS merely finds the best subset.
hope that helps ;)

Handle mismatch in number of features in Training Data and Prediction Data

I have 6 text features (say f1,f2,..,f6) available for the data on which I have trained a model. But when this model is deployed and a new data point comes, for which I have to make prediction using this model, it has only 2 features (f1, and f2). So, there is the problem of feature mismatch. How can I tackle this problem?
I have a few thoughts, but that are not very efficient.
Use only two features for training (f1 and f2), and discard other features (f3,..,f6). But this leads to a loss of information and my test set accuracy decreases.
Learn some relation between (f3,..,f6) with (f1 and f2). So that even though, (f3,..,f6) is not there in the new data point, the information can be extracted from f1, and f2 only.
The best way is of course train a new model using f1, f2 and any new data you may have.
Don't want to do that? If you don't have f3...f6, you shouldn't magically expect the model works as intended.
Now, think what are those "f3...f6"? Are they related to the new information you have? If they are, you may be able to approximate them. We can't tell you what to do because we don't have any clue what they are. Interpolation? Regression? Rough approximation?
My suggestion: you are missing most of the predictors for your model. Your old model is meaningless. Please just train a new one.
Perhaps you could fill in data for f3 to f6 with noise data that is an average value for all data that includes that feature. That way the data from features f3 through f6 won't stand out too much, and won't lean your classifier one way or the other. The classifier would be more likely to rely on the features provided f1 and f2 to classify.
When calculating this make sure the averages are calculated for each classification first then averaged. That way if your data set has a large amount of one class it won't skew the average.
Of course this might be an over simplification, and would work best with binary classification. It depends on the data set and classification.
Hope this helps :)

Principle Component Analysis

I am studying principle component analysis, and I have just learnt that before applying PCA to the data samples, we have to apply two preprocessing steps which are mean normalization and feature scaling. However, I have no idea about what mean normalization is and how it can be implemented.
At first I searched it; however, I could not find a instructive explanation. Is there anyone who can explain what is mean normalization and how it can be implemented ?
Assume there is a dataset with 'd' features(Columns) and 'n' Observations(Rows). For simplicity sake lets consider d=2 and n=100. Which means now you dataset has 2 features and 100 observations.
In other words, now your dataset is a 2-dimensional array with 100 rows and 2 columns - (100x2).
Initially, when you visualize it, you can see that the points are scattered in a 2 dimension.
When you standardize the dataset, and when you visualize it you can actually see that all the points have shifted towards the origin. In other words, all the observation points have a mean of value 0 and standard deviation of value 1. This process is called Standardization.
How do you Standardize..?
Its pretty simple. The Formula is straight forward.
z = (X - u) / s
Where,
X - an observation in the feature column
u - mean of the feature column
s - standard deviation of the feature column
Note: You have to apply standardization with respect to all feature in the dataset
Reference:
https://machinelearningmastery.com/normalize-standardize-machine-learning-data-weka/
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Identifying machine learning data to make predictions

As a learning exercise I plan to implement a machine learning algorithm (probably neural network) to predict what users earn trading stocks based on shares bought , sold and transaction times. Below datasets are test data I've formulated.
acronym's :
tab=millisecond time apple bought
asb=apple shares bought
tas=millisecond apple sold
ass=apple shares sold
tgb=millisecond time google bought
gsb=google shares bought
tgs=millisecond google sold
gss=google shares sold
training data :
username,tab,asb,tas,ass,tgb,gsb,tgs,gss
a,234234,212,456789,412,234894,42,459289,0
b,234634,24,426789,2,234274,3,458189,22
c,239234,12,156489,67,271274,782,459120,3
d,234334,32,346789,90,234254,2,454919,2
classifications :
a earned $45
b earned $60
c earned ?
d earned ?
Aim : predict earnings of users c & d based on training data
Is there any data points I should add to this data set? I should use alternative data perhaps ? As this is just a learning exercise of my own creation can add any feature that may be useful.
This data will need to be normalised, is there any other concept I should be aware of ?
Perhaps should not use time as a feature parameter as shares can bounce up and down depending on time.
You might want to solve your problem in below order:
Prediction for an individual stock's future value based on all stock's historical data.
Prediction for a combination of stocks' total future value based on a portfolio and all stocks' historical data.
A buy-sell short-term strategy for managing a portfolio. (when and what amount to buy/sell on which stock(s) )
If you can do 1) well for a particular stock, probably it's a good starting point for 2). 3) might be your goal but I put it in the last because it's even more complicated.
I would make some assumptions below and focus on how to solve 1) hopefully. :)
I assume at each timestamp, you have a vector of all possible features, e.g.:
stock price of company A (this is the target value)
stock price of other companies B, C, ..., Z (other companies might affect company A directly or indirectly)
52 week lowest price of A, B, C, ..., Z (long-term features begin)
52 week highest price of A, B, C, ..., Z
monthly highest/lowest price of A, B, C, ..., Z
weekly highest/lowest price of A, B, C, ..., Z (short-term features begin)
daily highest/lowest price of A, B, C, ..., Z
is revenue report day of A, B, C, ..., Z (really important features begin)
change of revenue of A, B, C, ..., Z
change of profit of of A, B, C, ..., Z
semantic score of company profile from social networks of A, ..., Z
... (imagination helps here)
And I assume you have almost all above features at every fixed time interval.
I think a lstm-like neural network is very relevant here.
Don't use the username along with the training data - the network might make associations between the username and the $ earned. Including it would factor in the user to the output decision, while excluding it ensures the network will be able to predict the $ earned for an arbitrary user.
Using parameter that you are suggesting seems me impossible to predict earnings.
The main reason is that input parameters don't correlate with output value.
You input values contradicts itself - consider such case is it possible that for the same input you will expect different output values? If so you won't be able predict any output for such input.
Let's go further, earnings of trader depend not only from a share of bought/sold stocks, but also from price of each one of them. This will bring us to the problem when we provide to neural network two equals input but desire different outputs.
How to define 'good' parameters to predict desired output in such case?
I suggest first of all to look for people who do such estimations then try to define a list of parameters they take into account.
If you will succeed you will get a huge list of variables.
Then you can try to build some model for example, using neural network.
Besides normalisation you'll also need scaling. Another question, which I have for you is classification of stocks. In your example you provide google and apple which are considered as blue-chipped stocks. I want to clarify, you want to make prediction of earning only for google and apple or prediction for any combination of two stocks?
If you want to make prediction only for google and apple and provide data which you have, then you can apply only normalization and scaling with some kind of recurrent neural network. Recurrent NN are better in prediction tasks then simple model of feedforward with backpropagation training.
But in case if you want to apply your training algorithm to more then just google and apple, I recommend you to split your training data into some groups by some criteria. One example of dividing can be according to capitalization of stocks. And if you want to make capitalization dividing, you can make five groups ( as example ). And if you decide to make five groups of stocks, you can also apply equilateral encoding in order to decrease number of dimensions for NN learning.
Another kind of grouping which you can think of can be area of operation of stock. For example agricultural, technological, medical, hi-end, tourist groups.
Let's say you decided to give this grouping as mentioned ( I mean agricultural, technological, medical, hi-end, tourist). Then five groups will give you five entries into NN to input layer ( so called thermometer encoding ).
And let's say you want to feed agricultural stock.
Then input will look like this:
1,0,0,0,0, x1, x2, ...., xn
Where x1, x2, ...., xn - are other entries.
Or if you apply equilateral encoding, then you'll have one dimension less ( I'm to lazy to describe how it will look like ).
Yet one more idea for converting entries for neural network can be thermometer encoding.
And one more idea to keep in your mind, as usually people loose on trading stocks, so your data set will be biased. I mean if you randomly choose only 10 traders, they all can be losers, and your data set will not be completely representative. So in order to avoid data bias, you should have big enough data set of traders.
And one more detail, you don't need to pass into NN user id, because NN then learn trading style of particular user, and use it for prediction.
Seems to me dimensions are more than data points. However, it might be the case that your observations are in a linear sub space, you just need to compute the kernel of the matrix shown above.
If the kernel has a larger dimension than the number of data points then you do not need add more data points.
Now there is another thing to look at. You should check out your classifier's VC dimension, don't want to add too many points to the dataset. But anyway that is mostly theoretical in this example, and I'm just joking.

Resources