Identifying machine learning data to make predictions

Identifying machine learning data to make predictions - machine-learning

As a learning exercise I plan to implement a machine learning algorithm (probably neural network) to predict what users earn trading stocks based on shares bought , sold and transaction times. Below datasets are test data I've formulated.
acronym's :
tab=millisecond time apple bought
asb=apple shares bought
tas=millisecond apple sold
ass=apple shares sold
tgb=millisecond time google bought
gsb=google shares bought
tgs=millisecond google sold
gss=google shares sold
training data :
username,tab,asb,tas,ass,tgb,gsb,tgs,gss
a,234234,212,456789,412,234894,42,459289,0
b,234634,24,426789,2,234274,3,458189,22
c,239234,12,156489,67,271274,782,459120,3
d,234334,32,346789,90,234254,2,454919,2
classifications :
a earned $45
b earned $60
c earned ?
d earned ?
Aim : predict earnings of users c & d based on training data
Is there any data points I should add to this data set? I should use alternative data perhaps ? As this is just a learning exercise of my own creation can add any feature that may be useful.
This data will need to be normalised, is there any other concept I should be aware of ?
Perhaps should not use time as a feature parameter as shares can bounce up and down depending on time.

You might want to solve your problem in below order:
Prediction for an individual stock's future value based on all stock's historical data.
Prediction for a combination of stocks' total future value based on a portfolio and all stocks' historical data.
A buy-sell short-term strategy for managing a portfolio. (when and what amount to buy/sell on which stock(s) )
If you can do 1) well for a particular stock, probably it's a good starting point for 2). 3) might be your goal but I put it in the last because it's even more complicated.
I would make some assumptions below and focus on how to solve 1) hopefully. :)
I assume at each timestamp, you have a vector of all possible features, e.g.:
stock price of company A (this is the target value)
stock price of other companies B, C, ..., Z (other companies might affect company A directly or indirectly)
52 week lowest price of A, B, C, ..., Z (long-term features begin)
52 week highest price of A, B, C, ..., Z
monthly highest/lowest price of A, B, C, ..., Z
weekly highest/lowest price of A, B, C, ..., Z (short-term features begin)
daily highest/lowest price of A, B, C, ..., Z
is revenue report day of A, B, C, ..., Z (really important features begin)
change of revenue of A, B, C, ..., Z
change of profit of of A, B, C, ..., Z
semantic score of company profile from social networks of A, ..., Z
... (imagination helps here)
And I assume you have almost all above features at every fixed time interval.
I think a lstm-like neural network is very relevant here.

Don't use the username along with the training data - the network might make associations between the username and the $ earned. Including it would factor in the user to the output decision, while excluding it ensures the network will be able to predict the $ earned for an arbitrary user.

Using parameter that you are suggesting seems me impossible to predict earnings.
The main reason is that input parameters don't correlate with output value.
You input values contradicts itself - consider such case is it possible that for the same input you will expect different output values? If so you won't be able predict any output for such input.
Let's go further, earnings of trader depend not only from a share of bought/sold stocks, but also from price of each one of them. This will bring us to the problem when we provide to neural network two equals input but desire different outputs.
How to define 'good' parameters to predict desired output in such case?
I suggest first of all to look for people who do such estimations then try to define a list of parameters they take into account.
If you will succeed you will get a huge list of variables.
Then you can try to build some model for example, using neural network.

Besides normalisation you'll also need scaling. Another question, which I have for you is classification of stocks. In your example you provide google and apple which are considered as blue-chipped stocks. I want to clarify, you want to make prediction of earning only for google and apple or prediction for any combination of two stocks?
If you want to make prediction only for google and apple and provide data which you have, then you can apply only normalization and scaling with some kind of recurrent neural network. Recurrent NN are better in prediction tasks then simple model of feedforward with backpropagation training.
But in case if you want to apply your training algorithm to more then just google and apple, I recommend you to split your training data into some groups by some criteria. One example of dividing can be according to capitalization of stocks. And if you want to make capitalization dividing, you can make five groups ( as example ). And if you decide to make five groups of stocks, you can also apply equilateral encoding in order to decrease number of dimensions for NN learning.
Another kind of grouping which you can think of can be area of operation of stock. For example agricultural, technological, medical, hi-end, tourist groups.
Let's say you decided to give this grouping as mentioned ( I mean agricultural, technological, medical, hi-end, tourist). Then five groups will give you five entries into NN to input layer ( so called thermometer encoding ).
And let's say you want to feed agricultural stock.
Then input will look like this:
1,0,0,0,0, x1, x2, ...., xn
Where x1, x2, ...., xn - are other entries.
Or if you apply equilateral encoding, then you'll have one dimension less ( I'm to lazy to describe how it will look like ).
Yet one more idea for converting entries for neural network can be thermometer encoding.
And one more idea to keep in your mind, as usually people loose on trading stocks, so your data set will be biased. I mean if you randomly choose only 10 traders, they all can be losers, and your data set will not be completely representative. So in order to avoid data bias, you should have big enough data set of traders.
And one more detail, you don't need to pass into NN user id, because NN then learn trading style of particular user, and use it for prediction.

Seems to me dimensions are more than data points. However, it might be the case that your observations are in a linear sub space, you just need to compute the kernel of the matrix shown above.
If the kernel has a larger dimension than the number of data points then you do not need add more data points.
Now there is another thing to look at. You should check out your classifier's VC dimension, don't want to add too many points to the dataset. But anyway that is mostly theoretical in this example, and I'm just joking.

Related

Why different stocks can be mergerd together to build a single prediction models?

Given n samples with d features of stock A, we can build a (d+1) dimensional linear model to predict the profit. However, in some books, I found that if we have m different stocks with n samples and d features for each, then they merge these data to get m*n samples with d features to build a single (d+1) dimensional linear model to predict the profit.
My confusion is that, different stocks usually have little connection with each other, and their profit are influenced by different factors and environment, so why they can be merged to build a single model?

If you are using R as tool of choice, you might like the time series embedding howto and its appendix -- the mathematics behind that is Taken's theorem:
[Takens's theorem gives] conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system.
It looks to me as the statement's you quote seem to relate to exactly this theorem: For d features (we are lucky, if we know that number - we usually don't), we need d+1 dimensions.
If more time series should be predicted, we can use the same embedding space if the features are the same. The dimensions d are usually simple variables (like e.g. temperature for different energy commodity stocks) - this example helped me to intuitively grasp the idea.
Further reading
Forecasting with Embeddings

Model that predict both categorical and numerical output

I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.

What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.

5 input and 3 output features for machine learning

Need some advise here.
I am trying to build a model where it can predict the 3 different output features when 5 input features are given.
for example,
5 input features: size of the house, house floor, house condition, number of rooms, parking.
3 output features: price for selling, price for buying, price for renting
What I am confusing right now is that, is that possible that the trained model are able to predict the 3 outputs? What I found from others' example/tutorial is that they mostly trying to do one thing only on their model.
Sorry if my explanations are bad, I am new to tensorflow and machine learning.

Neural network definitely can predict/approximate more outputs. I have experience with neuron regulator and there net produce control signal for two motors.
So I don't have experience with tensorflow. But this framework is from Google and is quite popular, so I'm almost sure, there is multioutput functionality.
There is nice example of such thing.

In common practice, we build a model to predict only one output, that is because in surpervised learning, we should input some certain kinds of variables, and find a relation between them with a wanted output. Because this relation generally cannot work between the input and another wanted output.
But we can have a special technique to fit your problem:
If we have four input variables : I1, I2, I3, I4, and we want three output lables (generally discrete): O1, O2, O3. so we can created a new lable O4 after mergering the original three outputs. For example, if O1, O2, O3 can onlt be 0 or 1, the O4 have 2^3, in total, 8 possible values. So, we can build a prediction model between four input variables and the output O4. And once value of O4 is known, O1-O3 are all known as well.
Howover, if the output variable are not all discrete, especially regression technique is used, the technique above wonnot work. So, to predict three output, we normally do training three times and make three models.

How to build multivariate ranking system?

I have data of various sellers on ecommerce platform. I am trying to compute seller ranking score based on various features, such as
1] Order fulfillment rates [numeric]
2] Order cancel rate [numeric]
3] User rating [1-5] { 1-2 : Worst, 3: Average , 5: Good} [categorical]
4] Time taken to confirm the order. (shorter the time taken better is the seller) [numeric]
My first instinct was to normalize all the features, then multiply parameters/feature by some weight . Add them together for each seller score. Finally, find relative ranking of sellers based on this score.
My Seller score equation looks like
Seller score = w1* Order fulfillment rates - w2*Order cancel rate + w3 * User rating + w4 * Time taken to confirm order
where, w1,w2,w3,w4 are weights.
My question is three fold
Are there better algorithms/approaches to solve this problem? i.e I linearly added the various features, I want to know better approach to build the ranking system?
How to come with the values for the weights?
Apart from using above features, few more that I can think of are ratio of positive to negative reviews, rate of damaged goods etc. How will these fit into my Score equation?
How to incorporate numeric and categorical variables in finding seller ranking score? (I have few categorical variables)
Is there an accepted way to weight multivariate systems like this ?

I would suggest the following approach:
First of all, keep in a matrix all features that you have available, whether you consider them useful or not.
(Hint: categorical variables are converted to numerical by simple encoding. Thus you can easily incorporate them (in the exact way you encoded user rating))
Then, you have to apply a dimensionality reduction algorithm, such as Singular Value Decomposition (SVD), in order to keep the most significant variables. Applying SVD may surprise you as to which features may be significant and which aren't.
After applying SVD, choosing the right weights for the n-most important features you decided to keep, is really up to you because it is purely qualitative and domain-dependent (as far as which features are more important).
The only way you could possibly calculate weights in a formalistic way is if the features were directly connected to something, e.g., revenue. Since this very rarely occurs, I suggest manually defining the weights; but for the sake of normalization, set:
w1 + w2 + ... + wn = 1
That is, split the "total importance" among the features you selected in a manner that sums to 1.

Using decision tree in Recommender Systems

I have a decision tree that is trained on the columns (Age, Sex, Time, Day, Views,Clicks) which gets classified into two classes - Yes or No - which represents buying decision for an item X.
Using these values,
I'm trying to predict the probability of 1000 samples(customers) which look like ('12','Male','9:30','Monday','10','3'),
('50','Female','10:40','Sunday','50','6')
........
I want to get the individual probability or a score which will help me recognize which customers are most likely to buy the item X. So i want to be able to sort the predictions and show a particular item to only 5 customers who will want to buy the item X.
How can I achieve this ?
Will a decision tree serve the purpose?
Is there any other method?
I'm new to ML so please forgive me for any vocabulary errors.

Using decision tree with a small sample set, you will definitely run into overfitting problem. Specially at the lower levels of the decision, where tree you will have exponentially less data to train your decision boundaries. Your data set should have a lot more samples than the number of categories, and enough samples for each categories.
Speaking of decision boundaries, make sure you understand how you are handling data type for each dimension. For example, 'sex' is a categorical data, where 'age', 'time of day', etc. are real valued inputs (discrete/continuous). So, different part of your tree will need different formulation. Otherwise, your model might end up handling 9:30, 9:31, 9:32... as separate classes.
Try some other algorithms, starting with simple ones like k-nearest neighbour (KNN). Have a validation set to test each algorithm. Use Matlab (or similar software) where you can use libraries to quickly try different methods and see which one works best. There is not enough information here to recommend you something very specific. Plus,
I suggest you try KNN too. KNN is able to capture affinity in data. Say, a product X is bought by people around age 20, during evenings, after about 5 clicks on the product page. KNN will be able to tell you how close each new customer is to the customers who bought the item. Based on this you can just pick the top 5. Very easy to implement and works great as a benchmark for more complex methods.
(Assuming views and clicks means the number of clicks and views by each customer for product X)

A decision tree is a classifier, and in general it is not suitable as a basis for a recommender system. But, given that you are only predicting the likelihood of buying one item, not tens of thousands, it kind of makes sense to use a classifier.
You simply score all of your customers and retain the 5 whose probability of buying X is highest, yes. Is there any more to the question?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart