What GARCH model do I use for relative spread series? - time-series

I have tried multiple GARCH model variations to remove financial time series characteristics from my dataset. I mainly tried ARMA(1,1),sGARCH models with normal distribution. My standardized residuals and squared standardized residuals don't show serial correlation anymore, which is good. However, the values for the goodness of fit test are always 0 or a very small number, which I think indicates that the model choice is not appropriate. What GARCH specification should I use?
(my dataset is a financial time series of daily relative spreads, so the spreads divided by the mean of ask en bid price of that day)

Related

How can I use AutoML in GCP to predict rare event?

As a title, I tried to use AutoML in Google Cloud Platform to predict some rare results.
For example, suppose I have 5 types of independent variables: age, living area, income, family size, and gender. I want to predict a rare event called "purchase".
Purchases are very rare, because for 10,000 data points, I will only get 3-4 purchases. Fortunately, I got loads more than just 10,000 data points. (I got 100 million data points)
I have tried to use AutoML to model the best combination, but since this is a rare result, the model only predicts for me that the number of purchases for all types of combinations in these 5 categories is 0. May I know how to solve this problem in AutoML?
In Cloud AutoML, the model predictions and the model evaluation metrics depend on the confidence threshold that is set. By default, in Cloud AutoML, the confidence threshold is 0.5. This value can be changed in the “Evaluate” tab of the “Models” section. To evaluate your model, change the confidence threshold to see how precision and recall are affected. The best confidence threshold depends on your use case. Here are some example scenarios to learn how evaluation metrics can be used. In your case, the recall metric has to be maximized (which would result in fewer false negatives) in order to correctly predict the purchase column.
Also, the training data has to be composed of a comparable number of examples from each class in the target variable so that the model can predict values with a higher confidence. Since your training data is highly skewed, preprocessing of the data such as resampling has to be performed to handle the skewness.

Why are data not split in training and testing for unsupervised learning algorithms?

We know that Prediction and Classification problems can break data according to a training ratio (generally 70-30 or 80-20 split), where the training data is passed to a model to be fit and its output is tested against the test data.
Let's Say if I have a data with 2 columns:
First column: Employee Age
Second Column: Employee Salary Type
With 100 records similar to this:
Employee Age Employee Salary Type
25 low
35 medium
26 low
37 medium
44 high
45 high
if the Training data is split by the ratio 70:30,
Let the Target variable be Employee Salary Type and predicted variable be Employee Age
The data is trained on 70 records and tested against the remaining 30 records while hiding their target variables.
Let's say, 25 out of 30 records have accurate prediction.
Accuracy of the model = (25/30)*100 = 83.33%
Which means the model is good
Lets apply same thing for an unsupervised learning like Clustering.
Here there's no target variable, Only cluster variables are present.
Lets consider both Employee age and Employee Salary as Cluster Variables.
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
If the Training ratio is applied here, We can cluster 70 random records and use rest of the 30 records for testing/validating the above model instead of testing with some other data (and their records).
Here we need to model fit 70% records and again need to model fit rest 30% records thereby we need to compare characteristics of cluster 1 of 70% data and characteristics of cluster 1 of rest 30% data.If characteristics are similar then we can reach the inference that clustering model was good.
Hence accuracy can be accurately measured here.
Why dont people prefer train/test/split for Unsupervised Analysis like Clustering, Association Rules, Forecasting, etc.
I beleive you have a few misconceptions, here is a quick review:
Review
Unsupervised learning
This is when you have data inputs but no labels, and learn something about the inputs
Semi-supervised learning
This is when you have data inputs and same labels, and learn something about the inputs and their relationship to the labels
Supervised learning
This is when you have data inputs and labels, and learn what input maps to which label
Questions
Now you have a few things you mention that dont seem right:
Then data will be automatically clustered according to
Employees with low age and low salary
Employees with medium age and medium salary
Employees high age and high salary
This is only guaranteed If you features represent employees using the age and salary, and you are using a clustering algorithm, you need to define a distance metric which says age and salaray are closer to one another
You also mention:
If the Training ratio is applied here,
We can cluster 70 random records and use rest of the
30 records for testing/validating
the above model instead of testing with
some other data (and their records).
Hence accuracy can be accurately measured here.
How do you know the labels? If you are clustering, you would not know what each cluster means as they are assigned only by your distance metric. A cluster usually only signifies distances being either closer or farther away.
You can never know what a correct label is unless you know that a cluster represents a certain label, but if you are using features to cluster and check distance on, they could not also be used for validation.
This is because you would always get 100% accuracy, since a feature is also a label.
A semi-supervised example
I think your misconception comes as you may be confusing learning types, so let's make an example using some fake data.
Let's say you have a table of data with Employee entries like the following:
Employee
Name
Age
Salary
University degree
University graduation date
Address
Now let's say some employees dont want to say their age, since it is not mandatory, but some do. Then you can use a semi-supervised learning approach to cluster employees and get information about their age.
Since we want to get the age, we can approximate by clustering.
Let's make features that represent the Employee age to help us cluster them together:
employee_vector = [salary, graduation, address]
With our input, we are making the claim that age can be determined by salary, graduation date and address, which might be true.
Let's say we have represented all these values numerically, then we can cluster items together.
What would these clusters mean with a standard distance metric Euclidian distance?
People who have less distant salaries, gratuation dates and addresses would be clustered together.
Then we could look at the clusters they are in and look at information about the ages we do know.
for cluster_id, employees in clusters:
ages = get_known_ages(employees)
Now we could use the ages to do lot's of operations to guess missing employee ages like using a normal distribution or just showing a min/max range.
We could never know what the exact age is, since the clustering does not know that.
We could never test for age, since it is not always known, and is not used in the feature vectors for the employees.
This is why you could not use purely unsupervised approaches since you have no labels.
I do not know to who you refer with "why don't people prefer ..." but usually if you are doing an unsupervised analysis you do not have label data and therefore, you cannot measure accuracy. In this case, you can use methods like silhouette or l-curve to estimate the performance of the model.
On the other hand, if you have a supervised task with label data (this example) you can compute the accuracy with cross-validation (test-train split).
Because most unsupervised algorithms are not optimization based. (K-means is an exception!)
Examples: Apriori, DBSCAN, Local Outlier Factor.
And if you do not optimize, how are you going to overfit? (And if you do not use labels, you in particular cannot overfit to these labels).

Classification of Stock Prices Based on Probabilities

I'm trying to build a classifier to predict stock prices. I generated extra features using some of the well-known technical indicators and feed these values, as well as values at past points to the machine learning algorithm. I have about 45k samples, each representing an hour of ohlcv data.
The problem is actually a 3-class classification problem: with buy, sell and hold signals. I've built these 3 classes as my targets based on the (%) change at each time point. That is: I've classified only the largest positive (%) changes as buy signals, the opposite for sell signals and the rest as hold signals.
However, presenting this 3-class target to the algorithm has resulted in poor accuracy for the buy & sell classifiers. To improve this, I chose to manually assign classes based on the probabilities of each sample. That is, I set the targets as 1 or 0 based on whether there was a price increase or decrease.
The algorithm then returns a probability between 0 and 1(usually between 0.45 and 0.55) for its confidence on which class each sample belongs to. I then select some probability bound for each class within those probabilities. For example: I select p > 0.53 to be classified as a buy signal, p < 0.48 to be classified as a sell signal and anything in between as a hold signal.
This method has drastically improved the classification accuracy, at some points to above 65%. However, I'm failing to come up with a method to select these probability bounds without a large validation set. I've tried finding the best probability values within a validation set of 3000 and this has improved the classification accuracy, yet the larger the validation set becomes, it is clear that the prediction accuracy in the test set is decreasing.
So, what I'm looking for is any method by which I could discern what the specific decision probabilities for each training set should be, without large validation sets. I would also welcome any other ideas as to how to improve this process. Thanks for the help!
What you are experiencing is called non-stationary process. The market movement depends on time of the event.
One way I used to deal with it is to build your model with data in different time chunks.
For example, use data from day 1 to day 10 for training, and day 11 for testing/validation, then move up one day, day 2 to day 11 for training, and day 12 for testing/validation.
you can save all your testing results together to compute an overall score for your model. this way you have lots of test data and a model that will adapt to time.
and you get 3 more parameters to tune, #1 how much data to use for train, #2 how much data for test, # per how many days/hours/data points you retrain your data.

Incorporating prior knowledge to machine learning models

Say I have a data set of students with features such as income level, gender, parents' education levels, school, etc. And the target variable is say, passing or failing a national exam. We can train a machine learning model to predict, given these values whether a student is likely to pass or fail (say in sklearn, using predict_prob we can say the probability of passing)
Now say I have a different set of information which has nothing to do with the previous data set, which includes the schools and percentage of students from that particular school who has passed that national exam last year and years before. say, schoolA: 10%, schoolB: 15%, etc.
How can I use this additional knowledge to improve my model. For sure this data is valuable. (Students from certain schools have a higher chance of passing the exam due to their educational facilities, qualified staff, etc.).
Do i some how add this information as a new feature to the data set? If so what is the recommend way. Or do I use this information after the model prediction and somehow combine these to get a final probability ? Obviously an average or a weighted average doesn't work due to the second data set having probabilities in the range below 20% which then drags the total probability very low. How do data scientist usually incorporate this kind of prior knowledge? Thank you
You can try different ways to add this data and see if your model will be able to learn on this set. More likely you'll see right away, that this additional data will just confuse the model. Mostly because you're already providing more precise data on each student of the school and the model has more freedom to use this information.
But artificial neural network training is all about continuous trials and errors, so you definitely should try to train it with all possible data you can imagine to see if it'll be able to get a descent error in the end.
Use the average pass percentage of the students' school as a new feature of each student is worth to try.

What does dimensionality reduction mean?

What does dimensionality reduction mean exactly?
I searched for its meaning, I just found that it means the transformation of raw data into a more useful form. So what is the benefit of having data in useful form, I mean how can I use it in a practical life (application)?
Dimensionality Reduction is about converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions convey much more information.
This is typically done while solving machine learning problems to get better features for a classification or regression task.
Heres a contrived example - Suppose you have a list of 100 movies and 1000 people and for each person, you know whether they like or dislike each of the 100 movies. So for each instance (which in this case means each person) you have a binary vector of length 100 [position i is 0 if that person dislikes the i'th movie, 1 otherwise ].
You can perform your machine learning task on these vectors directly.. but instead you could decide upon 5 genres of movies and using the data you already have, figure out whether the person likes or dislikes the entire genre and, in this way reduce your data from a vector of size 100 into a vector of size 5 [position i is 1 if the person likes genre i]
The vector of length 5 can be thought of as a good representative of the vector of length 100 because most people might be liking movies only in their preferred genres.
However its not going to be an exact representative because there might be cases where a person hates all movies of a genre except one.
The point is, that the reduced vector conveys most of the information in the larger one while consuming a lot less space and being faster to compute with.
You're question is a little vague, but there's an interesting statistical technique that may be what you're thinking off called Principal Component Analysis which does something similar (and incidentally plotting the results from which was my first real world programming task)
It's a neat, but clever technique which is remarkably widely applicable. I applied it to similarities between protein amino acid sequences, but I've seen it used for analysis everything from relationships between bacteria to malt whisky.
Consider a graph of some attributes of a collection of things where one has two independent variables - to analyse the relationship on these one obviously plots on two dimensions and you might see a scatter of points. if you've three variable you can use a 3D graph, but after that one starts to run out of dimensions.
In PCA one might have dozens or even a hundred or more independent factors, all of which need to be plotted on perpendicular axis. Using PCA one does this, then analyses the resultant multidimensional graph to find the set of two or three axis within the graph which contain the largest amount of information. For example the first Principal Coordinate will be a composite axis (i.e. at some angle through n-dimensional space) which has the most information when the points are plotted along it. The second axis is perpendicular to this (remember this is n-dimensional space, so there's a lot of perpendiculars) which contains the second largest amount of information etc.
Plotting the resultant graph in 2D or 3D will typically give you a visualization of the data which contains a significant amount of the information in the original dataset. It's usual for the technique to be considered valid to be looking for a representation that contains around 70% of the original data - enough to visualize relationships with some confidence that would otherwise not be apparent in the raw statistics. Notice that the technique requires that all factors have the same weight, but given that it's an extremely widely applicable method that deserves to be more widely know and is available in most statistical packages (I did my work on an ICL 2700 in 1980 - which is about as powerful as an iPhone)
http://en.wikipedia.org/wiki/Dimension_reduction
maybe you have heard of PCA (principle component analysis), which is a Dimension reduction algorithm.
Others include LDA, matrix factorization based methods, etc.
Here's a simple example. You have a lot of text files and each file consists some words. There files can be classified into two categories. You want to visualize a file as a point in a 2D/3D space so that you can see the distribution clearly. So you need to do dimension reduction to transfer a file containing a lot of words into only 2 or 3 dimensions.
The dimensionality of a measurement of something, is the number of numbers required to describe it. So for example the number of numbers needed to describe the location of a point in space will be 3 (x,y and z).
Now lets consider the location of a train along a long but winding track through the mountains. At first glance this may appear to be a 3 dimensional problem, requiring a longitude, latitude and height measurement to specify. But this 3 dimensions can be reduced to one if you just take the distance travelled along the track from the start instead.
If you were given the task of using a neural network or some statistical technique to predict how far a train could get given a certain quantity of fuel, then it will be far easier to work with the 1 dimensional data than the 3 dimensional version.
It's a technique of data mining. Its main benefit is that it allows you to produce a visual representation of many-dimensional data. The human brain is peerless at spotting and analyzing patterns in visual data, but can process a maximum of three dimensions (four if you use time, i.e. animated displays) - so any data with more than 3 dimensions needs to somehow compressed down to 3 (or 2, since plotting data in 3D can often be technically difficult).
BTW, a very simple form of dimensionality reduction is the use of color to represent an additional dimension, for example in heat maps.
Suppose you're building a database of information about a large collection of adult human beings. It's also going to be quite detailed. So we could say that the database is going to have large dimensions.
AAMOF each database record will actually include a measure of the person's IQ and shoe size. Now let's pretend that these two characteristics are quite highly correlated. Compared to IQs shoe sizes may be easy to measure and we want to populate the database with useful data as quickly as possible. One thing we could do would be to forge ahead and record shoe sizes for new database records, postponing the task of collecting IQ data for later. We would still be able to estimate IQs using shoe sizes because the two measures are correlated.
We would be using a very simple form of practical dimension reduction by leaving IQ out of records initially. Principal components analysis, various forms of factor analysis and other methods are extensions of this simple idea.

Resources