Dynamic clustering for panel data - time-series

I have panel data consisting of time series for 120 months, 45 institutions and approximately 8 variables for each one. I want to do a cluster analysis in order to detect stressed institutions based on dynamic clustering analysis. For instance, check if a stressed institution does move from one cluster to another, or if its behavior changes so much that it is no longer part of its own cluster.
The idea would be to use the information up to time t to cluster the institutions and get the clusters for each institution so it can evolve with new information and use all the information available up to that point from all the banks, with time varying clusters.
My first idea was to use statistical control techniques and anomaly detection for time series such as the ones in the package anomaly, but this procedure does not use all the information from the other banks, just its own. It might be that the whole system is stressed, so detecting an anomaly in one bank might be because of the system and not because of the particular bank.
I also tried using clustering in each period through hierarchical clustering, and did a decent job on classifying the institutions based on my knowledge of them. However, this procedure only uses data at each point in time, not all the data available up to that point.
I had the idea of using clustering methods for panel data at each point in time, using the data up to that point, and cycling through each month to get dynamic clusters using the whole dataset. However, I don't know if this approach makes sense, or if there are better methods to do this kind of analysis.
Thank you very much!

Related

Supervised learning by creating own labels

Scenario - I have data that does not have labels but I can create a function to label the data based on behavior and deploy the model so I don't have to keep labeling the data. Is this considered machine learning?
Objective: classify accounts with Volume spikes based on high medium low labels to deploy on big data (trillions of lines of data)
Data: the data I have includes the following attributes:
Account, Time, Date, Volume amount.
Method:
Create a new feature column called "spike" and create a pandas function to ID a spike greater than 5. Is this feature engineering?
Next I create my label column and classify it as low medium or high spike.
Next I Train a machine learning classifier and deploy it to label future accounts with similar patterns in big data.
Thoughts on this process? Is this approach correct for Machine learning?
1st question:
If your algorithm takes the decision, that is, put a label in a sample, based on the set of samples that you have, I'd say it's a machine learning algorithm. But if you design a code that takes into account your experience regarding the data, I think it's not an ML method. In brief, ML look at the data to get patterns and insights from them. I don't know why you're doing that, but is it need to be an ML algorithm? Sometimes you can solve the problem in a very simple way, without using ML.
2nd question: I'm afraid not. Select your data attributes (ex: Account, Time, Date, Volume amount), checking their correlations, try to figure out if you have a dominant one, etc. This process is pre ML. The feature engineering will select what are the best features to present to our algorithm in order to perform the classification (in your case)
3rd question: I think it's fair enough to start playing with some ML algorithms, such as KNN, SVM, NNs, Decision Tree, etc.

How to solve the "cold start" problem in computer vision based deep learning models?

By “Cold Start” I mean that often computer vision models for object detection or semantic segmentation require about 5000 images per class. So if an idea if floated within the company for e.g. we want to use object detection to count the number of wood logs when the truck is dispatched and then use the same app to count the number that is received.
So now the challenge is that you have only a few images of woods logs on a truck but to train any model you need thousands, so what do practitioners typically do for these prototypes?
Because at this stage it is not clear what model to try? It is also not very feasible to ask business to invest in collecting thousands of images of logs and label them?
That is why I am calling this “Cold Start”. How do you start?
What I have looked into is Conditional GANs, Pix-2-Pix but I am trying to understand the recommended method on how to start when you have very few images per object class.
I expect that when I drop a few images in a folder and call this library I end up getting a lot more images per class so I can then start my prototyping.
Note that asking for software libraries is specifically off-topic here.
No, there is no magic solution: if your data set doesn't have enough information in its images to train a hand-crafted model, no amount of software will change that fact. However, the first approach is to challenge that "fact": how do you know that you don't have enough images? What happened when you used what you have to train a model? You will train for more epochs before the model converges, but you should be able to achieve far better than random accuracy by training a comparable quantity of iterations.
I seriously doubt that you'll need to collect and label thousands of images: you have a very restricted paradigm, photos of log trucks taken from an vantage point you control. Training a model to count non-overlapping near-circles will take much less differentiation than, say, distinguishing motor vehicles from postal boxes.
Experiment with the basic models you have at hand -- you already have much more of the solution than you realize. If your data set is too small, go out the yard with a digital camera and get twice as many, three times, whatever you need. Flip the images left-right to get more input.
Does that get you moving?
Transfer learning solves the problem you are describing as "Cold Start". Basically you can import the weights obtained after training using a big and open dataset and just fine-tune them using the smaller dataset you already have. Data augmentation, freezing some of the layers, etc may help improving the results of a fine-tuned model.

Developing a web bot crawler system after clustering the bots

I am trying to identifying high hitting IP's over a duration of time.
I have performed clustering on certain features, got a 12 cluster output, out of which 8 were bots and 4 were humans, as per the centroid values of the cluster.
Now What technique can I use to analyze the data within the cluster, so as to know that the data points within the cluster are in the right clusters.
In other words, are there any statistical methods to check the quality of the clusters.?
What I can think of is , if I take a data point which is at the boundary of the cluster, If I measure the distance of this point from the other Centroids and from its own Centroid, then can I get to know how close the two clusters are to my point and may be how well are my data divided in cluster ??
Kindly guide how to measure the quality of my clusters, with respect to data points and what are the standard technique to do so.
Thanks in Advance.!!
Cheers.!
With k-means, the chances are that you already have a big heap of garbage. Because it is an incredibly crude heuristic, and unless you were extremely careful at designing your features (at which point you would already know how to check the quality of a cluster assignment) the result is barely better than choosing a few centroids at random. In particular with k-means, which is very sensitive to the scale of your features. The results are very unreliable if you have features of different types and scale (e.g. height, shoe size, body mass, BMI: k-means on such variables is statistical nonsense).
Do not dump your data into a clustering algorithm and expect to get something useful. Clustering follows the GIGO principle: garbage-in-garbage-out. Instead, you need to proceed as follows:
identify what is a good cluster in your domain. This is very data and problem dependant.
choose a clustering algorithm with a very simialar objective.
find a data transformation, distance function or modification of the clustering algorithm to align with your objective
carefully double-check the result for trivial, unwanted, biased and random solutions.
For example, if you blindly threw customer data into clustering algorithm, the chances are it will decide the best answer to be 2 clusters, corresponding to the attributes "gender=m" and "gender=f"simply because this is the most extreme factor in your data. But because this is a know attribute, this result is entirely useless.

Information retrieval (IR) vs data mining vs Machine Learning (ML)

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.
From people with experience in these fields, what exactly draws the line between these?
This is just the view of one person (formally trained in ML); others might see things quite differently.
Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.
Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.
Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.
So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).
Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.
The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.
I'd try to draw the line as follows:
Information retrieval is about finding something that already is part of your data, as fast as possible.
Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible.
Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible.
They intersect and often use techniques of one another. DM and IR both use index structures to accelerate processes. DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge.
They are often hard to separate. Do yourself a favor and don't just go for the buzzwords. In my opinion the best way of distinguishing them is by their intention, as given above: find data, generalize to new data, find new properties of existing data.
You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned.
I'd say there is no well-defined line between them. What separates them is their history and their emphases. Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between.
Data mining is about discovering hidden patterns or unknown knowledge, which can be used
for decision making by people.
Machine learning is about learning a model to classify new objects.

Using Artificial Intelligence (AI) to predict Stock Prices

Given a set of data very similar to the Motley Fool CAPS system, where individual users enter BUY and SELL recommendations on various equities. What I would like to do is show each recommendation and I guess some how rate (1-5) as to whether it was good predictor<5> (ie. correlation coefficient = 1) of the future stock price (or eps or whatever) or a horrible predictor (ie. correlation coefficient = -1) or somewhere in between.
Each recommendation is tagged to a particular user, so that can be tracked over time. I can also track market direction (bullish / bearish) based off of something like sp500 price. The components I think that would make sense in the model would be:
user
direction (long/short)
market direction
sector of stock
The thought is that some users are better in bull markets than bear (and vice versa), and some are better at shorts than longs- and then a combination the above. I can automatically tag the market direction and sector (based off the market at the time and the equity being recommended).
The thought is that I could present a series of screens and allow me to rank each individual recommendation by displaying available data absolute, market and sector out performance for a specific time period out. I would follow a detailed list for ranking the stocks so that the ranking is as objective as possible. My assumption is that a single user is right no more than 57% of the time - but who knows.
I could load the system and say "Lets rank the recommendation as a predictor of stock value 90 days forward"; and that would represent a very explicit set of rankings.
NOW here is the crux - I want to create some sort of machine learning algorithm that can identify patterns over a series of time so that as recommendations stream into the application we maintain a ranking of that stock (ie. similar to correlation coefficient) as to the likelihood of that recommendation (in addition to the past series of recommendations ) will affect the price.
Now here is the super crux. I have never taken an AI class / read an AI book / never mind specific to machine learning. So I cam looking for guidance - sample or description of a similar system I could adapt. Place to look for info or any general help. Or even push me in the right direction to get started...
My hope is to implement this with F# and be able to impress my friends with a new skill set in F# with an implementation of machine learning and potentially something (application / source) I can include in a tech portfolio or blog space;
Thank you for any advice in advance.
I have an MBA, and teach data mining at a top grad school.
The term project this year was to predict stock price movements automatically from news reports. One team had 70% accuracy, on a reasonably small sample, which ain't bad.
Regarding your question, a lot of companies have made a lot of money on pair trading (find a pair of assets that normally correlate, and buy/sell pair when they diverge). See the writings of Ed Thorpe, of Beat the Dealer. He's accessible and kinda funny, if not curmudgeonly. He ran a good hedge fund for a long time.
There is probably some room in using data mining to predict companies that will default (be unable to make debt payments) and shorting† them, and use the proceeds to buy shares in companies less likely to default. Look into survival analysis. Search Google Scholar for "predict distress" etc in finance journals.
Also, predicting companies that will lose value after an IPO (and shorting them. edit: Facebook!). There are known biases, in academic literature, that can be exploited.
Also, look into capital structure arbitrage. This is when the value of the stocks in a company suggest one valuation, but the value of the bonds or options suggest another value. Buy the cheap asset, short the expensive one.
Techniques include survival analysis, sequence analysis (Hidden Markov Models, Conditional Random Fields, Sequential Association Rules), and classification/regression.
And for the love of God, please read Fooled By Randomness by Taleb.
† shorting a stock usually involves calling your broker (that you have a good relationship with) and borrowing some shares of a company. Then you sell them to some poor bastard. Wait a while, hopefully the price has gone down, you buy some more of the shares and give them back to your broker.
My Advice to You:
There are several Machine Learning/Artificial Intelligence (ML/AI) branches out there:
http://www-formal.stanford.edu/jmc/whatisai/node2.html
I have only tried genetic programming, but in the "learning from experience" branch you will find neural nets. GP/GA and neural nets seem to be the most commonly explored methodologies for the purpose of stock market predictions, but if you do some data mining on Predict Wall Street, you might be able to utilize a Naive Bayes classifier to do what you're interested in doing.
Spend some time learning about the various ML/AI techniques, get a small data set and try to implement some of those algorithms. Each one will have its strengths and weaknesses, so I would recommend that you try to combine them using Naive Bays classifier (or something similar).
My Experience:
I'm working on the problem for my Masters Thesis so I'll pitch my results using Genetic Programming: www.twitter.com/darwins_finches
I started live trading with real money in 09/09/09.. yes, it was a magical day! I post the GP's predictions before the market opens (i.e. the timestamps on twitter) and I also place the orders before the market opens. The profit for this period has been around 25%, we've consistently beat the Buy & Hold strategy and we're also outperforming the S&P 500 with stocks that are under-performing it.
Some Resources:
Here are some resources that you might want to look into:
Max Dama's blog: http://www.maxdama.com/search/label/Artificial%20Intelligence
My blog: http://mlai-lirik.blogspot.com/
AI Stock Market Forum: http://www.ai-stockmarketforum.com/
Weka is a data mining tool with a collection of ML/AI algorithms: http://www.cs.waikato.ac.nz/ml/weka/
The Chatter:
The general consensus amongst "financial people" is that Artificial Intelligence is a voodoo science, you can't make a computer predict stock prices and you're sure to loose your money if you try doing it. None-the-less, the same people will tell you that just about the only way to make money on the stock market is to build and improve on your own trading strategy and follow it closely.
The idea of AI algorithms is not to build Chip and let him trade for you, but to automate the process of creating strategies.
Fun Facts:
RE: monkeys can pick better than most experts
Apparently rats are pretty good too!
I understand monkeys can pick better than most experts, so why not an AI? Just make it random and call it an "advanced simian Mersenne twister AI" or something.
Much more money is made by the sellers of "money-making" systems then by the users of those systems.
Instead of trying to predict the performance of companies over which you have no control, form a company yourself and fill some need by offering a product or service (yes, your product might be a stock-predicting program, but something a little less theoretical is probably a better idea). Work hard, and your company's own value will rise much quicker than any gambling you'd do on stocks. You'll also have plenty of opportunities to apply programming skills to the myriad of internal requirements your own company will have.
If you want to go down this long, dark, lonesome road of trying to pick stocks you may want to look into data mining techniques using advanced data mining software such as SPSS or SAS or one of the dozen others.
You'll probably want to use a combination or technical indicators and fundamental data. The data will more than likely be highly correlated so a feature reduction technique such as PCA will be needed to reduce the number of features.
Also keep in mind your data will constantly have to be updated, trimmed, shuffled around because market conditions will constantly be changing.
I've done research with this for a grad level class and basically I was somewhat successful at picking whether a stock would go up or down the next day but the number of stocks in my data set was fairly small (200) and it was over a very short time frame with consistent market conditions.
What I'm trying to say is what you want to code has been done in very advanced ways in software that already exists. You should be able to input your data into one of these programs and using either regression, or decision trees or clustering be able to do what you want to do.
I have been thinking of this for a few months.
I am thinking about Random Matrix Theory/Wigner's distribution.
I am also thinking of Kohonen self-learning maps.
These comments on speculation and past performance apply to you as well.
I recently completed my masters thesis on deep learning and stock price forecasting. Basically, the current approach seems to be LSTM and other deep learning models. There are also 10-12 technical indicators (TIs) based on moving average that have been shown to be highly predictive for stock prices, especially indexes such as SP500, NASDAQ, DJI, etc. In fact, there are libraries such as pandas_ta for computing various TIs.
I represent a group of academics that are trying to predict stocks in a general form that can also be applied to anything, even the rating of content.
Our algorithm, which we describe as truth seeking, works as follows.
Basically each participant has their own credence rating. This means that the higher your credence or credibility, then the more their vote counts. Credence is worked out by how close to the weighted credence each vote is. It's like you get a better credence value the closer you get to the average vote that has already been adjusted for credence.
For example, let's say that everyone is predicting that a stock's value will be at value X in 30 day's time (a future's option). People who predict on the average get a better credence. The key here is that the individual doesn't know what the average is, only the system. The system is tweaked further by weighting the guesses so that the target spot that generates the best credence is those votes that are already endowed with more credence. So the smartest people (historically accurate) project the sweet spot that will be used for further defining who gets more credence.
The system can be improved too to adjust over time. For example, when you find out the actual value, those people who guessed it can be rewarded with a higher credence. In cases where you can't know the future outcome, you can still account if the average weighted credence changes in the future. People can be rewarded even more if they spotted the trend early. The point is we don't need to even know the outcome in the future, just the fact that the weighted rating changed in the future is enough to reward people who betted early on the sweet spot.
Such a system can be used to rate anything from stock prices, currency exchange rates or even content itself.
One such implementation asks people to vote with two parameters. One is their actual vote and the other is an assurity percentage, which basically means how much a particular participant is assured or confident of their vote. In this way, a person with a high credence does not need to risk downgrading their credence when they are not sure of their bet, but at the same time, the bet can be incorporated, it just won't sway the sweet spot as much if a low assurity is used. In the same vein, if the guess is directly on the sweet spot, with a low assurity, they won't gain the benefits as they would have if they had used a high assurity.

Resources