Repeated measures for 3+ groups comparing percentages - spss

I'm new to SPSS. I have data of skin cancer diagnosis for the years 2004 - 2018. I want to compare the changes in distribution of new cases with regards to which body part and compare between the different years. I've managed to create a crosstab and grouped bar graph that shows the percentages but I would like to run a statistical analysis to see if the changes in distribution are significant over time. The groups I have are face, trunk, arm, leg or not specified, the number of cases for each year vary greatly which is why I'm looking to compare the ratios (percentages) between the different body sites. The only explanations I've found all refer to repeated observations of the same subject which is not the case here (a person is only included with their first diagnosis so can only appear in one of the years).
The analysis would be similar to comparing the percentages of an election between 3+ parties and how that distribution changes over the years but I haven't found any such tutorials. Please help!

The CTABLES or Custom Tables procedure, if you have access to it, will let you create a crosstabulation like you mention, and then will let you test both for any changes overall in the distribution of types, as well as comparing each pair of columns for each row.
More generally, problems like this would usually be handled as loglinear or logit models.

Related

Does addition on Date column leads to over fitting?

I'm working on a dataset and what to predict whether it will rain or not, so should I include the date column. I haven't built the model yet, but I think it will lead to overfitting.
I don't think datetime is a vital feature. Though useful feature could be the season but now-a-days it's changing rapidly due to climate change and so on.
Anyways as it's a time-series problem the results are much more dependent on the condition of prior days but of course there are subtle changes which makes it harder to predict.
There are some existing works you can find below:
https://pdfs.semanticscholar.org/2761/8afb77c5081d942640333528943149a66edd.pdf
(used 2 prior days info as features)
https://stackabuse.com/using-machine-learning-to-predict-the-weather-part-1/ (3 prior days info as features)
I think these are some good starting point.

RapidMiner - Time Series Segmentation

As I am fairly new to RapidMiner, I have a Historical Financial Data Set (with attributes Date, Open, Close, High, Low, Volume Traded) from Yahoo Finance and I am trying to find a way to segment it such as in the image below:
I am also planning on performing this segmentation on more than one of such Data Sets and then comparing between each segmentation (i.e. Segment 1 for Data Set A against Segment 1 for Data Set B), so I would preferably require an equal number of segments each.
I am aware that certain extensions are available within the RapidMiner Marketplace, however I do not believe that any of them have what I am looking for. Your assistance is much appreciated.
Edit: I am currently trying to replicate the Voting-Based Outlier Mining for Multiple Time Series (V-BOMM) with multiple data sets. So far, I am able to perform the operation by recording and comparing common dates against each other.
However, I would like to enhance the process to compare Segments rather than simply dates. I have gone through the existing functionalities of RapidMiner, and thus far I don't believe any fit my requirements.
I have also considered Dynamic Time Warping, but I can't seem to find an available functionality in RapidMiner.
Ultimate question: Can someone guide me to functionalities that can help replicate the segmentation in the attached image such that the segments can be compared between Historic Data Sets in RapidMiner? Also, can someone guide me on how to implement Dynamic Time Warping using RapidMiner?
I would use the new version of the Time Series extension, using the windowing features to segment the time series into whatever parts you want. There is a nice explanation of the new tools in the blog section of the community.

Classification of industry based on tags

I have a dataset (1M entries) on companies where all companies are tagged based on what they do.
For example, Amazon might be tagged with "Retail;E-Commerce;SaaS;Cloud Computing" and Google would have tags like "Search Engine;Advertising;Cloud Computing".
So now I want to analyze a cluster of companies, e.g. all online marketplaces like Amazon, eBay, etsy, and the like. But there is no single tag that I can look for, but I have to use a set of tags to quantify the likelihood for a company to be a marketplace.
For example tags like "Retail", "Shopping", "E-Commerce" are good tags, but then there might be some small consulting agencies or software development firms that consult / build software for online marketplaces and have tags like "consulting;retail;e-commerce" or "software development;e-commerce;e-commerce tools", which I want to exclude as they are not online marketplaces.
I'm wondering what is the best way of identifying all online market places from my dataset. What machine learning algorithm, is suited to select the maximum amount of companies which are in the industry I'm looking for while excluding the ones that are obviously not part of it.
I thought about supervised learning, but I'm not sure because of a few issues:
Labelling needed, which means I would have to go through thousands of companies and flag them on multiple industries (marketplace, finance, fashion, ...) as I'm interested in 20-30 industries overall
There are more than 1,000 tags associated with the companies. How would I define my features? 1 feature per tag would lead to a massive dimensionality.
Are there any best practices for such cases?
UPDATE:
It should be possible to assign companies to multiple clusters, e.g. Amazon should be identified as "Marketplace", but also as "Cloud Computing" or "Online Streaming".
I used tf-idf and kmeans to identify tags that form clusters, but I don't know how to assign likelihoods / scores to companies that indicate how good the company fits into the cluster based on its tags.
UPDATE:
While tf-idf in combination with kmeans delivered pretty neat clusters (meaning the companies within a cluster were actually similiar), I also tried to calculate probabilities of belonging to a cluster with Gaussian Mixture Models (GMMs), which led to completely messed up results where companies within a cluster were more or less random or came from a handful different industries.
No idea why this happened though...
UPDATE:
Found the error. I applied a PCA before the GMM to reduce dimensionality, however, this apparently led to the random results. Removing the PCA improved the results significantly.
However, the resulting posterior probabilities of my GMM are 0. or 1. exactly 99.9% of the time. Is there a parameter (I'm using a sklearn BayesianGMM) that needs to be adjusted to get more valuable probablilities that are a little bit more centered? Because right now everything < 1.0 is not part of a cluster anymore, but there's also few outliers that get a posterior of 1.0 and are thus assigned to an industry. For example, a company with "Baby;Consumer" gets assigned to the "Consumer Electronics" cluster, even though only 1 out of 2 tags may be suggesting this. So I'd like this to get a probability of < 1. such that I can define a threshold based on some cross-validation.

Interdependece of parameters in Machine Learning problems

Let's say we have 10-D data from class of students. The data involves parameters like Name, Grades, Courses, No. of hours of lectures, etc. of all the students of the class. Now, we want to analyze the impact of No. of hours of lectures on Grades.
If we closely watch our parameters, Name of the student has nothing to do with Grades, but Courses taken by student "might" have impact on Grades.
So, there could be parameters which are dependent on each other while some others can be totally independent. My question is, how do we decide that which parameter has impact on our classification/regression problem and which don't?
PS: I am not looking for exact solutions. If someone can just show me the right direction or keywords for google search, that should be sufficient.
Thanks.
The technique you're looking for is called dimension reduction. The Stanford machine learning class goes over one method (principal component analysis).
This is the problem of independent component analysis. ICA a family of methods for finding the statistically independent components of data sets. This is a difficult problem, and there exists a large variety of algorithms for finding good solutions. A popular algorithm is FastICA.
There are also related concepts of whitening and decorrelation.

Using Artificial Intelligence (AI) to predict Stock Prices

Given a set of data very similar to the Motley Fool CAPS system, where individual users enter BUY and SELL recommendations on various equities. What I would like to do is show each recommendation and I guess some how rate (1-5) as to whether it was good predictor<5> (ie. correlation coefficient = 1) of the future stock price (or eps or whatever) or a horrible predictor (ie. correlation coefficient = -1) or somewhere in between.
Each recommendation is tagged to a particular user, so that can be tracked over time. I can also track market direction (bullish / bearish) based off of something like sp500 price. The components I think that would make sense in the model would be:
user
direction (long/short)
market direction
sector of stock
The thought is that some users are better in bull markets than bear (and vice versa), and some are better at shorts than longs- and then a combination the above. I can automatically tag the market direction and sector (based off the market at the time and the equity being recommended).
The thought is that I could present a series of screens and allow me to rank each individual recommendation by displaying available data absolute, market and sector out performance for a specific time period out. I would follow a detailed list for ranking the stocks so that the ranking is as objective as possible. My assumption is that a single user is right no more than 57% of the time - but who knows.
I could load the system and say "Lets rank the recommendation as a predictor of stock value 90 days forward"; and that would represent a very explicit set of rankings.
NOW here is the crux - I want to create some sort of machine learning algorithm that can identify patterns over a series of time so that as recommendations stream into the application we maintain a ranking of that stock (ie. similar to correlation coefficient) as to the likelihood of that recommendation (in addition to the past series of recommendations ) will affect the price.
Now here is the super crux. I have never taken an AI class / read an AI book / never mind specific to machine learning. So I cam looking for guidance - sample or description of a similar system I could adapt. Place to look for info or any general help. Or even push me in the right direction to get started...
My hope is to implement this with F# and be able to impress my friends with a new skill set in F# with an implementation of machine learning and potentially something (application / source) I can include in a tech portfolio or blog space;
Thank you for any advice in advance.
I have an MBA, and teach data mining at a top grad school.
The term project this year was to predict stock price movements automatically from news reports. One team had 70% accuracy, on a reasonably small sample, which ain't bad.
Regarding your question, a lot of companies have made a lot of money on pair trading (find a pair of assets that normally correlate, and buy/sell pair when they diverge). See the writings of Ed Thorpe, of Beat the Dealer. He's accessible and kinda funny, if not curmudgeonly. He ran a good hedge fund for a long time.
There is probably some room in using data mining to predict companies that will default (be unable to make debt payments) and shorting† them, and use the proceeds to buy shares in companies less likely to default. Look into survival analysis. Search Google Scholar for "predict distress" etc in finance journals.
Also, predicting companies that will lose value after an IPO (and shorting them. edit: Facebook!). There are known biases, in academic literature, that can be exploited.
Also, look into capital structure arbitrage. This is when the value of the stocks in a company suggest one valuation, but the value of the bonds or options suggest another value. Buy the cheap asset, short the expensive one.
Techniques include survival analysis, sequence analysis (Hidden Markov Models, Conditional Random Fields, Sequential Association Rules), and classification/regression.
And for the love of God, please read Fooled By Randomness by Taleb.
† shorting a stock usually involves calling your broker (that you have a good relationship with) and borrowing some shares of a company. Then you sell them to some poor bastard. Wait a while, hopefully the price has gone down, you buy some more of the shares and give them back to your broker.
My Advice to You:
There are several Machine Learning/Artificial Intelligence (ML/AI) branches out there:
http://www-formal.stanford.edu/jmc/whatisai/node2.html
I have only tried genetic programming, but in the "learning from experience" branch you will find neural nets. GP/GA and neural nets seem to be the most commonly explored methodologies for the purpose of stock market predictions, but if you do some data mining on Predict Wall Street, you might be able to utilize a Naive Bayes classifier to do what you're interested in doing.
Spend some time learning about the various ML/AI techniques, get a small data set and try to implement some of those algorithms. Each one will have its strengths and weaknesses, so I would recommend that you try to combine them using Naive Bays classifier (or something similar).
My Experience:
I'm working on the problem for my Masters Thesis so I'll pitch my results using Genetic Programming: www.twitter.com/darwins_finches
I started live trading with real money in 09/09/09.. yes, it was a magical day! I post the GP's predictions before the market opens (i.e. the timestamps on twitter) and I also place the orders before the market opens. The profit for this period has been around 25%, we've consistently beat the Buy & Hold strategy and we're also outperforming the S&P 500 with stocks that are under-performing it.
Some Resources:
Here are some resources that you might want to look into:
Max Dama's blog: http://www.maxdama.com/search/label/Artificial%20Intelligence
My blog: http://mlai-lirik.blogspot.com/
AI Stock Market Forum: http://www.ai-stockmarketforum.com/
Weka is a data mining tool with a collection of ML/AI algorithms: http://www.cs.waikato.ac.nz/ml/weka/
The Chatter:
The general consensus amongst "financial people" is that Artificial Intelligence is a voodoo science, you can't make a computer predict stock prices and you're sure to loose your money if you try doing it. None-the-less, the same people will tell you that just about the only way to make money on the stock market is to build and improve on your own trading strategy and follow it closely.
The idea of AI algorithms is not to build Chip and let him trade for you, but to automate the process of creating strategies.
Fun Facts:
RE: monkeys can pick better than most experts
Apparently rats are pretty good too!
I understand monkeys can pick better than most experts, so why not an AI? Just make it random and call it an "advanced simian Mersenne twister AI" or something.
Much more money is made by the sellers of "money-making" systems then by the users of those systems.
Instead of trying to predict the performance of companies over which you have no control, form a company yourself and fill some need by offering a product or service (yes, your product might be a stock-predicting program, but something a little less theoretical is probably a better idea). Work hard, and your company's own value will rise much quicker than any gambling you'd do on stocks. You'll also have plenty of opportunities to apply programming skills to the myriad of internal requirements your own company will have.
If you want to go down this long, dark, lonesome road of trying to pick stocks you may want to look into data mining techniques using advanced data mining software such as SPSS or SAS or one of the dozen others.
You'll probably want to use a combination or technical indicators and fundamental data. The data will more than likely be highly correlated so a feature reduction technique such as PCA will be needed to reduce the number of features.
Also keep in mind your data will constantly have to be updated, trimmed, shuffled around because market conditions will constantly be changing.
I've done research with this for a grad level class and basically I was somewhat successful at picking whether a stock would go up or down the next day but the number of stocks in my data set was fairly small (200) and it was over a very short time frame with consistent market conditions.
What I'm trying to say is what you want to code has been done in very advanced ways in software that already exists. You should be able to input your data into one of these programs and using either regression, or decision trees or clustering be able to do what you want to do.
I have been thinking of this for a few months.
I am thinking about Random Matrix Theory/Wigner's distribution.
I am also thinking of Kohonen self-learning maps.
These comments on speculation and past performance apply to you as well.
I recently completed my masters thesis on deep learning and stock price forecasting. Basically, the current approach seems to be LSTM and other deep learning models. There are also 10-12 technical indicators (TIs) based on moving average that have been shown to be highly predictive for stock prices, especially indexes such as SP500, NASDAQ, DJI, etc. In fact, there are libraries such as pandas_ta for computing various TIs.
I represent a group of academics that are trying to predict stocks in a general form that can also be applied to anything, even the rating of content.
Our algorithm, which we describe as truth seeking, works as follows.
Basically each participant has their own credence rating. This means that the higher your credence or credibility, then the more their vote counts. Credence is worked out by how close to the weighted credence each vote is. It's like you get a better credence value the closer you get to the average vote that has already been adjusted for credence.
For example, let's say that everyone is predicting that a stock's value will be at value X in 30 day's time (a future's option). People who predict on the average get a better credence. The key here is that the individual doesn't know what the average is, only the system. The system is tweaked further by weighting the guesses so that the target spot that generates the best credence is those votes that are already endowed with more credence. So the smartest people (historically accurate) project the sweet spot that will be used for further defining who gets more credence.
The system can be improved too to adjust over time. For example, when you find out the actual value, those people who guessed it can be rewarded with a higher credence. In cases where you can't know the future outcome, you can still account if the average weighted credence changes in the future. People can be rewarded even more if they spotted the trend early. The point is we don't need to even know the outcome in the future, just the fact that the weighted rating changed in the future is enough to reward people who betted early on the sweet spot.
Such a system can be used to rate anything from stock prices, currency exchange rates or even content itself.
One such implementation asks people to vote with two parameters. One is their actual vote and the other is an assurity percentage, which basically means how much a particular participant is assured or confident of their vote. In this way, a person with a high credence does not need to risk downgrading their credence when they are not sure of their bet, but at the same time, the bet can be incorporated, it just won't sway the sweet spot as much if a low assurity is used. In the same vein, if the guess is directly on the sweet spot, with a low assurity, they won't gain the benefits as they would have if they had used a high assurity.

Resources