How to calculate TP, FP, TN, FN using only N and 4 relationships between the cells? - analysis

Is there an easy way to calculate TP, FP, TN, FN using only the total number of participants and the relationships Sensitivity, Specificity, PPV, and NPV?
I'm doing a meta-analysis on the diagnostic accuracy of various cancer screening tests and I need the MADA counts for a program I wrote to easily compare the treatments and display the ROC curves.
I'd like to find the counts but the author only gives the relations and the total number of participants.
I could simply guess and check the cells until I get a close answer,
or I could exhaustively simply do the algebra because I think it's solvable.
I'd simply like to know if there is a package in R or some other language that would simply solves this problem for me.

Related

Why different stocks can be mergerd together to build a single prediction models?

Given n samples with d features of stock A, we can build a (d+1) dimensional linear model to predict the profit. However, in some books, I found that if we have m different stocks with n samples and d features for each, then they merge these data to get m*n samples with d features to build a single (d+1) dimensional linear model to predict the profit.
My confusion is that, different stocks usually have little connection with each other, and their profit are influenced by different factors and environment, so why they can be merged to build a single model?
If you are using R as tool of choice, you might like the time series embedding howto and its appendix -- the mathematics behind that is Taken's theorem:
[Takens's theorem gives] conditions under which a chaotic dynamical system can be reconstructed from a sequence of observations of the state of a dynamical system.
It looks to me as the statement's you quote seem to relate to exactly this theorem: For d features (we are lucky, if we know that number - we usually don't), we need d+1 dimensions.
If more time series should be predicted, we can use the same embedding space if the features are the same. The dimensions d are usually simple variables (like e.g. temperature for different energy commodity stocks) - this example helped me to intuitively grasp the idea.
Further reading
Forecasting with Embeddings

Genetic algorithm - shortest path in weighted graph

I want to make a genetic algorithm that solves a shortest path problem in weighted, connected graph. Similar to travelling salesman, but instead of fully-connected graph, it's just connected.
My idea is to randomly generate a path consisting of n-1 nodes for each chromosome in binary form, where numbers indicate nodes in a path. Then I will choose the best depending on sum of weights (if cant go from A to B i would give it penalty) and crossover/mutate bits in it. Will it work? It feels a little like smaller version of bruteforce. Is there a better way?
Thanks!
Genetic algorithm is pretty much "smaller version of bruteforce". It is just a metaheuristic, not an optimization method which has decent convergence guarantees. It basically depends on randomness to provide new solutions, thus it is a "slightly better random search".
So "will it work"? Yes, it will do something, as long as you have enough randomness in mutation it will even (eventually) converge to optimum. Will it work better than a random search? Hard to say, this depends on dozens of factors, not only your encoding, but also all the hyperparameters used etc. in general genetic algorithms are about trials and errors. In particular representation of chromosomes which does not loose any information (yours does not) does not matter, meaning that everything depends on clever implementation of crossover and mutation (as long as chromosomes do not loose any information they are all equivalent).
Edited.
You can use permutation coding GA. In permutation coding, you should give the start and end points. GA searches for the best chromosome with your fitness function. Candidate solutions (chromosomes) will be like 2-5-4-3-1 or 2-3-1-4-5 or 1-2-5-4-3 etc. So your solution depends on your fitness function. (Look at GA package for R to apply permutation GA easily.)
Connections are constraints for your problem. My best advice is create a constraint matrix like that:
FirstPoint SecondPoint Connected
A B true
A C true
A E false
... ... ...
In standard TSP, just distances are considered. In your fitness function, you have to consider this matrix and add a penalty to return value for each false.
Example chromosome: A-B-E-D-C
A-B: 1
B-E: 1
E-D: 4
D-C: 3
Fitness value: 9
.
Example chromosome: A-E-B-C-D
A-E: penalty
E-B: 1
B-C: 6
C-D: 3
Fitness value: 10 + penalty value.
Because your constraint is a hard constraint, you can use max integer value as the penalty. GA will find the best solution. :)

how to decide p of ACF and q of PACF in AR, MA, ARMA and ARIMA?

I am confused about how to calculate p of ACF and q of PACF in AR, MA, ARMA and ARIMA. For example, in R, we use acf or pacf to get the best p and q.
However, based on the information I have read, p is the order of AR and q is the order of MA. Let's say p=2, then AR(2) is supposed to be y_t=a*y_t-1+b*y_t-2+c. We can calculate acf function (in R) when lag=1,2,3.... to find which lag brings the biggest acf function value. The same thing happens to MA for deciding q. But, does this mean that p and q have already been set up?
I guess here is the steps. But I am not sure if I am right.
So, let's say in R's functions acf and pacf, is this the real process:
1. For p=1, set lag=1,2,3,...max to see which lag has the biggest autocorrelation value.
2. For p=2,3,4..., do the same thing to find the lags.
3. Compare those values with each other. Let's say the biggest autocorrelation value comes when p=2 and lag=4, when we say the order of AR, ie. p, is 2?
Cloud anyone please give me an example showing exactly how to estimate p and q?
This isn't a good stackoverflow question. You want to be on the Math site for this. To answer your question, though, there isn't one single generally accepted method for finding the optimal p and q.
Generally, what most people tend to do, is eyeball it using pacf visualizations (in which case, as you observe, you can't distinguish whether to put time into p or q) and set p == q.
An alternative way to do it, would be to try estimating your time series with different values of p and q, in a grid search, and pick the combination that maximizes some estimator like log likelihood or out-of-sample error, or whatever makes sense on your dataset.
If I might suggest, however, you probably want to start by looking at the rather extensive body of research on arima models and see how others have done this - that really should be your first step for questions like this.
PACF plot for most optimal in the AR(p) model, ACF plot for most optimal in the MA(q) model

Identifying machine learning data to make predictions

As a learning exercise I plan to implement a machine learning algorithm (probably neural network) to predict what users earn trading stocks based on shares bought , sold and transaction times. Below datasets are test data I've formulated.
acronym's :
tab=millisecond time apple bought
asb=apple shares bought
tas=millisecond apple sold
ass=apple shares sold
tgb=millisecond time google bought
gsb=google shares bought
tgs=millisecond google sold
gss=google shares sold
training data :
username,tab,asb,tas,ass,tgb,gsb,tgs,gss
a,234234,212,456789,412,234894,42,459289,0
b,234634,24,426789,2,234274,3,458189,22
c,239234,12,156489,67,271274,782,459120,3
d,234334,32,346789,90,234254,2,454919,2
classifications :
a earned $45
b earned $60
c earned ?
d earned ?
Aim : predict earnings of users c & d based on training data
Is there any data points I should add to this data set? I should use alternative data perhaps ? As this is just a learning exercise of my own creation can add any feature that may be useful.
This data will need to be normalised, is there any other concept I should be aware of ?
Perhaps should not use time as a feature parameter as shares can bounce up and down depending on time.
You might want to solve your problem in below order:
Prediction for an individual stock's future value based on all stock's historical data.
Prediction for a combination of stocks' total future value based on a portfolio and all stocks' historical data.
A buy-sell short-term strategy for managing a portfolio. (when and what amount to buy/sell on which stock(s) )
If you can do 1) well for a particular stock, probably it's a good starting point for 2). 3) might be your goal but I put it in the last because it's even more complicated.
I would make some assumptions below and focus on how to solve 1) hopefully. :)
I assume at each timestamp, you have a vector of all possible features, e.g.:
stock price of company A (this is the target value)
stock price of other companies B, C, ..., Z (other companies might affect company A directly or indirectly)
52 week lowest price of A, B, C, ..., Z (long-term features begin)
52 week highest price of A, B, C, ..., Z
monthly highest/lowest price of A, B, C, ..., Z
weekly highest/lowest price of A, B, C, ..., Z (short-term features begin)
daily highest/lowest price of A, B, C, ..., Z
is revenue report day of A, B, C, ..., Z (really important features begin)
change of revenue of A, B, C, ..., Z
change of profit of of A, B, C, ..., Z
semantic score of company profile from social networks of A, ..., Z
... (imagination helps here)
And I assume you have almost all above features at every fixed time interval.
I think a lstm-like neural network is very relevant here.
Don't use the username along with the training data - the network might make associations between the username and the $ earned. Including it would factor in the user to the output decision, while excluding it ensures the network will be able to predict the $ earned for an arbitrary user.
Using parameter that you are suggesting seems me impossible to predict earnings.
The main reason is that input parameters don't correlate with output value.
You input values contradicts itself - consider such case is it possible that for the same input you will expect different output values? If so you won't be able predict any output for such input.
Let's go further, earnings of trader depend not only from a share of bought/sold stocks, but also from price of each one of them. This will bring us to the problem when we provide to neural network two equals input but desire different outputs.
How to define 'good' parameters to predict desired output in such case?
I suggest first of all to look for people who do such estimations then try to define a list of parameters they take into account.
If you will succeed you will get a huge list of variables.
Then you can try to build some model for example, using neural network.
Besides normalisation you'll also need scaling. Another question, which I have for you is classification of stocks. In your example you provide google and apple which are considered as blue-chipped stocks. I want to clarify, you want to make prediction of earning only for google and apple or prediction for any combination of two stocks?
If you want to make prediction only for google and apple and provide data which you have, then you can apply only normalization and scaling with some kind of recurrent neural network. Recurrent NN are better in prediction tasks then simple model of feedforward with backpropagation training.
But in case if you want to apply your training algorithm to more then just google and apple, I recommend you to split your training data into some groups by some criteria. One example of dividing can be according to capitalization of stocks. And if you want to make capitalization dividing, you can make five groups ( as example ). And if you decide to make five groups of stocks, you can also apply equilateral encoding in order to decrease number of dimensions for NN learning.
Another kind of grouping which you can think of can be area of operation of stock. For example agricultural, technological, medical, hi-end, tourist groups.
Let's say you decided to give this grouping as mentioned ( I mean agricultural, technological, medical, hi-end, tourist). Then five groups will give you five entries into NN to input layer ( so called thermometer encoding ).
And let's say you want to feed agricultural stock.
Then input will look like this:
1,0,0,0,0, x1, x2, ...., xn
Where x1, x2, ...., xn - are other entries.
Or if you apply equilateral encoding, then you'll have one dimension less ( I'm to lazy to describe how it will look like ).
Yet one more idea for converting entries for neural network can be thermometer encoding.
And one more idea to keep in your mind, as usually people loose on trading stocks, so your data set will be biased. I mean if you randomly choose only 10 traders, they all can be losers, and your data set will not be completely representative. So in order to avoid data bias, you should have big enough data set of traders.
And one more detail, you don't need to pass into NN user id, because NN then learn trading style of particular user, and use it for prediction.
Seems to me dimensions are more than data points. However, it might be the case that your observations are in a linear sub space, you just need to compute the kernel of the matrix shown above.
If the kernel has a larger dimension than the number of data points then you do not need add more data points.
Now there is another thing to look at. You should check out your classifier's VC dimension, don't want to add too many points to the dataset. But anyway that is mostly theoretical in this example, and I'm just joking.

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

Resources