Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have a dataset, which contains input and output labels. For most of the data, the output labels are certain. For instance, some data belongs to type A, some data belongs to type B, some data belongs to type C, some data belong to type D. But there is some special data, which contains only fuzzy information: we only know that these special data does not belong to type A or type B, in other words, it belongs either type C or type D.
So for this type of dataset, how can we employ machine learning method, such as XGBoost etc. to train a classification model? Is there any mature method to deal with this?
The simplest way is to put all those "special data" into a new category named "E" (ideally new category should be -1).
Then the ML model learns those as records that cannot be classified into your known categories based on your training data.
A complex way is to have 4 separate target values.
Probability of A: P(A)
Probability of B: P(B)
Probability of C: P(C)
Probability of D: P(D)
So for the "special data" where you know that it's not A or B, the values of these fields will be:
P(A):0, P(B):0, P(C):0.5, P(D):0.5
If you have some sort of probability function in your training data, use those instead of equal possibility.
Then predict the probabilities for all 4 using regression and you have more granular outputs.
The outputs with highly skewed probabilities can be put into definite classes and the outputs with a more even distribution can show that the output is also a case of "special data".
Is there an easy way to calculate TP, FP, TN, FN using only the total number of participants and the relationships Sensitivity, Specificity, PPV, and NPV?
I'm doing a meta-analysis on the diagnostic accuracy of various cancer screening tests and I need the MADA counts for a program I wrote to easily compare the treatments and display the ROC curves.
I'd like to find the counts but the author only gives the relations and the total number of participants.
I could simply guess and check the cells until I get a close answer,
or I could exhaustively simply do the algebra because I think it's solvable.
I'd simply like to know if there is a package in R or some other language that would simply solves this problem for me.
As a learning exercise I plan to implement a machine learning algorithm (probably neural network) to predict what users earn trading stocks based on shares bought , sold and transaction times. Below datasets are test data I've formulated.
acronym's :
tab=millisecond time apple bought
asb=apple shares bought
tas=millisecond apple sold
ass=apple shares sold
tgb=millisecond time google bought
gsb=google shares bought
tgs=millisecond google sold
gss=google shares sold
training data :
username,tab,asb,tas,ass,tgb,gsb,tgs,gss
a,234234,212,456789,412,234894,42,459289,0
b,234634,24,426789,2,234274,3,458189,22
c,239234,12,156489,67,271274,782,459120,3
d,234334,32,346789,90,234254,2,454919,2
classifications :
a earned $45
b earned $60
c earned ?
d earned ?
Aim : predict earnings of users c & d based on training data
Is there any data points I should add to this data set? I should use alternative data perhaps ? As this is just a learning exercise of my own creation can add any feature that may be useful.
This data will need to be normalised, is there any other concept I should be aware of ?
Perhaps should not use time as a feature parameter as shares can bounce up and down depending on time.
You might want to solve your problem in below order:
Prediction for an individual stock's future value based on all stock's historical data.
Prediction for a combination of stocks' total future value based on a portfolio and all stocks' historical data.
A buy-sell short-term strategy for managing a portfolio. (when and what amount to buy/sell on which stock(s) )
If you can do 1) well for a particular stock, probably it's a good starting point for 2). 3) might be your goal but I put it in the last because it's even more complicated.
I would make some assumptions below and focus on how to solve 1) hopefully. :)
I assume at each timestamp, you have a vector of all possible features, e.g.:
stock price of company A (this is the target value)
stock price of other companies B, C, ..., Z (other companies might affect company A directly or indirectly)
52 week lowest price of A, B, C, ..., Z (long-term features begin)
52 week highest price of A, B, C, ..., Z
monthly highest/lowest price of A, B, C, ..., Z
weekly highest/lowest price of A, B, C, ..., Z (short-term features begin)
daily highest/lowest price of A, B, C, ..., Z
is revenue report day of A, B, C, ..., Z (really important features begin)
change of revenue of A, B, C, ..., Z
change of profit of of A, B, C, ..., Z
semantic score of company profile from social networks of A, ..., Z
... (imagination helps here)
And I assume you have almost all above features at every fixed time interval.
I think a lstm-like neural network is very relevant here.
Don't use the username along with the training data - the network might make associations between the username and the $ earned. Including it would factor in the user to the output decision, while excluding it ensures the network will be able to predict the $ earned for an arbitrary user.
Using parameter that you are suggesting seems me impossible to predict earnings.
The main reason is that input parameters don't correlate with output value.
You input values contradicts itself - consider such case is it possible that for the same input you will expect different output values? If so you won't be able predict any output for such input.
Let's go further, earnings of trader depend not only from a share of bought/sold stocks, but also from price of each one of them. This will bring us to the problem when we provide to neural network two equals input but desire different outputs.
How to define 'good' parameters to predict desired output in such case?
I suggest first of all to look for people who do such estimations then try to define a list of parameters they take into account.
If you will succeed you will get a huge list of variables.
Then you can try to build some model for example, using neural network.
Besides normalisation you'll also need scaling. Another question, which I have for you is classification of stocks. In your example you provide google and apple which are considered as blue-chipped stocks. I want to clarify, you want to make prediction of earning only for google and apple or prediction for any combination of two stocks?
If you want to make prediction only for google and apple and provide data which you have, then you can apply only normalization and scaling with some kind of recurrent neural network. Recurrent NN are better in prediction tasks then simple model of feedforward with backpropagation training.
But in case if you want to apply your training algorithm to more then just google and apple, I recommend you to split your training data into some groups by some criteria. One example of dividing can be according to capitalization of stocks. And if you want to make capitalization dividing, you can make five groups ( as example ). And if you decide to make five groups of stocks, you can also apply equilateral encoding in order to decrease number of dimensions for NN learning.
Another kind of grouping which you can think of can be area of operation of stock. For example agricultural, technological, medical, hi-end, tourist groups.
Let's say you decided to give this grouping as mentioned ( I mean agricultural, technological, medical, hi-end, tourist). Then five groups will give you five entries into NN to input layer ( so called thermometer encoding ).
And let's say you want to feed agricultural stock.
Then input will look like this:
1,0,0,0,0, x1, x2, ...., xn
Where x1, x2, ...., xn - are other entries.
Or if you apply equilateral encoding, then you'll have one dimension less ( I'm to lazy to describe how it will look like ).
Yet one more idea for converting entries for neural network can be thermometer encoding.
And one more idea to keep in your mind, as usually people loose on trading stocks, so your data set will be biased. I mean if you randomly choose only 10 traders, they all can be losers, and your data set will not be completely representative. So in order to avoid data bias, you should have big enough data set of traders.
And one more detail, you don't need to pass into NN user id, because NN then learn trading style of particular user, and use it for prediction.
Seems to me dimensions are more than data points. However, it might be the case that your observations are in a linear sub space, you just need to compute the kernel of the matrix shown above.
If the kernel has a larger dimension than the number of data points then you do not need add more data points.
Now there is another thing to look at. You should check out your classifier's VC dimension, don't want to add too many points to the dataset. But anyway that is mostly theoretical in this example, and I'm just joking.
There are many tools and paper available which perform this task using basic sentence seperators.
Such tools are
http://nlp.stanford.edu/software/tokenizer.shtml
OpenNLP
NLTK
and there might be other. They mainly focus on
(a) If it's a period, it ends a sentence.
(b) If the preceding token is on my hand-compiled list of abbreviations, then it doesn't end a sentence.
(c) If the next token is capitalized, then it ends a sentence.
There are few paper which suggest techniques for SBD in ASR text
http://pdf.aminer.org/000/041/703/experiments_on_sentence_boundary_detection.pdf
http://www.icsd.aegean.gr/lecturers/kavallieratou/publications_files/icpr_2000.pdf
http://www.icsd.aegean.gr/lecturers/kavallieratou/publications_files/icpr_2000.pdf
Is there any tools which can perform sentence detection on ambiguous sentences like
John is actor and his father Mr Smith was top city doctor in NW (2 sentences)
Where is statue of liberty, what is it's height and what is the history behind? (3 sentences)
What you are seeking to do is to identify the independent clauses in a compound sentence. A compound sentence is a sentence with at least two independent clauses joined by a coordinating conjunction. There is no readily available tool for this, but you can identify compound sentences with a high degree of precision by using constituency parse trees.
Be wary, though. Sligh grammatical mistakes can yield a very wrong parse tree! For example, if you use the Berkeley parser (demo page: http://tomato.banatao.berkeley.edu:8080/parser/parser.html) on your first example, the parse tree is not what you would expect, but correct it to "John is an actor and his father ... ", and you can see the parse tree neatly divided into the structure S CC S:
Now, you simply take each sentence-label S as an independent clause!
Questions are not handled well, I am afraid, as you can check with your second example.
I have about 5000 terms in a table and I want to group them into categories that make sense.
For example some terms are:
Nissan
Ford
Arrested
Jeep
Court
The result should be that Nissan, Ford, Jeep get grouped into one category and that Arrested and Court are in another category. I looked at the Stanford Classifier NLP. Am I right to assume that this is the right one to choose to do this for me?
I would suggest you to use NLTK if there weren't many proper nouns. You can use the semantic similarity from WordNet as features and try to cluster the words. Here's a discussion about how to do that.
To use the Stanford Classifier, you need to know how many buckets (classes) of words you want. Besides I think that is for documents rather than words.
That's an interesting problem that the word2vec model that Google released may help with.
In a nutshell, a word is represented by an N-dimensional vector generated by a model. Google provides a great model that returns a 300-dimensional vector from a model trained on over 100 billion words from their news division.
The interesting thing is that there are semantics encoded in these vectors. Suppose you have the vectors for the words King, Man, and Woman. A simple expression (King - Man) + Woman will yield a vector that is exceedingly close to the vector for Queen.
This is done via a distance calculation (cosine distance is their default, but you can use your own on the vectors) to determine similarity between words.
For your example, the distance between Jeep and Ford would be much smaller than between Jeep and Arrested. Through this you could group terms 'logically'.