Data Partition in Supervised Machine Learning - machine-learning

I have tried to train one Hidden Markov Model(HMM) tagger to extract some user defined
entities. I am trying to run one classifier to extract various relationships and resolving
ambiguity of the extracted entities.
In both these supervised algorithms I have kept 80% of the data for training, and 20% for testing.
I am not comparing any model performance so not keeping any data for validation or cross
validation. Am I fine?
I tried to read some materials like,
Stackexchange Post, Previous post1,Previous Post2 and a Wikipedia Article

Related

when adaboost is better than XGboost in some data combinations?

my name is Eslam a masters' student in Egypt, my thesis is in the field of education data mining. I used AdaBoost and XGBoost techniques for my predictive model to predict students success rate based on Open Learning Analytics data-set - OLAD.
the idea behind the analysis is tying various techniques (including ensemble and non ensemble techniques) on different combinations of features,interesting results showed up
Results:
the question is why some techniques performs better than others in specific features combinations? specially for Random Rorest,XGB and ADA?
ML model could achieve different results based on what kind of space and what kind of function you want to approximate. You can expect that SVM achieve highest score on data which are naturally embedded on Hilbert space. On the other hand if data does not fit this kind of space (i.e. many categorical, not ordered features) you can expect boosting trees methods would outperform SVM.
However if I good understood that 'Decision Tree Accuracy' is a single decision tree based on results from the picture I believe your tests was done on small data sets or your boosting and RF was incorrectly parametrized.

Machine Learning - training data vs 'has to be classified' data

i have a general question about data pre-processing for machine learning.
I know that it is almost a must do to center the data around 0 (mean subtraction), normalize the data (remove the variance). There are other possible techniques. This hast to be used for training-data and validation data sets.
I have encountered a following problem. My neural network, trained to classify specific shapes in images, fails to do so if i do not apply this pre-processing techniques to the images that has to be classified. This 'to classify' images are of course not contained in training set or validation set. By thus my question:
Is it normal to apply normalization to data, which has to be classified, or does the bad performance of my network without this techniques mean, that my model is bad in the sense, that it has failed to generalize and over fitted?
P.S. with normalization used on 'to classify' images, my model performs quite well (about 90% accuracy), without below 30%.
Additional info: model: convolutional neural network with keras and tensorflow.
It goes without saying (although admittedly it is seldom mentioned explicitly in introductory tutorials, hence the frequent frustration of beginners) that new data fed to the model for classification have to undergo the very same pre-processing steps followed for the training (and test) data.
Some common sense is certainly expected here: in all kinds of ML modeling, new input data are expected to have the same "general form" with the original data used for training & testing; the opposite case (i.e. what you have been trying to perform), if you stop for a moment to think about it, you should be able to convince yourself that does not make much sense...
The following answers may help you clarify the idea, illustrating also the case of inverse transforming the predictions whenever necessary:
How to predict a function/table using Keras?
Getting very bad prediction with KerasRegressor

Creating supervised model in machine learning

I have recently learned how supervised learning works. It learns labeled dataset and predict unlabeled datum.
But, I have a question that is it fine to teach the created model with the predicted datum and then predict unlabeled datum again. And repeat the process.
For example, Model M was created by 10 labeled dataset D, then Model M predicts datum A. Then, data A is added into dataset D and creates Model M again. The process is repeated with the amount of unpredicted data.
What you are describing here is a well known technique known as (among other names) "selftraining" or "self semi-supervised training". See for example slides https://www.cs.utah.edu/~piyush/teaching/8-11-print.pdf. There are hundreads of modifications around this idea. Unfortunately, in general it is hard to prove that it should help, so while it will help for some datasets it will hard the other ones. The main criterion here is the quality of the very first model, since selftraining is based on the assumption, that your original model is really good, thus you can trust it enough to label new examples. It might help with slow concept drift with a strong model, but will fail misserably with weak models.
What you describe is called online machine learning, incremental supervised learning, Updateable Classifiers... There are bunch of algorithms that accomplish these behavior. See for example weka toolbox Updateable Classifiers.
I suggest to look following ones.
HoeffdingTree
IBk
NaiveBayesUpdateable
SGD

What's the difference between collective classification and semi-supervised learning

I encountered the trouble like the title:
The definition of collective classification is "Collective classification is the area in machine learning, in which unknown nodes in the network are classified based on the classes assigned to the known nodes and the network structure only."
Semi-supervised learning is to infer the correct labels for the given unlabeled data ---wiki
Thus the only diff between them is that cc has classification while ssl doesn't. Is that correct?
Semi supervised learning is more general - it does not specify/stipulate the structure of the input data. It can be summarized as "learning from a combination of labeled and unlabeled data points". The approach to performing the inference is also unspecified.
The "Collective classification" as you have reflected above does specify the way in which the unlabeled points are inferred:
based on the classes assigned to the known nodes and the network
structure only.
So there is an additional expectation on the data that they are
- represented in a graph structure
- their correlation can be used to computer their relative similarity and hence their class
A summary of Collective Classification from this paper https://www.cs.uic.edu/~xkong/sdm11_icml.pdf helps to illustrate the (higher) expectations on the data structure and semantics:
Collective classification in relational data has become animportant and
active research topic in the last decade,where class labels for a
group of linked instances are cor-related and need to be predicted
simultaneously.
The note about the types of problems applicable is also revealing - notice they are graph oriented data analysis tasks:
Collective classification has a wide variety of real
world appli-cations,e.g.hyperlinked document classification,
socialnetworks analysis and collaboration networks analysis

Predicting from a highly skewed dataset

I would like to find the factors that contribute to a particular event happening. However that event occurs only about 1% of the time. So if I have a class attribute called event_happened, 99% of the time the value is 0, and 1 only 1% of the time. Traditional data mining predictions techniques (decision tree, naive bayes etc) don't seem to be working in this case. Any suggestions as to how should go about mining this dataset? Thanks.
This is the typical description of the task Anomaly detection task
It defines its own group of algorithms:
In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.
And a statement about the possible approaches:
Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then testing the likelihood of a test instance to be generated by the learned model.
What you would choose is a question of personal flavor.
These approaches will help "learn" to find out outlier events; then the model that "predicts" them will define the factors that you are interested in.
lets say my attributes are hour_of_the day, day_of_the_week, state, customer_age, customer_gender etc. And I want to find out which of these factors contribute to my event occurring.
Based on this answer, I believe you need classification, but your result will be the model itself.
So, you perform, say, logistic regression, but your features are the data attributes themselves(some literature doesn't even separate features and attributes).
You have to somehow normalize this data. This can be tricky. I would go for boolean features(say hour_of_event==00, hour_of_event==01, hour_of_event==02,...)
Then, you apply any classification model, you end up with weights against each of the attributes. The attributes with (the highest weights will be the factors that you need).
This is an unbalanced classification problem.
I'm pretty sure I have seen some surveys and overview articles on methods that can handle unbalanced data well. You should research this term ("skew" is a bit broad, and may not get you the results you are looking for).

Resources