I am using InfoGainAttributeEval in Weka. I got this result
The result is not clear to me still I try to create a tree based upon my theoretical understanding.
I would like to know am I correct to making the tree? Because in the tree all right children are empty.
The dataset look like
Related
This may be a stupid question, so I apologise in advance. I'm currently using weka to classify performance using a number of attributes. However after running the J48 model on a 70/30 split, and visualising the decision tree, there is only one attribute being used, as opposed to all of them. Where have I went wrong and how do I fix it?this is what I've got
I am working with Keras and experimenting with AI and Machine Learning. I have a few projects made already and now I'm looking to replicate a dataset. What direction do I go to learn this? What should I be looking up to begin learning about this model? I just need an expert to point me in the right direction.
To clarify; by replicating a dataset I mean I want to take a series of numbers with an easily distinguishable pattern and then have the AI generate new data that is similar.
There are several ways to generate new data similar to a current dataset, but the most prominent way nowadays is to use a Generative Adversarial Network (GAN). This works by pitting two models against one another. The generator model attempts to generate data, and the discriminator model attempts to tell the difference between real data and generated data. There are plenty of tutorials out there on how to do this, though most of them are probably based on image data.
If you want to generate labels as well, make a conditional GAN.
The only other common method for generating data is a Variational Autoencoder (VAE), but the generated data tend to be lower-quality than what a GAN can generate. I don't know if that holds true for non-image data, though.
You can also use Conditional Variational Autoencoder which produces new data with label.
I have a fast set of multi dimensional timebased data which i suspect contain patterns. I simplified the dataset to create a custom visualization.
Humans see patterns in the visualization but the result of the pattern cannot be explained by the visualization. This is because of the simplification step, it hides data which is important.
I cannot put all my data in my visualization cause than humans cannot see the possible patterns anymore because too much data and dimensions are visualized.
Is there a technique that can detect hidden unknown patterns in a data set? (without using visualization, and without me learning the technique patterns) .
One optional extra would be that the technique should somehow be able to "explain the patterns" to me so that i can check if they make sense.
[edit] i can give the technique a collection of small sized datasets (extracted from the big dataset; still very multi dimensional) that i know that contain patterns (by using my visualization). The technique then needs to analyze under what conditions a pattern produces result a or result b.
First of, how did you "simplify" the data? If you did it without any heuristics, you might go ahead and perform PCA. The very idea of PCA is to solve your problem: Not losing "important" data while having a dimensional reduction. You can visualize your principal components so that patterns can be detected by the human eye as well as algorithms.
To your 2nd question: Yes, there are techniques that can detect hidden unknown patterns in data. However, this is a huge field (Machine Learning) and what algorithm you'd use, would depend on your problem structure, so it's impossible to give a specific model name at this point. From what you specified, neural networks in general seem fit to do the job. After you trained a network, you can visualize the activations or weights (Hinton Diagram) to perform an analysis on which input data is treated "similarly".
I hava construct a decision tree model for a binary classification problem. What is bothering me is that when i have a new test instance, how can i get the probability or score which it belongs to.(not the specific classify result)
A simple way can be to use the frequencies attached to the leaves, but this frequentist approach suffers from issues related to data quantities, so you can smooth those estimates in various ways.
Also, have a look at this question about C4.5.
I've been working weka for couple of months now.
Currently, I'm working on my machine learning course here in Ostfold University College.
I need a better way to construct a decision tree based on separated training and test sets.
Anybody come up with good idea can be of very great relief.
Thanx in advance.
-Neo
You might be asking for something more specific, but in general:
You build the decision tree with the training set, and you evaluate the performance of that tree using the test set. In other words, on the test data, you call a function usually named something like c*lassify*, passing in the newly-built tree and a data point (within your test set) you wish to classify.
This function returns the leaf (terminal) node from your tree to which that data point belongs--and assuming that the contents of that leaf is homogeneous (populated with data from a single class, not a mixture) then you have in essence assigned a class label to that data point. When you compare that class label assigned by the tree to the data point's actual class label, and repeat for all instances in your test set, you have a metric to evaluate the performance of your tree.
A rule of thumb: shuffle your data, then assign 90% to the training set and the other 10% to a test set.
actually i was looking for something like this - http://weka.wikispaces.com/Saving+and+loading+models
to save a model, load it and use it in the training set.
This is exactly what i was searching for. Hope it might be useful for anyone who had similar problem as mine.
cheers
-Neo182