Can information gain be negative? - machine-learning

It's so weird about my quiz.{(200+/300-)->((50+/50-),(250-,150+))}, aftet calculation the IG is negative. Can some one explain why?

After I researched some other questions, I think I got some idea. Some people say information gain cannot be negative, but actually it can be, depends on which feature you choose to split. I found a paper,http://compbio.umbc.edu/Documents/Negative_information.pdf, which shows some protein can influence the understanding of genomic composition, since protein are so messy. If we use some messy feature to build our tree, the information gain would be negative, since imagine, in the real world, if we want to make a decision, some features can make us hesitate. If we use this kind of features to make decision, the result would be more uncertain. Forgiving my poor English...
the example I've given is actually positive...my mistake, sorry about that.

Related

polynomial regression for payroll examination

I serve as internal auditor in few clients ,one of my client has thousands of employees in different location, most of them in the head office, the client looks for corporate control for the salary monitoring
is it make sense to use the regression method in order to find outliers,
potential parameter can be -years of experience, gender, level/rank etc
I planned to go over all the monthly payroll and look for significant outliers ,because of the differences between the global location, it might be a good idea to focus only in the head office
the idea is to train the model for previous months average and test it for the current month
what do you think is too much effort or theoretical ? or can have a good chance to bring value ?
thank you
This answers your question regarding the regression method to use. It makes sense to only use data from the head office, as adding data from different geographies will require you to add more data around general demographics, which you can avoid for a proof of concept.
Coming to the problem itself, you'll need to provide a better explanation of how you're defining outliers. Are you looking for mistakes in payroll? Or are you looking for people who make significantly more/less than their peers? You'll only be able to decide on a modelling framework once you get clarity the basic definitions.
Also, you might want to consider statistical significance tests like Grubbs test (more information on tests here) first, before moving to machine learning approaches. They're easier to set up and explain to non-practitioners.

How to give a logical reason for choosing a model

I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?
How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)

How do i represent the chromosome using Genetic Algorithm?

My task is to calculate clashes between alert time schedule and the user calendar schedule to generate the clashes less alert time schedule.
How should i represent the chromosome according to this problem?
How should i represent the time slots? (Binary or Number)
Thank You
(Please Consider i'm a beginner to the genetic algorithm studies)
Questions would be: What have you tried so far? How good are your results so far? Also your Problem is
stated quite unspecific. Thus here is what I can give:
The Chromosome should probably be the starttime of the alerts in your schedule (if I understood your Problem correctly).
As important is to think of the ways you want to evaluate and calculate the Fitness of your individuals (here clashes (e.g. amount or time overlap between appointments), but it is obvious that you might find better heuristics to receive better solutions / faster convergence)
Binary or continuous number might both work: I am usually going for numbers whenever there is no strong reason to not do so (since it is easier to Interpret, debug, etc.). Binary comes with some nice opportunities with respect to Mutation and Recombination.
I strongly recommend playing around and reading about those Things. This might look like a lot of extra work to implement, but you should rather come to see them as hyperparameters which Need to be tuned in order to receive the best Outcome.

Classifying URLs into categories - Machine Learning

[I'm approaching this as an outsider to machine learning. It just seems like a classification problem which I should be able to solve with fairly good accuracy with Machine Larning.]
Training Dataset:
I have millions of URLs, each tagged with a particular category. There are limited number of categories (50-100).
Now given a fresh URL, I want to categorize it into one of those categories. The category can be determined from the URL using conventional methods, but would require a huge unmanageable mess of pattern matching.
So I want to build a box where INPUT is URL, OUTPUT is Category. How do I build this box driven by ML?
As much as I would love to understand the basic fundamentals of how this would work out mathematically, right now much much more focussed on getting it done, so a conceptual understanding of the systems and processes involved is what I'm looking to get. I suppose machine learning is at a point where you can approach reasonably straight forward problems in that manner.
If you feel I'm wrong and I need to understand the foundations deeply in order to get value out of ML, do let me know.
I'm building this inside an AWS ecosystem so I'm open to using Amazon ML if it makes things quicker and simpler.
I suppose machine learning is at a point where you can approach reasonably straight forward problems in that manner.
It is not. Building an effective ML solution requires both an understanding of problem scope/constraints (in your case, new categories over time? Runtime requirements? Execution frequency? Latency requirements? Cost of errors? and more!). These constraints will then impact what types of feature engineering / processing you may look at, and what types of models you will look at. Your particular problem may also have issues with non I.I.D. data, which is an assumption of most ML methods. This would impact how you evaluate the accuracy of your model.
If you want to learn enough ML to do this problem, you might want to start looking at work done in Malicious URL classification. An example of which can be found here. While you could "hack" your way to something without learning more about ML, I would not personally trust any solution built in that manner.
If you feel I'm wrong and I need to understand the foundations deeply in order to get value out of ML, do let me know.
Okay, I'll bite.
There are really two schools of thought currently related to prediction: "machine learners" versus statisticians. The former group focuses almost entirely on practical and applied prediction, using techniques like k-fold cross-validation, bagging, etc., while the latter group is focused more on statistical theory and research methods. You seem to fall into the machine-learning camp, which is fine, but then you say this:
As much as I would love to understand the basic fundamentals of how this would work out mathematically, right now much much more focussed on getting it done, so a conceptual understanding of the systems and processes involved is what I'm looking to get.
While a "conceptual understanding of the systems and processes involved" is a prerequisite for doing advanced analytics, it isn't sufficient if you're the one conducting the analysis (it would be sufficient for a manager, who's not as close to the modeling).
With just a general idea of what's going on, say, in a logistic regression model, you would likely throw all statistical assumptions (which are important) to the wind. Do you know whether certain features or groups shouldn't be included because there aren't enough observations in that group for the test statistic to be valid? What can happen to your predictions and hypotheses when you have high variance-inflation factors?
These are important considerations when doing statistics, and oftentimes people see how easy it is to do from sklearn.svm import SVC or somthing like that and run wild. That's how you get caught with your pants around your ankles.
How do I build this box driven by ML?
You don't seem to have even a rudimentary understanding of how to approach machine/statistical learning problems. I would highly recommend that you take an "Introduction to Statistical Learning"- or "Intro to Regression Modeling"-type course in order to think about how you translate the URLs you have into meaningful features that have significant power predicting URL class. Think about how you can decompose a URL into individual pieces that might give some information as to which class a certain URL pertains. If you're classifying espn.com domains by sport, it'd be pretty important to parse nba out of http://www.espn.com/nba/team/roster/_/name/cle, don't you think?
Good luck with your project.
Edit:
To nudge you along, though: every ML problem boils down to some function mapping input to output. Your outputs are URL classes. Your inputs are URLs. However, machines only understand numbers, right? URLs aren't numbers (AFAIK). So you'll need to find a way to translate information contained in the URLs to what we call "features" or "variables." One place to start, there, would be one-hot encoding different parts of each URL. Think of why I mentioned the ESPN example above, and why I extracted info like nba from the URL. I did that because, if I'm trying to predict to which sport a given URL pertains, nba is a dead giveaway (i.e. it would very likely be highly predictive of sport).

Evaluation metrics for algoritthms to calculate topic-hotness

How do you evaluate algorithms for calculation of hotness of a post? As in how would you know, which performs better an exponential-decay or the redddit's algo? I understand the question may be a bit naive, but I am looking into performance metrics, or cost functions to help with this?
As with evaluation of any piece of software, you have to first set out problems for it solve and from those derive goals you want to achieve. After you have those, then you can start to determine what metrics will provide a useful approximation of progress towards the goals.
Perhaps you want your site to be great at breaking news. You probably will derive goals from that like "given sufficient votes, a new post should be able to make it to the top 30 listings in the first 10 minutes after it's posted". Then you can build out some test cases and see if you meet them.
Or perhaps you want to be the place with the "best" stuff from across the web. Your goals will weight more heavily towards user approval than newness.
You have to evaluate your own situation to come up with reasonable performance metrics.

Resources