Potential threat analysis classification (IP address) - machine-learning

I want to classify an IP address as either bad, neutral or good. I have 3 parameters i.e.
Is the IP address from a cloud provider like DigitalOcean (could be true or false only) ? if so, we penalize it.
Is the IP a known VPN/proxy (could be true or false only), we again penalize it.
Is the IP originating from a shady subnet (trust score in percentage)? We penalize it.
At first I wanted to use a credit score kind of weight measure approach i.e. the 3 conditions each carry 5 points. Each individual parameter would have a percentage rating, so if its a cloud provider we give it a 10/100 otherwise we can say give it 100/100.
The problem is that this approach would result in false negatives and optimizing the overall weight classification ranges would also be a problem.
Also differences in score ranges would be an issue. For example in a cloud provider could either be true or false, whereas a shady domain could have individual scores based on existing data.
What would be a more sane approach at tackling this? Would a decision tree be good enough or should I opt for KNN?

Related

Neural Networks for Large Repetitive Sets of Inputs

Suppose we want to make a neural network to predict the outcome of a race between some number of participants.
Each participant in the race has various statistics: Engine Power, Max Speed, Driver Experience, etc.
Now imagine we have been asked to build a system which can handle any number of participants from 2 to 400 participants (just to pick a concrete number).
From what I have learned about "traditional" Neural Nets so far, our choices are:
Build many different neural nets for each number of participants: n = 2, 3, 4, 5, ... , 400.
Train one neural network taking input from 400 participants. When a piece of data refers to a race with less that 400 participants (this will be a large percentage of the data) just set all remaining statistic inputs to 0.
Assuming this would work, is there any reason to expect one method to perform better than the other?
The former is more specialized, but you have much less training data per net, so my guess is that it would work out roughly the same?
Is there a standard way to approach problems similar to this?
We could imagine (simplistically) that the neural network first classifies the strength of each participant, and therefore, each time a new participant is added, it needs to apply this same analysis to these new inputs, potentially hinting that there might be a "smart" way to reduce the total amount of work required.
Is this just screaming for a convolutional neural network?
Between your two options, option 1 would involve repeating a lot of effort to train for different sizes, and would probably be very slow to train as a result.
Option 2 is a bit more workable, but the network would need extra training on different sized inputs.
Another option, which I think would be the most likely to work, would be to only train a neural net to choose a winner between two participants, and use this to create a ranking via many comparisons between pairs. Such an approach is described here.
We could imagine (simplistically) that the neural network first classifies the strength of each participant, and therefore, each time a new participant is added, it needs to apply this same analysis to these new inputs, potentially hinting that there might be a "smart" way to reduce the total amount of work required.
I think you've got the key idea here. Since we want to perform exactly the same analysis on each participants (assuming it makes no difference whether they're participant 1 or participant 400), this is an ideal problem for Weight Sharing. This means that the weights on the neurons doing the initial analysis on a participant are identical for each participant. When these weights change for one participant, they change for all participants.
While CNNs do use weight sharing, we don't need to use a CNN to use this technique. The details of how you'd go about doing this would depend on your framework.

How do UnknownCategoricalLevels affect the confidence values of H2O model predictions

I am using a DRF model generated with h2o flow. When running fresh input data against this model (using its MOJO in a java program with the EasyPredictModelWrapper), there are a large number of UnknownCategoricalLevels (checking with the getUnknownCategoricalLevelsSeen() and getUnknownCategoricalLevelsSeenPerColumn() methods).
My workaround for this was to only use those predictions that had a prediction confidence above a certain threshold (say 0.90). Ie. the classProbability selected by the model must be grater than threshold to be used.
My questions are:
Is this solution wrong-headed (ie. does not actually address/workaround the problem (eg. unknownlevels don't actually affect the class probability values)) or is it a valid workaround to the problem?
Is there a better way to address this issue?
Thanks.
The unknown categorical level is treated as an NA for that column.
Without knowing the details of your data (including the cost implications of false positives and false negatives), I wouldn't say that you need to threshold rows that have NAs any differently than for rows that do not. (The NA is already handled quite well by DRF.)
Note the built-in threshold is max-F1 (not 0.5). So if you are changing the threshold for rows with unknown values, it's relative to max-F1 (not 0.5). Using your own threshold is certainly a valid approach.
If you want to visualize your trees to more easily see how the NAs behave, you can do so following the instructions here:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#viewing-a-mojo
There are also other strategies for dealing with it, like target-encoding your categorical input column and treating an NA as the average target value. (This effectively turns a categorical variable into a numeric one, but requires you to preprocess the data.)

Machine Learning: How to detect the independent variables that are generating a dependent boolean value

I'm Trying to use machine learning in my job, but I can't find a way to adapt it to what I need. And I don't know if it is already a known problem or if I'm working with something that doesn't have a known solution yet.
Let's say that I have a lot of independent variables, encoded as onehot, and a dependent variable with only two status: True (The result had an error) and False (The result was successful)
My independent variables are the parameters I use for a query in an API, and the result is the one that returned the API.
My objective is to detect a pattern where I can see in a dataset in a certain timeframe of a few hours, the failing parameters, so I can avoid to query the API if I'm certain that it could fail.
(I'm working with millions of queries per day, and this mechanism is critical for a good user experience)
I'll try to make an example so you can understand what I need.
Suppose that I have a delivery company, I count with 3 trucks, and 3 different routes I could take.
So, my dummy variables would be T1,T2,T3,R1,R2 and R3 (I could delete T3 and R3 since there are considered by the omission of the other 2)
Then, I have a big dataset of the times that the delivery was delayed. So: Delayed=1 or Delayed=0
With this, I would have a set like this:
T1_|_T2_|_T3_|_R1_|_R2_|_R3||Delayed
------------------------------------
_1_|_0__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_0_|_1__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_1_|_0__|_0__|_1__|_0__|_0_||____0__
Not only I want to say "in most cases, truck 1 arrives late, it could have a problem, I shouldn't send it more", that is a valid result too, but I also want to detect things like: "in most cases, truck 1 arrives late when it goes in the route 1, probably this type of truck has a problem on this specific route"
This dataset is an example, the real one is huge, with thousand of dependent variables, so it could probably have more than one problem in the same dataset.
example: truck 1 has problems in route 1, and truck 3 has problems in route 1.
example2: truck 1 has problems in route 1, and truck 3 has problems in any route.
So, I would make a blacklist like:
example: Block if (truck=1 AND route=1) OR (truck=3 AND route=1)
example2: Block if (truck=1 AND route=1) OR truck=3
I'm actually doing this without machine learning, with an ugly code that makes a massive cartesian product of the independent columns, and counts the quantity of "delayed". Then I choose the worst delayed/total proportion, I blacklist it, and I iterate again with new values.
This errors are commonly temporary, so I would send a new dataset every few hours, I don't need a lifetime span analysis, except that the algorithm considers these temporary issues.
Anyone has a clue of what can I use, or where can I investigate about it?
Don't hesitate to ask for more info if you need it.
Thanks in advance!
Regards
You should check out the scikit-learn package for machine learning classifiers (Random Forest is an industry standard). For this problem, you could feed a portion of the data (training set, say 80% of the data) to the model and it would learn how to predict the outcome variable (delayed/not delayed).
You can then test the accuracy of your model by 'testing' on the remaining 20% of your data (the test set), to see if your model is any good at predicting the correct outcome. This will give you a % accuracy. Higher is better generally, unless you have severely imbalanced classes, in which case your classifier will just always predict the more common class for easy high accuracy.
Finally, if the accuracy is satisfactory, you can find out which predictor variables your model considered most important to achieve that level of prediction, i.e. Variable Importance. I think this is what you're after. So running this every few hours would tell you exactly which features (columns) in your set are best at predicting if a truck is late.
Obviously, this is all easier said than done and often you will have to perform significant cleaning of your data, sometimes normalisation (not in the case of random forests though), sometimes weighting your classifications, sometimes engineering new features... there is a reason this is a dedicated profession.
Essentially what you're asking is "how do I do Data Science?". Hopefully this will get you started, the rest (i.e. learning) is on you.

How to check a trained neural network

I am writing a little bit about googles deepdream. It's possible to check with deepdream learned networks, see research blog google the examplbe with the dumbbells.
In the example a network is trained to recognize a dumbbell. Then they use deepdream to see what the network has learned and the result is the network was trained bad. Because it recognize a dumbbell plus an arm as a dumbbell.
My question is, how will networks check in practice? With deepdream or which other method?
Best greetings
Generally in machine learning you validate your learned network on a dataset you did not use in the training process (a test set). So in this case, you would have a set of examples with and without dumbbells that was used to train the model, as well as a set (also consisting of dumbbells and without) that were not seen during the training procedure.
When you have your model, you let it predict the labels of the withheld set. You then compare these predicted labels to the actual ones:
Every time you predict a dumbbell correctly, you increment the amount of True Positives,
in case it correctly predicts the absence of a dumbbell, you increment the amount of True Negatives
when it predicted a dumbbell, but it should not be one, increment the amount of False Positives
Finally if it predicted no dumbbell, but there is one, you increment the amount of False Negatives
Based on these four, you can then calculate measures such as F1 score or accuracy to calculate the performance of the model. (Have a look at the following wiki: https://en.wikipedia.org/wiki/F1_score )

What is the appropriate Machine Learning Algorithm for this scenario?

I am working on a Machine Learning problem which looks like this:
Input Variables
Categorical
a
b
c
d
Continuous
e
Output Variables
Discrete(Integers)
v
x
y
Continuous
z
The major issue that I am facing is that Output Variables are not totally independent of each other and there is no relation that can be established between them. That is, there is a dependence but not due to the causality (one value being high doesn't imply that the other will be high too but the chances of other being higher will improve)
An Example would be:
v - Number of Ad Impressions
x - Number of Ad Clicks
y - Number of Conversions
z - Revenue
Now, for an Ad to be clicked, it has to first appear on a search, so Click is somewhat dependent on Impression.
Again, for an Ad to be Converted, it has to be first clicked, so again Conversion is somewhat dependent on Click.
So running 4 instances of the problem predicting each of the output variables doesn't make sense to me. Infact there should be some way to predict all 4 together taking care of their implicit dependencies.
But as you can see, there won't be a direct relation, infact there would be a probability that is involved but which can't be worked out manually.
Plus the output variables are not Categorical but are in fact Discrete and Continuous.
Any inputs on how to go about solving this problem. Also guide me to existing implementations for the same and which toolkit to use to quickly implement the solution.
Just a random guess - I think this problem can be targeted by Bayesian Networks. What do you think ?
Bayesian Networks will do fine in your case. Your network won't be that huge either so you can live with exact inference algorithms like graph elimination or junction tree. If you decide to use BNs, then you can use Kevin Murphy's BN toolbox. Here is a link to that. For a more general toolbox that uses Gibbs sampling for approximate Monte Carlo inference, you can use BUGS.
Edit:
As an example look at the famous sprinkler example here. For totally discrete variables, you define the conditional probability tables as in the link. For instance you say that given that today is cloudy, there is a 0.8 probability of rain. You define all probability distributions, where the graph shows the causality relations (i.e. if cloud then rain etc.) Then as query you ask to your inference algorithm questions like, given that grass was wet; was it cloudy, was it raining, was the sprinkler on and so on.
To use BNs one needs a system model that is described in terms of causality relations (Directed Acyclic Graph) and probability transitions. If you wanna learn your system parameters there are techniques like EM algorithm. However, learning the graph structure is a really hard task and supervised machine learning approaches will do better in that case.

Resources