Classification with custom utility function - machine-learning

I have a problem which involves optimization of actions over time:
Lets assume I have a set of input variables X where each X_i_t has a
value at each point in time t = 0 ... T.
For each point in time, I would like to choose an action a_t of a set of
actions A,
such that a utility function U(a0, ..., a_T) is maximized.
Note, the utility function does not have a closed-form solution and its value depends on the entire sequence of actions a_0 ... a_T.
How would I implement something like this? I am perfectly happy with a keyword I can use to look up relevant literature. I do not need a full solution. - Though if somebody can point me to a python sklearn function which does this, I would definitly not say no...
My first intuition was "logistic regression" but there is no way to assign "correct labels" to an action a_t at time t, since the utility depends on the actions taken earlier and later in the time series.

If you plan to use neural networks with TensorFlow or Pytorch, it will be easy. As long as you can express the function U within the framework and the utility function is reasonably close to being continuous, you can back-propagate the utility to the network. You just ask the optimizer to maximize the utility and that's it.
If the utility function is discrete, it gets tricky, but there are several tricks you might try. One of them is the REINFORCE algorithm (Monte-Carlo policy gradient). Another trick that is getting quite popular is Gubmle softmax that allows sampling of discrete action and propagating the error to the network.
If you plan to use different classifiers (like decision forests or whatever), you might try something based on imitation learning like the SEARN algorithim.

Related

Why shouldn't we use multiple activation functions in the same layer?

I'm a newbie when in comes to ML and neural nets, have been studying mainly online through coursera videos and a bit of kaggle/github. All the examples or cases where I've seen neural networks being applied have one thing in common - they use a specific type of activation function in all the nodes pertaining to a specific layer.
From what I understand, each node uses non-linear activation functions to learn about a particular pattern in the data. If it were so, why not use multiple types of activation functions?
I did find one link, which basically says that it's easier to manage a network if we use just one activation function per layer. Any other benefits?
Purpose of an activation function is to introduce non-linearity to a neural network. See this answer for more insight on why our deep neural networks would not actually be deep without non-linearity.
Activation functions do their job by controlling the outputs of the neurons. Sometimes they provide a simple threshold like ReLU does, which can be coded as following:
if input > 0:
return input
else:
return 0
And some other times they behave in more complicated ways such as tanh(x) or sigmoid(x). See this answer for more on different sorts of activations.
I also would like to add that I agree with #Joe, an activation function does not learn a particular pattern, it effects the way that a neural network learns multiple patterns. Each activation function have its own kind of effect on the output.
Thus, one benefit of not using multiple activation functions in a single layer would be predictability of their effect. We know what ReLU or Sigmoid does to the output of a convolutional filter for example. But do we now the effect of their cascaded use? In which order btw, does ReLU come first, or is it better for us to use Sigmoid first? Does it matter?
If we want to benefit from the combination of activation functions, all of these questions (and maybe many more) need to be answer with scientific evidences. Tedious experiments and evaluations should be done to get some meaningful results. Only then we would now what does it mean to use them together and after that, maybe a new type of activation function will arise and there will be a new name for it.

why we use kmeans.fit function in kmeans clustering method?

I am using kmeans clustering technique from a video but i do not understand why we use .fit method in kmeans clustering?
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(X) //why we use this fit method here
kmeans is your defined model.
To train our model , we use kmeans.fit() here.
The argument in
kmeans.fit(argument)
is our data set that need to be Clustered.
After using the
fit() function
our model is ready.
And we get labels for that clusters using
data_labels = kmeans.labels_
Because the sklearn people early on decided that everything should have fit(X, y) and predict(X) functions. And it likely is not going to change, because of backwards compatibility...
It does not make a whole lot of sense for clustering, which does not use y (which defaults to None as it is ignored). And there is no real use case where you would want to drop-in replace a classifier with a clustering, either.
Nevertheless, you'll at some point need to run the algorithm. It is an anti-pattern to do this in a constructor (so KMeans(n_clusters=5, data=X) is a no-no), so you will have to invoke some method. You may as well call it fit then, which fits at least for optimization based methods such as k-means.
You could, however, simply use the method k_means(X, n_clusters=5) instead of using the class. Then it would be a single line (see the source code of fit for an example).

Supervised learning linear regression

I am confused about how linear regression works in supervised learning. Now I want to generate a evaluation function for a board game using linear regression, so I need both the input data and output data. Input data is my board condition, and I need the corresponding value for this condition, right? But how can I get this expected value? Do I need to write an evaluation function first by myself? But I thought I need to generate an evluation function by using linear regression, so I'm a little confused about this.
It's supervised-learning after all, meaning: you will need input and output.
Now the question is: how to obtain these? And this is not trivial!
Candidates are:
historical-data (e.g. online-play history)
some form or self-play / reinforcement-learning (more complex)
But then a new problem arises: which output is available and what kind of input will you use.
If there would be some a-priori implemented AI, you could just take the scores of this one. But with historical-data for example you only got -1,0,1 (A wins, draw, B wins) which makes learning harder (and this touches the Credit Assignment problem: there might be one play which made someone lose; it's hard to understand which of 30 moves lead to the result of 1). This is also related to the input. Take chess for example and take a random position from some online game: there is the possibility that this position is unique over 10 million games (or at least not happening often) which conflicts with the expected performance of your approach. I assumed here, that the input is the full board-position. This changes for other inputs, e.g. chess-material, where the input is just a histogram of pieces (3 of these, 2 of these). Now there are much less unique inputs and learning will be easier.
Long story short: it's a complex task with a lot of different approaches and most of this is somewhat bound by your exact task! A linear evaluation-function is not super-uncommon in reinforcement-learning approaches. You might want to read some literature on these (this function is a core-component: e.g. table-lookup vs. linear-regression vs. neural-network to approximate the value- or policy-function).
I might add, that your task indicates the self-learning approach to AI, which is very hard and it's a topic which somewhat gained additional (there was success before: see Backgammon AI) popularity in the last years. But all of these approaches are highly complex and a good understanding of RL and the mathematical-basics like Markov-Decision-Processes are important then.
For more classic hand-made evaluation-function based AIs, a lot of people used an additional regressor for tuning / weighting already implemented components. Some overview at chessprogramming wiki. (the chess-material example from above might be a good one: assumption is: more pieces better than less; but it's hard to give them values)

what actually the sigmoid derivative purpose?

Honestly im learning the neural network but i have a question in the activation part.
I know that the question is general and a lot of explanation around the internet. But i still don't understand clearly.
Why we need to derivate the sigmoid function? why do not we just use
it?
It will be good if you give the clear explanation. Thankyou.
I've seen many videos on youtube, i've read many article about it but still don't get it.
Thanks for your help.
Your question is not entirely clear, but I assume you are asking: "Why don't we just use the Sigmoid function without having to calculate its derivative?".
Your question is also very broad, so my answer is very broad and wordy, you will need to read more to understand all the details, for which I'll try to provide links.
Activation function: as the name suggests, we are wanting to know if a given node is "on" or "off", for which the sigmoid function provides an easy way to turn continuous variables (X) into a range of {0,1}.
Use cases can vary and this function has certain properties, and so that is why there are many alternative "activation" functions, like tanh, ReLU, etc. Read more here: https://en.wikipedia.org/wiki/Sigmoid_function
Differentiate (derivate): most models we want to find the best-fit beta parameters for all our activation functions. To do this we, we typically want to minimise a "cost" function that describes how good our model is at predicting observed data. One way to solve this optimisation problem is Gradient Descent. Each step of gradient descent updates the parameters by following the multi-dimensional cost-function space. To do this, it needs the gradient of the activation function. This is important for back propagation that uses gradient descent to optimise the network, it requires that the activation functions you use (in most cases) to be differentiateable.
Read more here: https://en.wikipedia.org/wiki/Gradient_descent
I suggest if you have a deeper question that you take it to one of the machine learning stackexchange sites.

What is the appropriate Machine Learning Algorithm for this scenario?

I am working on a Machine Learning problem which looks like this:
Input Variables
Categorical
a
b
c
d
Continuous
e
Output Variables
Discrete(Integers)
v
x
y
Continuous
z
The major issue that I am facing is that Output Variables are not totally independent of each other and there is no relation that can be established between them. That is, there is a dependence but not due to the causality (one value being high doesn't imply that the other will be high too but the chances of other being higher will improve)
An Example would be:
v - Number of Ad Impressions
x - Number of Ad Clicks
y - Number of Conversions
z - Revenue
Now, for an Ad to be clicked, it has to first appear on a search, so Click is somewhat dependent on Impression.
Again, for an Ad to be Converted, it has to be first clicked, so again Conversion is somewhat dependent on Click.
So running 4 instances of the problem predicting each of the output variables doesn't make sense to me. Infact there should be some way to predict all 4 together taking care of their implicit dependencies.
But as you can see, there won't be a direct relation, infact there would be a probability that is involved but which can't be worked out manually.
Plus the output variables are not Categorical but are in fact Discrete and Continuous.
Any inputs on how to go about solving this problem. Also guide me to existing implementations for the same and which toolkit to use to quickly implement the solution.
Just a random guess - I think this problem can be targeted by Bayesian Networks. What do you think ?
Bayesian Networks will do fine in your case. Your network won't be that huge either so you can live with exact inference algorithms like graph elimination or junction tree. If you decide to use BNs, then you can use Kevin Murphy's BN toolbox. Here is a link to that. For a more general toolbox that uses Gibbs sampling for approximate Monte Carlo inference, you can use BUGS.
Edit:
As an example look at the famous sprinkler example here. For totally discrete variables, you define the conditional probability tables as in the link. For instance you say that given that today is cloudy, there is a 0.8 probability of rain. You define all probability distributions, where the graph shows the causality relations (i.e. if cloud then rain etc.) Then as query you ask to your inference algorithm questions like, given that grass was wet; was it cloudy, was it raining, was the sprinkler on and so on.
To use BNs one needs a system model that is described in terms of causality relations (Directed Acyclic Graph) and probability transitions. If you wanna learn your system parameters there are techniques like EM algorithm. However, learning the graph structure is a really hard task and supervised machine learning approaches will do better in that case.

Resources