I am new to Machine Learning, and I have got several questions with my data now.
Let's say I have X samples, Y features and I also have the connection between x1 and x2 (e.g. the interaction count)
As most of the tutorial of Machine Learning start with labels specifically labelled at the sample itself...
I would like to ask how I should build the model? I want to have a model that it can predict two specific samples to see how high the interaction counts would be.
Giving me a direction/ keywords to learn would be good enough, thanks!
I have got others suggestion for the approach:
Formulate the problem as z=f(x1,x2), i.e. label depends on tuple of sample. if a dataset of ((x1,x2)=>z) is prepared, it can then be used to train regression, decision trees or networks.
Related
I’m very new to machine learning.
I have a dataset with data given me by a f1 race. User is playing this game and is giving me this dataset.
With machine learning, I have to work with this data and when a user (I know they are 10) plays a game I have to recognize who’s playing.
The data consists of datagram packet occurred in 1/10 second freq, the packets contains the following Time, laptime, lapdistance, totaldistance, speed, car position, traction control, last lap time, fuel, gear,..
I’ve thought to use a kmeans used in a supervised way.
Which algorithm could be better?
The task must be a multiclass classification. The very first step in any machine learning activity is to define a score metric (https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/). That allows you to compare models between themselves and decide which is better. Then build a base model with random forest or/and logistic regression as suggested in another answer - they perform well out-of-the-box. Then try to play with features and understand which of them are more informative. And don't forget about a visualizations - they give many hints for data wrangling, etc.
this is somewhat a broad question, so I'll try my best
kmeans is unsupervised algorithm meaning it will find the classes itself and it best used when you know there are multiple classes but you don't know what exactly they are... using it with labeled data just means you will compute the distance of new vector v to each vector in the dataset and pick the one (or ones using majority vote) which give the min distance , this is not considered as machine learning
in this case when you do have the labels, supervised approach will yield much better results
I suggest try random forest and logistic regression at first, those are the most basic and common algorithms and they give pretty good results
if you haven't achieve the desired accuracy you can use deep learning and build a neural network with input layer as big as your packet's values and output layer of the number of classes, in between you can use one or multiple hidden layers with various nodes, but this is advanced approach and you better pick up some experience in machine learning field before pursue it
Note: the data is a time series, meaning that every driver has it's own behaviour of driving a car, so data should be considered as bulks of points, with this you can apply pattern matching technics, also there are a several neural networks build exactly for this data (like RNN) but this is far far advanced and much more difficult to implement
I need a learning model that when we test it with a data sample it says which train data cause the answer .
Is there anything that do this?
(I already know KNN will do this)
thanks
look for generative models
"It asks the question: based on generation assumptions, which category is most likely to generate this signal?"
This is not a very well worded question:
Which train data cause the answer? I already know KNN will do this
KNN will tell you what the K nearest neighbors are, but it's not just those K training samples that cause the answer, it's also all the other training samples by being farther away.
The objective of machine learning is to generalize from the whole of the training dataset, so all samples in the training dataset (after outlier filtering, dataset reduction steps) cause the answer.
If your question is 'Which class of machine learning algorithms makes a decision by comparing a new instance to instances seen in the training data, and can list the training examples which most strongly informed the decision?', the answer is: Instance based learning https://en.wikipedia.org/wiki/Instance-based_learning
(e.g. KNN, kernel machines, RBF)
I have a set of items that are each described by 10 precise numbers n1, .., n10. I would like to learn the coefficients k1, .., k10 that should be associated to those numbers to rank them according to my criteria.
In that purpose I created a web application (in php) that shows me two items and ask me which one should be ranked first (it gives the supervision to the machine learning).
My question : I can't find a way to learn the ten coefficients at the same time for each case. Do you have any idea on what algorithm I could use ? (neural networks with all the 10 numbers in entry seem to be a good option because it would learn all the coefficient, but I don't know what would be the output of this network as I would like to learn it by comparing the items two by two.)
Neural network is fine for this. Your output would be the 10 coefficients. Comparing them "two by two" is nothing that influences the net architecture. Standard neural net training procedure takes care of "comparing the items" (if you want to call it that) itself.
At last, make sure to know if you have linear (single layer perceptron is enough) or non-linear (multilayer perceptron) data.
I would appreciate a some insights into the workings of the PyBrain's neural network. I have a dataset of different household features that correspond to a certain household income. The task is to create a regression based on neural networks to be able to predict the income for given features.
I've tried the simple constructor
pybrain.tools.shortcuts.buildNetwork(feature_count, 12, 1, recurrent=False)
and it kinda works. But if i change the hiddenlayer to use GaussianLayer or LinearLayer i am getting the NaNs as output during the training phase.
Is there maybe something else that needs to be taken care of when using these layers (I am guessing maybe feature selection, when they correlate)?
Thanks
I solved a neural network regression problem using pybrain where I had to forecast the load on a power station using weather features. This appears to be the same problem as yours, except in application. I followed the guide here: http://fastml.com/pybrain-a-simple-neural-networks-library-in-python/ which brought me 90% of the way towards the final solution. I had 8 inputs and one outputs.
One "gotcha" I found was that I had to normalise my input values to 0 -> 1. The MSE value would not decrease on each EPOCH otherwise. Also, if any of my input vaues were NaN, I got continuous Nan values out.
I hope this helps.
I am working on a Machine Learning problem which looks like this:
Input Variables
Categorical
a
b
c
d
Continuous
e
Output Variables
Discrete(Integers)
v
x
y
Continuous
z
The major issue that I am facing is that Output Variables are not totally independent of each other and there is no relation that can be established between them. That is, there is a dependence but not due to the causality (one value being high doesn't imply that the other will be high too but the chances of other being higher will improve)
An Example would be:
v - Number of Ad Impressions
x - Number of Ad Clicks
y - Number of Conversions
z - Revenue
Now, for an Ad to be clicked, it has to first appear on a search, so Click is somewhat dependent on Impression.
Again, for an Ad to be Converted, it has to be first clicked, so again Conversion is somewhat dependent on Click.
So running 4 instances of the problem predicting each of the output variables doesn't make sense to me. Infact there should be some way to predict all 4 together taking care of their implicit dependencies.
But as you can see, there won't be a direct relation, infact there would be a probability that is involved but which can't be worked out manually.
Plus the output variables are not Categorical but are in fact Discrete and Continuous.
Any inputs on how to go about solving this problem. Also guide me to existing implementations for the same and which toolkit to use to quickly implement the solution.
Just a random guess - I think this problem can be targeted by Bayesian Networks. What do you think ?
Bayesian Networks will do fine in your case. Your network won't be that huge either so you can live with exact inference algorithms like graph elimination or junction tree. If you decide to use BNs, then you can use Kevin Murphy's BN toolbox. Here is a link to that. For a more general toolbox that uses Gibbs sampling for approximate Monte Carlo inference, you can use BUGS.
Edit:
As an example look at the famous sprinkler example here. For totally discrete variables, you define the conditional probability tables as in the link. For instance you say that given that today is cloudy, there is a 0.8 probability of rain. You define all probability distributions, where the graph shows the causality relations (i.e. if cloud then rain etc.) Then as query you ask to your inference algorithm questions like, given that grass was wet; was it cloudy, was it raining, was the sprinkler on and so on.
To use BNs one needs a system model that is described in terms of causality relations (Directed Acyclic Graph) and probability transitions. If you wanna learn your system parameters there are techniques like EM algorithm. However, learning the graph structure is a really hard task and supervised machine learning approaches will do better in that case.