I'm new to Machine Learning, and I'd like to make a question regarding the model generalization. In my case, I'm going to produce some mechanical parts, and I'm interested in the control of the input parameters to obtain certain properties on the final part.
More particularly, I'm interested in 8 parameters (say, P1, P2, ..., P8). In which to optimize the number of required pieces produced to maximize the combinations of parameters explored, I've divided the problem into 2 sets. For the first set of pieces, I'll vary the first 4 parameters (P1 ... P4), while the others will be held constant. In the second case, I'll do the opposite (variables P5 ... P8 and constants P1 ... P4).
So I'd like to know if it's possible to make a single model that has the eight parameters as inputs to predict the properties of the final part. I ask because as I'm not varying all the 8 variables at once, I thought that maybe I would have to do 1 model for each set of parameters, and the predictions of the 2 different models couldn't be related one to the other.
Thanks in advance.
In most cases having two different models will have a better accuracy then one big model. The reason is that in local models, the model will only look at 4 features and will be able to identify patterns among them to make prediction.
But this particular approach will most certainly fail to scale. Right now you only have two sets of data but what if it increases and you have 20 sets of data. It will not be possible for you to create and maintain 20 ML models in production.
What works best for your case will need some experimentation. Take a random sample from data and train ML models. Take one big model and two local models and evaluate their performance. Not just accuracy, but also their F1 score, AUC-PR and ROC curve too to find out what works best for you. If you do not see a major performance drop, then one big model for the entire dataset will be a better option. If you know that your data will always be divided into these two sets and you dont care about scalability, then go with two local models.
Related
The whole point of using an SVM is that the algorithm will be able to decide whether an input is true or false etc etc.
I am trying to use an SVM for predictive maintenance to predict how likely a system is to overheat.
For my example, the range is 0-102°C and if the temperature reaches 80°C or above it's classed as a failure.
My inputs are arrays of 30 doubles(the last 30 readings).
I am making some sample inputs to train the SVM and I was wondering if it is good practice to pass in very specific data to train it - eg passing in arrays 80°C, 81°C ... 102°C so that the model will automatically associate these values with failure. You could do an array of 30 x 79°C as well and set that to pass.
This seems like a complete way of doing it, although if you input arrays like that - would it not be the same as hardcoding a switch statement to trigger when the temperature reads 80->102°C.
Would it be a good idea to pass in these "hardcoded" style arrays or should I stick to more random inputs?
If there is a finite set of possibilities I would really recommend using Naïve Bayes, as that method would fit this problem perfectly. However if you are forced to use an SVM, I would say that would be rather difficult. For starters the main idea with an SVM is to use it for classification, and the amount of scenarios does not really matter. The input is however seldom discrete, so I guess there usually are infinite scenarios. However, an SVM implemented normally would only give you a classification, unless you have 100 classes one for 1% another one for 2%, this wouldn't really solve problem.
The conclusion is that this could work, but it would not be considered "best practice". You can imagine your 30 dimensional vector space divided into 100 small sub spaces, and each datapoint, a 30x1 vector is a point in that vectorspace so that the probability is decided by which of the 100 subsets its in. However, having a 100 classes and not very clean or insufficient data, will lead to very bad, hard performing models.
Cheers :)
I'm Trying to use machine learning in my job, but I can't find a way to adapt it to what I need. And I don't know if it is already a known problem or if I'm working with something that doesn't have a known solution yet.
Let's say that I have a lot of independent variables, encoded as onehot, and a dependent variable with only two status: True (The result had an error) and False (The result was successful)
My independent variables are the parameters I use for a query in an API, and the result is the one that returned the API.
My objective is to detect a pattern where I can see in a dataset in a certain timeframe of a few hours, the failing parameters, so I can avoid to query the API if I'm certain that it could fail.
(I'm working with millions of queries per day, and this mechanism is critical for a good user experience)
I'll try to make an example so you can understand what I need.
Suppose that I have a delivery company, I count with 3 trucks, and 3 different routes I could take.
So, my dummy variables would be T1,T2,T3,R1,R2 and R3 (I could delete T3 and R3 since there are considered by the omission of the other 2)
Then, I have a big dataset of the times that the delivery was delayed. So: Delayed=1 or Delayed=0
With this, I would have a set like this:
T1_|_T2_|_T3_|_R1_|_R2_|_R3||Delayed
------------------------------------
_1_|_0__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_0_|_1__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_1_|_0__|_0__|_1__|_0__|_0_||____0__
Not only I want to say "in most cases, truck 1 arrives late, it could have a problem, I shouldn't send it more", that is a valid result too, but I also want to detect things like: "in most cases, truck 1 arrives late when it goes in the route 1, probably this type of truck has a problem on this specific route"
This dataset is an example, the real one is huge, with thousand of dependent variables, so it could probably have more than one problem in the same dataset.
example: truck 1 has problems in route 1, and truck 3 has problems in route 1.
example2: truck 1 has problems in route 1, and truck 3 has problems in any route.
So, I would make a blacklist like:
example: Block if (truck=1 AND route=1) OR (truck=3 AND route=1)
example2: Block if (truck=1 AND route=1) OR truck=3
I'm actually doing this without machine learning, with an ugly code that makes a massive cartesian product of the independent columns, and counts the quantity of "delayed". Then I choose the worst delayed/total proportion, I blacklist it, and I iterate again with new values.
This errors are commonly temporary, so I would send a new dataset every few hours, I don't need a lifetime span analysis, except that the algorithm considers these temporary issues.
Anyone has a clue of what can I use, or where can I investigate about it?
Don't hesitate to ask for more info if you need it.
Thanks in advance!
Regards
You should check out the scikit-learn package for machine learning classifiers (Random Forest is an industry standard). For this problem, you could feed a portion of the data (training set, say 80% of the data) to the model and it would learn how to predict the outcome variable (delayed/not delayed).
You can then test the accuracy of your model by 'testing' on the remaining 20% of your data (the test set), to see if your model is any good at predicting the correct outcome. This will give you a % accuracy. Higher is better generally, unless you have severely imbalanced classes, in which case your classifier will just always predict the more common class for easy high accuracy.
Finally, if the accuracy is satisfactory, you can find out which predictor variables your model considered most important to achieve that level of prediction, i.e. Variable Importance. I think this is what you're after. So running this every few hours would tell you exactly which features (columns) in your set are best at predicting if a truck is late.
Obviously, this is all easier said than done and often you will have to perform significant cleaning of your data, sometimes normalisation (not in the case of random forests though), sometimes weighting your classifications, sometimes engineering new features... there is a reason this is a dedicated profession.
Essentially what you're asking is "how do I do Data Science?". Hopefully this will get you started, the rest (i.e. learning) is on you.
Im new to &investigating Machine Learning. I have a use case & data but I am unsure of a few things, mainly how my model will run, and what model to start with. Details of the use case and questions are below. Any advice is appreciated.
My Main question is:
When basing a result on scores that are accumulated over time, is it possible to design a model to run on a continuous basis so it gives a best guess at all times, be it run on day one or 3 months into the semester?
What model should I start with? I was thinking a classifier, but ranking might be interesting also.
Use Case Details
Apprentices take a semesterized course, 4 semesters long, each 6 months in duration. Over the course of a semester, apprentices perform various operations and processes & are scored on how well they do. After each semester, the apprentices either have sufficient score to move on to semester 2, or they fail.
We are investigating building a model that will help identify apprentices who are in danger of failing, with enough time for them to receive help.
Each procedure is assigned a complexity code of simple, intermediate or advanced, and are weighted by complexity.
Regarding Features, we have the following: -
Initial interview scores
Entry Exam Scores
Total number of simple procedures each apprentice performed
Total number of intermediate procedures each apprentice performed
Total number of advanced procedures each apprentice performed
Average score for each complexity level
Demograph information (nationality, age, gender)
I am unsure of is how the model will work and when we will run it. i.e. - if we run it on day one of the semester, I assume everyone will fail as everyone has procedure scores of 0
Current plan is to run the model 2-3 months into each semester, so there is enough score data & also enough time to help any apprentices who are in danger of failing.
This definitely looks like a classification model problem:
y = f(x[0],x[1], ..., x[N-1])
where y (boolean output) = {pass, fail} and x[i] are different features.
There is a plethora of ML classification models like Naive Bayes, Neural Networks, Decision Trees, etc. which can be used depending upon the type of the data. In case you are looking for an answer which suggests a particular ML model, then I would need more data for the same. However, in general, this flow-chart can be helpful in selection of the same. You can also read about Model Selection from Andrew-Ng's CS229's 5th lecture.
Now coming back to the basic methodology, some of these features like initial interview scores, entry exam scores, etc. you already know in advance. Whereas, some of them like performance in procedures are known over the semester.
So, there is no harm in saying that the model will always predict better towards the end of each semester.
However, I can make a few suggestions to make it even better:
Instead of taking the initial procedure-scores as 0, take them as a mean/median of the past performances in other procedures by the subject-apprentice.
You can even build a sub-model to analyze the relation between procedure-scores and interview-scores as they are not completely independent. (I will explain this sentence in the later part of the answer)
However, if the semester is very first semester of the subject-apprentice, then you won't have such data already present for that apprentice. In that case, you might need to consider the average performances of other apprentices with similar profiles as the subject-apprentice. If the data-set is not very large, K Nearest Neighbors approach can be quite useful here. However, for large data-sets, KNN suffers from the curse of dimensionality.
Also, plot a graph between y and different variables x[i], so as to see the independent variation of y with respect to each variable.
Most probably (although it's just a hypotheses), y will depend more the initial variables in comparison the variables achieved later. The reason being that the later variables are not completely independent of the former variables.
My point is, if a model can be created to predict the output of a semester, then, a similar model can be created to predict just the output of the 1st procedure-test.
In the end, as the model might be heavily based on demographic factors and other things, it might not be a very successful model. For the same reason, we cannot accurately predict election results, soccer match results, etc. As they are heavily dependent upon real-time dynamic data.
For dynamic predictions based on different procedure performances, Time Series Analysis can be a bit helpful. But in any case, the final result will heavily dependent on the apprentice's continuity in motivation and performance which will become more clear towards the end of each semester.
I'm building a machine learning model where some columns are physical addresses (which I can translate into X / Y coordinates) but I'm a little bit confused on how this will be handled by the ML algorithm.
Is there a particular way to translate a GEO location into columns for use into ML (classification and/or regression) ?
Thanks in advance !
The choice of features would, in general, depend on what kind of relationship you anticipate between the features and the target variable. You are right in saying that post code number itself does not bear any relation to the target. Here the postcode is simply a string, or a category. What kind of model are you planning to use? Linear regression and Decision tree are two examples. These models capture relationships in different ways. As an example for a feature, you could compute the straight line distance between the source and destination, and use that in the model, since intuitively, the farther they are, the higher the transit time is likely to be. What else does the transit time depend on? See if you can relate the factors influencing the travel time to the information that you have, i.e., the postcodes / XY co-ordinates, in some way.
This summarizes the answer we ended up with in the comments of the questions:
This transformation from ZIP codes to geo-coordinates should not be seen as a "split" but only as a way to represent your data in a multidimensional way (in this case the dimension will be 2).
Machine learning algorithms exist for both unidimensional and multidimensional data. The two dimensions can be correlated or uncorrelated, depending on how you define the parameters of the model you choose afterwards.
Moreover, the correlation does not have to be explicitly set in most cases. Only an initial value may be useful, but many algorithm also rely on random initialization or other simple methods that estimate it from a subset of your data. So, for clarity's sake, if you model you data by a Gaussian for example, when estimating the parameters of this Gaussian, the covariance matrix will have non-diagonal term that are non-zeros which will represent the data correlation. You only need not to take an assumption that states that the 2 dimensions are uncorrelated!
I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)