I want to predict the machine failure.
My data frame contain two columns, first one date and second one logical(0 for working and 1 for failure) ex.
Data label
12/5/2015 0
13/5/2015 1
14/5//2015 0
15/5/2015 0
based on above data frame I want to predict the failure date one day in advance, please let me know which model i should use to predict the failure date.
Unfortunately, you are not going to be able to make a meaningful prediction with just this type of data. You should consider moving towards a richer data set that might include additional features such as things like: 'operational hours', 'days since last service', 'average operating temperature', etc.
Normally, for these types of predictions you would want to have several features that you believe are relevant to what you want to predict.
Related
This is actually part of my thesis research, where I have to run a time series analysis on pollution and economic growth of a single country.
I have data of over 144 years of the two variables with each value representing a single year. I imported, set the values as numeric and attached the dataset through the console and ran:
ts_gdp= (data=`GDP per capita, start=1871,end=2014,frequency=1, names=gdp)
I get to see all the values for the first variable and then follow up with the stl() but I get this error. Any clues why this shows up, although I have set the frequency=1, which is the number of observations for the unit of time, in this case a year? Thank you in advance!
Error in stl(GDP, s.window = "periodic") :
series is not periodic or has less than two periods
having a machine which sends (not regularly) its status values 0, 1, 2, we're storing it in Graphite. Now the status means:
0 - stopped
1 - working
2 - stopped by anomaly
The requested KPIs to extract are the classical ones: how much time on status 0 or 1 or 2 in a day or a week? Before reinventing the wheel, we're looking at the best way to compute those PKIs and if in Graphite (or possible other time-series solution) there are already function which deal with summing the time where the data point value is just a condition. Clearly the time intervals to sum are not stored, it's the time elapsed between a data point and the next one.
Or should the data pre-processed to compute the time intervals and then store three data sets like: status.working, status.stopped, status.alarm and for each store when the specific "event" started and how much it lasted?
There are other KPIs, for example the number of alarms in a day. Receiving two status data points in a row both indicating status "2" is actually a single alarm condition and must count as 1.
So, is there a best way to store such data without pre-processing it? It sounds to be a common pattern but (shame on us?) we have not found this topic well explored.
Thanks.
Graphite has a number of functions that could help you here. One that stands out is the summarize() function in which you can pass an aggregation method (in this case sum) and a duration in minutes/hours/days/weeks/etc), take a look here
isNonNull is another useful function: it can be used to determine the existence of a datapoint regardless of the value.
When you say that the machie reports a value 0 to indicate it has stopped - does it actually send that value or does it report nothing? This is an important detail and will have some bearing on the end result of your solution.
I am given a financial time series that is characterized by a bunch of structural breaks, i.e. the series isn't moving (literally at all), but at some points in time the series jumps up or down. Then it stays at this level for a while until the series jumps again. So the time series basically looks like a step function.
My assumption is that these breaks come from some particular exogenous variables that are in the form of dummies. So if a particular exogenous variable takes on the value 1, (I assume) it is very likely that the series jumps.
My question is how I could model this particular time series (in a uni- or multivariate sense). I guess that standard AR(MA)-models are inappropriate. I was thinking about creating two binary variables that take on the value 1 if there's an upward (downward) break and 0 otherwise. Then I would run a dynamic probit model to test the probabilities that the exogenous variables trigger a break. What do you think about this idea? Or would you have other suggestions? Please note that I don't wanna test for structural breaks but rather formulate a time series model.
Did you try ARIMAX, TAR, or STAR models?
You said that you have time series data and you think this series is influanced by some exogeneous shocks. I think you need to include exogeneous variable in your time series analysis thats where ARIMAX comes. This modela allows you to include exogeneous variable in ARIMA model.
You also said that there are(is) structural breaks. Try Treshold AutoRegressive or Smoothed Treshold AutoRegressive. I hope this helps to find more materials about that models. Here is one click here
Suppose there is a data set of statistical data with a number of input columns and one output column. The predictors characterize some particular process that is repeated, so one data row is corresponding to one occasion of that process. And for these process characteristics the order and duration is important. Some of them might be absent at all, some of them are repeated, but with different speed or other parameter.
Let's say that our process is names P and it can have a lot of child parts, that form the process together. Let's say, once the process had N sub processes:
Sub process 1, with: speed = SpdA, duration = DurA, depth = DepA
Right after sub process A next sub process B happened:
Sub process 2, with: speed = SpdB, duration = DurB, depth = DepB
...
... N. Sub process N.
So there might be from 1 to N child processes in each process, that is, in each data row. And the amount of the child processes may vary from one row to another. This is about the input data.
As for the output - the output here in the simplest case is binary - either success or failure, but in reality it will be a positive number starting from 0 to positive infinity. This number represents the time by which the process has finished successfully. If the value for the output is a positive infinity - it means that the process failed to succeed.
Very important note, if we are going with the simplest case where the output is binary - in the statistical data set there will be data rows that mostly have failure in the output. The goal is to find the hypothetical parameters that values of the test predictors should be equal to, to make the process succeed.
For example, after learning we should be able to tell what is the concrete universal input parameters that will most process success. That was the simplest, binary output case.
However, in real life we will have the output that represents time by which the process finished successfully, and +infinity - if failure. So here the goal is the same - make the process succeed or as much close to success as possible. The goal is to generate the test inputs that we might use in future to prevent the output equal to +infinity.
The goal maximum is, having the target time provided, find the exact values for the inputs that will make the process finish successfully as closer to the given time as possible. Here we should expect the enumeration of child processes, their order and the values for each child process to be predicted.
Here in this problem, I guess, the output will play the role of the input and the input will play the role of the output.
What is the approach to solve these problems? How to handle the variable number of characteristics and how to handle the order that might vary in the each data row?
I am a novice in machine learning and would appreciate the concrete suggestions or examples of similar problems solved.
Any help and advice welcome!
I'd like to write a spam filter program with SVM and I choose libsvm as the tool.
I got 1000 good mails and 1000 spam mails, then I classify them into :
700 good_train mails 700 spam_train mails
300 good_test mails 300 spam_test mails
Then I wrote a program to count the time of each words occur in each file, got result like:
good_train_1.txt:
today 3
hello 7
help 5
...
I learned that libsvm needs format like:
1 1:3 2:1 3:0
2 1:3 2:3 3:1
1 1:7 3:9
as its input. I know that 1, 2, 1 is the label, but what does 1:3 mean?
How could I transfer what I've got to this format?
Likely, the format is
classLabel attribute1:count1 ... attributeN:countN
N is the total number of different words in your text corpus. You will have to check the documentation for the tool you are using(or its sources), to see if you can use a sparser format by not including the attributes having count 0.
How could I transfer what I've got to this format?
Here's how I would do this. I would use the script you've got to compute the count of words for each mail in the training set. Then, use another script and transfer that data into the LIBSVM format that you've shown earlier. (This can be done in a variety of ways, but it should be reasonable to write with an easy input/output language like Python) I would batch all "good-mail" data into one file, and label that class as "1". Then, I would do the same process with the "spam-mail" data and label that class "-1". As nologin said, LIBSVM requires the class label to precede the features, but the features themselves can be any number as long as they are in ascending order, e.g. 2:5 3:6 5:9 is allowed, but not 3:23 1:3 7:343.
If you're concerned that your data is not in the correct format, use their script
checkdata.py
before training and it should report any possible errors.
Once you have two separate files with data in the correct format, you can call
cat file_good file_spam > file_training
and generate a training file that contains data on both good and spam mail. Then, do the same process with the testing set. One psychological advantage with forming the data this way is that you know the top 700 (or 300) mail in the training (or testing) set is good mail, and the remaining are spam mail. This makes it easier to create other scripts you may want to act on the data, such as a precision/recall code.
If you have other questions, the FAQ at http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html should be able to answer a few, as well as the various README files that come with installation. (I personally found the READMEs in the "Tools" and "Python" directories to be a great boon.) Sadly, the FAQ does not touch much on what nologin said, about data being in a sparse format.
On a final note, I doubt that you need to keep counts of every possible word that could appear in mail. I would recommend counting only the most common words you would suspect to appear in spam mail. Other potential features include total word count, average word length, average sentence length, and other possible data that you feel may be helpful.