Forecasting.ForecastBySsa with Multiple variables as input - machine-learning

I've got this code to predict a time series. I want to have a prediction based upon a time series of prices and a correlated indicator.
So together with the value to forecast, I want to pass a side value but I cannot understand if this is taken into account because prediction doesn't change with or without it. In which way do I need to tell to the algorithm how to consider these parameters?
public static TimeSeriesForecast PerformTimeSeriesProductForecasting(List<TimeSeriesData> listToForecast)
{
var mlContext = new MLContext(seed: 1); //Seed set to any number so you have a deterministic environment
var productModelPath = $"product_month_timeSeriesSSA.zip";
if (File.Exists(productModelPath))
{
File.Delete(productModelPath);
}
IDataView productDataView = mlContext.Data.LoadFromEnumerable<TimeSeriesData>(listToForecast);
var singleProductDataSeries = mlContext.Data.CreateEnumerable<TimeSeriesData>(productDataView, false).OrderBy(p => p.Date);
TimeSeriesData lastMonthProductData = singleProductDataSeries.Last();
const int numSeriesDataPoints = 2500; //The underlying data has a total of 34 months worth of data for each product
// Create and add the forecast estimator to the pipeline.
IEstimator<ITransformer> forecastEstimator = mlContext.Forecasting.ForecastBySsa(
outputColumnName: nameof(TimeSeriesForecast.NextClose),
inputColumnName: nameof(TimeSeriesData.Close), // This is the column being forecasted.
windowSize: 22, // Window size is set to the time period represented in the product data cycle; our product cycle is based on 12 months, so this is set to a factor of 12, e.g. 3.
seriesLength: numSeriesDataPoints, // This parameter specifies the number of data points that are used when performing a forecast.
trainSize: numSeriesDataPoints, // This parameter specifies the total number of data points in the input time series, starting from the beginning.
horizon: 5, // Indicates the number of values to forecast; 2 indicates that the next 2 months of product units will be forecasted.
confidenceLevel: 0.98f, // Indicates the likelihood the real observed value will fall within the specified interval bounds.
confidenceLowerBoundColumn: nameof(TimeSeriesForecast.ConfidenceLowerBound), //This is the name of the column that will be used to store the lower interval bound for each forecasted value.
confidenceUpperBoundColumn: nameof(TimeSeriesForecast.ConfidenceUpperBound)); //This is the name of the column that will be used to store the upper interval bound for each forecasted value.
// Fit the forecasting model to the specified product's data series.
ITransformer forecastTransformer = forecastEstimator.Fit(productDataView);
// Create the forecast engine used for creating predictions.
TimeSeriesPredictionEngine<TimeSeriesData, TimeSeriesForecast> forecastEngine = forecastTransformer.CreateTimeSeriesEngine<TimeSeriesData, TimeSeriesForecast>(mlContext);
// Save the forecasting model so that it can be loaded within an end-user app.
forecastEngine.CheckPoint(mlContext, productModelPath);
ITransformer forecaster;
using (var file = File.OpenRead(productModelPath))
{
forecaster = mlContext.Model.Load(file, out DataViewSchema schema);
}
// We must create a new prediction engine from the persisted model.
TimeSeriesPredictionEngine<TimeSeriesData, TimeSeriesForecast> forecastEngine2 = forecaster.CreateTimeSeriesEngine<TimeSeriesData, TimeSeriesForecast>(mlContext);
// Get the prediction; this will include the forecasted product units sold for the next 2 months since this the time period specified in the `horizon` parameter when the forecast estimator was originally created.
prediction = forecastEngine.Predict();
return prediction;
}
TimeSeriesData has multiple attributes, not only the value of the series that I ant to forecast. Just wonder if they are taken into account when forecasting o not.
Is there a better method to forecast this type of series like LMST? Is this method available in ML.NET?

There is a new ticket for enhancement: Multivariate Time based series forecasting to ML.Net
See ticket: github.com/dotnet/machinelearning/issues/5638

Related

how to select SpatRaster layers from their names?

I've got a SpatRaster of (150 x 150 x 1377) that shows temporal evolution of precipitations. Each layer is a given hour in a 2-month interval, but some hours are missing, and the dataset isn't continuous. The layers names are strings as "YYYYMMDDhhmm".
I need to find the mean value every three hours even on whole intervals or on missing-data intervals. On entire ones I want to average three data and on missing-data ones I would like to average two of them or, if two are missing, to select the unique value as the averaged one.
How can I use data names to select how to act?
I've already tried this code but I'm averaging on three continuous layers by index and not by hours. How can I convert names in DateTime form from "tidyverse" in order to use rollapply() to see if two steps back I find the DateTime I am expecting? Is there any other method to check this out?
HSAF=rast(c((paste0(resfolder, "HSAF_final1_5.tif")),(paste0(resfolder, "HSAF_final6_10.tif")),(paste0(resfolder, "HSAF_final11_15.tif")),
(paste0(resfolder, "HSAF_final16_20.tif")),(paste0(resfolder, "HSAF_final21_25.tif")),(paste0(resfolder, "HSAF_final26_30.tif")),
(paste0(resfolder, "HSAF_final31_N04.tif")),(paste0(resfolder, "HSAF_finalN05_N08.tif")),(paste0(resfolder, "HSAF_finalN09_N13.tif")),
(paste0(resfolder, "HSAF_finalN14_N18.tif")),(paste0(resfolder, "HSAF_finalN19_N23.tif")),(paste0(resfolder, "HSAF_finalN24_N28.tif")),
(paste0(resfolder, "HSAF_finalN29_N30.tif"))))
index=names(HSAF)
j=2
for (i in seq(1,3, by=3))
{third_el<- HSAF[index[i+j]]
second_el <- HSAF[index[i+j-1]]
first_el<- HSAF[index[i+j-2]]
newraster<- c(first_el, second_el, third_el)
newraster<- mean(newraster, filename=paste0(tempfile(), ".tif"))
names(newraster)<- paste0(index[i+j-2],index[i+j-1],index[i+j])
}
for (i in seq(4,1374 , by=3))
{ third_el<- HSAF[index[i+j]]
second_el <- HSAF[index[i+j-1]]
first_el<- HSAF[index[i+j-2]]
subraster<- c(first_el, second_el, third_el)
subraster<- mean(subraster, filename=paste0(tempfile(), ".tif"))
names(subraster)<- paste0(index[i+j-2],index[i+j-1],index[i+j])
add(newraster)<- subraster
}

looping program for MLP Keras prediction

I am (sort of a beginner starting out) experimenting with Keras on a time series data application where I created a regression model and then saved it to run on a different Python script.
The time series data that I am dealing with is hourly data, and I am using a saved model in Keras to predict a value for each of hour in the data set. (data = CSV file is read into pandas) With a years worth of time series data there is 8760 (hours in a year) predictions and finally I am attempting to sum the values of the predictions at the end.
In the code below I am not showing how the model architecture gets recreated (keras requirement for a saved model) and the code works its just extremely slow. This method seems fine for under a 200 predictions but for a 8760 the code seems to bog down way too much to ever finish.
I don't have any experience with databases but would this be a better method versus storing 8760 keras predictions in a Python list? Thanks for any tips I am still riding the learning curve..
#set initial loop params & empty list to store modeled data
row_num = 0
total_estKwh = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params])
estimatedKwh = load_trained_model(weights_path).predict(params)
print('Analyzing row number:', row_num)
total_estKwh.append(estimatedKwh)
row_num += 1
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()
Seems you are making your life very difficult without obvious reason...
For starters, you don't need to load your model for every row - this is overkill! You shoud definitely move load_trained_model(weights_path) out of the for loop, with something like
model = load_trained_model(weights_path) # load ONCE
and replace the respective line in the loop with
estimatedKwh = model.predict(params)
Second, it is again not efficient to call the model for prediction row-by-row; it is preferable to first prepare your params as an array, and then feed this to the model for getting batch predictions. Forget the print statement, too..
All in all, try this:
params_array = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params]) # is this if really necessary??
params_array.append(params)
params_array = np.asarray(params_array, dtype=np.float32)
total_estKwh = load_trained_model(weights_path).predict(params_array)
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()

Storing Vehicle Id in Anomaly Detection

I tested anomaly detection using Deeplearning4j, everything works fine except that, I am not able to preserve the VehicleID while training. What is the best approach in such scenario?
Please look at the following snippet of code, SparkTransformExecutor returns a RDD and InMemorySequence is taking a list when, I am collecting list from RDD indexing is not guaranteed.
val records:JavaRDD[util.List[util.List[Writable]]] = SparkTransformExecutor
.executeToSequence(.....)
val split = records.randomSplit(Array[Double](0.7,0.3))
val testSequences = split(1)
//in memory sequence reader
val testRR = new InMemorySequenceRecordReader(testSequences.collect().toList)
val testIter = new RecordReaderMultiDataSetIterator.Builder(batchSize)
.addSequenceReader("records", trainRR)
.addInput("records")
.build()
Typically you track training examples by index in a dataset. Track which index that dataset is vehicle is in the dataset alongside training. There are a number of ways to do that.
In dl4j, we typically keep the data raw and use record readers + transform processes for the training data. If you use a record reader on raw data (pick one for your dataset, it could be csv or even video) and use a recordreader datasetiterator like here:
```java
RecordReader recordReader = new CSVRecordReader(0, ',');
recordReader.initialize(new FileSplit(new ClassPathResource("iris.txt").getFile()));
int labelIndex = 4;
int numClasses = 3;
int batchSize = 150;
RecordReaderDataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
iterator.setCollectMetaData(true); //Instruct the iterator to collect metadata, and store it in the DataSet objects
DataSet allData = iterator.next();
DataSet trainingData = testAndTrain.getTrain();
DataSet testData = testAndTrain.getTest();
```
(Complete code here):
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/dataexamples/CSVExampleEvaluationMetaData.java
Alongside this you use TransformProcess:
```
//Let's define the schema of the data that we want to import
//The order in which columns are defined here should match the
//order in which they appear in the input data
Schema inputDataSchema = new Schema.Builder()
//We can define a single column
.addColumnString("DateTimeString")
....
.build();
//At each step, we identify column by the name we gave them in the
input data schema, above
TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
//your transforms go here
.build();
```
Complete example below:
https://github.com/deeplearning4j/dl4j-examples/blob/6967b2ec2d51b0d19b5d6437763a2936ca922a0a/datavec-examples/src/main/java/org/datavec/transform/basic/BasicDataVecExampleLocal.java
If you use these things, you customize keep the data as is, but have a complete data pipeline. There are a lot of ways to do it, just keep in mind you start with the vehicle id, it doesn't have to disappear.

Weighted Graph DijkstraShortestPath : getpath() does not return path with least cost

Thanks for prompt response chessofnerd and Joshua.I am sorry for unclear logsand unclear question.Let me rephrase it.
Joshua:
I am storing my weights in DB and retrieving from DB in transformer.
I have 4 devices connected in my topology and between some devices there are multiple connections and between 2 devices only single connection as shown below.
I am using undirected weighted graph.
Initially all links are assigned weight of 0.When I request a path between D1 and D4 , I increase the weight of each link by 1.
When a second request comes for another path, I am feeding all the weights through Transformer.
When request comes second time, I am correctly feeding weight of 1 for links L1,L2,L3 and 0 for other links.
Since weight of (L4,L5,L3) or (L6,L7,L3) or (L8,L9,L3) is less than weight of (L1,L2,L3), I am expecting I will get one of these paths - (L4,L5,L3) or (L6,L7,L3) or (L8,L9,L3). But I am getting again (L1,L2,L3)
D1---L1-->D2---L2--->D3--L3--->D4
D1---L4-->D2---L5--->D3--L3--->D4
D1---L6-->D2---L7--->D3--L3--->D4
D1---L8-->D2---L9--->D3--L3---->D4
transformer simply returns the weight previosuly stored for link.
Graph topology = new UndirectedSparseMultigraph()
DijkstraShortestPath pathCalculator = new DijkstraShortestPath(topology, wtTransformer);
List path = pathCalculator.getPath(node1, node2);
private final Transformer wtTransformer = new Transformer() {
public Integer transform(Link link) {
int weight = getWeightForLink(link, true);
return weight;
}
}
You're creating DijkstraShortestPath so that it caches results. Add a "false" parameter to the constructor to change this behavior.
http://jung.sourceforge.net/doc/api/edu/uci/ics/jung/algorithms/shortestpath/DijkstraShortestPath.html
(And no, the cache does not get poisoned if you change an edge weight; if you do that, it's your responsibility to create a new DSP instance, or not use caching in the first place.)

Updating the data-set when classifing new nominal instances

I'm using J48 to classify instances composed of both numeric and nominal values.
My problem is that I don't know which nominal-value I'll come across during my program.
Therefor I need to update my nominal-attribute's data of the model "on the fly".
For instance, say I have only 2 attribute, occupation and age and the run is as followed:
OccuptaionAttribute = {}.
input: [Piano teacher, 22].
OccuptaionAttribute = {Piano teacher}.
input: [school teacher, 30]
OccuptaionAttribute = {Piano teacher, school teacher}.
input: [Piano teacher, 40]
OccuptaionAttribute = {Piano teacher, school teacher}.
etc.
Now I've try to do so manually by copying the previous attributes, adding the new attribute and then updating the model's data.
That works fine when training the model.
But!
when I want to classify a new instance, say [SW engineer, 52], OccuptaionAttribute was updated:
OccuptaionAttribute = {Piano teacher, school teacher, SW engineer}, but the tree itself never "met" "SW engineer" before so the classification cannot be fulfilled and an Exception is thrown.
Can you direct how to handle the above situation?
Does Weka has any mechanism supporting the above issue?
Thanks!
When training add a placeholder data to your nominal-attributes like __other__.
Before trying to classify an instance first check whether the value of nominal attribute is seen before; if its not use the placeholder value:
Attribute attribute = instances.attribute("OccuptaionAttribute");
String s = "SW engineer";
int index = attribute.indexOfValue(s);
if (index == -1) {
index = attribute.indexOfValue("__other__");
}
When you have enough data train again with the new values.

Resources