looping program for MLP Keras prediction - machine-learning

I am (sort of a beginner starting out) experimenting with Keras on a time series data application where I created a regression model and then saved it to run on a different Python script.
The time series data that I am dealing with is hourly data, and I am using a saved model in Keras to predict a value for each of hour in the data set. (data = CSV file is read into pandas) With a years worth of time series data there is 8760 (hours in a year) predictions and finally I am attempting to sum the values of the predictions at the end.
In the code below I am not showing how the model architecture gets recreated (keras requirement for a saved model) and the code works its just extremely slow. This method seems fine for under a 200 predictions but for a 8760 the code seems to bog down way too much to ever finish.
I don't have any experience with databases but would this be a better method versus storing 8760 keras predictions in a Python list? Thanks for any tips I am still riding the learning curve..
#set initial loop params & empty list to store modeled data
row_num = 0
total_estKwh = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params])
estimatedKwh = load_trained_model(weights_path).predict(params)
print('Analyzing row number:', row_num)
total_estKwh.append(estimatedKwh)
row_num += 1
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()

Seems you are making your life very difficult without obvious reason...
For starters, you don't need to load your model for every row - this is overkill! You shoud definitely move load_trained_model(weights_path) out of the for loop, with something like
model = load_trained_model(weights_path) # load ONCE
and replace the respective line in the loop with
estimatedKwh = model.predict(params)
Second, it is again not efficient to call the model for prediction row-by-row; it is preferable to first prepare your params as an array, and then feed this to the model for getting batch predictions. Forget the print statement, too..
All in all, try this:
params_array = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params]) # is this if really necessary??
params_array.append(params)
params_array = np.asarray(params_array, dtype=np.float32)
total_estKwh = load_trained_model(weights_path).predict(params_array)
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()

Related

Python vectorizing a dataframe lookup table

I have two dataframes. One is a lookup table consisting of key/value pairs. The other is my main dataframe. The main dataframe has many more records than the lookup table. I need to construct a 'key' from existing columns in my main dataframe and then lookup a value matching that key in my lookup table. Here they are:
lk = pd.DataFrame( { 'key': ['key10', 'key9'],'value': [100, 90]})
lk.set_index('key', inplace=True)
date_today = datetime.now()
df = pd.DataFrame({ 'date1':[date_today, date_today,date_today],
'year':[1999,2001,2003],
'month':[10,9,10],
'code':[10,4,5],
'date2':[None, date_today, None],
'keyed_value': [0,0,0]})
This is how i get a value:
df['constructed'] = "key" + df['month'].astype('str')
def getKeyValue(lk, k):
return lk.loc[k, 'value']
print(getKeyValue(lk, df['constructed']))
Here are my issues:
1) I don't want to use iteration or apply methods. My actual data is over 2 million rows and 200 columns. It was really slow (over 2 minutes) with apply. So i opted for an inner join and hence the need to created a new 'constructed' column. After the join i drop the 'constructed' column. The join has helped by bringing execution down to 48 seconds. But there has to be faster way (i am hoping).
2) How do i vectorize this? I don't know how to even approach it. Is it even possible? I tried this but just got an error:
df['keyed_values'] = getKeyValue(lk, df['constructed'])
Any help or pointers is much appreciated.

Forecasting.ForecastBySsa with Multiple variables as input

I've got this code to predict a time series. I want to have a prediction based upon a time series of prices and a correlated indicator.
So together with the value to forecast, I want to pass a side value but I cannot understand if this is taken into account because prediction doesn't change with or without it. In which way do I need to tell to the algorithm how to consider these parameters?
public static TimeSeriesForecast PerformTimeSeriesProductForecasting(List<TimeSeriesData> listToForecast)
{
var mlContext = new MLContext(seed: 1); //Seed set to any number so you have a deterministic environment
var productModelPath = $"product_month_timeSeriesSSA.zip";
if (File.Exists(productModelPath))
{
File.Delete(productModelPath);
}
IDataView productDataView = mlContext.Data.LoadFromEnumerable<TimeSeriesData>(listToForecast);
var singleProductDataSeries = mlContext.Data.CreateEnumerable<TimeSeriesData>(productDataView, false).OrderBy(p => p.Date);
TimeSeriesData lastMonthProductData = singleProductDataSeries.Last();
const int numSeriesDataPoints = 2500; //The underlying data has a total of 34 months worth of data for each product
// Create and add the forecast estimator to the pipeline.
IEstimator<ITransformer> forecastEstimator = mlContext.Forecasting.ForecastBySsa(
outputColumnName: nameof(TimeSeriesForecast.NextClose),
inputColumnName: nameof(TimeSeriesData.Close), // This is the column being forecasted.
windowSize: 22, // Window size is set to the time period represented in the product data cycle; our product cycle is based on 12 months, so this is set to a factor of 12, e.g. 3.
seriesLength: numSeriesDataPoints, // This parameter specifies the number of data points that are used when performing a forecast.
trainSize: numSeriesDataPoints, // This parameter specifies the total number of data points in the input time series, starting from the beginning.
horizon: 5, // Indicates the number of values to forecast; 2 indicates that the next 2 months of product units will be forecasted.
confidenceLevel: 0.98f, // Indicates the likelihood the real observed value will fall within the specified interval bounds.
confidenceLowerBoundColumn: nameof(TimeSeriesForecast.ConfidenceLowerBound), //This is the name of the column that will be used to store the lower interval bound for each forecasted value.
confidenceUpperBoundColumn: nameof(TimeSeriesForecast.ConfidenceUpperBound)); //This is the name of the column that will be used to store the upper interval bound for each forecasted value.
// Fit the forecasting model to the specified product's data series.
ITransformer forecastTransformer = forecastEstimator.Fit(productDataView);
// Create the forecast engine used for creating predictions.
TimeSeriesPredictionEngine<TimeSeriesData, TimeSeriesForecast> forecastEngine = forecastTransformer.CreateTimeSeriesEngine<TimeSeriesData, TimeSeriesForecast>(mlContext);
// Save the forecasting model so that it can be loaded within an end-user app.
forecastEngine.CheckPoint(mlContext, productModelPath);
ITransformer forecaster;
using (var file = File.OpenRead(productModelPath))
{
forecaster = mlContext.Model.Load(file, out DataViewSchema schema);
}
// We must create a new prediction engine from the persisted model.
TimeSeriesPredictionEngine<TimeSeriesData, TimeSeriesForecast> forecastEngine2 = forecaster.CreateTimeSeriesEngine<TimeSeriesData, TimeSeriesForecast>(mlContext);
// Get the prediction; this will include the forecasted product units sold for the next 2 months since this the time period specified in the `horizon` parameter when the forecast estimator was originally created.
prediction = forecastEngine.Predict();
return prediction;
}
TimeSeriesData has multiple attributes, not only the value of the series that I ant to forecast. Just wonder if they are taken into account when forecasting o not.
Is there a better method to forecast this type of series like LMST? Is this method available in ML.NET?
There is a new ticket for enhancement: Multivariate Time based series forecasting to ML.Net
See ticket: github.com/dotnet/machinelearning/issues/5638

Storing Vehicle Id in Anomaly Detection

I tested anomaly detection using Deeplearning4j, everything works fine except that, I am not able to preserve the VehicleID while training. What is the best approach in such scenario?
Please look at the following snippet of code, SparkTransformExecutor returns a RDD and InMemorySequence is taking a list when, I am collecting list from RDD indexing is not guaranteed.
val records:JavaRDD[util.List[util.List[Writable]]] = SparkTransformExecutor
.executeToSequence(.....)
val split = records.randomSplit(Array[Double](0.7,0.3))
val testSequences = split(1)
//in memory sequence reader
val testRR = new InMemorySequenceRecordReader(testSequences.collect().toList)
val testIter = new RecordReaderMultiDataSetIterator.Builder(batchSize)
.addSequenceReader("records", trainRR)
.addInput("records")
.build()
Typically you track training examples by index in a dataset. Track which index that dataset is vehicle is in the dataset alongside training. There are a number of ways to do that.
In dl4j, we typically keep the data raw and use record readers + transform processes for the training data. If you use a record reader on raw data (pick one for your dataset, it could be csv or even video) and use a recordreader datasetiterator like here:
```java
RecordReader recordReader = new CSVRecordReader(0, ',');
recordReader.initialize(new FileSplit(new ClassPathResource("iris.txt").getFile()));
int labelIndex = 4;
int numClasses = 3;
int batchSize = 150;
RecordReaderDataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
iterator.setCollectMetaData(true); //Instruct the iterator to collect metadata, and store it in the DataSet objects
DataSet allData = iterator.next();
DataSet trainingData = testAndTrain.getTrain();
DataSet testData = testAndTrain.getTest();
```
(Complete code here):
https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/dataexamples/CSVExampleEvaluationMetaData.java
Alongside this you use TransformProcess:
```
//Let's define the schema of the data that we want to import
//The order in which columns are defined here should match the
//order in which they appear in the input data
Schema inputDataSchema = new Schema.Builder()
//We can define a single column
.addColumnString("DateTimeString")
....
.build();
//At each step, we identify column by the name we gave them in the
input data schema, above
TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
//your transforms go here
.build();
```
Complete example below:
https://github.com/deeplearning4j/dl4j-examples/blob/6967b2ec2d51b0d19b5d6437763a2936ca922a0a/datavec-examples/src/main/java/org/datavec/transform/basic/BasicDataVecExampleLocal.java
If you use these things, you customize keep the data as is, but have a complete data pipeline. There are a lot of ways to do it, just keep in mind you start with the vehicle id, it doesn't have to disappear.

Fastest way to remove values from one array based on another array

I am working on some data processing in a Rails app and I am trying to deal with into a performance pain-point. I have 2 arrays x_data and y_data that each looks as follows (With different values of course):
[
{ 'timestamp_value' => '2017-01-01 12:00', 'value' => '432' },
{ 'timestamp_value' => '2017-01-01 12:01', 'value' => '421' },
...
]
Each array has up to perhaps 25k items. I need to prepare this data for further x-y regression analysis.
Now, some values in x_data or y_data can be nil. I need to remove values from both arrays if either x_data or y_data has a nil value at that timestamp. I then need to return the values only for both arrays.
In my current approach, I am first extracting the timestamps from both arrays where the values are not nil, then performing a set intersection on the timestamps to produce a final timestamps array. I then select values using that final array of timestamps. Here's the code:
def values_for_regression(x_data, y_data)
x_timestamps = timestamps_for(x_data)
y_timestamps = timestamps_for(y_data)
# Get final timestamps as the intersection of the two
timestamps = x_timestamps.intersection(y_timestamps)
x_values = values_for(x_data, timestamps)
y_values = values_for(y_data, timestamps)
[x_values, y_values]
end
def timestamps_for(data)
Set.new data.reject { |row| row['value'].nil? }.
map { |row| row['timestamp_value'] }
end
def values_for(data, timestamps)
data.select { |row| timestamps.include?(row['timestamp_value']) }.
map { |row| row['value'] }
end
This approach isn't terribly performant, and I need to do this on several sets of data in quick succession. The overhead of the multiple loops adds up. There must be a way to at least reduce the number of loops necessary.
Any ideas or suggestions will be appreciated.
You're doing a lot of redundant iterating and creating a lot of intermediate arrays of data.
Yourtimestamps_for and values_for both perform a select followed by a map. The select creates an intermediate array; since your arrays are up to 25,000 items, this is potentially an intermediate throw-away array of the same size. You're doing this four times, once for x and y timestamps, and once for x and y values. You produce another intermediate array by taking the intersection of the two sets of timestamps. You also do a complete scan of both arrays for nils twice, once to find timestamps with non-nil values, and again mapping the timestamps you just extracted to their values.
While it's definitely more readable to functionally transform the input arrays, you can dramatically reduce memory usage and execution time by combining the various iterations and transformations.
All the iterations can be combined into a single loop over one data set (along with setup time for producing a timestamp->value lookup hash for the second set). Any timestamps not present in the first set will make a timestamp in the second set ignored anyways, so there is no reason to find all the timestamps in both sets, only to then find their intersection.
def values_for_regression(x_data, y_data)
x_values = []
y_values = []
y_map = y_data.each_with_object({}) { |data, hash| hash[data['timestamp-value']] = data['value'] }
x_data.each do |data|
next unless x_value = data['value']
next unless y_value = y_map[data['timestamp-value']]
x_values << x_value
y_values << y_value
end
[x_values, y_values]
end
I think this is functionally identical, and a quick benchmark shows a ~70% reduction in runtime:
user system total real
yours 9.640000 0.150000 9.790000 ( 9.858914)
mine 2.780000 0.060000 2.840000 ( 2.845621)

Ruby Rails Average two attributes in a query returning multiple objects

I've got two attributes I'm trying to average, but it's only averaging the second field here. is there a way to do this?
e = TiEntry.where('ext_trlid = ? AND mat_pidtc = ?', a.trlid, a.pidtc).average(:mat_mppss_rprcp && :mat_fppss_rprcp)
e = TiEntry.where('ext_trlid = ? AND mat_pidtc = ?', a.trlid, a.pidtc).select("AVG(mat_mppss_rprcp) AS avg1, AVG(mat_fppss_rprcp) AS avg2").map { |i| [i.avg1, i.avg2] }
Is this working for you? it works as the average method does, but you can support as may values as you want
The advantage between this and the other queries here is this only uses one simple SQL query. The others fetch with an SQL everything in your table(can take some time if table is big) and then computes the average in ruby language
I am sure that you have all ready looked at http://api.rubyonrails.org/classes/ActiveRecord/Calculations.html#method-i-average
But you cant get the average of 2 things.
what you can do not to repeat your query is:
entries = TiEntry.where('ext_trlid = ? AND mat_pidtc = ?', a.trlid, a.pidtc)
average_mppss = entries.average(:mat_mppss_rprcp)
average_fppss = entries.average(:mat_fppss_rprcp)
this will only execute your query one time
I hope that this works for you

Resources