Storing Vehicle Id in Anomaly Detection - deeplearning4j

I tested anomaly detection using Deeplearning4j, everything works fine except that, I am not able to preserve the VehicleID while training. What is the best approach in such scenario?
Please look at the following snippet of code, SparkTransformExecutor returns a RDD and InMemorySequence is taking a list when, I am collecting list from RDD indexing is not guaranteed.
val records:JavaRDD[util.List[util.List[Writable]]] = SparkTransformExecutor
val split = records.randomSplit(Array[Double](0.7,0.3))
val testSequences = split(1)
//in memory sequence reader
val testRR = new InMemorySequenceRecordReader(testSequences.collect().toList)
val testIter = new RecordReaderMultiDataSetIterator.Builder(batchSize)
.addSequenceReader("records", trainRR)

Typically you track training examples by index in a dataset. Track which index that dataset is vehicle is in the dataset alongside training. There are a number of ways to do that.
In dl4j, we typically keep the data raw and use record readers + transform processes for the training data. If you use a record reader on raw data (pick one for your dataset, it could be csv or even video) and use a recordreader datasetiterator like here:
RecordReader recordReader = new CSVRecordReader(0, ',');
recordReader.initialize(new FileSplit(new ClassPathResource("iris.txt").getFile()));
int labelIndex = 4;
int numClasses = 3;
int batchSize = 150;
RecordReaderDataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
iterator.setCollectMetaData(true); //Instruct the iterator to collect metadata, and store it in the DataSet objects
DataSet allData =;
DataSet trainingData = testAndTrain.getTrain();
DataSet testData = testAndTrain.getTest();
(Complete code here):
Alongside this you use TransformProcess:
//Let's define the schema of the data that we want to import
//The order in which columns are defined here should match the
//order in which they appear in the input data
Schema inputDataSchema = new Schema.Builder()
//We can define a single column
//At each step, we identify column by the name we gave them in the
input data schema, above
TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
//your transforms go here
Complete example below:
If you use these things, you customize keep the data as is, but have a complete data pipeline. There are a lot of ways to do it, just keep in mind you start with the vehicle id, it doesn't have to disappear.


Forecasting.ForecastBySsa with Multiple variables as input

I've got this code to predict a time series. I want to have a prediction based upon a time series of prices and a correlated indicator.
So together with the value to forecast, I want to pass a side value but I cannot understand if this is taken into account because prediction doesn't change with or without it. In which way do I need to tell to the algorithm how to consider these parameters?
public static TimeSeriesForecast PerformTimeSeriesProductForecasting(List<TimeSeriesData> listToForecast)
var mlContext = new MLContext(seed: 1); //Seed set to any number so you have a deterministic environment
var productModelPath = $"";
if (File.Exists(productModelPath))
IDataView productDataView = mlContext.Data.LoadFromEnumerable<TimeSeriesData>(listToForecast);
var singleProductDataSeries = mlContext.Data.CreateEnumerable<TimeSeriesData>(productDataView, false).OrderBy(p => p.Date);
TimeSeriesData lastMonthProductData = singleProductDataSeries.Last();
const int numSeriesDataPoints = 2500; //The underlying data has a total of 34 months worth of data for each product
// Create and add the forecast estimator to the pipeline.
IEstimator<ITransformer> forecastEstimator = mlContext.Forecasting.ForecastBySsa(
outputColumnName: nameof(TimeSeriesForecast.NextClose),
inputColumnName: nameof(TimeSeriesData.Close), // This is the column being forecasted.
windowSize: 22, // Window size is set to the time period represented in the product data cycle; our product cycle is based on 12 months, so this is set to a factor of 12, e.g. 3.
seriesLength: numSeriesDataPoints, // This parameter specifies the number of data points that are used when performing a forecast.
trainSize: numSeriesDataPoints, // This parameter specifies the total number of data points in the input time series, starting from the beginning.
horizon: 5, // Indicates the number of values to forecast; 2 indicates that the next 2 months of product units will be forecasted.
confidenceLevel: 0.98f, // Indicates the likelihood the real observed value will fall within the specified interval bounds.
confidenceLowerBoundColumn: nameof(TimeSeriesForecast.ConfidenceLowerBound), //This is the name of the column that will be used to store the lower interval bound for each forecasted value.
confidenceUpperBoundColumn: nameof(TimeSeriesForecast.ConfidenceUpperBound)); //This is the name of the column that will be used to store the upper interval bound for each forecasted value.
// Fit the forecasting model to the specified product's data series.
ITransformer forecastTransformer = forecastEstimator.Fit(productDataView);
// Create the forecast engine used for creating predictions.
TimeSeriesPredictionEngine<TimeSeriesData, TimeSeriesForecast> forecastEngine = forecastTransformer.CreateTimeSeriesEngine<TimeSeriesData, TimeSeriesForecast>(mlContext);
// Save the forecasting model so that it can be loaded within an end-user app.
forecastEngine.CheckPoint(mlContext, productModelPath);
ITransformer forecaster;
using (var file = File.OpenRead(productModelPath))
forecaster = mlContext.Model.Load(file, out DataViewSchema schema);
// We must create a new prediction engine from the persisted model.
TimeSeriesPredictionEngine<TimeSeriesData, TimeSeriesForecast> forecastEngine2 = forecaster.CreateTimeSeriesEngine<TimeSeriesData, TimeSeriesForecast>(mlContext);
// Get the prediction; this will include the forecasted product units sold for the next 2 months since this the time period specified in the `horizon` parameter when the forecast estimator was originally created.
prediction = forecastEngine.Predict();
return prediction;
TimeSeriesData has multiple attributes, not only the value of the series that I ant to forecast. Just wonder if they are taken into account when forecasting o not.
Is there a better method to forecast this type of series like LMST? Is this method available in ML.NET?
There is a new ticket for enhancement: Multivariate Time based series forecasting to ML.Net
See ticket:

looping program for MLP Keras prediction

I am (sort of a beginner starting out) experimenting with Keras on a time series data application where I created a regression model and then saved it to run on a different Python script.
The time series data that I am dealing with is hourly data, and I am using a saved model in Keras to predict a value for each of hour in the data set. (data = CSV file is read into pandas) With a years worth of time series data there is 8760 (hours in a year) predictions and finally I am attempting to sum the values of the predictions at the end.
In the code below I am not showing how the model architecture gets recreated (keras requirement for a saved model) and the code works its just extremely slow. This method seems fine for under a 200 predictions but for a 8760 the code seems to bog down way too much to ever finish.
I don't have any experience with databases but would this be a better method versus storing 8760 keras predictions in a Python list? Thanks for any tips I am still riding the learning curve..
#set initial loop params & empty list to store modeled data
row_num = 0
total_estKwh = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params])
estimatedKwh = load_trained_model(weights_path).predict(params)
print('Analyzing row number:', row_num)
row_num += 1
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()
Seems you are making your life very difficult without obvious reason...
For starters, you don't need to load your model for every row - this is overkill! You shoud definitely move load_trained_model(weights_path) out of the for loop, with something like
model = load_trained_model(weights_path) # load ONCE
and replace the respective line in the loop with
estimatedKwh = model.predict(params)
Second, it is again not efficient to call the model for prediction row-by-row; it is preferable to first prepare your params as an array, and then feed this to the model for getting batch predictions. Forget the print statement, too..
All in all, try this:
params_array = []
for i, row in data.iterrows():
params = row.values
if (params.ndim == 1):
params = np.array([params]) # is this if really necessary??
params_array = np.asarray(params_array, dtype=np.float32)
total_estKwh = load_trained_model(weights_path).predict(params_array)
df = pd.DataFrame.from_records(total_estKwh)
total = df.sum()
totalStd = np.std(df.values)
totalMean = df.mean()

Dataflow: How to create a pipeline from an already existing PCollection spewed by another pipeline

I am trying split my pipeline into many smaller pipelines so they execute faster. I am partitioning a PCollection of Google Cloud Storage blobs (PCollection)so that I get a
PCollectionList<Blob> collectionList
from there I would love to be able to something like:
Pipeline p2 = Pipeline.create(collectionList.get(0));
Pipeline p3 = Pipeline.create(collectionList.get(1));
But I haven't found any documentation about creating an initial PCollection from an already existing PCollection, I'd be very grateful if anyone can point me the right direction.
You should look into the Partition transform to split a PCollection into N smaller ones. You can provide a PartitionFn to define how the split is done. You can find below an example from the Beam programming guide:
// Provide an int value with the desired number of result partitions, and a PartitionFn that represents the partitioning function.
// In this example, we define the PartitionFn in-line.
// Returns a PCollectionList containing each of the resulting partitions as individual PCollection objects.
PCollection<Student> students = ...;
// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
students.apply(Partition.of(10, new PartitionFn<Student>() {
public int partitionFor(Student student, int numPartitions) {
return student.getPercentile() // 0..99
* numPartitions / 100;
// You can extract each partition from the PCollectionList using the get method, as follows:
PCollection<Student> fortiethPercentile = studentsByPercentile.get(4);

Spark dataframe reduceByKey

I am using Spark 1.5/1.6, where I want to do reduceByKey operation in DataFrame, I don't want to convert the df to rdd.
Each row looks like and I have multiple rows for id1.
id1, id2, score, time
I want to have something like:
id1, [ (id21, score21, time21) , ((id22, score22, time22)) , ((id23, score23, time23)) ]
So, for each "id1", I want all records in a list
By the way, the reason why don't want to convert df to rdd is because I have to join this (reduced) dataframe to another dataframe, and I am doing re-partitioning on the join key, which makes it faster, I guess the same cannot be done with rdd
Any help will be appreciated.
To simply preserve the partitioning already achieved then re-use the parent RDD partitioner in the reduceByKey invocation:
val rdd = df.toRdd
val parentRdd = rdd.dependencies(0) // Assuming first parent has the
// desired partitioning: adjust as needed
val parentPartitioner = parentRdd.partitioner
val optimizedReducedRdd = rdd.reduceByKey(parentPartitioner, reduceFn)
If you were to not specify the partitioner as follows:
df.toRdd.reduceByKey(reduceFn) // This is non-optimized: uses full shuffle
then the behavior you noted would occur - i.e. a full shuffle occurs. That is because the HashPartitioner would be used instead.

SPARK - Joining two data streams - maintenance of cache

It is evident that the out of box join capability in spark streaming does not warrent a lot of real life use cases. The reason being it joins only the data contained in the micro batch RDDs.
Use case is to join data from two kafka streams and enrich each object in stream1 with it's corresponding object in stream2 in spark and save it to HBase.
Implementation would
maintain a dataset in memory from objects from stream2, adding or replacing objects as and when they are recieved
for every element in stream1, access the cache to find a matching object from stream2, save to HBase if match is found or put it back on the kafka stream if not.
This question is on exploration of Spark streaming and it's API to find a way to implement the above mentioned.
You can join the incoming RDDs to other RDDs -- not just the ones in that micro-batch. Basically you keep a "running total" RDD that you fill something like:
var globalRDD1: RDD[...] = sc.emptyRDD
var globalRDD2: RDD[...] = sc.emptyRDD
dstream1.foreachRDD(rdd => if (!rdd.isEmpty) globalRDD1 = globalRDD1.union(rdd))
dstream2.foreachRDD(rdd => if (!rdd.isEmpty) {
globalRDD2 = globalRDD2.union(rdd))
globalRDD1.join(globalRDD2).foreach(...) // etc, etc
A good start would be to look into mapWithState. This is a more efficient replacement for updateStateByKey. These are defined on PairDStreamFunction, so assuming your objects of type V in stream2 are identified by some key of type K, your first point would go like this:
def stream2: DStream[(K, V)] = ???
def maintainStream2Objects(key: K, value: Option[V], state: State[V]): (K, V) = {
(key, state.get())
val spec = StateSpec.function(maintainStream2Objects)
val stream2State = stream2.mapWithState(spec)
stream2State is now a stream where each batch contains the (K, V) pairs with the latest value seen for each key. You can do a join on this stream and stream1 to perform the further logic for your second point.
