I read a post about forecasting time series with LSTM using CNTK. It is very helpful for me to get better understanding of how to apply this method to tackle other problems. It is so simple to implement LSTM network using CNTK, only with a couple lines of code.
model = Sequential (
RecurrentLSTMLayer {$stateDim$, usePeepholes = true} : #first LSTM
DenseLayer {$labelDim$, bias=false} # followed by an adaptor layer (from LSTN output size to the output or label size)
)
I used the data preprocessing program included in the post to generate the training data and validation data. I tried to prepare my domain data by studing the training data file of the example, I can understand each row consists of 15 features (input window) and 12 labels (output output window). The first two rows of data shown below.
1|i -0.117767881389987 -0.136789685972378 -0.157142990666484 -0.110810379516591 -0.0514608885500003 -0.0519359184751851 -0.093395333203464 -0.0466859579796335 -0.027053633818924 -0.0228974319964887 -0.0226106294034727 -0.0771583282775792 0.0326521764808296 0.0382623225371779 0.0179878482650109 |o -0.0419931078602005 -0.00707823233794613 -0.0326514790959216 0.107345877141872 0.0500879860433807 -0.0227185182952923 0.0354644105738453 0.0276734047763592 0.0830922226488839 0.0670409830200276 0.0983389666100694 0.101450282120106 |
1|i -0.142277570256967 -0.162630874951073 -0.11629826380118 -0.0569487728345894 -0.0574238027597742 -0.0988832174880532 -0.0521738422642226 -0.0325415181035131 -0.0283853162810779 -0.0280985136880618 -0.0826462125621683 0.0271642921962405 0.0327744382525887 0.0124999639804217 -0.0474809921447896 |o -0.0125661166225353 -0.0381393633805107 0.101857992857283 0.0446001017587916 -0.0282064025798814 0.0299765262892562 0.02218552049177 0.0776043383642948 0.0615530987354385 0.0928510823254802 0.0959623978355166 0.0630698500493061 |
As mentioned in the post, Input(15 values) and output (12 values) windows move forward one step at a time(see picture below), therefore, data at each row should just shift one value at a time, but they don't appear to me that is the case. There don't seem to have any overlaps of values between two rows.
Input and output windows for the time series
My question is how should I prepare the training and validation data for time series prediction using CNTK LSTM?
I looked into the data preprocessing script (in R) and found out that values in each input and output window are sum of the smooth trend and the random component (noise) as part of data normalization.
Related
Description of the problem
My goal is quite basic: to plot time series in an interactive plot. After some research I decided to give a try to Altair.
There are already QGIS plugins for time-series visualisation, but as far as I'm aware, none for plotting time-series at vector-level, interactively clicking on a map and selecting a Polygon. So that's why I decided to go for a self-made solution using Altair, maybe combining it with Folium to add functionalities later on.
I'm totally new to the Altair library (as well as Vega and Vega-lite), and quite new in datascience and data visualisation as well... so apologies in advance for my ignorance!
There are already well explained tutorials on how to plot time series with Altair (for example here, or in the official website). However, my study case has some particularities that, as far as I've seen, have not yet been approached altogether.
The data is produced using the Python API for Google Earth Engine and preprocessed with Python and the pandas/geopandas libraries:
In Google Earth Engine, a vegetation index (NDVI in the current case) is computed at pixel-level for a certain region of interest (ROI). Then the function image.reduceRegions() is mapped across the ImageCollection to compute the mean of the ndvi in every polygon of a FeatureCollection element, which represent agricultural parcels. The resulting vector file is exported.
Under a Jupyter-lab environment, the data is loaded into a geopandas GeoDataFrame object and preprocessed, transposing the DataFrame and creating a datetime column, among others, in order to have the data well-shaped for time-series representation with Altair.
Data overview after preprocessing:
My "final" goal would be to show, in the same graphic, an interactive line plot with a set of lines representing each one an agricultural parcel, with parcels categorized by crop types in different colours, e.g. corn in green, wheat in yellow, peer trees in brown... (the information containing the crop type of each parcel can be added to the DataFrame making a join with another DataFrame).
I am thinking of something looking more or less like the following example, with legend's years being the parcels coloured by crop types:
But so far I haven't managed to make my data look this way... at all.
As you can see there are many nulls in the data (this is due to the application of a cloud masking function and to the fact that there are several Sentinel-2 orbits intersecting the ROI). I would like to just omit the non-null values for earch column/parcel, but I don't know if this data configuration can pose problems (any advice on that?).
So far I got:
The generation of the preceding graphic, for a single parcel, takes already around 23 seconds. Which is something maybe shoud/cloud be improved (how?)
And more importantly, the expected line representing the item/polygon/parcel's values (NDVI) is not even shown in the plot (note that I chose a parcel containing rather few non-null values).
For sure I am doing many things wrong. Would be great to get some advice to solve (some of) them.
Sample of the data and code to reproduce the issue
Here's a text sample of the data in JSON format, and the code used to reproduce the issue is the following:
import pandas as pd
import geopandas as gpd
import altair as alt
df= pd.read_json(r"path\to\json\file.json")
df['date']= pd.to_datetime(df['date'])
print(gdf.dtypes)
df
Output:
lines=alt.Chart(df).mark_line().encode(
x='date:O',
y='17811:Q',
color=alt.Color(
'17811:Q', scale=alt.Scale(scheme='redyellowgreen', domain=(-1, 1)))
)
lines.properties(width=700, height=600).interactive()
Output:
Thanks in advance for your help!
If I understand correctly, it is mostly the format of your dataframe that needs to be changed from wide to long, which you can do either via .melt in pandas or .transform_fold in Altair. With melt, the default names are 'variable' (the previous columns name) and 'value' (the value for each column) for the melted columns:
alt.Chart(df.melt(id_vars='date'), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
The gaps comes from the NaNs; if you want Altair to interpolate missing values, you can drop the NaNs:
alt.Chart(df.melt(id_vars='date').dropna(), width=500).mark_line().encode(
x='date:T',
y='value',
color=alt.Color('variable')
)
If you want to do it all in Altair, the following is equivalent to the last pandas example above (the transform uses 'key' instead of 'variable' as the name for the former columns). I also use and ordinal instead of nominal type for the color encoding to show how to make the colors more similar to your example.:
alt.Chart(df, width=500).mark_line().encode(
x='date:T',
y='value:Q',
color=alt.Color('key:O')
).transform_fold(
df.drop(columns='date').columns.tolist()
).transform_filter(
'isValid(datum.value)'
)
I'm training a linear regressor in BigQuery. I'm training it with ~20000 rows of data and a target output column with values in {0, 1}. I'm setting the EARLY_STOP option to False. I have set the MAX_ITERATIONS option to 10. My training always stops at 1 iteration with an MAE of ~0.2. Here is the create model query:
CREATE OR REPLACE MODEL `sketching.linear_regressor_08-13--11-59-03.811249`
OPTIONS (model_type=#model_type,
input_label_cols=#input_label_cols,
max_iterations=#max_iterations,
data_split_method=#data_split_method,
early_stop=#early_stop) AS
SELECT * FROM `ds.training_table`
These are the parameters in the query (python object print):
[ScalarQueryParameter('model_type', 'STRING', 'linear_reg'),
ArrayQueryParameter('input_label_cols', 'STRING', ['target_output']),
ScalarQueryParameter('max_iterations', 'INT64', 10),
ScalarQueryParameter('data_split_method', 'STRING', 'RANDOM'),
ScalarQueryParameter('early_stop', 'BOOL', False)]
PS: I've inspected bytes_processed of the QueryJob. It checks out (meaning that it is actually processing the whole table.
UPDATE
It looks like BQ is ignoring most of the model options that is supplied. This is a screenshot of my model status on the bigquery web api:
As you can see in the training options section, it is showing none of the options that I provided and that options that are displayed are actually not provided. I changed the data split method option and it did affect a change here.
UPDATE 2
I provided the L1_REG (0.1) option and it magically fixed the problem. Training went up to 10 iterations (the max_iterations provided).
If I run the model without any optional options (or just the early_stop option) it stops at 1 iteration.
Your model is probably stopping because the least square solution is being directly computed. This is due to the fact that the default setting for OPTIMIZE_STRATEGY='AUTO_STRATEGY' and your query doesn't fulfill the requirements for batch gradient descent to be selected over the direct least square solution.[1]
If you don't want this to occur, set OPTIMIZE_STRATEGY='BATCH_GRADIENT_DESCENT'
[1] https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-glm#optimize_strategy
I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX
I want to train a ComputationGraph which has two outputs (this model) and in my script I have INDArrays (1 input and 2 outputs) ready to be sent in the neural network and it seems that I should use a MultiDataSetIterator to be able to setup batchsize before using the model.fit() function. I have been looking for a way to implement that for a long time and I have always found answers with CSV files but it is not what I want to use because while performing the simulations of the game I am creating a dataset of INDArrays that are stored in the memory and I am not loading any kind of CSV file.
Any ideas on how to create my MultiDataSetIterator to feed my fit() function ?
You don't have to use the multidataset iterator. You can fit with a multidataset (here) or you can fit with arrays of ndarrays(here) using your ndarrays in memory.
If you compare two sets of data (such as two files), the differences between these sets can be displayed in two columns, or two panes, such as WinMerge does.
But are there any visual paradigms to display the differences between multiple data sets?
Update
The starting point of my question was the assumption that displaying differences between 2 files is relatively easy, as I mentioned WinMerge, whereas comparing 3 or more text files turns out to be more complicated, as there will be more and more differences between, say, different versions of a document that have been created over time.
How would you highlight parts of the file that are the same in 2 versions, but different from other versions?
The data sets I have in mind are objects (A, B, C, ...) which may or may not exist and have properties (a, b, c, ...) which may be set or not set.
Example:
Set 1: A(a, b, c), B(b, c), C(c)
Set 2: A(a, b, c), B(b), C(c)
Set 3: A(a, b), B(b)
If you compare 2 sets, e.g. 1 and 2, the difference would be in B(c). Comparing sets 2 and 3 results in the difference A(c) and C().
If you compare all 3 sets, you end up with 3 comparisons (n * (n-1) / 2)
I have a different view than some of those who provided Answers--i.e., that you need to further specify the problem. The abstraction level is about right. Further specification would make the problem easier, but the solution less useful.
A couple of years ago, i saw a graphic on ProgrammableWeb--it compared the results from a search on Yahoo with the results from the same search on Google. There's a lot of information to covey: some results are in both sets, some in just one, and the common results will have different positions in the respective engine's results, which somehow has to be shown.
I like the graphic and reimplemented it in Matplotlib (a Python scientific plotting library). Below is an example using some random points as well as python code i used to generate it:
from matplotlib import pyplot as PLT
xvals = NP.array([(2,3), (5,7), (8,6), (1.5,1.8), (3.0,3.8), (5.3,5.2),
(3.7,4.1), (2.9, 3.7), (8.4, 6.1), (7.1, 6.4)])
yvals = NP.tile( NP.array([5,3]), [10,1] )
fig = PLT.figure()
ax1 = fig.add_subplot(111)
ax1.plot(x, y, "-", lw=3, color='b')
ax1.plot(x, y2, "-", lw=3, color='b')
for a, b in zip(xvals, yvals) : ax1.plot(a,b,'-o',ms=8,mfc='orange', color='g')
PLT.axis("off")
PLT.show()
This model has some interesting features: (i) it actually deals with 'similarity' on a per-item basis (the vertically-oriented line connecting the dots) rather than aggregate similarity; (ii) the degree of similarity between two data points is proportional to the angle of the line connecting them--90 degrees if they are equal, with a decreasing angle as the difference increases; this is very intuitive; (iii) cases in which a point in one data set is not present in the second data set are easy to show--a point will appear on one of the two lines but without a line connecting it to a point on the other line.
This model works well for comparing search results because each search result has a 'score' (its index, or order in the Results List). For other types of data, you might have to assign a score to each data point--a similarity metric might i suppose (in a sense, that's actually what the search result order is, an distance from the top of the list)
Since there has been so much work into displaying a diff of two files, you might start by expressing your 'multiple data sets' in an appropriate text format, then using whatever you want to show a diff between those text formats.
But you should tell us more about your data sets!
I experimented a bit, and implemented two displays:
Matrix
Timeline
I agree with Peter, you should specify what type your data is and what you wish to bring out in the comparison.
Depending on the nature of the data/comparison you can consider different visualisations. Is your data ordered or unordered? How many things are you comparing, i.e. fine grain or gross comparison?
Examples:
Visualizing a comparison of unordered data could just be plotting the two histograms of your sets (i.e. distributions):
image source
On the other hand, comparing a huge ordered dataset like DNA can be done innovatively.
Also, check out visual complexity, it's a great resource for interesting visualization.