DeepLearning4j and DataVec read csv file with label - deeplearning4j

I have built a DL4j project. Everything is fine if I use MNIST dataset as follows:
DataSetIterator mnistTrain = new MnistDataSetIterator(batchSize, true, rngSeed);
DataSetIterator mnistTest = new MnistDataSetIterator(batchSize, false, rngSeed);
However, I want to switch to my own csv file with the following format:
A | B | C | X | Y
-------------------------
1 | 100 | 5 | 15 | 6
...
X and Y are the outcomes (or labels). As I plan to perform regression analysis, so both X and Y are real numbers. So I read the csv file using the following code:
RecordReader recordReaderTrain = new CSVRecordReader(1, ",");
recordReaderTrain.initialize(new FileSplit(new File("src/main/resources/data/Data.csv")));
DataSetIterator dataIterTrain = new RecordReaderDataSetIterator(recordReaderTrain, batchSize, 3, 2);
3 in the code means index of the labels and 2 means number of possible labels. There no much explanation about these two parameters. I guess they mean the labels start from the 4th column and has 2 labels.
When I run the code, it shows the following exception:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 14
I think it is because dl4j does not recognize 15 as label.
So my question is: how can I properly read the csv file for a regression analysis?
Many thanks.

Right so we have examples for regression:
https://github.com/deeplearning4j/dl4j-examples/tree/cc383de91bdf4e28e36859aa2e8749100cd63177/dl4j-examples/src/main/java/org/deeplearning4j/examples/feedforward/regression
You need to pass regression true (it's an extra part of the constructor) to the RecordReaderDataSetIterator.

Related

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

Estimate a numerical value through Spark MLlib Regression

I'm training a Spark MLlib linear regressor but I believe I didn't understand part of the libraries hands-on usage.
I have 1 feature (NameItem) and one output (Accumulator).
The first one is categorical (Speed, Temp, etc), the second is numerical in double type.
Training set is made of several milions of entries and they are not linearly correlated (I checked with heatmap and correlation indexes).
Issue: I'd like to estimate the Accumulator value given the NameItem value through linear regression, but I think it is not what I'm actually doing.
Question: How can I do It?
I first divided the dataset in training set and data set:
(trainDF, testDF) = df.randomSplit((0.80, 0.20), seed=42)
After that I tried a pipeline approach, as most tutorials show:
1) I indexed NameItem
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
2) Then I encoded it
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["EncodedItem"]
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
3) And also assembled it
assemblerInput = encoderOutput
assembler = VectorAssembler(inputCols=assemblerInput, outputCol="features")
After that I continued with the effective training:
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
lrModel = pipeline.fit(trainDF)
That's what I obtain when I apply the prediction on the test set:
predictions = lrModel.transform(testDF).show(5, False)
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
|NameItem |Accumulator |CategorizedItem|EncodedItem |features |prediction |
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
|Speed |44000.00000000 |265.0 |(688,[265],[1.0])|(689,[265,688],[1.0,44000.0]) |44000.100892495786|
|Speed |245000.00000000 |265.0 |(688,[265],[1.0])|(689,[265,688],[1.0,245000.0]) |245000.09963708033|
|Temp |4473860.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,4473860.0]) |4473859.874261986 |
|Temp |6065.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,6065.0]) |6065.097757082314 |
|Temp |10140.00000000 |66.0 |(688,[66],[1.0]) |(689,[66,688],[1.0,10140.0]) |10140.097731630483|
+--------------+-----------------+---------------+-----------------+-------------------------------+------------------+
only showing top 5 rows
How can it be possible that for the same categorical feature (for example Temp) I get 3 different predictions?
Even though they are very close to the expected value, I feel there's something wrong.
How can it be possible that for the same categorical feature (for example Temp) I get 3 different predictions?
It's because somehow your output Accumulator has found its way into features (which of course should not be the case), so the model just "predicts" (essentially copies) this part of the input; that's why the predictions are so "accurate"...
Seems like the VectorAssembler messes things up. Thing is, you don't really need a VectorAssembler here, since in fact you only have a "single" feature (the one-hot encoded sparse vector in EncodedItem). This might be the reason why VectorAssembler behaves like that here (it is asked to "assemble" a single feature), but in any case this would be a bug.
So what I suggest is to get rid of the VectorAssembler, and rename the EncodedItem directly as features, i.e.:
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["features"] # 1st change
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, lr]) # 2nd change
lrModel = pipeline.fit(trainDF)
UPDATE (after feedback in the comments)
My Spark version Is 1.4.4
Unfortunately I cannot reproduce the issue, simply because I have not access to Spark 1.4.4, which you are using. But I have confirmed that it works OK in the most recent version of Spark 2.4.4, making me even more inclined to believe that there was indeed some bug back in v1.4, which however has subsequently been resolved.
Here is a reproduction in Spark 2.4.4, using some dummy data resembling yours:
spark.version
# '2.4.4'
from pyspark.ml.feature import VectorAssembler, OneHotEncoderEstimator, StringIndexer
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
# dummy data resembling yours:
df = spark.createDataFrame([['Speed', 44000],
['Temp', 23000],
['Temp', 5000],
['Speed', 75000],
['Weight', 5300],
['Height', 34500],
['Weight', 6500]],
['NameItem', 'Accumulator'])
df.show()
# result:
+--------+-----------+
|NameItem|Accumulator|
+--------+-----------+
| Speed| 44000|
| Temp| 23000|
| Temp| 5000|
| Speed| 75000|
| Weight| 5300|
| Height| 34500|
| Weight| 6500|
+--------+-----------+
indexer = StringIndexer(inputCol="NameItem", outputCol="CategorizedItem", handleInvalid = "keep")
encoderInput = [indexer.getOutputCol()]
encoderOutput = ["EncodedItem"]
encoder = OneHotEncoderEstimator(inputCols=encoderInput, outputCols=encoderOutput)
assemblerInput = encoderOutput
assembler = VectorAssembler(inputCols=assemblerInput, outputCol="features")
lr = LinearRegression(labelCol="Accumulator")
pipeline = Pipeline(stages=[indexer, encoder, assembler, lr])
lrModel = pipeline.fit(df)
lrModel.transform(df).show() # predicting on the same df, for simplicity
The result of the last transform is
+--------+-----------+---------------+-------------+-------------+------------------+
|NameItem|Accumulator|CategorizedItem| EncodedItem| features| prediction|
+--------+-----------+---------------+-------------+-------------+------------------+
| Speed| 44000| 2.0|(4,[2],[1.0])|(4,[2],[1.0])| 59500.0|
| Temp| 23000| 1.0|(4,[1],[1.0])|(4,[1],[1.0])|14000.000000000004|
| Temp| 5000| 1.0|(4,[1],[1.0])|(4,[1],[1.0])|14000.000000000004|
| Speed| 75000| 2.0|(4,[2],[1.0])|(4,[2],[1.0])| 59500.0|
| Weight| 5300| 0.0|(4,[0],[1.0])|(4,[0],[1.0])| 5900.000000000004|
| Height| 34500| 3.0|(4,[3],[1.0])|(4,[3],[1.0])| 34500.0|
| Weight| 6500| 0.0|(4,[0],[1.0])|(4,[0],[1.0])| 5900.000000000004|
+--------+-----------+---------------+-------------+-------------+------------------+
from where you can see that:
The features now do not include the values of the output variable Accumulator, as it should be indeed; in fact, as I had argued above, features is now identical with EncodedItem, making the VectorAssembler redundant, exactly as we should expect since we only have one single feature.
The prediction values are now identical for the same values of NameItem, again as we would expect them to be, plus that they are less accurate and thus more realistic.
So, most certainly, your issue has to do with the vastly outdated Spark version 1.4.4 you are using. Spark has made leaps since v1.4, and you should seriously consider updating...

KMeans clustering in PySpark

I have a spark dataframe 'mydataframe' with many columns. I am trying to run kmeans on only two columns: lat and long (latitude & longitude) using them as simple values). I want to extract 7 clusters based on just those 2 columns and then I want to attach the cluster asignment to my original dataframe. I've tried:
from numpy import array
from math import sqrt
from pyspark.mllib.clustering import KMeans, KMeansModel
# Prepare a data frame with just 2 columns:
data = mydataframe.select('lat', 'long')
data_rdd = data.rdd # needs to be an RDD
data_rdd.cache()
# Build the model (cluster the data)
clusters = KMeans.train(data_rdd, 7, maxIterations=15, initializationMode="random")
But I am getting an error after a while:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5191.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5191.0 (TID 260738, 10.19.211.69, executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last)
I've tried to detach and re-attach the cluster. Same result. What am I doing wrong?
Since, based on another recent question of yours, I guess you are in your very first steps with Spark clustering (you are even importing sqrt & array, without ever using them, probably because it is like that in the docs example), let me offer advice in a more general level rather than in the specific question you are asking here (hopefully also saving you from subsequently opening 3-4 more questions, trying to get your cluster assignments back into your dataframe)...
Since
you have your data already in a dataframe
you want to attach the cluster membership back into your initial
dataframe
you have no reason to revert to an RDD and use the (soon to be deprecated) MLlib package; you will do your job much more easily, elegantly, and efficiently using the (now recommended) ML package, which works directly with dataframes.
Step 0 - make some toy data resembling yours:
spark.version
# u'2.2.0'
df = spark.createDataFrame([[0, 33.3, -17.5],
[1, 40.4, -20.5],
[2, 28., -23.9],
[3, 29.5, -19.0],
[4, 32.8, -18.84]
],
["other","lat", "long"])
df.show()
# +-----+----+------+
# |other| lat| long|
# +-----+----+------+
# | 0|33.3| -17.5|
# | 1|40.4| -20.5|
# | 2|28.0| -23.9|
# | 3|29.5| -19.0|
# | 4|32.8|-18.84|
# +-----+----+------+
Step 1 - assemble your features
In contrast to most ML packages out there, Spark ML requires your input features to be gathered in a single column of your dataframe, usually named features; and it provides a specific method for doing this, VectorAssembler:
from pyspark.ml.feature import VectorAssembler
vecAssembler = VectorAssembler(inputCols=["lat", "long"], outputCol="features")
new_df = vecAssembler.transform(df)
new_df.show()
# +-----+----+------+-------------+
# |other| lat| long| features|
# +-----+----+------+-------------+
# | 0|33.3| -17.5| [33.3,-17.5]|
# | 1|40.4| -20.5| [40.4,-20.5]|
# | 2|28.0| -23.9| [28.0,-23.9]|
# | 3|29.5| -19.0| [29.5,-19.0]|
# | 4|32.8|-18.84|[32.8,-18.84]|
# +-----+----+------+-------------+
As perhaps already guessed, the argument inputCols serves to tell VectoeAssembler which particular columns in our dataframe are to be used as features.
Step 2 - fit your KMeans model
from pyspark.ml.clustering import KMeans
kmeans = KMeans(k=2, seed=1) # 2 clusters here
model = kmeans.fit(new_df.select('features'))
select('features') here serves to tell the algorithm which column of the dataframe to use for clustering - remember that, after Step 1 above, your original lat & long features are no more directly used.
Step 3 - transform your initial dataframe to include cluster assignments
transformed = model.transform(new_df)
transformed.show()
# +-----+----+------+-------------+----------+
# |other| lat| long| features|prediction|
# +-----+----+------+-------------+----------+
# | 0|33.3| -17.5| [33.3,-17.5]| 0|
# | 1|40.4| -20.5| [40.4,-20.5]| 1|
# | 2|28.0| -23.9| [28.0,-23.9]| 0|
# | 3|29.5| -19.0| [29.5,-19.0]| 0|
# | 4|32.8|-18.84|[32.8,-18.84]| 0|
# +-----+----+------+-------------+----------+
The last column of the transformed dataframe, prediction, shows the cluster assignment - in my toy case, I have ended up with 4 records in cluster #0 and 1 record in cluster #1.
You can further manipulate the transformed dataframe with select statements, or even drop the features column (which has now fulfilled its function and may be no longer necessary)...
Hopefully you are much closer now to what you actually wanted to achieve in the first place. For extracting cluster statistics etc., another recent answer of mine might be helpful...
Despite my other general answer, and in case you, for whatever reason, must stick with MLlib & RDDs, here is what causes your error using the same toy df.
When you select columns from a dataframe to convert to RDD, as you do, the result is an RDD of Rows:
df.select('lat', 'long').rdd.collect()
# [Row(lat=33.3, long=-17.5), Row(lat=40.4, long=-20.5), Row(lat=28.0, long=-23.9), Row(lat=29.5, long=-19.0), Row(lat=32.8, long=-18.84)]
which is not suitable as an input to MLlib KMeans. You'll need a map operation for this to work:
df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1])).collect()
# [(33.3, -17.5), (40.4, -20.5), (28.0, -23.9), (29.5, -19.0), (32.8, -18.84)]
So, your code should be like this:
from pyspark.mllib.clustering import KMeans, KMeansModel
rdd = df.select('lat', 'long').rdd.map(lambda x: (x[0], x[1]))
clusters = KMeans.train(rdd, 2, maxIterations=10, initializationMode="random") # works OK
clusters.centers
# [array([ 40.4, -20.5]), array([ 30.9 , -19.81])]

if (freq) x$counts else x$density length > 1 and only the first element will be used

for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!
It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)

torch/nn - Joining arrays of Tensors element wise

The subject of this question is joining tensors for neural networks with torch/nn and torch/nngraph libraries for Lua. I started coding in Lua a few weeks ago so my experience is very minimal. In the text below, I refer to lua tables as arrays.
Context
I am working with a recurrent neural network for speech recognition.
At some point in the network there are N number of arrays of m Tensors.
a = {a1, a2, ..., aM},
b = {b1, b2, ..., bM},
... N times
Where ai and bi are tensors and {} represents an array.
What needs to be done is join all those arrays element-wise so that output is an array of M Tensors where output[i] is the result of joining every ith Tensors from the N arrays over the second dimension.
output = {z1, z2, ..., zM}
Example
|| used to represent Tensors
x = {|1 1|, |2 2|}
|1 1| |2 2|
Tensors of size 2x2
y = {|3 3 3|, |4 4 4|}
|3 3 3| |4 4 4|
Tensors of size 2x3
|
| Join{x,y}
\/
z = {|1 1 3 3 3|, |2 2 4 4 4|}
|1 1 3 3 3| |2 2 4 4 4|
Tensors of size 2x5
So the first Tensor of x of size 2x2 was joined with the first Tensor of y of size 2x3 over the second dimension and same thing for second Tensor of each array resulting in z an array of Tensors 2x5.
Problem
Now this is a basic concatenation, but I can't seem to find a module in the torch/nn library that would allow me to do that. I could write my own module of course, but if an already existing module does it then I would rather go with that.
The only existing module I know that joins table is (obviously) JoinTable. It takes an array of Tensors and joins them together. I want to join arrayS of tensors element-wise.
Also, as we are feeding input to our network, the number of Tensors in the N arrays varies, so m from the context above is not constant.
Idea
What I thought I could do in order to use the module JoinTable is convert my arrays into Tensors instead and then JoinTable on the converted N Tensors. But then again I would need a module that does such a conversion and and another one to convert back to an array in order to feed it to the next layers of the network.
Last resort
Write a new module that iterates over all given arrays and concatenates element-wise. Of course it's do-able, but the whole purpose of this post is to find a way to avoid writing smelly modules. It seems weird to me that such a module doesn't already exist.
Conclusion
I finally decided to do as I wrote in Last resort. I wrote a new module that iterates over all given arrays and concatenates element-wise.
Though, the answer given by #fmguler does the same without having to write a new module.
You can do it with nn.SelectTable and nn.JoinTable like this;
require 'nn'
x = {torch.Tensor{{1,1},{1,1}}, torch.Tensor{{2,2},{2,2}}}
y = {torch.Tensor{{3,3,3},{3,3,3}}, torch.Tensor{{4,4,4},{4,4,4}}}
res = {}
res[1] = nn.JoinTable(2):forward({nn.SelectTable(1):forward(x),nn.SelectTable(1):forward(y)})
res[2] = nn.JoinTable(2):forward({nn.SelectTable(2):forward(x),nn.SelectTable(2):forward(y)})
print(res[1])
print(res[2])
If you want this to be done in a module, wrap it in nnGraph;
require 'nngraph'
x = {torch.Tensor{{1,1},{1,1}}, torch.Tensor{{2,2},{2,2}}}
y = {torch.Tensor{{3,3,3},{3,3,3}}, torch.Tensor{{4,4,4},{4,4,4}}}
xi = nn.Identity()()
yi = nn.Identity()()
res = {}
--you can loop over columns here>>
res[1] = nn.JoinTable(2)({nn.SelectTable(1)(xi),nn.SelectTable(1)(yi)})
res[2] = nn.JoinTable(2)({nn.SelectTable(2)(xi),nn.SelectTable(2)(yi)})
module = nn.gModule({xi,yi},res)
--test like this
result = module:forward({x,y})
print(result)
print(result[1])
print(result[2])
--gives the result
th> print(result)
{
1 : DoubleTensor - size: 2x5
2 : DoubleTensor - size: 2x5
}
th> print(result[1])
1 1 3 3 3
1 1 3 3 3
[torch.DoubleTensor of size 2x5]
th> print(result[2])
2 2 4 4 4
2 2 4 4 4
[torch.DoubleTensor of size 2x5]

Resources