I am trying to create a dataset for ML using Scilab, and I need to save during data generation because it's too big for scilab's max stack.
Here is a toy example I made to find out what goes wrong but I'm not able to figure it out
datas=[];
labels=[];
for i =1:10
for j=1:100
if j==1
disp(i)
end
data = sin(-%pi:0.01:%pi);
label = rand();
datas = [datas, data];
labels = [labels, label];
end
save(chemin+'\test.h5','-append','datas','labels')
datas = [];
labels = [];
end
I am looking for the shape of data to be [1000,629] at the end, but I get [62900,0]
Have you any ideas why it is?
Here is an example of how to incrementally save a big matrix without any memory pressure:
// create a new HDF5 file
a = h5open(TMPDIR + "/test.h5", "w")
// create the dataset
N = 3; // number of chuncks
nrows = 5; // rows of a single chunk
ncols = 10; // cols of a single chunk
chsize = [nrows, ncols];
maxrows = N*nrows; // final number of rows of concatenated matrix
maxcols = ncols; // final number of cols of concatenated matrix
for k=1:N
// warning, x is viewed as a C-matrix (row-major), transpose if applicable
x = rand(nrows,ncols);
h5dataset(a, "My_Dataset", ...
[chsize ;1 1 ;1 1 ;chsize ;chsize],...
x, ...
[k*nrows ncols; maxrows maxcols; 1+(k-1)*nrows 1 ;1 1 ;chsize; chsize])
h5dump(a, "My_Dataset");
end
disp(a.root.My_Dataset.data)
h5close(a)
You have to vertically concatenate (semicolon) instead of horizontally (coma)
datas = [datas; data];
labels = [labels; label];
BTW this won't solve your memory problem as the matrices grow in Scilab's workspace and using "-append" just owerwrites the objects in the hdf5 file (you are using the same names).
Related
I use Deeplearning4j to classify equipment names. I marked ~ 50,000 items with 495 classes, and I use this data to train the neural network.
That is, as input, I provide a set of vectors (50,000) consisting of 0 and 1, and the expected class for each vector (0 to 494).
I use the IrisClassifier example as a basis for the code.
I saved the trained model to a file, and now I can use it to predict the class of equipment.
As an example, I tried to use for prediction the same data (50,000 items) that I used for training, and compare the prediction with my markup of this data.
The result turned out to be very good, the error of the neural network is ~ 1%.
After that, I tried to use for prediction the first 100 vectors from these 50,000 records, and removed the rest 49900.
And for these 100 vectors, the prediction is different when compared with the prediction for the same 100 vectors in the composition of 50,000.
That is, the less data we provide to the trained model, the greater the prediction error.
Even for exactly the same vectors.
Why does this happen?
My code.
Training:
//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File(args[0])));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = 3331;
int numClasses = 495;
int batchSize = 4000;
// DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build();
List<DataSet> trainingData = new ArrayList<>();
List<DataSet> testData = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.8); //Use 80% of data for training
trainingData.add(testAndTrain.getTrain());
testData.add(testAndTrain.getTest());
}
DataSet allTrainingData = DataSet.merge(trainingData);
DataSet allTestData = DataSet.merge(testData);
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(allTrainingData); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(allTrainingData); //Apply normalization to the training data
normalizer.transform(allTestData); //Apply normalization to the test data. This is using statistics calculated from the *training* set
long seed = 6;
int firstHiddenLayerSize = labelIndex/6;
int secondHiddenLayerSize = firstHiddenLayerSize/4;
//log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.activation(Activation.TANH)
.weightInit(WeightInit.XAVIER)
.updater(new Sgd(0.1))
.l2(1e-4)
.list()
.layer(new DenseLayer.Builder().nIn(labelIndex).nOut(firstHiddenLayerSize)
.build())
.layer(new DenseLayer.Builder().nIn(firstHiddenLayerSize).nOut(secondHiddenLayerSize)
.build())
.layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
.nIn(secondHiddenLayerSize).nOut(numClasses).build())
.build();
//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));
for(int i=0; i<5000; i++ ) {
model.fit(allTrainingData);
}
//evaluate the model on the test set
Evaluation eval = new Evaluation(numClasses);
INDArray output = model.output(allTestData.getFeatures());
eval.eval(allTestData.getLabels(), output);
log.info(eval.stats());
// Save the Model
File locationToSave = new File(args[1]);
model.save(locationToSave, false);
Prediction:
// Open the network file
File locationToLoad = new File(args[0]);
MultiLayerNetwork model = MultiLayerNetwork.load(locationToLoad, false);
model.init();
// First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
// Data to predict
CSVRecordReader recordReader = new CSVRecordReader(numLinesToSkip, delimiter); //skip no lines at the top - i.e. no header
recordReader.initialize(new FileSplit(new File(args[1])));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int batchSize = 4000;
DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).build();
List<DataSet> dataSetList = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
dataSetList.add(allData);
}
DataSet dataSet = DataSet.merge(dataSetList);
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(dataSet);
normalizer.transform(dataSet);
// Now use it to classify some data
INDArray output = model.output(dataSet.getFeatures());
// Save result
BufferedWriter writer = new BufferedWriter(new FileWriter(args[2], true));
for (int i=0; i<output.rows(); i++) {
writer
.append(output.getRow(i).argMax().toString())
.append(" ")
.append(String.valueOf(i))
.append(" ")
.append(output.getRow(i).toString())
.append('\n');
}
writer.close();
Ensure you save the normalizer as follows alongside the model:
import org.nd4j.linalg.dataset.api.preprocessor.serializer.NormalizerSerializer;
NormalizerSerializer SUT = NormalizerSerializer.getDefault();
SUT.write(normalizer,new File("outputFile.bin"));
NormalizeStandardize restored = SUT.restore(new File("outputFile.bin");
You need to use the same normalizer data for both training and prediction. Otherwise it will use wrong statistics when transforming your data.
The way you are currently doing it, results in data that looks very different from the training data, that is why you get such a different result.
I am trying to horizontally join multiple dataframes (with same number of records) in pyspark using monotonically_increasing_id(). However the results obtained have inflated number of records
for i in range(len(lst)+1):
if i==0:
df[i] = cust_mod.select('key')
df[i+1] = df[i].withColumn("idx", monotonically_increasing_id())
else:
df_tmp = o[i-1].select(col("value").alias(obj_names[i-1]))
df_tmp = df_tmp.withColumn("idx", monotonically_increasing_id())
df[i+1] = df[i].join(df_tmp, "idx", "outer")
Expected number of records in df[i+1]=~60m. Got : ~88m. It seems monotonically increasing id is not generating same numbers all the time. How can I resole this problem?
Other details:
cust_mod > dataframe, count- ~60m
o[i] - another set of dataframes, with length equal to cust_mod
lst - a list than has 49 components . So in total 49 loops
I tried using zipWithIndex():
for i in range(len(lst)+1):
if i==0:
df[i] = cust_mod.select('key')
df[i+1] = df[i].rdd.zipWithIndex().toDF()
else:
df_tmp = o[i-1].select("value").rdd.zipWithIndex().toDF()
df_tmp1 = df_tmp.select(col("_1").alias(obj_names[i-1]),col("_2"))
df[i+1] = df[i].join(df_tmp1, "_2", "inner").drop(df_tmp1._2)
But it's way sloww. Like 50 times slow.
I'm a novice with Lua/Torch. I have an existing model that includes a max pooling layer. I want to take the input into that layer and split it into chunks, feeding each chunk into a new max pooling layer.
I have written a stand-alone Lua script that can split a tensor into two chunks and forward the two chunks into a network with two max-pooling layers.
But trying to integrate that back into the existing model I can't figure out how to amend the data "mid-flow", as it were, to do the tensor split. I've read the docs and can't see any function or example of architecture that somewhere along the line splits a tensor into two and forwards each part separately.
Any ideas? Thanks!
you want define a layer yourself.
The layer will be like below, if your layer input is one dimension:
CSplit, parent = torch.class('nn.CSplit', 'nn.Module')
function CSplit:__init(firstCount)
self.firstCount = firstCount
parent.__init(self)
end
function CSplit:updateOutput(input)
local inputSize = input:size()[1]
local firstCount = self.firstCount
local secondCount = inputSize - firstCount
local first = torch.Tensor(self.firstCount)
local second = torch.Tensor(secondCount)
for i=1, inputSize do
if i <= firstCount then
first[i] = input[i]
else
second[i - firstCount] = input[i]
end
end
self.output = {first, second}
return self.output
end
function CSplit:updateGradInput(input, gradOutput)
local inputSize = input:size()[1]
self.gradInput = torch.Tensor(input)
for i=1, inputSize do
if i <= self.firstCount then
self.gradInput[i] = gradOutput[1][i]
else
self.gradInput[i] = gradOutput[2][i-self.firstCount]
end
end
return self.gradInput
end
How to use it? you need to specify the first chunk size like the code below.
testNet = nn.CSplit(4)
input = torch.randn(10)
output = testNet:forward(input)
print(input)
print(output[1])
print(output[2])
testNet:backward(input, {torch.randn(4), torch.randn(6)})
you can see runnable iTorch notebook code here
Image of my dataset:
I am using the HDF5DotNet with C# and I can read only full data as the attached image in the dataset. The hdf5 file is too big, up to nearly 10GB, and if I load the whole array into the memory then it will be out of memory.
I would like to read all data from rows 5 and 7 in the attached image. Is that anyway to read only these 2 rows data into memory in a time without having to load all data into memory first?
private static void OpenH5File()
{
var h5FileId = H5F.open(#"D:\Sandbox\Flood Modeller\Document\xmdf_results\FMA_T1_10ft_001.xmdf", H5F.OpenMode.ACC_RDONLY);
string dataSetName = "/FMA_T1_10ft_001/Temporal/Depth/Values";
var dataset = H5D.open(h5FileId, dataSetName);
var space = H5D.getSpace(dataset);
var dataType = H5D.getType(dataset);
long[] offset = new long[2];
long[] count = new long[2];
long[] stride = new long[2];
long[] block = new long[2];
offset[0] = 1; // start at row 5
offset[1] = 2; // start at column 0
count[0] = 2; // read 2 rows
count[0] = 165701; // read all columns
stride[0] = 0; // don't skip anything
stride[1] = 0;
block[0] = 1; // blocks are single elements
block[1] = 1;
// Dataspace associated with the dataset in the file
// Select a hyperslab from the file dataspace
H5S.selectHyperslab(space, H5S.SelectOperator.SET, offset, count, block);
// Dimensions of the file dataspace
var dims = H5S.getSimpleExtentDims(space);
// We also need a memory dataspace which is the same size as the file dataspace
var memspace = H5S.create_simple(2, dims);
double[,] dataArray = new double[1, dims[1]]; // just get one array
var wrapArray = new H5Array<double>(dataArray);
// Now we can read the hyperslab
H5D.read(dataset, dataType, memspace, space,
new H5PropertyListId(H5P.Template.DEFAULT), wrapArray);
}
You need to select a hyperslab which has the correct offset, count, stride, and block for the subset of the dataset that you wish to read. These are all arrays which have the same number of dimensions as your dataset.
The block is the size of the element block in each dimension to read, i.e. 1 is a single element.
The offset is the number of blocks from the start of the dataset to start reading, and count is the number of blocks to read.
You can select non-contiguous regions by using stride, which again counts in blocks.
I'm afraid I don't know C#, so the following is in C. In your example, you would have:
hsize_t offset[2], count[2], stride[2], block[2];
offset[0] = 5; // start at row 5
offset[1] = 0; // start at column 0
count[0] = 2; // read 2 rows
count[1] = 165702; // read all columns
stride[0] = 1; // don't skip anything
stride[1] = 1;
block[0] = 1; // blocks are single elements
block[1] = 1;
// This assumes you already have an open dataspace with ID dataspace_id
H5Sselect_hyperslab(dataspace_id, H5S_SELECT_SET, offset, stride, count, block)
You can find more information on reading/writing hyperslabs in the HDF5 tutorial.
It seems there are two forms of H5D.read in C#, you want the second form:
H5D.read(Type) Method (H5DataSetId, H5DataTypeId, H5DataSpaceId,
H5DataSpaceId, H5PropertyListId, H5Array(Type))
This allows you specify the memory and file dataspaces. Essentially, you need one dataspace which has information about the size, stride, offset, etc. of the variable in memory that you want to read into; and one dataspace for the dataset in the file that you want to read from. This lets you do things like read from a non-contiguous region in a file to a contiguous region in an array in memory.
You want something like
// Dataspace associated with the dataset in the file
var dataspace = H5D.get_space(dataset);
// Select a hyperslab from the file dataspace
H5S.selectHyperslab(dataspace, H5S.SelectOperator.SET, offset, count);
// Dimensions of the file dataspace
var dims = H5S.getSimpleExtentDims(dataspace);
// We also need a memory dataspace which is the same size as the file dataspace
var memspace = H5S.create_simple(rank, dims);
// Now we can read the hyperslab
H5D.read(dataset, datatype, memspace, dataspace,
new H5PropertyListId(H5P.Template.DEFAULT), wrapArray);
From your posted code, I think I've spotted the problem. First you do this:
var space = H5D.getSpace(dataset);
then you do
var dataspace = H5D.getSpace(dataset);
These two calls do the same thing, but create two different variables
You call H5S.selectHyperslab with space, but H5D.read uses dataspace.
You need to make sure you are using the correct variables consistently. If you remove the second call to H5D.getSpace, and change dataspace -> space, it should work.
Maybe you want to have a look at HDFql as it abstracts yourself from the low-level details of HDF5. Using HDFql in C#, you can read rows #5 and #7 of dataset Values using a hyperslab selection like this:
float [,]data = new float[2, 165702];
HDFql.Execute("SELECT FROM Values(5:2:2:1) INTO MEMORY " + HDFql.VariableTransientRegister(data));
Afterwards, you can access these rows through variable data. Example:
for(int x = 0; x < 2; x++)
{
for(int y = 0; y < 165702; y++)
{
System.Console.WriteLine(data[x, y]);
}
}
I have implemented fft into at32ucb series ucontroller using kiss fft library and currently struggling with the output of the fft.
My intention is to analyse sound coming from piezo speaker.
Currently, the frequency of the sounder is 420Hz which I successfully got from the fft output (cross checked with an oscilloscope). However, the output frequency is just half of expected if I put function generator waveform into the system.
I suspect its the frequency bin calculation formula which I got wrong; currently using, fft_peak_magnitude_index*sampling frequency / fft_size.
My input is real and doing real fft. (output samples = N/2)
And also doing iir filtering and windowing before fft.
Any suggestion would be a great help!
// IIR filter calculation, n = 256 fft points
for (ctr=0; ctr<n; ctr++)
{
// filter calculation
y[ctr] = num_coef[0]*x[ctr];
y[ctr] += (num_coef[1]*x[ctr-1]) - (den_coef[1]*y[ctr-1]);
y[ctr] += (num_coef[2]*x[ctr-2]) - (den_coef[2]*y[ctr-2]);
y1[ctr] = y[ctr] - 510; //eliminate dc offset
// hamming window
hamming[ctr] = (0.54-((0.46) * cos(2*M_PI*ctr/n)));
window[ctr] = hamming[ctr]*y1[ctr];
fft_input[ctr].r = window[ctr];
fft_input[ctr].i = 0;
fft_output[ctr].r = 0;
fft_output[ctr].i = 0;
}
kiss_fftr_cfg fftConfig = kiss_fftr_alloc(n,0,NULL,NULL);
kiss_fftr(fftConfig, (kiss_fft_scalar * )fft_input, fft_output);
peak = 0;
freq_bin = 0;
for (ctr=0; ctr<n1; ctr++)
{
fft_mag[ctr] = 10*(sqrt((fft_output[ctr].r * fft_output[ctr].r) + (fft_output[ctr].i * fft_output[ctr].i)))/(0.5*n);
if(fft_mag[ctr] > peak)
{
peak = fft_mag[ctr];
freq_bin = ctr;
}
frequency = (freq_bin*(10989/n)); // 10989 is the sampling freq
//************************************
//Usart write
char filtResult[10];
//sprintf(filtResult, "%04d %04d %04d\n", (int)peak, (int)freq_bin, (int)frequency);
sprintf(filtResult, "%04d %04d %04d\n", (int)x[ctr], (int)fft_mag[ctr], (int)frequency);
char c;
char *ptr = &filtResult[0];
do
{
c = *ptr;
ptr++;
usart_bw_write_char(&AVR32_USART2, (int)c);
// sendByte(c);
} while (c != '\n');
}
The main problem is likely to be how you declared fft_input.
Based on your previous question, you are allocating fft_input as an array of kiss_fft_cpx. The function kiss_fftr on the other hand expect an array of scalar. By casting the input array into a kiss_fft_scalar with:
kiss_fftr(fftConfig, (kiss_fft_scalar * )fft_input, fft_output);
KissFFT essentially sees an array of real-valued data which contains zeros every second sample (what you filled in as imaginary parts). This is effectively an upsampled version (although without interpolation) of your original signal, i.e. a signal with effectively twice the sampling rate (which is not accounted for in your freq_bin to frequency conversion). To fix this, I suggest you pack your data into a kiss_fft_scalar array:
kiss_fft_scalar fft_input[n];
...
for (ctr=0; ctr<n; ctr++)
{
...
fft_input[ctr] = window[ctr];
...
}
kiss_fftr_cfg fftConfig = kiss_fftr_alloc(n,0,NULL,NULL);
kiss_fftr(fftConfig, fft_input, fft_output);
Note also that while looking for the peak magnitude, you probably are only interested in the final largest peak, instead of the running maximum. As such, you could limit the loop to only computing the peak (using freq_bin instead of ctr as an array index in the following sprintf statements if needed):
for (ctr=0; ctr<n1; ctr++)
{
fft_mag[ctr] = 10*(sqrt((fft_output[ctr].r * fft_output[ctr].r) + (fft_output[ctr].i * fft_output[ctr].i)))/(0.5*n);
if(fft_mag[ctr] > peak)
{
peak = fft_mag[ctr];
freq_bin = ctr;
}
} // close the loop here before computing "frequency"
Finally, when computing the frequency associated with the bin with the largest magnitude, you need the ensure the computation is done using floating point arithmetic. If as I suspect n is an integer, your formula would be performing the 10989/n factor using integer arithmetic resulting in truncation. This can be simply remedied with:
frequency = (freq_bin*(10989.0/n)); // 10989 is the sampling freq