merging several hdf5 files into one pytable - hdf5

I have several hdf5 files, each of them with the same structure. I'd like to create one pytable out of them by somehow merging the hdf5 files.
What I mean is that if an array in file1 has size x and array in file2 has size y, the resulting array in the pytable will be of size x+y, containing first all the entries from file1 and then all the entries from file2.

How you want to do this depends slightly on the data type that you have. Arrays and CArrays have a static size so you need to preallocate the data space. Thus you would do something like the following:
import tables as tb
file1 = tb.open_file('/path/to/file1', 'r')
file2 = tb.open_file('/path/to/file2', 'r')
file3 = tb.open_file('/path/to/file3', 'r')
x = file1.root.x
y = file2.root.y
z = file3.create_array('/', 'z', atom=x.atom, shape=(x.nrows + y.nrows,))
z[:x.nrows] = x[:]
z[x.nrows:] = y[:]
However, EArrays and Tables are extendable. Thus you don't need to preallocate the size and can copy_node() and append() instead.
import tables as tb
file1 = tb.open_file('/path/to/file1', 'r')
file2 = tb.open_file('/path/to/file2', 'r')
file3 = tb.open_file('/path/to/file3', 'r')
x = file1.root.x
y = file2.root.y
z = file1.copy_node('/', name='x', newparent=file3.root, newname='z')
z.append(y)

Related

h5py reading raw data with escape characters

Hi I want to read the hdf5 file data as it is written
But when I read it with the following code I get the following output
COde
hf = h5py.File('Json.h5', 'r')
data_read = hf.get("BinaryData_metadata")
rmdwrite = open("Test.json", "w")
rmdwrite.write(str(np.array(data_read)))
rmdwrite.close()
hf.close()
Output
[b'{\n\t"TestReport": {\n\t\t"TestName": "XYZ",\n\t\t"Description"................
How to get the exact output with the same formatting in my output file?
When I print with this
Data_arr = str(np.array(data_read))
Data_arr = repr(Data_arr)
I get
'[b\'{\\n\\t"TestReport": {\\n\\t\\t"Te................
OKey this is how I am writing the data via C++
DataSpace dataspace(1, dimsf); //Creating Dataspace
StrType datatype(PredType::C_S1); //Creating Datatype of type char
datatype.setOrder(order); //Data Store Order
datatype.setSize(file_datastring.length()); //Datalength
datatype.setCset(H5T_CSET_UTF8);
DataSet dataset = Hdf5::fileObject.createDataSet(WriteDataSet, datatype, dataspace); //Create dataset
dataset.write(file_datastring, datatype); //Write to dataset
is there something here which is appending that extra \
The solution which I found was
hf = h5py.File(H5FileName, 'r')
FileObj = open(OutFileName, "w")
hf.get(H5DataSetName).value.tofile(FileObj)
FileObj.close()
hf.close()
This works perfectly
Regards.
Siddharth

Custom dataset Loader pytorch

i am doing covid-19 classification.i took dataset from kaggle. it has folder named dataset which contain 3 folders normal pnuemonia and covid-19 each contaning images for these classes i am stucked in writting getitem in pytorch custom dataloader ?
Dataset has 189 covid images but by this get item i get 920 images of covid kindly help
class_names = ['normal', 'viral', 'covid']
root_dir = 'COVID-19 Radiography Database'
source_dirs = ['NORMAL', 'Viral Pneumonia', 'COVID-19']
if os.path.isdir(os.path.join(root_dir, source_dirs[1])):
os.mkdir(os.path.join(root_dir, 'test'))
for i, d in enumerate(source_dirs):
os.rename(os.path.join(root_dir, d), os.path.join(root_dir, class_names[i]))
for c in class_names:
os.mkdir(os.path.join(root_dir, 'test', c))
for c in class_names:
images = [x for x in os.listdir(os.path.join(root_dir, c)) if x.lower().endswith('png')]
selected_images = random.sample(images, 30)
for image in selected_images:
source_path = os.path.join(root_dir, c, image)
target_path = os.path.join(root_dir, 'test', c, image)
shutil.move(source_path, target_path)
Above code is used to create test dataset which has 30 images of each class
class ChestXRayDataset(torch.utils.data.Dataset):
def __init__(self, image_dirs, transform):
def get_images(class_name):
images = [x for x in os.listdir(image_dirs[class_name]) if
x[-3:].lower().endswith('png')]
print(f'Found {len(images)} {class_name} examples')
return images
self.images = {}
self.class_names = ['normal', 'viral', 'covid']
for class_name in self.class_names:
self.images[class_name] = get_images(class_name)
self.image_dirs = image_dirs
self.transform = transform
def __len__(self):
return sum([len(self.images[class_name]) for class_name in self.class_names])
def __getitem__(self, index):
class_name = random.choice(self.class_names)
index = index % len(self.images[class_name])
image_name = self.images[class_name][index]
image_path = os.path.join(self.image_dirs[class_name], image_name)
image = Image.open(image_path).convert('RGB')
return self.transform(image), self.class_names.index(class_name)
**Stucked in get item of this **
images in folder are arranged as follows
Dataset is as follows
**Code for confusion matrix is **
nb_classes = 3
confusion_matrix = torch.zeros(nb_classes, nb_classes)
with torch.no_grad():
for data in tqdm_notebook(dl_train,total=len(dl_train),unit='batch'):
img,lab = data
print(lab)
img,lab = img.to(device),lab.to(device)
_,output = torch.max(model(img),1)
print(output)
for t, p in zip(lab.view(-1), output.view(-1)):
confusion_matrix[t.long(), p.long()] += 1
output for confusion matrix only one class is getting trained
confusio matrix image
Putting you images in a dictionary complicates the manipulation, rather use a list. Also you Dataset should not have any randomness, shuffling of the data should happen from the DataLoader not from the Dataset.
Use something like below:
class ChestXRayDataset(torch.utils.data.Dataset):
def __init__(self, image_dirs, transform):
def get_images(class_name):
images = [x for x in os.listdir(image_dirs[class_name]) if
x[-3:].lower().endswith('png')]
print(f'Found {len(images)} {class_name} examples')
return images
self.images = []
self.labels = []
self.class_names = ['normal', 'viral', 'covid']
for class_name in self.class_names:
images = get_images(class_name)
# This is a list containing all the images
self.images.extend(images)
# This is a list containing all the corresponding image labels
self.labels.extend([class_name]*len(images))
self.image_dirs = image_dirs
self.transform = transform
def __len__(self):
return len(self.images)
# Will return the image and its label at the position `index`
def __getitem__(self, index):
# image at index position of all the images
image_name = self.images[index]
# Its label
class_name = self.labels[index]
image_path = os.path.join(self.image_dirs[class_name], image_name)
image = Image.open(image_path).convert('RGB')
return self.transform(image), self.class_names.index(class_name)
If you enumerate it say using
ds = ChestXRayDataset(image_dirs, transform)
for x, y in ds:
print (x.shape, y)
You should see all the images and the labels in the sequential order.
However in real case you would rather use a Torch DataLoader and pass it the ds object with shuffle parameter set to True. So the DataLoader will take care of shuffling the Dataset by calling the __getitem__ with shuffled index values.

Removing the file paths and using the file number to perform some calculations while plotting

I am trying to read .txt files from a directory which have the following order;
x-23.txt
x-43.txt
x-83.txt
:
:
x-243.txt
I am calling these files using filename = system("ls ../Data/*.txt"). The goal is to load these files and plot certain columns. At the same time, I am trying to parse the file names such that it would look like as below so that I can use them as title in the plot and add/subtract them from a certain column;
23
43
83
:
:
243
For that, I tried the following;
dirname = '../Data/'
str = system('echo "'.dirname. '" | perl -pe ''s/x[\d-](\d+).txt/\1.\2/'' ')
cv = word(str, 1)
The above lines doesn't seem to trim and produce numbers on the files. The code all together;
filelist1 = system("ls ../Data/*.txt")
print filelist1
dirname = '../Data/'
str = system('echo "'.dirname. '" | perl -pe ''s/x[\d-](\d+).txt/\1.\2/'' ')
cv = word(str, 1)
plot for [filename1 in filelist1] filename1 using (-cv/1000+ Tx($4)):(X($3)) with points pt 7 lc 6 title system('basename '.filename1),\
I am trying to use the file numbers "cv" after parsing the .txt files to subtract them from column Tx($4) while plotting.
directory = "../temp/"
filelist = system("cd ../temp/ ; ls *.txt")
files = words(filelist)
filename(i) = directory . word(filelist,i)
title(i) = word(filelist,i)[3 : strstrt(word(filelist,i),'.')-1]
plot for [i=1:files] filename(i) using ... title title(i)
Test case (edited to show pulling files from another directory):
gnuplot> print filelist
x-234.txt
x-23.txt
x-2.txt
x-34.txt
gnuplot> do for [i=1:files] { print i, ": ", filename(i) }
1: ../temp/x-234.txt
2: ../temp/x-23.txt
3: ../temp/x-2.txt
4: ../temp/x-34.txt
gnuplot> plot for [i=1:files] x*i title title(i)

scilab save('-append') doesn't seem to work

I am trying to create a dataset for ML using Scilab, and I need to save during data generation because it's too big for scilab's max stack.
Here is a toy example I made to find out what goes wrong but I'm not able to figure it out
datas=[];
labels=[];
for i =1:10
for j=1:100
if j==1
disp(i)
end
data = sin(-%pi:0.01:%pi);
label = rand();
datas = [datas, data];
labels = [labels, label];
end
save(chemin+'\test.h5','-append','datas','labels')
datas = [];
labels = [];
end
I am looking for the shape of data to be [1000,629] at the end, but I get [62900,0]
Have you any ideas why it is?
Here is an example of how to incrementally save a big matrix without any memory pressure:
// create a new HDF5 file
a = h5open(TMPDIR + "/test.h5", "w")
// create the dataset
N = 3; // number of chuncks
nrows = 5; // rows of a single chunk
ncols = 10; // cols of a single chunk
chsize = [nrows, ncols];
maxrows = N*nrows; // final number of rows of concatenated matrix
maxcols = ncols; // final number of cols of concatenated matrix
for k=1:N
// warning, x is viewed as a C-matrix (row-major), transpose if applicable
x = rand(nrows,ncols);
h5dataset(a, "My_Dataset", ...
[chsize ;1 1 ;1 1 ;chsize ;chsize],...
x, ...
[k*nrows ncols; maxrows maxcols; 1+(k-1)*nrows 1 ;1 1 ;chsize; chsize])
h5dump(a, "My_Dataset");
end
disp(a.root.My_Dataset.data)
h5close(a)
You have to vertically concatenate (semicolon) instead of horizontally (coma)
datas = [datas; data];
labels = [labels; label];
BTW this won't solve your memory problem as the matrices grow in Scilab's workspace and using "-append" just owerwrites the objects in the hdf5 file (you are using the same names).

Randoms in one Array

I have six variables, that these variables have different position on the screen, I wanna put different images in these variables, hence i have an Array with the images.
misImagenes = {[1] = "rec/ro.png",[2] ="rec/az.png",[3] ="rec/ros.png",[4] ="rec/ne.png",[5] ="rec/ve.png",[6] ="rec/am.png"}
I put the elements of this Array into another Array into that have 2 different randoms, like this:
randoms = {[1] = misImagenes[math.random(1,6)],[2] = misImagenes[math.random(1,6)] }
So, I wanna put this randoms of random form, hence, i create an random of the randoms.
randomRan = randoms[math.random(1,2)]
I put the randomRan into the 6 variables, but the images of the variables are always equals.
uno = display.newImageRect(randomRan,340,280)
dos = display.newImageRect(randomRan,340,280)
tres = display.newImageRect(randomRan,340,280)
cuatro = display.newImageRect(randomRan,340,280)
cinco = display.newImageRect(randomRan,340,280)
seis = display.newImageRect(randomRan,340,280)
This variables have the randomRan, but the images are alway equals, i need that the images are differents, 2 differents images in random variables.
Thanks
It looks like what you want to do is commonly called shuffling and filtering an array.
Once you assign randoms[math.random(1,2)] to the randomRan variable, it is going to stay the same no matter what. It isn't like randomRan is going to be random each time it's used. However, if it were a function call, like randomRan(), then that would be a different case, depending on what the function did. A variable, once assigned to, generally stays the same unless changed.
math.randomseed(os.time()) -- Make sure to seed the random number generator.
local function shuffle(t)
local n = #t
while n >= 2 do
-- n is now the last pertinent index
local k = math.random(n) -- 1 <= k <= n
-- Quick swap
t[n], t[k] = t[k], t[n]
n = n - 1
end
return t
end
local misImagenes = {"rec/ro.png", "rec/az.png", "rec/ros.png", "rec/ne.png", "rec/ve.png", "rec/am.png"}
local randomImages = {}
-- Make a copy of misImagenes for randomImages.
for i, v in ipairs(misImagenes) do randomImages[i] = v end
-- Shuffle the new array. This will randomize the order of its contents.
shuffle(randomImages)
-- Since you want only two unique images for a total of six rectangles,
-- we'll have to duplicate and overwrite the other four, randomly.
for i = 1+2, 6 do
randomImages[i] = randomImages[math.random(1, 2)]
end
-- Now to filter the array with newImageRect.
for i=1, #randomImages do
randomImages[i] = display.newImageRect(randomImages[i], 340, 280)
end
-- randomImages now contains all of your randomized image rectangles.
The shuffle algorithm was borrowed from here to show an example of how this could work.
If you are doing
var1 = randomRan
var2 = randomRan
then var1 and var2 will have the same value - randomRan does not get recomputed each time its evaluated. YIf that is what you want, you can repeat the expression you used to initialize randomRan:
var1 = randoms[math.random(1,2)]
var2 = randoms[math.random(1,2)]
and if you want to avoid retyping that ocmplex expression, you can encapsulate it in a function:
--Return a random image
local function randomRan()
return randoms[math.random(1,2)]
end
var1 = randomRan()
var2 = randomRan()

Resources