dev-clean folder of LibriSpeecch Dataset

dev-clean folder of LibriSpeecch Dataset - machine-learning

I am working on LibriSpeech Dev-Clean Dataset.
I am not able to understand the structure of the dataset.
I did come to know that Directories like 84, 1272 etc. under the dev-clean folder represent the ID of the speakers.
But what does sub-folders represent?
I meant if we look inside 1272 directory under the dev-clean folder, It is again divided into 3 folders i.e. - 128104, 135031, 141231.
This seems to be ambiguous to me. Any ideas?

Librispeech is made from the audiobooks by certain amount of speakers. 1272 is the speaker ID. 128104, 135031, 141231 are book IDs.
inside each folder there are recordings relevant to certain book.

Related

Netlogo Behaviorspace How to save data not per tick but based on reporter

I have a netlogo model, for which a run takes about 15 minutes, but goes through a lot of ticks. This is because per tick, not much happens. I want to do quite a few runs in an experiment in behaviorspace. The output (only table output) will be all the output and input variables per tick. However, not all this data is relevant: it's only relevant once a day (day is variable, a run lasts 1095 days).
The result is that the model gets so slow running experiments via behaviorspace. Not only would it be nicer to have output data with just 1095 rows, it perhaps also causes the experiment to slow down tremendously.
How to fix this?

It is possible to write your own output file in a BehaviorSpace experiment. Program your code to create and open an output file that contains only the results you want.
The problem is to keep BehaviorSpace from trying to open the same output file from different model runs running on different processors, which causes a runtime error. I have tried two solutions.
Tell BehaviorSpace to only use one processor for the experiment. Then you can use the same output file for all model runs. If you want the output lines to include which model run it's on, use the primitive behaviorspace-run-number.
Have each model run create its own output file with a unique name. Open the file using something like:
file-open (word "Output-for-run-" behaviorspace-run-number ".csv")
so the output files will be named Output-for-run-1.csv etc.
(If you are not familiar with it, the CSV extension is very useful for writing output files. You can put everything you want to output on a big list, and then when the model finishes write the list into a CSV file with:
csv:to-file (word "Output-for-run-" behaviorspace-run-number ".csv") the-big-list
)

How would I label training and testing data for a Convolutional Neural Network?

This is a bit of an abstract question.
I have a group of 28x28 px images from certain people, and I would like to label that data with each person who wrote it. How would I go about labeling it for training and testing? This is my first neural network, and I'm having difficulty finding any tutorials that suit my particular need. It feels like most Data, like MNIST/EMNIST, are already labeled.
Some more info is that I'm using Python 3, and Keras with Tensorflow backend.

I am assuming that you know who wrote each image. Then this is a matter of associating that information (the class label) with each image. There are several ways of doing this. Two common approaches are:
Folder structure
Make a folder for each class (person), and put the images inside.
Folder contents:
john/01.png
john/02.png
jane/03.png
susan/...
CSV file
In this case the images can be all in one folder, and then a dedicate Comma-Separated-Values file is used to contain
Folder contents:
dataset.csv
images/01.png
images/02.png
images/03.png
images/....
dataset.csv contents:
filename,person
images/01.png,john
images/02.png,john
images/03.png,jane
...
The CSV approach is nice if you have additional data about each file that you want to store. For instance metadata that could be relevant such as who recorded the file, when was it recorded, with what kind of equipment, what locations etc.
Combinations of the two are also possible, of course.

How to combine multiple h5 files?

Limited by the device, I could only produce several h5 files (the format of each file are same with shape of [idx, 1, 224, 224]) for huge dataset (>100GB) and now I'm confused about the solution to combine these files into a single one for further training on PyTorch. enter image description here

In h5py, groups and files support copy(), which can be used to move groups (including the root group) and their contents between files.
See the docs here (scroll down a bit to find copy()):
http://docs.h5py.org/en/latest/high/group.html
The HDF5 distribution also includes a command-line tool called h5copy that can be used to move things around, and the C API has an H5Ocopy() function.

Finding what hard drive sectors occupy a file

I'm looking for a nice easy way to find what sectors occupy a given file. My language preference is C#.
From my A-Level Computing class I was taught that a hard drive has a lookup table on the first few KB of the disk. In this table there is a linked list for each file detailing what sectors that file occupies. So I'm hoping there's a convinient way to look in this table for a certain file and see what sectors it occupies.
I have tried Google'ing but I am finding nothing useful. Maybe I'm not searching for the right thing but I can't find anything at all.
Any help is appreciated, thanks.

About Drives
The physical geometry of modern hard drives is no longer directly accessible by the operating system. Early hard drives were simple enough that it was possible to address them according to their physical structure, cylinder-head-sector. Modern drives are much more complex and use systems like zone bit recording , in which not all tracks have the same amount of sectors. It's no longer practical to address them according to their physical geometry.
from the fdisk man page:
If possible, fdisk will obtain the disk geometry automatically. This is not necessarily the physical disk geometry (indeed, modern disks do not really have anything
like a physical geometry, certainly not something that can be described in simplistic Cylinders/Heads/Sectors form)
To get around this problem modern drives are addressed using Logical Block Addressing, which is what the operating system knows about. LBA is an addressing scheme where the entire disk is represented as a linear set of blocks, each block being a uniform amount of bytes (usually 512 or larger).
About Files
In order to understand where a "file" is located on a disk (at the LBA level) you will need to understand what a file is. This is going to be dependent on what file system you are using. In Unix style file systems there is a structure called an inode which describes a file. The inode stores all the attributes a file has and points to the LBA location of the actual data.
Ubuntu Example
Here's an example of finding the LBA location of file data.
First get your file's inode number
$ ls -i
659908 test.txt
Run the file system debugger. "yourPartition" will be something like sda1, it is the partition that your file system is located on.
$sudo debugfs /dev/yourPartition
debugfs: stat <659908>
Inode: 659908 Type: regular Mode: 0644 Flags: 0x80000
Generation: 3039230668 Version: 0x00000000:00000001
...
...
Size of extra inode fields: 28
EXTENTS:
(0): 266301
The number under "EXTENTS", 266301, is the logical block in the file system that your file is located on. If your file is large there will be multiple blocks listed. There's probably an easier way to get that number, I couldn't find one.
To validate that we have the right block use dd to read that block off the disk. To find out your file system block size, use dumpe2fs.
dumpe2fs -h /dev/yourPartition | grep "Block size"
Then put your block size in the ibs= parameter, and the extent logical block in the skip= parameter, and run dd like this:
sudo dd if=/dev/yourPartition of=success.txt ibs=4096 count=1 skip=266301
success.txt should now contain the original file's contents.

sudo hdparm --fibmap file
For ext, vfat and NTFS ..maybe more.
fibmap is also a linux C library.

Organizing thousands of images on a server

I'm developing a website which might grow up to a few thousand users, all of which would upload up to ten pictures on the server.
I'm wondering what would be the best way of storing pictures.
Lets assume that I have, 5000 users with 10 pictures each, which gives us 50 000 pics. (I guess it wouldn't be a good idea to store them in the database in blobs ;) )
Would it be a good way to dynamically create directories for every 100 users registered, (50 dirs in total, assuming 5000 users), and upload their pictures there? Would naming convention 'xxx_yy.jpg' (xxx being user id and yy picture number) be ok?
In this case, however, there would be 1000 (100x10) pictures in one folder, isn't it too many?

I would most likely store the images by a hash of their contents. A 128-bit SHA, for instance. So, I'd rename a user's uploaded image 'foo.jpg' to be its 128-bit sha (probably in base 64, for uniform 16-character names) and then store the user's name for the file and its SHA in a database. I'd probably also add a reference count. Then if some folks all upload the same image, it only gets stored once and you can delete it when all references vanish.
As for actual physical storage, now that you have a guaranteed uniform naming scheme, you can use your file system as a balanced tree. You can either decide how many files maximum you want in a directory, and have a balancer move files to maintain this, or you can imagine what a fully populated tree would look like, and store your files that way.
The only real drawback to this scheme is that it decouples file names from contents so a database loss can mean not knowing what any file is called, but you should be careful to back up that kind of information anyway.

Different filesystems perform differently with directories holding large numbers of files. Some slow down tremendously. Some don't mind at all. For example, IBM JFS2 stores the contents of directory inodes as a B+ Tree sorted by filename.... so it probably provides log(n) access time even in the case of very large directories.
getting ls or dir to read, sort, get size/date info, and print them to stdout is a completely different task from accessing the file contents given the filename.... So don't let the inability of ls to list a huge directory guide you.
Whatever you do, don't optimize too early. Just make sure your file access mechanism can be asbstracted (make a FileStorage that you .getfile(id) from, or something...).
That way you can put in whatever directory structure you like, or for example if you find it's better to store these items as a BLOB column in a database, you have that option...

granted i have never stored 50,000 images, but i usually just store all images in the same directory and name them as such to avoid conflict. then store the reference in the db.
$ext = explode( '.', $filename );
$newName = md5( microtime() ) . '.' . $ext;
that way you never have the same two filenames as microtime will never be the same.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart