I'm trying to download a small chunk of the YouTube-8M dataset. It is just a dataset with video features and labels and you can create your own model to classify them.
The command that they claim will download the dataset is this :
curl storage.googleapis.com/data.yt8m.org/download_fix.py | shard=1,100 partition=2/frame/train mirror=us python
This actually didn't worked at all and the error produced is :
'shard' is not recognized as an internal or external command,operable program or bash file.
I found someone posted on a forum. It says to add 'set' to the variables which seems to fix my problem partially.
curl storage.googleapis.com/data.yt8m.org/download_fix.py | set shard=1,100 partition=2/video/train mirror=us python
The download seemingly started for a split second and an error pop up. The error right now is (23) Failed writing body.
So what is the command line for downloading the dataset.
I'd try using the Kaggle API instead. You can install the API using:
pip install Kaggle
Then download your credentials (step-by-step guide here). Finally, you can download the dataset like so:
kaggle competitions download -c youtube8m
If you only want part of the dataset, you can first list all the downloadable files:
kaggle competitions files -c youtube8m
And then only download the file(s) you want:
kaggle competitions download -c youtube8m -f name_of_your_file.extension
Hope that helps! :)
Quick question: what is the compiler flag to allow g++ to spawn multiple instances of itself in order to compile large projects quicker (for example 4 source files at a time for a multi-core CPU)?
You can do this with make - with gnu make it is the -j flag (this will also help on a uniprocessor machine).
For example if you want 4 parallel jobs from make:
make -j 4
You can also run gcc in a pipe with
gcc -pipe
This will pipeline the compile stages, which will also help keep the cores busy.
If you have additional machines available too, you might check out distcc, which will farm compiles out to those as well.
There is no such flag, and having one runs against the Unix philosophy of having each tool perform just one function and perform it well. Spawning compiler processes is conceptually the job of the build system. What you are probably looking for is the -j (jobs) flag to GNU make, a la
make -j4
Or you can use pmake or similar parallel make systems.
People have mentioned make but bjam also supports a similar concept. Using bjam -jx instructs bjam to build up to x concurrent commands.
We use the same build scripts on Windows and Linux and using this option halves our build times on both platforms. Nice.
If using make, issue with -j. From man make:
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously.
If there is more than one -j option, the last one is effective.
If the -j option is given without an argument, make will not limit the
number of jobs that can run simultaneously.
And most notably, if you want to script or identify the number of cores you have available (depending on your environment, and if you run in many environments, this can change a lot) you may use ubiquitous Python function cpu_count():
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.cpu_count
Like this:
make -j $(python3 -c 'import multiprocessing as mp; print(int(mp.cpu_count() * 1.5))')
If you're asking why 1.5 I'll quote user artless-noise in a comment above:
The 1.5 number is because of the noted I/O bound problem. It is a rule of thumb. About 1/3 of the jobs will be waiting for I/O, so the remaining jobs will be using the available cores. A number greater than the cores is better and you could even go as high as 2x.
make will do this for you. Investigate the -j and -l switches in the man page. I don't think g++ is parallelizable.
distcc can also be used to distribute compiles not only on the current machine, but also on other machines in a farm that have distcc installed.
I'm not sure about g++, but if you're using GNU Make then "make -j N" (where N is the number of threads make can create) will allow make to run multple g++ jobs at the same time (so long as the files do not depend on each other).
GNU parallel
I was making a synthetic compilation benchmark and couldn't be bothered to write a Makefile, so I used:
sudo apt-get install parallel
ls | grep -E '\.c$' | parallel -t --will-cite "gcc -c -o '{.}.o' '{}'"
Explanation:
{.} takes the input argument and removes its extension
-t prints out the commands being run to give us an idea of progress
--will-cite removes the request to cite the software if you publish results using it...
parallel is so convenient that I could even do a timestamp check myself:
ls | grep -E '\.c$' | parallel -t --will-cite "\
if ! [ -f '{.}.o' ] || [ '{}' -nt '{.}.o' ]; then
gcc -c -o '{.}.o' '{}'
fi
"
xargs -P can also run jobs in parallel, but it is a bit less convenient to do the extension manipulation or run multiple commands with it: Calling multiple commands through xargs
Parallel linking was asked at: Can gcc use multiple cores when linking?
TODO: I think I read somewhere that compilation can be reduced to matrix multiplication, so maybe it is also possible to speed up single file compilation for large files. But I can't find a reference now.
Tested in Ubuntu 18.10.
I am building core-image-minimal with "beaglebone" as the target machine.
I'd like to edit the kernel config to remove some features to improve boot time. I've learned I can do a bitbake -c menuconfig virtual/kernel to launch the ncurses editor, but I don't really understand what configuration I'm editing. Is it the one for beablebone, or just a generic kernel?
How do I take the base beablebone kernel config, edit it, and then have bitbake use it when I build core-image-minimal?
Thanks.
To make sure that the beaglebone is using which kernel. You have to find its machine Configuration. For example, beaglebone.conf
In there, you will see PREFERRED_PROVIDER_virtual/kernel = "linux-mainline"
To determine which kernel for beaglebone, you need to find it within recipes-kernel. For example, linux-mainline
after that, to do configuration, we have 2 ways to get to the Kernel's graphical configuration utility.
bitbake -c menuconfig linux-mainline
bitbake -c devshell linux-mainline
make nconfig
There is a tutorial on installing drivers HERE
I encountered the following unexplainable behaviour in Vowpal Wabbit. Sometimes it simply doesn't save a model when -f flag is specified, without raising any exceptions.
The command is composed automatically by a script and has the following form (file names are changed):
vw -d ./data/train_set -p ./predictions
-f ./model --cache --passes 3
--ftrl_alpha 0.106920149657 --ignore T -l 0.83184072971
-b 29 --loss_function logistic --ftrl_beta 0.97391780827
--ftrl -q SE -q SZ -q DR
Then it trains normally and the standard diagnostic information is displayed. But the model is not saved!
The most weird thing about it is that everything works fine with another parameter configurations!
The context: I'm working on hyperparameter optimization and my script successively composes vw training and validation commands. It always succeeds to get to 5th iteration, and always fails on the 6th (on exactly the same command). Any help will be appreciated.
That was a bug in Vowpal Wabbit source code. Now it's fixed and models are saved as expected. Here is an issue on Github:
https://github.com/JohnLangford/vowpal_wabbit/issues/859
I have a training.arff file, where each entry has 2000 features (attributes). I want to select the top n of those attributes using the Information Gain criteria. How can I do that using WEKA and the command line? I have checked online and it seems that it is a two stage process, because I have to use a ranker as the second step. Could someone explain me how to do it?
The way to do it is this:
java weka.filters.supervised.attribute.AttributeSelection \
-E "weka.attributeSelection.InfoGainAttributeEval" \
-S "weka.attributeSelection.Ranker -N 10" -i training.arff -o training_IG.arff
The -E option is to tell which class to use as evaluator, and the -S tells what search method to use (in this case ranking).