How to select top n features using Information Gain as criteria

How to select top n features using Information Gain as criteria - machine-learning

I have a training.arff file, where each entry has 2000 features (attributes). I want to select the top n of those attributes using the Information Gain criteria. How can I do that using WEKA and the command line? I have checked online and it seems that it is a two stage process, because I have to use a ranker as the second step. Could someone explain me how to do it?

The way to do it is this:
java weka.filters.supervised.attribute.AttributeSelection \
-E "weka.attributeSelection.InfoGainAttributeEval" \
-S "weka.attributeSelection.Ranker -N 10" -i training.arff -o training_IG.arff
The -E option is to tell which class to use as evaluator, and the -S tells what search method to use (in this case ranking).

Related

Compare perf-stat results to that of likwid-perfctr results

I want to do some comparison between the outputs of perf-stat to that of likwid-perfctr. Is there a way to do that. I tried running two commands, one for perf-stat, and the other for liquid-perfctr.
The commands are:
sudo perf stat -C 2 -e instructions, BR_INST_RETIRED.ALL_BRANCHES,branches,rc004,INST_RETIRED.ANY ./loop
sudo likwid-perfctr -C 2 -g MYLIST1 -f ./loop
The first instruction is related to perf-stat which captures importantly branches, and instructions count redundantly. The second instruction is related to likwid-perfctr which captures similar data. Just to mention I wrote my own group called MYLIST1 for likwid-perfctr.
But when I compare both the results, its turning out to be quite different.
Output Comparison
So, when we look into the output, INSTR_RETIRED_ANY in perf stat are: 15552, to that of likwid-perfctr are: 190594. And branches are: 3168 vs 42744.
I'm not sure what I'm doing wrong. Or is there any way to properly do that.

Creating external tables from Google Sheet in BigQuery Clould Shell using a sheet other than the default

I am creating a BQ table from the second tab of a GoogleSheets spreadsheet. First, I create the table definition file as follows:
bq mkdef \
--noautodetect \
--source_format=source_format \
"drive_uri" \
path_to_schema_file > /tmp/mytable_def.json
and then I manually modify mytable_def.json to indicate that the table should be created from the second tab:
"googleSheetsOptions":{"range": "sheetB"}
However, I am looking for a way to do this directly from the first mkdef command. Is this possible?

I think it is worth trying JQ tool as reliable way to process JSON objects as a data manipulation step in the above mentioned approach:
bq mkdef \
--noautodetect \
--source_format=source_format \
"drive_uri" \
path_to_schema_file | jq '.googleSheetsOptions += {"range": "sheetB"}' > /tmp/mytable_def.json
As was mentioned by #Daniel Zagales, referencing the documentation page, table definition file should be adjusted by the hand or substituted for any tool that can afford it as a part of command-line processing.

tmux: variable indicating whether text is selected?

I use vi keybindings in Tmux's copy-mode, and I'd like to make Esc clear the current selection if there is one, or exit copy-mode if nothing was selected.
bind -T copy-mode-vi Escape if-shell -F '#{selection_active_flag}' \
'send-keys -X clear-selection' \
'send-keys -X cancel'
I was hoping Tmux might expose a variable that indicates the selection state (I made up selection_active_flag to express my intent, it doesn't actually exist), similar to window_zoomed_flag (which does exist).
Is there a way to achieve this?

Tmux 2.6 introduced selection_present. As stated in the changelog,
Add selection_present format when in copy mode (allows key bindings that do
something different if there is a selection).
This is exactly what I was looking for, and though I'm running Tmux 2.6, it seems I have an outdated man page, as it made no mention of selection_present.
The final working solution is:
bind -T copy-mode-vi Escape if-shell -F '#{selection_present}' \
'send-keys -X clear-selection' \
'send-keys -X cancel'

Use all cores to make OpenCV 3 [duplicate]

Quick question: what is the compiler flag to allow g++ to spawn multiple instances of itself in order to compile large projects quicker (for example 4 source files at a time for a multi-core CPU)?

You can do this with make - with gnu make it is the -j flag (this will also help on a uniprocessor machine).
For example if you want 4 parallel jobs from make:
make -j 4
You can also run gcc in a pipe with
gcc -pipe
This will pipeline the compile stages, which will also help keep the cores busy.
If you have additional machines available too, you might check out distcc, which will farm compiles out to those as well.

There is no such flag, and having one runs against the Unix philosophy of having each tool perform just one function and perform it well. Spawning compiler processes is conceptually the job of the build system. What you are probably looking for is the -j (jobs) flag to GNU make, a la
make -j4
Or you can use pmake or similar parallel make systems.

People have mentioned make but bjam also supports a similar concept. Using bjam -jx instructs bjam to build up to x concurrent commands.
We use the same build scripts on Windows and Linux and using this option halves our build times on both platforms. Nice.

If using make, issue with -j. From man make:
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously.
If there is more than one -j option, the last one is effective.
If the -j option is given without an argument, make will not limit the
number of jobs that can run simultaneously.
And most notably, if you want to script or identify the number of cores you have available (depending on your environment, and if you run in many environments, this can change a lot) you may use ubiquitous Python function cpu_count():
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.cpu_count
Like this:
make -j $(python3 -c 'import multiprocessing as mp; print(int(mp.cpu_count() * 1.5))')
If you're asking why 1.5 I'll quote user artless-noise in a comment above:
The 1.5 number is because of the noted I/O bound problem. It is a rule of thumb. About 1/3 of the jobs will be waiting for I/O, so the remaining jobs will be using the available cores. A number greater than the cores is better and you could even go as high as 2x.

make will do this for you. Investigate the -j and -l switches in the man page. I don't think g++ is parallelizable.

distcc can also be used to distribute compiles not only on the current machine, but also on other machines in a farm that have distcc installed.

I'm not sure about g++, but if you're using GNU Make then "make -j N" (where N is the number of threads make can create) will allow make to run multple g++ jobs at the same time (so long as the files do not depend on each other).

GNU parallel
I was making a synthetic compilation benchmark and couldn't be bothered to write a Makefile, so I used:
sudo apt-get install parallel
ls | grep -E '\.c$' | parallel -t --will-cite "gcc -c -o '{.}.o' '{}'"
Explanation:
{.} takes the input argument and removes its extension
-t prints out the commands being run to give us an idea of progress
--will-cite removes the request to cite the software if you publish results using it...
parallel is so convenient that I could even do a timestamp check myself:
ls | grep -E '\.c$' | parallel -t --will-cite "\
if ! [ -f '{.}.o' ] || [ '{}' -nt '{.}.o' ]; then
gcc -c -o '{.}.o' '{}'
fi
"
xargs -P can also run jobs in parallel, but it is a bit less convenient to do the extension manipulation or run multiple commands with it: Calling multiple commands through xargs
Parallel linking was asked at: Can gcc use multiple cores when linking?
TODO: I think I read somewhere that compilation can be reduced to matrix multiplication, so maybe it is also possible to speed up single file compilation for large files. But I can't find a reference now.
Tested in Ubuntu 18.10.

To understand the practical use of Grep's option -H in different situations

This question is based on this answer.
Why do you get the same output from the both commands?
Command A
$sudo grep muel * /tmp
masi:muel
Command B
$sudo grep -H muel * /tmp
masi:muel
Rob's comment suggests me that Command A should not give me masi:, but only muel.
In short, what is the practical purpose of -H?

Grep will list the filenames by default if more than one filename is given. The -H option makes it do that even if only one filename is given. In both your examples, more than one filename is given.
Here's a better example:
$ grep Richie notes.txt
Richie wears glasses.
$ grep -H Richie notes.txt
notes.txt:Richie wears glasses.
It's more useful when you're giving it a wildcard for an unknown number of files, and you always want the filenames printed even if the wildcard only matches one file.

If you grep a single file, -H makes a difference:
$ grep muel mesi
muel
$ grep -H muel mesi
masi:muel
This could be significant in various scripting contexts. For example, a script (or a non-trivial piped series of commands) might not be aware of how many files it's actually dealing with: one, or many.

When you grep from multiple files, by default it shows the name of the file where the match was found. If you specify -H, the file name will always be shown, even if you grep from a single file. You can specify -h to never show the file name.

Emacs has grep interface (M-x grep, M-x lgrep, M-x rgrep). If you ask Emacs to search for foo in the current directory, then Emacs calls grep and process the grep output and then present you with results with clickable links. Clickable links, just like Google.
What Emacs does is that it passes two options to grep: -n (show line number) and -H (show filenames even if only one file. the point is consistency) and then turn the output into clickable links.
In general, consistency is good for being a good API, but consistency conflicts with DWIM.
When you directly use grep, you want DWIM, so you don't pass -H.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to select top n features using Information Gain as criteria - machine-learning

Related

Compare perf-stat results to that of likwid-perfctr results

Creating external tables from Google Sheet in BigQuery Clould Shell using a sheet other than the default

tmux: variable indicating whether text is selected?

Use all cores to make OpenCV 3 [duplicate]

To understand the practical use of Grep's option -H in different situations

Categories

Resources