Handling ID variables and Factors in R - machine-learning

I have this dataset and I want to build
some models and compare them.
However I'm quite confused of how should the product ID independent variable.
I have this dataset, all variables are numeric, but the product ID variable is int as shown below:
Data set
str(data)
'data.frame': 16 obs. of 6 variables:
$ Productid: int 1 2 3 4 5 6 7 8 9 10 ...
$ x1 : num 6.21 7.75 7.21 8.33 4.87 5.09 6.04 6.09 6.08 6.17 ...
$ x2 : num 7.08 3.29 4.38 2.79 7.71 7.5 6.58 5.13 5.5 5.58 ...
$ x3 : num 2 1.54 1.79 1.63 1.96 2.13 2.04 2 2.09 2.13 ...
$ x4 : num 2.54 2.26 2.58 2.71 1.7 2.42 2.04 2.42 2.46 2.48 ...
$ Y : num 4.97 6.98 4.58 6.45 4.33 4.26 6.16 6.26 5.83 5.74 ...
How to handle this product ID? should I do one - hot - encoding ?
And if the solution is to transform it into a factor, what ML algorithm accepts factors ?

ID is there just for identification of a product but doesn't have any impact on dependent variable therefore it should not be included in any model.

Related

Ubuntu 18.04: GNU parallel can't find and use the full number of cores on a local system

I am using GNU parallel (version 20200522) on Ubuntu Linux 18.04.4 and running jobs on all cores of a local server minus 2 cores, that is I am using the -j-2 parameter.
find /folder/ -type f -iname "*.pdf" | parallel -j-2 --nice 2 "script.sh {1} {1/.}; mv -f -v {1} /folder2/; mv -f {1/.}.txt /folder3/" :::: -
However, the program shows
Error: Cannot run any jobs.
I tried using the -j100% parameter and I have seen that it uses just 1 core(job), and I deduce that, for GNU parallel, 100% of the available cores on this system is just one core.
If I use the -j5 parameter (which does not imply autodetection of the total number of cores), everything is alright, parallel launches 5 jobs and uses 5 cores.
The interesting part is that the file /root/.parallel/tmp/sshlogin/MACHINE_NAME/cpuspec contains the following:
1
6
6
which means, I think, that GNU parallel should see 6 available cores.
I have tried deleting the cpuspec file and running parallel again to redetect the total number of cores, but the cpuspec file and the behavior of the program remain the same.
On different systems, deleting the cpuspec file solved all issues, but on this particular system it is not working. The virtual machine is copied from another server with a different configuration, that is why I need deleting the cpuspec file.
What should I do to get GNU parallel to correctly detect the number of cores on the system, so that I can use the -j-2 parameter?
Update 21.07:
After deleting once again the folder with the cpuspec file, running the parallel --number-of-sockets/cores/threads commands and using just once the -S 6/: parameter, the problem seems to have resolved itself. Now GNU parallel correctly detects the number of cores and the -j-2 parameters works.
I am not sure what good things happened, but I am not able to reproduce the bug anymore.
Ole, thank you for your answer. If I meet the bug again or if I am able to reproduce it, I will post it here.
And here is the output to the commands:
parallel --number-of-sockets
1
parallel --number-of-cores
6
parallel --number-of-threads
6
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz
stepping : 2
microcode : 0xffffffff
cpu MHz : 2397.218
cache size : 15360 KB
physical id : 0
siblings : 6
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm pti fsgsbase bmi1 avx2 smep bmi2 erms xsaveopt
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 4794.43
clflush size : 64
cache_alignment : 64
address sizes : 42 bits physical, 48 bits virtual
power management:
And it repeats itself for 5 more cores.
You may have found a bug. Please post the output of:
cat /proc/cpuinfo
parallel --number-of-sockets
parallel --number-of-cores
parallel --number-of-threads
Also see if you can make an MCVE.
As a workaround you can use -S 6/: to force GNU Parallel to detect 6 cores on your system.
find /folder/ -type f -iname "*.pdf" |
parallel -S 6/: -j-2 --nice 2 "script.sh {1} {1/.}; mv -f -v {1} /folder2/; mv -f {1/.}.txt /folder3/"
(Also :::: - can be left out completely: If there is no ::: :::: then GNU Parallel reads from stdin).

Creating static archives with ar or libtool fails on jailbroken iOS

I'm trying to build and run Go on a self-hosted jailbroken iOS device (Corellium virtual devices). Go bootstraps and can build and run regular binaries. However, Go archives fail to build because ar doesn't work:
$ dpkg -S /usr/bin/clang /usr/bin/ar /usr/bin/libtool
org.coolstar.llvm-clang64: /usr/bin/clang
org.coolstar.cctools: /usr/bin/ar
org.coolstar.cctools: /usr/bin/libtool
$ apt policy org.coolstar.cctools org.coolstar.llvm-clang64
org.coolstar.cctools:
Installed: 895
Candidate: 895
Version table:
*** 895 500
500 http://apt.thebigboss.org/repofiles/cydia stable/main iphoneos-arm Packages
100 /Library/dpkg/status
org.coolstar.llvm-clang64:
Installed: 5.0.1-2
Candidate: 5.0.1-2
Version table:
*** 5.0.1-2 500
500 http://apt.thebigboss.org/repofiles/cydia stable/main iphoneos-arm Packages
100 /Library/dpkg/status
$ ../bin/go build -buildmode=c-archive -ldflags="-v" -work ../misc/ios/detect.go
WORK=/tmp/go-build734936238
# command-line-arguments
HEADER = -H1 -T0x2000 -R0x1000
0.00 deadcode
0.05 symsize = 0
0.08 pclntab=740717 bytes, funcdata total 185492 bytes
0.08 dodata
0.08 symsize = 0
0.08 symsize = 0
0.09 dynreloc
0.09 dwarf
0.12 asmb
0.13 datblk
0.14 reloc
0.20 sym
0.22 header
archive: ar -q -c -s $WORK/b001/exe/a.out.a /tmp/go-link-569975294/go.o /tmp/go-link-569975294/000000.o /tmp/go-link-569975294/000001.o /tmp/go-link-569975294/000002.o /tmp/go-link-569975294/000003.o /tmp/go-link-569975294/000004.o /tmp/go-link-569975294/000005.o /tmp/go-link-569975294/000006.o /tmp/go-link-569975294/000007.o /tmp/go-link-569975294/000008.o /tmp/go-link-569975294/000009.o /tmp/go-link-569975294/000010.o /tmp/go-link-569975294/000011.o /tmp/go-link-569975294/000012.o /tmp/go-link-569975294/000013.o /tmp/go-link-569975294/000014.o
/var/mobile/go-tip/pkg/tool/darwin_arm64/link: running ar failed: exit status 1
fatal error: ar: can't find or exec: /usr/bin/arm64-apple-darwin14-ranlib (No such file or directory)
ar: internal ranlib command failed
Optimistic as I am, I tried symlinking ranlib:
# ln -s /usr/bin/ranlib /usr/bin/arm64-apple-darwin14-ranlib
$ ../bin/go build -buildmode=c-archive -ldflags="-v" -work ../misc/ios/detect.go
WORK=/tmp/go-build239621581
# command-line-arguments
HEADER = -H1 -T0x2000 -R0x1000
0.00 deadcode
0.06 symsize = 0
0.08 pclntab=740717 bytes, funcdata total 185492 bytes
0.08 dodata
0.09 symsize = 0
0.09 symsize = 0
0.09 dynreloc
0.10 dwarf
0.12 asmb
0.13 datblk
0.15 reloc
0.20 sym
0.22 header
archive: ar -q -c -s $WORK/b001/exe/a.out.a /tmp/go-link-618780113/go.o /tmp/go-link-618780113/000000.o /tmp/go-link-618780113/000001.o /tmp/go-link-618780113/000002.o /tmp/go-link-618780113/000003.o /tmp/go-link-618780113/000004.o /tmp/go-link-618780113/000005.o /tmp/go-link-618780113/000006.o /tmp/go-link-618780113/000007.o /tmp/go-link-618780113/000008.o /tmp/go-link-618780113/000009.o /tmp/go-link-618780113/000010.o /tmp/go-link-618780113/000011.o /tmp/go-link-618780113/000012.o /tmp/go-link-618780113/000013.o /tmp/go-link-618780113/000014.o
/var/mobile/go-tip/pkg/tool/darwin_arm64/link: running ar failed: exit status 1
/usr/bin/arm64-apple-darwin14-ranlib: archive member: $WORK/b001/exe/a.out.a(go.o) offset in archive not a multiple of 8 (must be since member is an 64-bit object file)
fatal error: ar: fatal error in /usr/bin/arm64-apple-darwin14-ranlib
I even switched Go to use libtool:
$ ../bin/go build -buildmode=c-archive -ldflags="-v" -work ../misc/ios/detect.go
WORK=/tmp/go-build502332895
# command-line-arguments
HEADER = -H1 -T0x2000 -R0x1000
0.00 deadcode
0.04 symsize = 0
0.08 pclntab=740717 bytes, funcdata total 185492 bytes
0.08 dodata
0.08 symsize = 0
0.09 symsize = 0
0.09 dynreloc
0.10 dwarf
0.12 asmb
0.14 datblk
0.15 reloc
0.20 sym
0.22 header
archive: libtool -static -o $WORK/b001/exe/a.out.a /tmp/go-link-320669019/go.o /tmp/go-link-320669019/000000.o /tmp/go-link-320669019/000001.o /tmp/go-link-320669019/000002.o /tmp/go-link-320669019/000003.o /tmp/go-link-320669019/000004.o /tmp/go-link-320669019/000005.o /tmp/go-link-320669019/000006.o /tmp/go-link-320669019/000007.o /tmp/go-link-320669019/000008.o /tmp/go-link-320669019/000009.o /tmp/go-link-320669019/000010.o /tmp/go-link-320669019/000011.o /tmp/go-link-320669019/000012.o /tmp/go-link-320669019/000013.o /tmp/go-link-320669019/000014.o
/var/mobile/go-tip/pkg/tool/darwin_arm64/link: running libtool failed: signal: segmentation fault
Now, ar (with symlinked ranlib) fails on even simple object files:
$ cat blah.c
int blah() {
return 0;
}
$ clang -c blah.c
$ ar -q -c -s blah.a blah.o
/usr/bin/arm64-apple-darwin14-ranlib: archive member: blah.a(blah.o) offset in archive not a multiple of 8 (must be since member is an 64-bit object file)
Libtool works:
$ libtool -static -o blah.a blah.o
But segfaults on more complex object files as seen above.
Go can already successfully create archives using the Xcode ar tool. It's only the iOS native "cctools" version that fails.
I worked around the problem by using llvm-ar. The command line flags are slightly different, but at least I now have a working archive.

Why multi-gpu faster than single gpu in caffe training?

In the same hardware/software env, With the same net and solver, just differ in command line.
While command line is:
caffe-master/build/tools/caffe train --solver=solver_base.prototxt --gpu=6
It tasks about 50 seconds per 100 iters.
While command is :
caffe-master/build/tools/caffe train --solver=solver_base.prototxt --gpu=4,5,6,7
It takes about 48 seconds per 100 iters.
As usual, multi-gpu training should cost more time than single-gpu because of cost like replication. So does anyone can tell me why. Thanks very much!
Env:
2 * Intel(R) Xeon(R) CPU E5-2699 v4 # 2.20GHz
8 * Nvidia Tesla V100 PCIE 16GB
Caffe 1.0.0 / use_cudnn on
Cuda 9.0.176
Cudnn 6.0.21

grep invert match on two files

I have two text files containing one column each, for example -
File_A File_B
1 1
2 2
3 8
If I do grep -f File_A File_B > File_C, I get File_C containing 1 and 2. I want to know how to use grep -v on two files so that I can get the non-matching values, 3 and 8 in the above example.
Thanks.
You can also use comm if it allows empty output delimiter
$ # -3 means suppress lines common to both input files
$ # by default, tab character appears before lines from second file
$ comm -3 f1 f2
3
8
$ # change it to empty string
$ comm -3 --output-delimiter='' f1 f2
3
8
Note: comm requires sorted input, so use comm -3 --output-delimiter='' <(sort f1) <(sort f2) if they are not already sorted
You can also pass common lines got from grep as input to grep -v. Tested with GNU grep, some version might not support all these options
$ grep -Fxf f1 f2 | grep -hxvFf- f1 f2
3
8
-F option to match strings literally, not as regex
-x option to match whole lines only
-h to suppress file name prefix
f- to accept stdin instead of file input
awk 'NR==FNR{a[$0]=$0;next} !($0 in a) {print a[(FNR)], $0}' f1 f2
3 8
To Understand the meaning of NR and FNR check below output of their print.
awk '{print NR,FNR}' f1 f2
1 1
2 2
3 3
4 4
5 1
6 2
7 3
8 4
Condition NR==FNR is used to extract the data from first file as both NR and FNR would be same for first file only.
With GNU diff command (to compare files line by line):
diff --suppress-common-lines -y f1 f2 | column -t
The output (left column contain lines from f1, right column - from f2):
3 | 8
-y, --side-by-side - output in two columns

how to use svm-scale in LIBSVM?

I tried running the command svm-scale -l 0 -u 1 -s range data.data > data_scaled.data but I get the error: SyntaxError: invalid syntax. Please find details in the picture below.
I am running the command in a Windows command shell using a Python interface. Is my command format wrong?
I assume, that you use the original LIBSVM (as mentioned in the title of your question) package from here.
There the call should be svm-scale -l 0 -u 1 -s scaledParameters.txt input.data
According to the code, it will print the scaled output to your terminal. The -s option will write down the ranges of your feature values, e.g.
x
0 1
1 63375 13454352
2 1 10
3 1 10
4 1 10
5 1 10
6 1 10
7 1 10
8 1 10
9 1 10
10 1 10
If you just want to scale your data, you have to adapt the LIBSVM scale code to write the scaled data into a file.

Resources