Docker xserver for NVIDIA opengl application (without X in host) - docker

I am trying to create an image of Docker that runs a X server using a NVIDIA GPU for OpenGL headless application. (Could be used creating textures, running Unity3D without screen, etc). In this case, the host does not run a X server, I want to do all inside the container.
I am using this Dockerfile for the image:
FROM ubuntu:18.04
ENV DEBIAN_FRONTEND=noninteractive
RUN apt update && \
apt install -y \
libglvnd0 \
libgl1 \
libglx0 \
libegl1 \
libgles2 \
xserver-xorg-video-nvidia-440
COPY xorg.conf.nvidia-headless /etc/X11/xorg.conf
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES graphics
ENV DISPLAY :1
ENTRYPOINT ["/bin/bash"]
For the xorg.config.nvidia-headless I have created this with nvidia-xconfig
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0"
EndSection
Section "Files"
EndSection
Section "Module"
Load "dbe"
Load "extmod"
Load "type1"
Load "freetype"
Load "glx"
EndSection
Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
Option "UseDisplayDevice" "None"
SubSection "Display"
Virtual 1920 1080
Depth 24
EndSubSection
EndSection
I run docker with --privileged and with --gpus all using nvidia-docker and sharing the device --device --device=/dev/dri/card0. Inside Docker, I could run nvidia-smi perfectly.
When I run the docker, I start a X server with
Xorg -noreset +extension GLX +extension RANDR +extension RENDER -logfile ./xserver.log vt1 :1
But it shows an error:
(EE)
Fatal server error:
(EE) no screens found(EE)
(EE)
This is the complete log:
X.Org X Server 1.19.6
Release Date: 2017-12-20
[ 1296.109] X Protocol Version 11, Revision 0
[ 1296.109] Build Operating System: Linux 4.4.0-168-generic x86_64 Ubuntu
[ 1296.109] Current Operating System: Linux ubuntu 4.15.0-112-generic #113-Ubuntu SMP Thu Jul 9 23:41:39 UTC 2020 x86_64
[ 1296.109] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-112-generic root=UUID=8f2dc01d-1666-4abd-9bd1-cfe0a20afdf1 ro splash quiet vt.handoff=1
[ 1296.109] Build Date: 14 November 2019 06:20:00PM
[ 1296.109] xorg-server 2:1.19.6-1ubuntu4.4 (For technical support please see http://www.ubuntu.com/support)
[ 1296.109] Current version of pixman: 0.34.0
[ 1296.109] Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
[ 1296.109] Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
[ 1296.110] (++) Log file: "./xserver.log", Time: Wed Aug 19 08:38:46 2020
[ 1296.110] (==) Using config file: "/etc/X11/xorg.conf"
[ 1296.110] (==) Using system config directory "/usr/share/X11/xorg.conf.d"
[ 1296.111] (==) ServerLayout "Layout0"
[ 1296.111] (**) |-->Screen "Screen0" (0)
[ 1296.111] (**) | |-->Monitor "Monitor0"
[ 1296.112] (**) | |-->Device "Device0"
[ 1296.112] (**) |-->Input Device "Keyboard0"
[ 1296.112] (**) |-->Input Device "Mouse0"
[ 1296.112] (==) Automatically adding devices
[ 1296.112] (==) Automatically enabling devices
[ 1296.112] (==) Automatically adding GPU devices
[ 1296.112] (==) Automatically binding GPU devices
[ 1296.112] (==) Max clients allowed: 256, resource mask: 0x1fffff
[ 1296.114] (WW) The directory "/usr/share/fonts/X11/cyrillic" does not exist.
[ 1296.114] Entry deleted from font path.
[ 1296.114] (WW) The directory "/usr/share/fonts/X11/100dpi/" does not exist.
[ 1296.114] Entry deleted from font path.
[ 1296.114] (WW) The directory "/usr/share/fonts/X11/75dpi/" does not exist.
[ 1296.114] Entry deleted from font path.
[ 1296.114] (WW) The directory "/usr/share/fonts/X11/Type1" does not exist.
[ 1296.114] Entry deleted from font path.
[ 1296.114] (WW) The directory "/usr/share/fonts/X11/100dpi" does not exist.
[ 1296.114] Entry deleted from font path.
[ 1296.114] (WW) The directory "/usr/share/fonts/X11/75dpi" does not exist.
[ 1296.114] Entry deleted from font path.
[ 1296.114] (==) FontPath set to:
/usr/share/fonts/X11/misc,
built-ins
[ 1296.114] (==) ModulePath set to "/usr/lib/xorg/modules"
[ 1296.114] (WW) Hotplugging is on, devices using drivers 'kbd', 'mouse' or 'vmmouse' will be disabled.
[ 1296.114] (WW) Disabling Keyboard0
[ 1296.114] (WW) Disabling Mouse0
[ 1296.115] (II) Loader magic: 0x55dca9edc020
[ 1296.115] (II) Module ABI versions:
[ 1296.115] X.Org ANSI C Emulation: 0.4
[ 1296.115] X.Org Video Driver: 23.0
[ 1296.115] X.Org XInput driver : 24.1
[ 1296.115] X.Org Server Extension : 10.0
[ 1296.116] (EE) dbus-core: error connecting to system bus: org.freedesktop.DBus.Error.FileNotFound (Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory)
[ 1296.116] (++) using VT number 1
[ 1296.116] (II) systemd-logind: logind integration requires -keeptty and -keeptty was not provided, disabling logind integration
[ 1296.116] (II) xfree86: Adding drm device (/dev/dri/card0)
[ 1296.119] (**) OutputClass "nvidia" ModulePath extended to "/usr/lib/x86_64-linux-gnu/nvidia/xorg,/usr/lib/xorg/modules"
[ 1296.122] (--) PCI:*(0:1:0:0) 10de:100c:1043:84b7 rev 161, Mem # 0xf9000000/16777216, 0xd0000000/134217728, 0xd8000000/33554432, I/O # 0x0000e000/128, BIOS # 0x????????/131072
[ 1296.122] (II) LoadModule: "glx"
[ 1296.123] (II) Loading /usr/lib/xorg/modules/extensions/libglx.so
[ 1296.131] (EE) Failed to load /usr/lib/xorg/modules/extensions/libglx.so: /usr/lib/xorg/modules/extensions/libglx.so: undefined symbol: glxServer
[ 1296.131] (II) UnloadModule: "glx"
[ 1296.131] (II) Unloading glx
[ 1296.131] (EE) Failed to load module "glx" (loader failed, 7)
[ 1296.131] (II) LoadModule: "nvidia"
[ 1296.131] (II) Loading /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
[ 1296.138] (II) Module nvidia: vendor="NVIDIA Corporation"
[ 1296.139] compiled for 1.6.99.901, module version = 1.0.0
[ 1296.139] Module class: X.Org Video Driver
[ 1296.140] (II) NVIDIA dlloader X Driver 440.100 Fri May 29 08:21:27 UTC 2020
[ 1296.140] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[ 1296.141] (II) Loading sub module "fb"
[ 1296.141] (II) LoadModule: "fb"
[ 1296.141] (II) Loading /usr/lib/xorg/modules/libfb.so
[ 1296.143] (II) Module fb: vendor="X.Org Foundation"
[ 1296.143] compiled for 1.19.6, module version = 1.0.0
[ 1296.143] ABI class: X.Org ANSI C Emulation, version 0.4
[ 1296.143] (II) Loading sub module "wfb"
[ 1296.143] (II) LoadModule: "wfb"
[ 1296.143] (II) Loading /usr/lib/xorg/modules/libwfb.so
[ 1296.144] (II) Module wfb: vendor="X.Org Foundation"
[ 1296.144] compiled for 1.19.6, module version = 1.0.0
[ 1296.144] ABI class: X.Org ANSI C Emulation, version 0.4
[ 1296.144] (II) Loading sub module "ramdac"
[ 1296.144] (II) LoadModule: "ramdac"
[ 1296.144] (II) Module "ramdac" already built-in
[ 1296.145] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[ 1296.145] (EE) NVIDIA: system's kernel log for additional error messages and
[ 1296.145] (EE) NVIDIA: consult the NVIDIA README for details.
[ 1296.145] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[ 1296.145] (EE) NVIDIA: system's kernel log for additional error messages and
[ 1296.145] (EE) NVIDIA: consult the NVIDIA README for details.
[ 1296.145] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[ 1296.145] (EE) NVIDIA: system's kernel log for additional error messages and
[ 1296.145] (EE) NVIDIA: consult the NVIDIA README for details.
[ 1296.145] (EE) No devices detected.
[ 1296.145] (II) Applying OutputClass "nvidia" to /dev/dri/card0
[ 1296.145] loading driver: nvidia
[ 1296.145] (==) Matched nvidia as autoconfigured driver 0
[ 1296.145] (==) Matched nouveau as autoconfigured driver 1
[ 1296.145] (==) Matched nouveau as autoconfigured driver 2
[ 1296.145] (==) Matched modesetting as autoconfigured driver 3
[ 1296.145] (==) Matched fbdev as autoconfigured driver 4
[ 1296.145] (==) Matched vesa as autoconfigured driver 5
[ 1296.145] (==) Assigned the driver to the xf86ConfigLayout
[ 1296.145] (II) LoadModule: "nvidia"
[ 1296.145] (II) Loading /usr/lib/x86_64-linux-gnu/nvidia/xorg/nvidia_drv.so
[ 1296.145] (II) Module nvidia: vendor="NVIDIA Corporation"
[ 1296.145] compiled for 1.6.99.901, module version = 1.0.0
[ 1296.145] Module class: X.Org Video Driver
[ 1296.145] (II) UnloadModule: "nvidia"
[ 1296.145] (II) Unloading nvidia
[ 1296.145] (II) Failed to load module "nvidia" (already loaded, 21980)
[ 1296.145] (II) LoadModule: "nouveau"
[ 1296.146] (WW) Warning, couldn't open module nouveau
[ 1296.146] (II) UnloadModule: "nouveau"
[ 1296.146] (II) Unloading nouveau
[ 1296.146] (EE) Failed to load module "nouveau" (module does not exist, 0)
[ 1296.146] (II) LoadModule: "modesetting"
[ 1296.146] (II) Loading /usr/lib/xorg/modules/drivers/modesetting_drv.so
[ 1296.147] (II) Module modesetting: vendor="X.Org Foundation"
[ 1296.147] compiled for 1.19.6, module version = 1.19.6
[ 1296.147] Module class: X.Org Video Driver
[ 1296.147] ABI class: X.Org Video Driver, version 23.0
[ 1296.147] (II) LoadModule: "fbdev"
[ 1296.147] (WW) Warning, couldn't open module fbdev
[ 1296.147] (II) UnloadModule: "fbdev"
[ 1296.147] (II) Unloading fbdev
[ 1296.147] (EE) Failed to load module "fbdev" (module does not exist, 0)
[ 1296.147] (II) LoadModule: "vesa"
[ 1296.147] (WW) Warning, couldn't open module vesa
[ 1296.147] (II) UnloadModule: "vesa"
[ 1296.147] (II) Unloading vesa
[ 1296.147] (EE) Failed to load module "vesa" (module does not exist, 0)
[ 1296.147] (II) NVIDIA dlloader X Driver 440.100 Fri May 29 08:21:27 UTC 2020
[ 1296.147] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs
[ 1296.147] (II) modesetting: Driver for Modesetting Kernel Drivers: kms
[ 1296.147] (WW) xf86OpenConsole: setpgid failed: Operation not permitted
[ 1296.147] (WW) xf86OpenConsole: setsid failed: Operation not permitted
[ 1296.147] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[ 1296.147] (EE) NVIDIA: system's kernel log for additional error messages and
[ 1296.147] (EE) NVIDIA: consult the NVIDIA README for details.
[ 1296.147] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the
[ 1296.147] (EE) NVIDIA: system's kernel log for additional error messages and
[ 1296.147] (EE) NVIDIA: consult the NVIDIA README for details.
[ 1296.147] (WW) Falling back to old probe method for modesetting
[ 1296.147] (EE) Screen 0 deleted because of no matching config section.
[ 1296.147] (II) UnloadModule: "modesetting"
[ 1296.147] (EE) Device(s) detected, but none match those in the config file.
[ 1296.147] (EE)
Fatal server error:
[ 1296.147] (EE) no screens found(EE)
[ 1296.147] (EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
[ 1296.147] (EE) Please also check the log file at "./xserver.log" for additional information.
[ 1296.147] (EE)
[ 1296.149] (EE) Server terminated with error (1). Closing log file.
Does anyone could help me with this? This will run on headless machine with a NVIDIA GPU.

First things first: If you want headless OpenGL do not use an X server!
It's been years since a X server was required to to talk to the GPU. You can do headless rendering just fine without. Nvidia has a nice article on how to do it: https://developer.nvidia.com/blog/egl-eye-opengl-visualization-without-x-server/
The gist is, that you use EGL to set up a context and make the context current without a surface by calling eglMakeCurrent(eglDpy, EGL_NO_SURFACE, EGL_NO_SURFACE, eglCtx);.
You will still need the Nvidia driver for Xorg, since it also carries all the offscreen stuff, but there's an important caveat: The Nvidia userland driver must match the host systems nvidia kernel module version. If you wrap the driver up in a Docker container you're essentially tying that Docker image to the particular kernel module version on the host system. Not a desireable situation. Instead you should configure your docker image to bind the driver and OpenGL implementation libraries from the host system. Unfortunately there's no universal placement of where those libraries and drivers are to be found, which means that it takes a little bit more effort to pull them all in reliably. But despair not, Nvidia already did the work for you:
https://gitlab.com/nvidia/container-images/opengl
Also for setting up the off-screen context reliably it helps to unset the DISPLAY variable: Since Nvidia just built all their Vulkan and EGL stuff on top of the Xorg driver there are some codepaths that evaluate that variable and unsetting it helps nudging all the codepaths in the right direction. So inside your program, before setting up the OpenGL context do a setenv("DISPLAY", NULL, 0).

You're very close! My understanding is that you don't want the nvidia drivers inside of your container. You just want the container to use the drivers that the host system already has installed (you need to install the drivers on the host system if it doesn't have them!).
So instead of installing xserver-xorg-video-nvidia-440 install xserver-xorg-video-dummy in your Dockerfile. Then change your device section of xorg.conf to
Section "Device"
Identifier "Device0"
Driver "dummy"
EndSection
And because the dummy driver doesn't support virtual displays remove that line from your Display subsection and optionally set Modes instead
SubSection "Display"
Depth 24
Modes "1920x1080"
EndSubSection
Then the real magic occurs with the docker run command. You have to mount your host system's cuda libraries and set the linker path
docker run -v /usr/lib/wsl:/usr/lib/wsl -e LD_LIBRARY_PATH=/usr/lib/wsl/lib --device=/dev/dri/card0 --gpus all --rm -it <your-image-name>
I am using the NVIDIA GPU on my Windows 11 machine under WSL2 and see glorious GPU rendering with Open3d's OffscreenRenderer which as of yet doesn't support truly headless rendering with EGL.
# nvidia-smi
Fri Jan 13 18:05:59 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 527.92.01 Driver Version: 528.02 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| N/A 55C P8 4W / 40W | 145MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 21 G /Xwayland N/A |
| 0 N/A N/A 21 G /Xwayland N/A |
| 0 N/A N/A 23 G /Xwayland N/A |
| 0 N/A N/A 29 G /Xorg N/A |
| 0 N/A N/A 51 C+G /python3.7 N/A |
+-----------------------------------------------------------------------------+
(The Xorg process is from the command you provided, python3.7 is my Open3d script using some OpenGL rendering enabled by said Xorg process, and the Xwayland processes are some WSL things that allow running Linux GUI applications)
I haven't tested this on Ubuntu but I believe the relevant path is /usr/local/nvidia/lib64 (at least that's what they are on GKE machines I use see: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus) I don't have an Ubuntu box with a GPU handy otherwise I would try to be more helpful.
Note if you're using more recent versions of mesa for your OpenGl under WSLg in windows you'll need to instruct Mesa to choose the GPU you want or if you only have one NVIDIA GPU you can just add
ENV MESA_D3D12_DEFAULT_ADAPTER_NAME NVIDIA
to your Dockerfile. See https://github.com/microsoft/wslg/wiki/GPU-selection-in-WSLg for more details.

Related

Enable HDMI output for AM3358 Debian 10.3 2020-04-06 4GB SD IoT (BeagleBone Black) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed last year.
Improve this question
I am trying to get my BeagleBone Black to post to my monitor, to be able to use it as a stand-alone PC.
This was possible with the image that the BeagleBone shipped with.
I have just installed a new image on my BeagleBone Black Rev C.
The new image is AM3358 Debian 10.3 2020-04-06 4GB SD IoT.
I am able to SSH to it using PuTTY, and this way I have succesfully performed some actions with it, like using wget to download files from Google Drive etc.
So it seems like the board is working well, and that the HDMI output is disabled somewhere.
I have copied the contents of the uEnv.txt-file below, to show that the lines regarding disabling of video are commented out:
#Docs: http://elinux.org/Beagleboard:U-boot_partitioning_layout_2.0
uname_r=4.19.94-ti-r42
#uuid=
#dtb=
###U-Boot Overlays###
###Documentation: http://elinux.org/Beagleboard:BeagleBoneBlack_Debian#U-Boot_Overlays
###Master Enable
enable_uboot_overlays=1
###
###Overide capes with eeprom
#uboot_overlay_addr0=/lib/firmware/<file0>.dtbo
#uboot_overlay_addr1=/lib/firmware/<file1>.dtbo
#uboot_overlay_addr2=/lib/firmware/<file2>.dtbo
#uboot_overlay_addr3=/lib/firmware/<file3>.dtbo
###
###Additional custom capes
#uboot_overlay_addr4=/lib/firmware/<file4>.dtbo
#uboot_overlay_addr5=/lib/firmware/<file5>.dtbo
#uboot_overlay_addr6=/lib/firmware/<file6>.dtbo
#uboot_overlay_addr7=/lib/firmware/<file7>.dtbo
###
###Custom Cape
#dtb_overlay=/lib/firmware/<file8>.dtbo
###
###Disable auto loading of virtual capes (emmc/video/wireless/adc)
#disable_uboot_overlay_emmc=1
#disable_uboot_overlay_video=1
#disable_uboot_overlay_audio=1
#disable_uboot_overlay_wireless=1
#disable_uboot_overlay_adc=1
###
###PRUSS OPTIONS
###pru_rproc (4.14.x-ti kernel)
#uboot_overlay_pru=/lib/firmware/AM335X-PRU-RPROC-4-14-TI-00A0.dtbo
###pru_rproc (4.19.x-ti kernel)
uboot_overlay_pru=/lib/firmware/AM335X-PRU-RPROC-4-19-TI-00A0.dtbo
###pru_uio (4.14.x-ti, 4.19.x-ti & mainline/bone kernel)
#uboot_overlay_pru=/lib/firmware/AM335X-PRU-UIO-00A0.dtbo
###
###Cape Universal Enable
enable_uboot_cape_universal=1
###
###Debug: disable uboot autoload of Cape
#disable_uboot_overlay_addr0=1
#disable_uboot_overlay_addr1=1
#disable_uboot_overlay_addr2=1
#disable_uboot_overlay_addr3=1
###
###U-Boot fdt tweaks... (60000 = 384KB)
#uboot_fdt_buffer=0x60000
###U-Boot Overlays###
cmdline=coherent_pool=1M net.ifnames=0 lpj=1990656 rng_core.default_quality=100 quiet
#In the event of edid real failures, uncomment this next line:
#cmdline=coherent_pool=1M net.ifnames=0 lpj=1990656 rng_core.default_quality=100 quiet video=HDMI-A-1:1024x768#60e
##enable Generic eMMC Flasher:
##make sure, these tools are installed: dosfstools rsync
#cmdline=init=/opt/scripts/tools/eMMC/init-eMMC-flasher-v3.sh
Furthermore, the contents of /opt/scripts/tools/version.sh
git:/opt/scripts/:[a335abcf87d2ef5fd96e7de83cdf3f0ff5a4da2b]
eeprom:[A335BNLT00C02128SBB11942]
model:[TI_AM335x_BeagleBone_Black]
dogtag:[BeagleBoard.org Debian Buster IoT Image 2020-04-06]
bootloader:[microSD-(push-button)]:[/dev/mmcblk0]:[U-Boot SPL 2019.04-00002-gc9b3922522 (Aug 24 2020 - 16:42:18 -0500)]:[location: dd MBR]
bootloader:[microSD-(push-button)]:[/dev/mmcblk0]:[U-Boot 2019.04-00002-gc9b3922522]:[location: dd MBR]
bootloader:[eMMC-(default)]:[/dev/mmcblk1]:[U-Boot SPL 2019.04-00002-g07d5700e21 (Mar 06 2020 - 11:24:55 -0600)]:[location: dd MBR]
bootloader:[eMMC-(default)]:[/dev/mmcblk1]:[U-Boot 2019.04-00002-g07d5700e21]:[location: dd MBR]
UBOOT: Booted Device-Tree:[am335x-boneblack-uboot-univ.dts]
UBOOT: Loaded Overlay:[AM335X-PRU-RPROC-4-19-TI-00A0]
UBOOT: Loaded Overlay:[BB-ADC-00A0.bb.org-overlays]
UBOOT: Loaded Overlay:[BB-BONE-eMMC1-01-00A0.bb.org-overlays]
UBOOT: Loaded Overlay:[BB-HDMI-TDA998x-00A0.bb.org-overlays]
kernel:[4.19.94-ti-r42]
nodejs:[v10.24.0]
/boot/uEnv.txt Settings:
uboot_overlay_options:[enable_uboot_overlays=1]
uboot_overlay_options:[uboot_overlay_pru=/lib/firmware/AM335X-PRU-RPROC-4-19-TI-00A0.dtbo]
uboot_overlay_options:[enable_uboot_cape_universal=1]
pkg check: to individually upgrade run: [sudo apt install --only-upgrade <pkg>]
pkg:[bb-cape-overlays]:[4.14.20210821.0-0~buster+20210821]
pkg:[bb-customizations]:[1.20211215.2-0~buster+20220102]
pkg:[bb-usb-gadgets]:[1.20220112.3-0~buster+20220112]
pkg:[bb-wl18xx-firmware]:[1.20211222.2-0~buster+20211222]
pkg:[kmod]:[26-1]
pkg:[librobotcontrol]:[1.0.5-git20200715.0-0~buster+20200716]
pkg:[firmware-ti-connectivity]:[20190717-2rcnee1~buster+20200305]
groups:[debian : debian adm kmem dialout cdrom floppy audio dip video plugdev users systemd-journal bluetooth netdev i2c gpio pwm eqep remoteproc admin spi iio docker tisdk weston-launch xenomai cloud9ide]
cmdline:[console=ttyO0,115200n8 bone_capemgr.uboot_capemgr_enabled=1 root=/dev/mmcblk1p1 ro rootfstype=ext4 rootwait coherent_pool=1M net.ifnames=0 lpj=1990656 rng_core.default_quality=100 quiet video=HDMI-A-1:1024x768#60e]
dmesg | grep remote
[ 43.365021] remoteproc remoteproc0: wkup_m3 is available
[ 43.467266] remoteproc remoteproc0: powering up wkup_m3
[ 43.467297] remoteproc remoteproc0: Booting fw image am335x-pm-firmware.elf, size 217168
[ 43.474100] remoteproc remoteproc0: remote processor wkup_m3 is now up
[ 45.923727] remoteproc remoteproc1: 4a334000.pru is available
[ 45.927100] remoteproc remoteproc2: 4a338000.pru is available
dmesg | grep pru
[ 45.923727] remoteproc remoteproc1: 4a334000.pru is available
[ 45.923942] pru-rproc 4a334000.pru: PRU rproc node pru#4a334000 probed successfully
[ 45.927100] remoteproc remoteproc2: 4a338000.pru is available
[ 45.927243] pru-rproc 4a338000.pru: PRU rproc node pru#4a338000 probed successfully
dmesg | grep pinctrl-single
[ 0.949255] pinctrl-single 44e10800.pinmux: 142 pins, size 568
[ 1.475029] pinctrl-single 44e10800.pinmux: pin PIN108 already requested by ocp:A15_pinmux; cannot claim for 0-0070
[ 1.485668] pinctrl-single 44e10800.pinmux: pin-108 (0-0070) status -22
[ 1.492351] pinctrl-single 44e10800.pinmux: could not request pin 108 (PIN108) from group nxp_hdmi_bonelt_pins on device pinctrl-single
dmesg | grep gpio-of-helper
[ 0.962575] gpio-of-helper ocp:cape-universal: ready
lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
END
I solved the problem, following a solution found here.
The solution was to run a kernel update:
sudo /opt/scripts/tools/update_kernel.sh
So the real hero here is RobertCNelson who provided that answer.

About Beaglebone Black CAN protocol setting

Thank you for watching this.
I'm having difficulties with my BBB on CAN communication like for months...
I'd be really pleased if you could give me just a little help!
I'm working on CAN protocol between BBB and another CAN device.
The another device is confirmed to be working alright with CAN.
I'm using my BBB with Cloud9 platform on windows laptop,
and on the another device, it's using CAN0.
I have set the 'config-pin' on BBB like below using CAN1, and I tried 'cansend' utility.
The bitratre value on the another device is also set to be equal.
config-pin p9.24 can
config-pin p9.26 can
ip link set can1 up type can bitrate
cansend can1 300#AC.AB.AD.AE.75.49.AD.D1
Yet it still seems like there is no CAN packets being sent or received.
(That receiving code is written in additional info. )
Plus, I tried to catch some signal with oscilloscope machine, but I couldn't get a thing at all.
Then, I modified some lines of uEnv.txt like below, located inside the boot folder of BBB.
###Additional custom capes
uboot_overlay_addr4=/lib/firmware/BB-CAN0-00A0.dtbo
uboot_overlay_addr5=/lib/firmware/BB-CAN1-00A0.dtbo
#uboot_overlay_addr6=/lib/firmware/<file6>.dtbo
#uboot_overlay_addr7=/lib/firmware/<file7>.dtbo
###
But CAN still does not work, and config-pin command after this uEnv.txt setting shows error like below
debian#beaglebone:/lib/firmware$ config-pin -q p9.24
ERROR: open() for /sys/devices/platform/ocp/ocp:P9_24_pinmux/state failed, No such file or directory
I truly suspect there might be something wrong with the driver or pinmux setting,
because the code did work well in other situations.
The same messages for the other overlayed pins. Actually any config-pin commands don't work on these pins. (And of course the CAN bus is still not working)
I'm currently using the latest AM3358 Debian 10.3 (2020-04-06) SD IoT image, and packages seems to be all updated well. The image is flashed and no SD card is in.
I really appreciate you read this. Thank you!
Additional Info:
CAN receive part code
#include <linux/can.h>
#include <linux/can/raw.h>
#include <net/if.h>
#include <sys/ioctl.h>
#include <sys/socket.h>
#include <errno.h>
#include <unistd.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <fcntl.h>
int InitCanInterface(const char *ifname){
    int sock = socket(PF_CAN, SOCK_RAW, CAN_RAW);
    fcntl(sock, F_SETFL, O_NONBLOCK);
    if (sock == -1){
        printf("Fail to create CAN socket for %s - %m \n", ifname);
        return -1;
    }
    printf("Success to create CAN socket for %s\n", ifname);
    struct ifreq ifr;
    strncpy(ifr.ifr_name, ifname, IFNAMSIZ - 1);
    int ret = ioctl(sock, SIOCGIFINDEX, &ifr);
    if (ret == -1){
        perror("Fail to get CAN interface index");
        return -1;
        }
    printf("Success to get CAN interface index: %d\n", ifr.ifr_ifindex);
    struct sockaddr_can addr;
    addr.can_family = AF_CAN;
    addr.can_ifindex = ifr.ifr_ifindex;
    ret = bind(sock, (struct sockaddr*)&addr, sizeof(addr));
    if (ret == -1){
        perror("Fail to bind CAN socket -");
        return -1;
    }
    printf("Success to bind CAN socket\n");
    return sock;}
    int TransmitCanFrame(const int sock, const uint32_t id, const uint8_t *data, const size_t data_len){
        struct  can_frame frame;
        frame.can_id = id & 0x1ffffff;
        frame.can_id |= (1<<31);
        memcpy(frame.data, data, data_len);
        frame.can_dlc = data_len;
        int tx_bytes = write(sock, &frame, sizeof(frame));
        if (tx_bytes == -1){
            perror("Fail to transmit CAN frame -");
            return -1;
        }
        printf("Success to transmit CAN frame - %d bytes is transmitted\n", tx_bytes);
        return 0;
    }
#define CAN_FRAME_MAX_LEN 8
int ReceiveCanFrame(const int sock){
    struct can_frame frame;
    int rx_bytes = read(sock, &frame, sizeof(frame));
    if (rx_bytes < 0){
        //perror("Fail to receive CAN frame -");
        return -1;
    }
    else if (rx_bytes < (int)sizeof(struct can_frame)){
        printf("Incomplete CAN frame is received - rx_bytes: %d/n", rx_bytes);
        return -1;
    }
    else if (frame.can_dlc > CAN_FRAME_MAX_LEN){
        printf("Invalid dlc: %u\n", frame.can_dlc);
        return -1;
    }
    if (((frame.can_id >> 29) & 1) ==1) {
        printf("Error frame is received\n");
    }
    else if (((frame.can_id >> 30) & 1) ==1) {
        printf("RTR frame is received\n");
    }
    else {
        if (((frame.can_id >> 31) & 1) == 0){
            printf("11bit long std CAN frame is received\n");
            printf("%#x\n",frame.can_id);
        }
        else {
            printf("29bit long ext CAN frame is received\n");
            printf("%#x\n",frame.can_id & 0x0001fffffff );
        }
    }
    for (int ii=0; ii<8; ii++) {
        printf("0x%X ", frame.data[ii]);
    }
    printf("\n");
    printf("\n");
    return 0;
}
int main(){
    int sock = InitCanInterface("can0");
    if (sock < 0 ){
        return -1;
    }
    // uint8_t can_data[CAN_FRAME_MAX_LEN] = {};
    while(1) {
    //printf("hello\n");
    //printf("%d",ReceiveCanFrame(sock));
    //if(ReceiveCanFrame(sock) == 0)
        //printf("No response\n");
    ReceiveCanFrame(sock);
    sleep(1);    
    }    
    return 0;
}
uname -a outputs
debian#beaglebone:/lib/firmware$ uname -a
Linux beaglebone 4.19.94-ti-r64 #1buster SMP PREEMPT Fri May 21 23:57:28 UTC 2021 armv7l GNU/Linux
debian#beaglebone:/lib/firmware$ sudo /opt/scripts/tools/version.sh
git:/opt/scripts/:[e8ae28ccc34a177e9435a0d24cdf8421e081c19a]
eeprom:[A335BNLT00C00620BBBK11BC]
model:[TI_AM335x_BeagleBone_Black]
dogtag:[BeagleBoard.org Debian Buster IoT Image 2020-04-06]
bootloader:[eMMC-(default)]:[/dev/mmcblk1]:[U-Boot SPL 2019.04-00002-g07d5700e21 (Mar 06 2020 - 11:24:55 -0600)]:[location: dd MBR]
bootloader:[eMMC-(default)]:[/dev/mmcblk1]:[U-Boot 2019.04-00002-g07d5700e21]:[location: dd MBR]
UBOOT: Booted Device-Tree:[am335x-boneblack-uboot-univ.dts]
UBOOT: Loaded Overlay:[AM335X-PRU-RPROC-4-19-TI-00A0]
UBOOT: Loaded Overlay:[BB-ADC-00A0]
UBOOT: Loaded Overlay:[BB-BONE-eMMC1-01-00A0]
UBOOT: Loaded Overlay:[BB-CAN0-00A0]
UBOOT: Loaded Overlay:[BB-CAN1-00A0]
UBOOT: Loaded Overlay:[BB-HDMI-TDA998x-00A0]
kernel:[4.19.94-ti-r64]
nodejs:[v10.24.0]
/boot/uEnv.txt Settings:
uboot_overlay_options:[enable_uboot_overlays=1]
uboot_overlay_options:[uboot_overlay_addr4=/lib/firmware/BB-CAN0-00A0.dtbo]
uboot_overlay_options:[uboot_overlay_addr5=/lib/firmware/BB-CAN1-00A0.dtbo]
uboot_overlay_options:[uboot_overlay_pru=/lib/firmware/AM335X-PRU-RPROC-4-19-TI-00A0.dtbo]
uboot_overlay_options:[enable_uboot_cape_universal=1]
pkg check: to individually upgrade run: [sudo apt install --only-upgrade <pkg>]
pkg:[bb-cape-overlays]:[4.14.20210416.0-0~buster+20210416]
pkg:[bb-customizations]:[1.20210708.0-0~buster+20210708]
pkg:[bb-usb-gadgets]:[1.20200504.0-0~buster+20200504]
pkg:[bb-wl18xx-firmware]:[1.20210520.0-0~buster+20210520]
pkg:[kmod]:[26-1]
pkg:[librobotcontrol]:[1.0.5-git20200715.0-0~buster+20200716]
pkg:[firmware-ti-connectivity]:[20190717-2rcnee1~buster+20200305]
groups:[debian : debian adm kmem dialout cdrom floppy audio dip video plugdev users systemd-journal bluetooth netdev i2c gpio pwm eqep remoteproc admin spi iio docker tisdk weston-launch xenomai cloud9ide]
cmdline:[console=ttyO0,115200n8 bone_capemgr.uboot_capemgr_enabled=1 root=/dev/mmcblk1p1 ro rootfstype=ext4 rootwait coherent_pool=1M net.ifnames=0 lpj=1990656 rng_core.default_quality=100 quiet]
dmesg | grep remote
[ 65.289088] remoteproc remoteproc0: wkup_m3 is available
[ 65.320630] remoteproc remoteproc0: powering up wkup_m3
[ 65.320664] remoteproc remoteproc0: Booting fw image am335x-pm-firmware.elf, size 217148
[ 65.320951] remoteproc remoteproc0: remote processor wkup_m3 is now up
[ 68.227786] remoteproc remoteproc1: 4a334000.pru is available
[ 68.241566] remoteproc remoteproc2: 4a338000.pru is available
dmesg | grep pru
[ 68.227786] remoteproc remoteproc1: 4a334000.pru is available
[ 68.227985] pru-rproc 4a334000.pru: PRU rproc node pru#4a334000 probed successfully
[ 68.241566] remoteproc remoteproc2: 4a338000.pru is available
[ 68.241750] pru-rproc 4a338000.pru: PRU rproc node pru#4a338000 probed successfully
dmesg | grep pinctrl-single
[ 0.943044] pinctrl-single 44e10800.pinmux: 142 pins, size 568
dmesg | grep gpio-of-helper
[ 0.956633] gpio-of-helper ocp:cape-universal: ready
lsusb
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
END
dmesg for CAN
debian#beaglebone:/lib/firmware$ dmesg | grep can
[ 1.205500] c_can_platform 481cc000.can: c_can_platform device registered (regs=3377e4b7, irq=42)
[ 1.206878] c_can_platform 481d0000.can: c_can_platform device registered (regs=292aef38, irq=43)
[ 1.422353] can: controller area network core (rev 20170425 abi 9)
[ 992.007971] c_can_platform 481d0000.can can1: setting BTR=2701 BRPE=0000
[ 992.016624] IPv6: ADDRCONF(NETDEV_UP): can1: link is not ready
[ 992.017512] IPv6: ADDRCONF(NETDEV_CHANGE): can1: link becomes ready
The full uEnv.txt
debian#beaglebone:/lib/firmware$ cat /boot/uEnv.txt
#Docs: http://elinux.org/Beagleboard:U-boot_partitioning_layout_2.0
uname_r=4.19.94-ti-r64
#uuid=
#dtb=
###U-Boot Overlays###
###Documentation: http://elinux.org/Beagleboard:BeagleBoneBlack_Debian#U-Boot_Overlays
###Master Enable
enable_uboot_overlays=1
###
###Overide capes with eeprom
#uboot_overlay_addr0=/lib/firmware/<file0>.dtbo
#uboot_overlay_addr1=/lib/firmware/<file1>.dtbo
#uboot_overlay_addr2=/lib/firmware/<file2>.dtbo
#uboot_overlay_addr3=/lib/firmware/<file3>.dtbo
###
###Additional custom capes
uboot_overlay_addr4=/lib/firmware/BB-CAN0-00A0.dtbo
uboot_overlay_addr5=/lib/firmware/BB-CAN1-00A0.dtbo
#uboot_overlay_addr6=/lib/firmware/<file6>.dtbo
#uboot_overlay_addr7=/lib/firmware/<file7>.dtbo
###
###Custom Cape
#dtb_overlay=/lib/firmware/<file8>.dtbo
###
###Disable auto loading of virtual capes (emmc/video/wireless/adc)
#disable_uboot_overlay_emmc=1
#disable_uboot_overlay_video=1
#disable_uboot_overlay_audio=1
#disable_uboot_overlay_wireless=1
#disable_uboot_overlay_adc=1
###
###PRUSS OPTIONS
###pru_rproc (4.14.x-ti kernel)
#uboot_overlay_pru=/lib/firmware/AM335X-PRU-RPROC-4-14-TI-00A0.dtbo
###pru_rproc (4.19.x-ti kernel)
uboot_overlay_pru=/lib/firmware/AM335X-PRU-RPROC-4-19-TI-00A0.dtbo
###pru_uio (4.14.x-ti, 4.19.x-ti & mainline/bone kernel)
#uboot_overlay_pru=/lib/firmware/AM335X-PRU-UIO-00A0.dtbo
###
###Cape Universal Enable
enable_uboot_cape_universal=1
###
###Debug: disable uboot autoload of Cape
#disable_uboot_overlay_addr0=1
#disable_uboot_overlay_addr1=1
#disable_uboot_overlay_addr2=1
#disable_uboot_overlay_addr3=1
###
###U-Boot fdt tweaks... (60000 = 384KB)
#uboot_fdt_buffer=0x60000
###U-Boot Overlays###
cmdline=coherent_pool=1M net.ifnames=0 lpj=1990656 rng_core.default_quality=100 quiet
#In the event of edid real failures, uncomment this next line:
#cmdline=coherent_pool=1M net.ifnames=0 lpj=1990656 rng_core.default_quality=100 quiet video=HDMI-A-1:1024x768#60e
##enable Generic eMMC Flasher:
##make sure, these tools are installed: dosfstools rsync
#cmdline=init=/opt/scripts/tools/eMMC/init-eMMC-flasher-v3.sh
Some help on can or socketCAN will be found here for the BBB or other family board:
https://www.beyondlogic.org/adding-can-to-the-beaglebone-black/
Also:
https://github.com/craigpeacock/CAN-Examples
These examples are a bit older and I have noticed that the Linux Distro on the BBB, if getting it from bbb.io/latest-images , is going through an overhaul.
For instance, I have noticed that the config-pin utility still works but that some of their overlays and DeviceTrees for the BBB peripherals are being sent into mainline, esp. for the BBAI.
When those examples do not help you configure the, and those examples are not mine but I figured they would help, socketCAN on Linux, please reply. I am working on a simple CAN interface from those examples and while using Linux is helpful, some things like the Device Trees are in a mode of change and I think this goes along w/ config-pin too.
For instance...
If you go to their forum at forum.beagleboard.org, you will see some people from GSOC working on examples from config-pin utilities to PRU cores which will be helpful for people getting into the shared memory, microcontroller game.
Here is the config-pin idea I found on their forum page:
https://forum.beagleboard.org/t/beagle-config-logs/30174
I have set up CAN on debian 10.3 (Buster) on beaglebone black.
I left uEnv.txt as default and issued these commands (as root) to enable CAN:
config-pin p9.24 can
config-pin p9.26 can
ip link set can1 up type can bitrate 1000000
candump can1
Once this is working, you can automate this setup using uEnv.txt and /etc/network/interfaces as descibed here - https://www.beyondlogic.org/adding-can-to-the-beaglebone-black/ and here - https://www.thomas-wedemeyer.de/beaglebone-canbus-python.html
If this doesnt work, I'd suggest ensuring there is no other software or updates installed that could be messing this up - try a fresh debian install on another SD card, and ensuring the hardware - the bus driver and wiring is ok.
I solved BBB CAN problem just by changing transceiver board into another one.
Don't use cjmcu-230 CAN transceiver board. I use the one from Waveshare. https://www.waveshare.com/sn65hvd230-can-board.htm
Both CAN transceiver board use same SN65HVD230 chip, but it seems that there is some ground pin circuit issue inside the cjmcu-230 board.
Hope you don't waste your time if you have this issue.

Disabling UART0 for console output and use it for general-purpose on pocketbeagle

I am using a debian image running on the pocketbeagle (the smaller version of the beaglebone). I can't seem to figure out how to disable the UART0 for console debugging and use it for my own purposes.
When I type:
dmesg | grep tty
I get the following:
[ 0.000000] Kernel command line: console=ttyO0,115200n8 root=/dev/mmcblk0p1 ro rootfstype=ext4 rootwait coherent_pool=1M net.ifnames=0 quiet
[ 0.002567] WARNING: Your 'console=ttyO0' has been replaced by 'ttyS0'
[ 1.446237] 44e09000.serial: ttyS0 at MMIO 0x44e09000 (irq = 158, base_baud = 3000000) is a 8250
[ 1.459149] console [ttyS0] enabled
[ 1.460177] 48022000.serial: ttyS1 at MMIO 0x48022000 (irq = 159, base_baud = 3000000) is a 8250
[ 1.461029] 48024000.serial: ttyS2 at MMIO 0x48024000 (irq = 160, base_baud = 3000000) is a 8250
[ 1.462034] 481a8000.serial: ttyS4 at MMIO 0x481a8000 (irq = 161, base_baud = 3000000) is a 8250
I have tried looking at the uEnv.txt file in /boot but there's nothing related to UART0. This is what I found in uEnv.txt:
#Docs: http://elinux.org/Beagleboard:U-boot_partitioning_layout_2.0
uname_r=4.9.82-ti-r102
#uuid=
#dtb=
###U-Boot Overlays###
###Documentation: http://elinux.org/Beagleboard:BeagleBoneBlack_Debian#U-Boot_Overlays
###Master Enable
enable_uboot_overlays=1
###
###Overide capes with eeprom
#uboot_overlay_addr0=/lib/firmware/<file0>.dtbo
#uboot_overlay_addr1=/lib/firmware/<file1>.dtbo
#uboot_overlay_addr2=/lib/firmware/<file2>.dtbo
#uboot_overlay_addr3=/lib/firmware/<file3>.dtbo
###
###Additional custom capes
#uboot_overlay_addr4=/lib/firmware/<file4>.dtbo
#uboot_overlay_addr5=/lib/firmware/<file5>.dtbo
#uboot_overlay_addr6=/lib/firmware/<file6>.dtbo
#uboot_overlay_addr7=/lib/firmware/<file7>.dtbo
###
###Custom Cape
#dtb_overlay=/lib/firmware/<file8>.dtbo
###
###Disable auto loading of virtual capes (emmc/video/wireless/adc)
#disable_uboot_overlay_emmc=1
#disable_uboot_overlay_video=1
#disable_uboot_overlay_audio=1
#disable_uboot_overlay_wireless=1
#disable_uboot_overlay_adc=1
###
###PRUSS OPTIONS
###pru_rproc (4.4.x-ti kernel)
#uboot_overlay_pru=/lib/firmware/AM335X-PRU-RPROC-4-4-TI-00A0.dtbo
###pru_uio (4.4.x-ti, 4.14.x-ti & mainline/bone kernel)
#uboot_overlay_pru=/lib/firmware/AM335X-PRU-UIO-00A0.dtbo
###
###Cape Universal Enable
enable_uboot_cape_universal=1
###
###Debug: disable uboot autoload of Cape
#disable_uboot_overlay_addr0=1
#disable_uboot_overlay_addr1=1
#disable_uboot_overlay_addr2=1
#disable_uboot_overlay_addr3=1
###
###U-Boot fdt tweaks... (60000 = 384KB)
#uboot_fdt_buffer=0x60000
###U-Boot Overlays###
cmdline=coherent_pool=1M net.ifnames=0 quiet
#In the event of edid real failures, uncomment this next line:
#cmdline=coherent_pool=1M net.ifnames=0 quiet video=HDMI-A-1:1024x768#60e
##enable Generic eMMC Flasher:
##make sure, these tools are installed: dosfstools rsync
#cmdline=init=/opt/scripts/tools/eMMC/init-eMMC-flasher-v3.sh
What should I do to use the UART0 on my pocketbeagle? I can't seem to find any documentation about it whatsoever.
I don't think you will find a good answer by looking in the bootloader, device tree, or kernel. I think the ideal way to solve this problem is by configuring the operating system.
I did this with a pocketbeagle using debian 9.9, which uses systemd. It turns out that two systemd services interfere with uart0: serial-getty#ttyO0.service and serial-getty#ttyS0.service. systemd lets you force disable a service using mask. This only has to be done once. so 'systemctl mask serial-getty#ttyO0.service' and 'systemctl mask serial-getty#ttyS0.service' should do it. Note that before using ttyO0 you may have to manually configure a serial port. I did it with a c program and a library called termios, but there's probably a way to do it on the command line.
to verify my answer, you can investigate the /etc/systemd/system directory and run 'systemctl list-units --type=service' to see what's running on your system. more information about systemctl is here: https://www.tecmint.com/list-all-running-services-under-systemd-in-linux/

Installing TensorFlow-GPU

I try to install tensorflow-gpu. The problem is that I have nvidia-375.82 driver, while tensorflow requires 375.66.
When I got this error
ImportError: libnvidia-fatbinaryloader.so.375.66: cannot open shared object file: No such file or directory
I tried to make link
sudo ln -s /usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.82 /usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.66
It helps to avoid ImportError, but nothing more. If I try to run smth
import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
I get result by cpu and prints
2017-10-07 15:56:03.329769: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329832: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329850: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329864: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.329878: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-07 15:56:03.429055: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_NO_DEVICE
2017-10-07 15:56:03.429198: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: sklert-new-comp
2017-10-07 15:56:03.429226: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: sklert-new-comp
2017-10-07 15:56:03.429317: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.66.0
2017-10-07 15:56:03.429384: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.82 Wed Jul 19 21:16:49 PDT 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
"""
2017-10-07 15:56:03.429446: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.82.0
2017-10-07 15:56:03.429473: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 375.82.0 does not match DSO version 375.66.0 -- cannot find working devices in this configuration
Device mapping: no known devices.
2017-10-07 15:56:03.430336: I tensorflow/core/common_runtime/direct_session.cc:300] Device mapping:
MatMul: (MatMul): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467133: I tensorflow/core/common_runtime/simple_placer.cc:872] MatMul: (MatMul)/job:localhost/replica:0/task:0/cpu:0
b: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467201: I tensorflow/core/common_runtime/simple_placer.cc:872] b: (Const)/job:localhost/replica:0/task:0/cpu:0
a: (Const): /job:localhost/replica:0/task:0/cpu:0
2017-10-07 15:56:03.467226: I tensorflow/core/common_runtime/simple_placer.cc:872] a: (Const)/job:localhost/replica:0/task:0/cpu:0
[[ 22. 28.]
[ 49. 64.]]
Is there any way to use tensorflow with gpu without downgrading?
...
Seems that problem is not in tensorflow, but in nvidia-drivers
sudo dmesg | grep NVRM
[ 1.267417] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 375.82 Wed Jul 19 21:16:49 PDT 2017 (using threaded interrupts)
[ 108.803115] NVRM: API mismatch: the client has the version 375.66, but
NVRM: this kernel module has the version 375.82. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
[ 1419.021917] NVRM: API mismatch: the client has the version 375.66, but
NVRM: this kernel module has the version 375.82. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
Some drivers have different version:
locate 375.66
/usr/lib/i386-linux-gnu/libcuda.so.375.66
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.375.66
/usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.66
/usr/lib/x86_64-linux-gnu/libcuda.so.375.66
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.375.66
/usr/lib32/nvidia-375/libnvidia-fatbinaryloader.so.375.66

Tensorflow Bazel 0.3.0 build CUDA 8.0 GTX 1070 fails

Here are my specs:
GTX 1070
Driver 367 (installed from .run)
Ubuntu 16.04
CUDA 8.0 (installed from .run)
Cudnn 5
Bazel 0.3.0 (potential problem?)
gcc 4.9.3
Tensorflow installed from source
To verify versions:
volcart#volcart-Precision-Tower-7910:~/$ nvidia-smi
Fri Aug 5 15:03:32 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35 Driver Version: 367.35 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 0000:03:00.0 On | N/A |
| 0% 38C P8 11W / 185W | 495MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 20303 G /usr/lib/xorg/Xorg 280MiB |
| 0 20909 G compiz 114MiB |
| 0 21562 G ...s-passed-by-fd --v8-snapshot-passed-by-fd 98MiB |
+-----------------------------------------------------------------------------+
volcart#volcart-Precision-Tower-7910:~/$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
volcart#volcart-Precision-Tower-7910:~/$ bazel version
Build label: 0.3.0
Build target: bazel-out/local-fastbuild/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Fri Jun 10 11:38:23 2016 (1465558703)
Build timestamp: 1465558703
Build timestamp as int: 1465558703
volcart#volcart-Precision-Tower-7910:~/$ gcc -vUsing built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.9/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.9.3-13ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.9 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.9.3 (Ubuntu 4.9.3-13ubuntu2)
I did switch bazel versions, so I executed bazel clean successfully.
I can verify CUDA is functional via ~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$
volcart#volcart-Precision-Tower-7910:~/NVIDIA_CUDA-8.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1070"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8113 MBytes (8507162624 bytes)
(15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores
GPU Max Clock rate: 1797 MHz (1.80 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1070
Result = PASS
When I ./configure I enter all the defaults.
The current errors
When I build the training example I get this:
volcart#volcart-Precision-Tower-7910:/usr/local/lib/python2.7/dist-packages/tensorflow$ sudo bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
Sending SIGTERM to previous Bazel server (pid=7108)... done.
.
INFO: Found 1 target...
...
./tensorflow/core/platform/default/logging.h: In instantiation of 'std::string* tensorflow::internal::Check_LTImpl(const T1&, const T2&, const char*) [with T1 = int; T2 = long unsigned int; std::string = std::basic_string<char>]':
tensorflow/core/common_runtime/gpu/gpu_device.cc:567:5: required from here
./tensorflow/core/platform/default/logging.h:197:35: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
TF_DEFINE_CHECK_OP_IMPL(Check_LT, < )
^
./tensorflow/core/platform/macros.h:54:29: note: in definition of macro 'TF_PREDICT_TRUE'
#define TF_PREDICT_TRUE(x) (x)
^
./tensorflow/core/platform/default/logging.h:197:1: note: in expansion of macro 'TF_DEFINE_CHECK_OP_IMPL'
TF_DEFINE_CHECK_OP_IMPL(Check_LT, < )
^
ERROR: /usr/local/lib/python2.7/dist-packages/tensorflow/tensorflow/cc/BUILD:199:1: Linking of rule '//tensorflow/cc:tutorials_example_trainer' failed: crosstool_wrapper_driver_is_not_gcc failed: error executing command third_party/gpus/crosstool/clang/bin/crosstool_wrapper_driver_is_not_gcc -o bazel-out/local_linux-opt/bin/tensorflow/cc/tutorials_example_trainer ... (remaining 805 argument(s) skipped): com.google.devtools.build.lib.shell.BadExitStatusException: Process exited with status 1.
bazel-out/local_linux-opt/bin/tensorflow/cc/_objs/tutorials_example_trainer/tensorflow/cc/tutorials/example_trainer.o: In function `tensorflow::example::ConcurrentSteps(tensorflow::example::Options const*, int)':
example_trainer.cc:(.text._ZN10tensorflow7example15ConcurrentStepsEPKNS0_7OptionsEi+0x517): undefined reference to `google::protobuf::internal::empty_string_'
bazel-out/local_linux-opt/bin/tensorflow/core/kernels/libidentity_reader_op.lo(identity_reader_op.o): In function `tensorflow::IdentityReader::SerializeStateLocked(std::string*)':
identity_reader_op.cc:(.text._ZN10tensorflow14IdentityReader20SerializeStateLockedEPSs[_ZN10tensorflow14IdentityReader20SerializeStateLockedEPSs]+0x36): undefined reference to `google::protobuf::MessageLite::SerializeToString(std::string*) const'
bazel-out/local_linux-opt/bin/tensorflow/core/kernels/libwhole_file_read_ops.lo(whole_file_read_ops.o): In function `tensorflow::WholeFileReader::SerializeStateLocked(std::string*)':
And when I try to build the pip package I get this:
volcart#volcart-Precision-Tower-7910:/usr/local/lib/python2.7/dist-packages/tensorflow$ sudo bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
WARNING: /usr/local/lib/python2.7/dist-packages/tensorflow/util/python/BUILD:11:16: in includes attribute of cc_library rule //util/python:python_headers: 'python_include' resolves to 'util/python/python_include' not in 'third_party'. This will be an error in the future.
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/bit_depth.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/gemmlowp.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/map.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/public/output_stages.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/profiling/instrumentation.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
WARNING: /home/volcart/.cache/bazel/_bazel_root/109ad80a732aaece8a87d1e3693889e7/external/gemmlowp/BUILD:102:12: in hdrs attribute of cc_library rule #gemmlowp//:eight_bit_int_gemm: Artifact 'external/gemmlowp/profiling/profiler.h' is duplicated (through '#gemmlowp//:eight_bit_int_gemm_public_headers' and '#gemmlowp//:gemmlowp_headers').
INFO: Found 1 target...
INFO: From Compiling external/protobuf/src/google/protobuf/util/internal/utility.cc [for host]:
...
INFO: From Compiling tensorflow/core/distributed_runtime/tensor_coding.cc:
tensorflow/core/distributed_runtime/tensor_coding.cc: In member function 'bool tensorflow::TensorResponse::ParseTensorSubmessage(google::protobuf::io::CodedInputStream*, tensorflow::TensorProto*)':
tensorflow/core/distributed_runtime/tensor_coding.cc:123:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (num_bytes != buf.size()) return false;
^
ERROR: /usr/local/lib/python2.7/dist-packages/tensorflow/tensorflow/core/kernels/BUILD:1498:1: undeclared inclusion(s) in rule '//tensorflow/core/kernels:batchtospace_op_gpu':
this rule is missing dependency declarations for the following files included by 'tensorflow/core/kernels/batchtospace_op_gpu.cu.cc':
'/usr/local/cuda-8.0/include/cuda_runtime.h'
'/usr/local/cuda-8.0/include/host_config.h'
'/usr/local/cuda-8.0/include/builtin_types.h'
'/usr/local/cuda-8.0/include/device_types.h'
'/usr/local/cuda-8.0/include/host_defines.h'
'/usr/local/cuda-8.0/include/driver_types.h'
'/usr/local/cuda-8.0/include/surface_types.h'
'/usr/local/cuda-8.0/include/texture_types.h'
'/usr/local/cuda-8.0/include/vector_types.h'
'/usr/local/cuda-8.0/include/library_types.h'
'/usr/local/cuda-8.0/include/channel_descriptor.h'
'/usr/local/cuda-8.0/include/cuda_runtime_api.h'
'/usr/local/cuda-8.0/include/cuda_device_runtime_api.h'
'/usr/local/cuda-8.0/include/driver_functions.h'
'/usr/local/cuda-8.0/include/vector_functions.h'
'/usr/local/cuda-8.0/include/vector_functions.hpp'
'/usr/local/cuda-8.0/include/common_functions.h'
'/usr/local/cuda-8.0/include/math_functions.h'
'/usr/local/cuda-8.0/include/math_functions.hpp'
'/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.h'
'/usr/local/cuda-8.0/include/math_functions_dbl_ptx3.hpp'
'/usr/local/cuda-8.0/include/cuda_surface_types.h'
'/usr/local/cuda-8.0/include/cuda_texture_types.h'
'/usr/local/cuda-8.0/include/device_functions.h'
'/usr/local/cuda-8.0/include/device_functions.hpp'
'/usr/local/cuda-8.0/include/device_atomic_functions.h'
'/usr/local/cuda-8.0/include/device_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/device_double_functions.h'
'/usr/local/cuda-8.0/include/device_double_functions.hpp'
'/usr/local/cuda-8.0/include/sm_20_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_20_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_32_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_32_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_35_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_60_atomic_functions.h'
'/usr/local/cuda-8.0/include/sm_60_atomic_functions.hpp'
'/usr/local/cuda-8.0/include/sm_20_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_20_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_30_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_30_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_32_intrinsics.h'
'/usr/local/cuda-8.0/include/sm_32_intrinsics.hpp'
'/usr/local/cuda-8.0/include/sm_35_intrinsics.h'
'/usr/local/cuda-8.0/include/surface_functions.h'
'/usr/local/cuda-8.0/include/texture_fetch_functions.h'
'/usr/local/cuda-8.0/include/texture_indirect_functions.h'
'/usr/local/cuda-8.0/include/surface_indirect_functions.h'
'/usr/local/cuda-8.0/include/device_launch_parameters.h'
'/usr/local/cuda-8.0/include/cuda_fp16.h'
'/usr/local/cuda-8.0/include/math_constants.h'
'/usr/local/cuda-8.0/include/curand_kernel.h'
'/usr/local/cuda-8.0/include/curand.h'
'/usr/local/cuda-8.0/include/curand_discrete.h'
'/usr/local/cuda-8.0/include/curand_precalc.h'
'/usr/local/cuda-8.0/include/curand_mrg32k3a.h'
'/usr/local/cuda-8.0/include/curand_mtgp32_kernel.h'
'/usr/local/cuda-8.0/include/cuda.h'
'/usr/local/cuda-8.0/include/curand_mtgp32.h'
'/usr/local/cuda-8.0/include/curand_philox4x32_x.h'
'/usr/local/cuda-8.0/include/curand_globals.h'
'/usr/local/cuda-8.0/include/curand_uniform.h'
'/usr/local/cuda-8.0/include/curand_normal.h'
'/usr/local/cuda-8.0/include/curand_normal_static.h'
'/usr/local/cuda-8.0/include/curand_lognormal.h'
'/usr/local/cuda-8.0/include/curand_poisson.h'
'/usr/local/cuda-8.0/include/curand_discrete2.h'.
nvcc warning : option '--relaxed-constexpr' has been deprecated and replaced by option '--expt-relaxed-constexpr'.
nvcc warning : option '--relaxed-constexpr' has been deprecated and replaced by option '--expt-relaxed-constexpr'.
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 138.913s, Critical Path: 102.63s
I saw some people complaining about bazel 0.3.1, maybe need to downgrade to 0.3.0. The error you gave is not very informative, that's just the parent script saying that child script failed, there should be more info on the console with the actual error.
I went through the setup steps two days ago for GTX 1080 and it worked with this config.
Ubuntu 16.04
Nvidia Driver: nvidia-367.35 (installed from .run file)
Bazel 0.3.0
gcc: 4.9.3 (default with 16.04)
CUDA 8.0.27 (installed from .run file into default dirs)
compute capability: (use default values for config)

Resources