Error running NVIDIA deepfacelab - model doesn't start training - machine-learning

enter image description here
See the screenshot. I'm running NVIDIA Deepfacelab on Paperspace virtual machine, and the model does not start running. It's running on Windows 10 and P6000. Training parameters are visible in the screenshot.
The GPU should be supported and I just installed the newest driver from NVIDIA. What could be the problem? It gets stuck after loading the data. There is no clear error message, it just doesn't go forward. I also tried different batch sizes for training, down to 2.
I saw two threads regarding this issue in GitHub, but no-one has replied to their threads. I found some old thread that recommended installing the new driver and did so.

Related

Problem with display rate of images (photos) using Docker + GPU

I want to ask a semi-theoretical question.
I'm using a Docker image which utilizing Nvidia-container-runtime to communicate with the GPU on my machine.
The Docker image purpose is to run an application which involves presentation of images (photos) at high rate (1 Hz - 10 Hz) (Gui application). However, as we noticed, there are some delays on the presentation rates in contrast to running the same application on bare OS (without the overhead of Docker container). Does anyone encountered this issue? Is this issue can be resolved somehow? As a note, the display rate should be as exact as possible, meaning we can't allow delays of more than 10 ms.

Raspberrry Pi 3 + Windows IOT Core crashes after some time

Im developing an uwp app on Raspberry Pi 3 with Windows IOT Core. But after I deploy my app and use it for couple days the os crashes. It says something went wrong. It says "Your pc ran into a problem and needs to restart". It restarts couple times but still same error on every boot.
I tried to remove the sd card(Class 10,64 GB) format it and reinstall everything. At first it was okay but after some time same error appears.
I tried to use different os builds and it didnt work.
I tried to use industrial power supply (5V3A) and also it didnt work.
My SD Card is not one of the recommended ones but do I really have to get the recommended sd cards to use the windows iot core properly?
"Your PC ran into a problem and needs to restart" is a typical blue screen message seen on Windows systems from the last few years - laptops and desktops with far larger hard drives and no SD card. The error is not associated with a RAM or disk space shortage (operating systems running in graphical mode usually monitor and actively warn about either). In your case, it is showing at startup, when not much is running (taking up RAM), and you can check the amount of space used on the card with the PC.
The key stats for SD cards are size (you have plenty) and speed (clearly enough or you would have trouble installing/running anything after starting the Pi). The cause is something else, and finding out what will require getting a more detailed error message from Windows - "a problem" could mean anything. In my experience, blue screen errors have mostly involved having a wrong driver installed, sometimes a bad Windows update - but IoT Core has its own alternatives, like "bad system configuration". Look for the underscored string (e.g., BAD_SYSTEM_CONFIG_INFO) at the end of your blue screen message, as that is the first hint.
Unfortunately, most Windows BSoD documentation is for traditional PCs, so I cannot recommend specific troubleshooting tools and be sure that they will run on the Pi.
You can use Windows Debugger to debug the kernel and drivers on Windows IoT Core. WinDbg is a very powerful debugger that most Windows developers are familiar with. Or you can also refer to this topic in MSDN, it shows how to create the dump file when the app crashes. If possible, you can share your code so that we can reproduce the issue.

nvidia-smi command could communicate with nvidia driver microsoft azure dsvm

Right after creating and starting up a data science virtual machine and connecting through ssh, I tried to use the nvidia-smi to see if the built-in nvidia and cuda were working property. The returned message read
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA
driver. Make sure that the latest NVIDIA driver is installed and
running.
These were supposed to be part of the vm, yet when I tried to run the program I created, my local computer's default CPU was used instead of the vm's GPU. The ultimate goal of my project is to run an object detection model with the performance sped up from the my lousy 11 sec/image, so I figured I would use a vm and take advantage of its computing power. Yet it seems like this may not be the best option, so if anyone else has some advice there, I would appreciate it.
The issue you are seeing is because you are using a D Series VM. Only the N series VMs have GPUs. So in order to utilize the GPU you need to select one of the following sizes:
https://learn.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu
For this size family, the vCPU (core) quota in your subscription is initially set to 0 in each region. You will need to request a vCPU quota increase for this family in an available region.

BeagleBone Black doesn't power on

I am working in a technology Laboratory. We have 15 BBB, an suddenly, 5 of them didn't power on any more.
They stay with the power on Led on, but nothing more happens.
Picture:
What can i solve the problem?
Thank you
Prior to solve the problem, you probably have to investigate it first.
I would verify those beaglebones are still functional:
That is, checking if the beaglebone black is displaying any messages on the serial console,
The procedure for connecting a USB-to-TTL adapter is described here.
I would strongly suggest to buy the exact adapter featured in the article above on e-bay
if you don't have one.
If there were no messages displayed on the serial console, I would attempt to load u-boot from the serial port.
This can be done by connecting both P8.44/SYS_BOOT3/LCD_DATA3/GPIO2_9
and P8.43/SYS_BOOT2/LCD_DATA2/GPIO2_8 to the ground (two of P9.43/P9.44/P9.45/P9.46) using two 4.7 k
ohm resistors, powering the beaglebone with an external 5V power supply (not by USB),
and power-cycling the beaglebone - power-cycling IS required, performing a 'reset' is
not enough for the new SYSBOOT configuration to be taken into account.
You can then download u-boot from your PC using Teraterm: u-boot-spl-.bin should
be downloaded using x-modem, and u-boot.bin using y-modem, as described in the
'Boot over UART' section of this TI wiki article.
Once you have u-boot running, you should be able to reinstall your beaglebone using information available on the Internet.
If you cannot boot using the boot ROM and the serial port, this would probably be a bad sign.
I would suggest to try the procedure for loading u-boot from the serial port with a beaglebone you know is working, this is totally non-intrusive providing that you don't modify the eMMC from u-boot.

Understanding the Android emulator: Testing images? Network connectivity dependencies?

To better clarify my generic question:
I have gotten the Android emulator to work by running a full "make full-eng" build, as per the Google documentation. However, I wanted to debug it, so once I ran the emulator, and called "$ adb shell dmesg" and routed that to an output text file, I found a couple of strange lines:
...
<4>goldfish_new_pdev goldfish_interrupt_controller at ff000000 irq -1
<4>goldfish_new_pdev goldfish_device_bus at ff001000 irq 1
<4>goldfish_new_pdev goldfish_timer at ff003000 irq 3
<4>goldfish_new_pdev goldfish_rtc at ff01000
So when you run the Android full build, it gives you Goldfish as the system image? I want to know if it's testing the things I want for Galaxy Nexus. The kernel was a modified maguro kernel (omap project) for Galaxy Nexus, that I put into the build tree. But the platform I want to be testing is IceCreamSandwich. Is the emulator testing this platform? (b/c the output in this log is leading me to believe it isn't) Or is the emulator testing a "generic" image?
Also, an important further question: I modified the kernel's "socket.h" file, to override the INET protocol with an undefined protocol (FINS). In theory the phone should boot up, but NO internet access. Does the phone emulator care what you do to the internet protocols? Does it use your host computer's networking capabilities?
One further follow-up: What processes/system-services/events (that are involved in booting to a stable state) of the phone DEPEND on the internet protocols of the traditional underlying network stack? (protocols being defined to set up the network sockets)
At the time I wrote the question I did not understand a few things and think I've learned a little while messing with the emulator at the "kernel level". First of all, the emulator tests the "goldfish kernel" (Linux version 2.6.29, with ARM architecture) of a "generic" phone brand. It's almost as if the emulator is a type of phone in of itself, and you cannot mix these image kernels. For example, I tried building a Nexus S crespo phone image with goldfish kernel (so in other words, no crespo kernel) and the phone just "hangs" at the Google splash-screen (at least it's not a boot-loop).
My research (FINS) worked on this emulator, but did not work on any of the 3 platforms supported on actual hardware: Nexus S, Galaxy Nexus, and Motorola Xoom. I am not sure why, given Google does not seems to give users the ability to debug at the lowest level of a phone (I'm sure the actual developers use such kinds of tools in building these phones/testing them). This leads to one major issue which answers my last follow-up: The Android Debug Bridge depends upon INET protocol. My emulator boots up successfully and runs as I want (no internet, b/c there is no INET), but these actual phones do NOT. My hypothesis is that: If INET is overridden with a protocol that is empty (in this case, that would be FINS, which intends to deal with INET at the userspace level, but this appears to be too late for the phone system to be satisfied), the ADB daemon (classified as a type of system service perhaps) cannot work/be connected to and Android hardware will crash because of this. The emulator I believe is more flexible than a real phone, as the hardware is perhaps virtually represented and does not have the same limitations as physical hardware does.
You can consult my wiki/documentation (part of my research team's larger site) of my struggle with the Android phone boot process for more details and my various attempts: http://finsframework.org/mediawiki/index.php/Alexander_G._Ororbia_II
If anyone ever figures out how to get a working boot log from a Nexus S, Galaxy Nexus, or Motorola Xoom that gets stuck in a "boot-loop" (without ADB), please let let me know, as I will be working on this problem for a while to come (and I will update my other Stack Overflow-Android questions to reflect this correction). Any corrections to my understanding would also be appreciated.
NOTE: This answer is editable, as I still think there is some way of getting the phone to produce boot logs on the host machine without the ADB daemon.

Resources