Watchdog timeout during call to file.format? - esp8266

This question is entirely unrelated to my code, but to satisfy the obligatory show your code directive:
file.format()
Before the call above returns, on this one SoC I always get a wdt reset. Sometimes but not always the flash does appear to be formatted when the chip is started again. And sometimes if freezes after the wdt reset message, and has to be powered off (looks like wrong comm parameters after pressing hardware reset, but none of the terminal app options seemed to match.)
(Note: since starting this draft I built another copy of my device, using another new, recently received ESP8266-12E, and it behaves identically. Previously built copies still work normally, with the identical firmware.)
So this must be a bad chip, right? Or maybe bad on-board flash? It is a brand new one I just bought. I've also seen file.write issues, with buffer size always 255 bytes or less, though no read issues at all.
One other quirk, after burning a cloud-built nodemcu image to this ESP8266-12E device, adc.read returned 65535 and adc.readvdd33 returned an apparently valid value. (I corrected that by burning esp_init_data_default.bin to 0x3FC000.) This was the first (out of 15, maybe 20) I have seen that was like that. I did not check to see if an older version of nodemcu was already on it.
This wouldn't be the first chip with which I've had issues on arrival; it's at least the 2nd, likely the 3rd or 4th.
So maybe the larger question, what percentage of the ESP8266's that you buy, are either DOA or suffer infant mortality? (Not counting the ones that you have reason to believe were inadvertently killed.)

The problem can be something other than the ESPs, like a inappropriate power supply. I know from my own experience that the Arduino Uno and most USB-TTL converters cannot safely deliver enough current to ESPs. If you're not already, consider using a dedicated power supply circuit that are connected to a USB power source.

It does indeed seem to be a hardware issue, 2 bad out of 6, not good! I think it might be a certain vendor but don't want to name names without being sure... Whatever is wrong with the chip hangs it up long enough to make the watchdog bark.
Much more than the cost of the part, the time consumed figuring out whether it's lua code, firmware, supporting connections, peripherals or the chip itself, is the costly thing (not to mention frustration, and wasted storage on SO.)

Related

STM32 Memory Dump and Extracting Secret Key

I am quite new at embedded development and started with a STM32F429 board to improve myself.
I have just developed a basic Caesar encryption application for my board. It is working well, and defined the secret key as "3". Now I would like to extract this super secret(!) key from my device.
How can I do it? Should I dump the memory or firmware of my device, and how?
May you suggest me any software for this proccess? (Not ST Utility or STM softwares please. Because I would like to try gained experiences on other devices as well.)
Thanks!
I take it the value you're looking for is hardcoded. In that case it resides in the internal flash. So yes, a memory dump will be necessary.
I will go the long way and assume that you know very little about how it works, so if you know some of this stuff, well, good for you. I will try to give a few pointers.
Specifically about STM32:
You have an option to boot the microcontroller from the so-called system memory, which is read-only memory, and it is already preprogrammed from factory with a bootloader. You can talk to that running bootloader via UART (most common way, comes with ST-Link, but any cheapo USB-UART bridge also works). Or it can be some other protocol. You can ask that bootloader to read its flash out to you, among other things. This is covered in AN2606.pdf. It has some useful links in it, such as:
names of documents, where you can find specific bootloader commands for any interface you wish. Of course, you only care about interfaces, that the bootloader of your specific MCU F429 supports, which are found in the same AN2606, page 172 (for bootloader version 0.7, there is also 0.9 for those MCUs, I have no idea how to tell which one you have, so...try? UART configuration seems identical anyway):
So what exactly needs to be done? Flip the state of BOOT0 pin - permanently - of the MCU and reset it (power cycle or reset pin, both ok). You will boot the MCU into bootloader instead of booting program from flash. You can read about it in the Reference manual STM32F429, page 69. It talks about states of BOOT0 and BOOT1 pins on boot. What pins are boot0 - if they're not marked on your board, then you'll have to consult F429 datasheet, page 69 (I swear, it's a coincidence). Depending on your specific IC, it will be one pin or another.
It will activate all MCUs peripherals as per docs above and it and wait on its UART and other pins for commands. Commands listed in the documents I provided above. Let's take a look at AN3155 about USART of bootloader:
And the commands are
are all in that document, the table of contents in pdf really helps to find stuff quickly. Of course, if you need specific details, and you will need specific information about specific commands, it's all in there too. How many bytes in command, how many bytes at a time you read from flash etc. Basically, you can either write your own program that does that (even program another microcontroller to program that microcontroller using victim's bootloader), or use any other software that knows what commands to send to the bootloader. It can be ST utility, it can be any other program. They all implement the very same command set, so it doesn't actually matter much. I couldn't find many programs that do that, the only thing that stood out was stm32flash. Never used it myself. I'm ok with ST stuff, since I know what it does (I think).
Oh yeah back to getting the secret value out. I almost forgot about that. Well, then you open the dump in hex viewer/editor, and scroll it around looking for interesting combination of values. Yeah, that's kinda what it looks like. One can run it via disassembly. Scroll disassembled code around, see if there are any numeric values that stand out. You know, some random number 0xD35B581 or something hardcoded in the middle of pretty program could mean something, like be a serial number or a secret number. Unfortunately, I'm reaching boundaries of my competence here, so won't go any further on what one can do with dump.

Multiple boards Windows Device Manager prompts resource conflict, error code is 12

I inserted multiple same boards into the system. Device PCIe is implemented with Xilinx IP core. After each FPGA program is programed, manually refresh the device manager to check whether the device and driver are working properly.
My confusion is that it seems that this approach can only work on two boards at the same time. After the third board is programmed , and then refresh the task manager, the system prompts insufficient resources, "This device cannot find enough free resources that it can use (Code 12)"
I tried to disable the other two boards, but the device still prompted conflicts. I don't know how to query conflicting resources.
My boards have 2 BARs(BAR0: 2KB, BAR1:16MB), and 1 IRQ.
I did a few experiments and felt that it was caused by the conflict of memory resources. The conflicting party was the AMD integrated graphics card that came with the motherboard.
1st, 2nd, 3rd all indicate my board number.
3rd is not recognized because of conflicts.
After I shut it down, I plugged in all three boards and then turned it on again. As a result, during the startup process, the system restarted again after a sudden power failure, and then all three boards were normal. At this time, it is found that the memory address of the graphics card has changed
I want to know how to resolve the conflict? modify my driver code or FPGA configuration?
PCIe enumeration should resolve memory allocations issues, however there are a couple implementation issues to be aware of. Case in point, I have used Xilinx XDMA's with a 64 bit BAR of 2GB in size and I have literally bricked a DELL XPS motherboard. But I have done the same with an IBM system and it just worked. The point here is that enumeration can be done with Firmware, Hardware or OS driven event. If you doing hardware manager, that sounds like OS driven, but when I toasted the XPS board, that was some kind of FW issue that was related to BAR size that resulted in a permanent failure. 16MB isn't big and it should not be a problem, but I would recommend going with the Xilinx defaults first and show reliability there. I think it's one BAR at 1M. I have run 3x 64 bit BARS at 1MB a piece without issue, but keep it simple and show reliability. Then move up. This will help isolate if it is a system flakiness or not.
I have seen some system use FW based enumeration that comes up really fast, before the FPGA has configured, in which case there is no PCIe target to ID. If you frequently find that your FPGA is not detected on power up, but detected on a rescan, this can be a symptom. How to resolve this is a bit of a pain. We ended up using partial reconfiguration. Start with the PCIe interface, then have a reconfig to load the remaining image. Let's hope it isn't this problem
The next thing is to be aware of is your reset mechanism within the FPGA. You probably hooked the PCIe IP reset to the bus reset which is great, however, I have in the past also hooked that reset to internal PLL locked signals that may not be up. For troubleshooting purposes, keep that reset simple and get rid of everything else to show that just the PCIe IP by itself is reliable first.
You also have to be careful here too. If you strip things down, make sure it is clean. If you ignore the PLL lock and try to use a Xilinx driver, such as the XDMA driver, it has a routine where it tries to identify the XDMA with data transactions. It is looking for the DMA BAR one BAR at a time. But when it does this, the transaction it attempts may go out on the AXI bus if the BAR isn't the XDMA control BAR. If the AXI bus isn't out of reset or clocked when this happens you will lock the AXI bus and I have locked up a Linux box this way on several occasions. AXI requires that transactions complete, otherwise it just sits there waiting.
BTW, on a Linux box, you can look at the enumeration output in the kernel log. I'm not sure if Windows shows you the same thing or not. But this can be helpful if you see that the device was initially probed but then something invalid was detected in the config register, versus not being seen at all.
So a couple things to look at.

DSP on Beaglebone

I have a Beaglebone running Ubuntu. We want to continuously sample from 3 on-board ATD converters at 100KS/s, and every window of samples we will run a cross correlation DSP algorithm. Once we find a correlation value above a threshold, we will send the value to a PC.
My concern is the process scheduling in Ubuntu. If our process gets swapped out and an ATD sample becomes available during this time, the process will miss the sample. We need to ensure that our process will capture every sample and save it in memory.
With this being said, is there a way to trigger interrupts on the Beaglebone so that if an ATD sample is ready, the sample will be saved in the memory of our program even if the program does not have the processor at the time?
Thanks!
You might be able to trigger the EDMA or use the PRUSS. Probably best to ask on beagleboard#googlegroups.com. There isn't a DSP per-se on the BeagleBone.
This is not exactly an answer to your question, but hopefully it explains how the process works. Since you didn't mention what hardware you are running for AD conversion, maybe this is the best that can be done:
With audio hardware, which faces the same problem, the solution comes from the hardware and the drivers working together: whenever the hardware has filled up enough of the buffer it signals the driver (via an interrupt or some similar mechanism). In some cases, it's also possible that the driver polls the hardware or something like that, but that's a less efficient solution, and I'm not sure anyone does it that way anymore (maybe on cheaper hardware?). From there, the driver process may call right into the end-user process, or it may simply mark the relevant end-user process as "runnable". Either way, control needs to be transferred to the end user process.
For that to happen, the end user process must be running at a higher priority than anything else occupying the CPUs at that moment. To guarantee that your process will always be first in the queue, you can run it at a high priority, with the appropriate permissions, you can even run in very high priorities.
The time it takes for the top priority process to go from runnable to running is sometimes called the "latency" of the OS, though I am sure there's a more specific technical term. The latency of Linux is on the order of 1 ms, but since it's not a "hard" real-time OS, this is not a guarantee. If this is too long to handle your chunks of data, you may have to buffer some of it in your driver.

Trouble reading memory

When I run my code through the debugger, after a series of steps it eventually gets lost and executes commands out of order. I'm not sure if the stack is overflowing or what.
This is the error I usually get:
MSP430: Trouble Reading Memory Block at 0xffe2e on Page 0 of Length 0x1d2: Invalid parameter(s)
Any suggestions on what it could be? I read briefly about possible issues with not handling some interrupts.
Also, I'm trying to fill my RAM with a specific value so that I can tell if the stack is overflowing, any suggestions on how to fill the entire RAM with, say a value of 0x1234?
Thanks!
What debugger and compiler are you using? I've found that msp430-gcc and msp430-gdb/gdbproxy can get very confused with GCC optimizations turned on. However, broken code is sometimes is emitted without them turned on (its a quality product, really).
The easiest way to fill memory is to modify you crt0.s startup file and link it yourself. When memory is set to 0, you can change the pattern there.
Which device are you using? On 16-bit devices, 0xffe2e is outside of the address space of the processor, likely an array index or similar which has gone negative.
I have seen this error as well when using code composer studio and TI's USBFET programmer although I have not been able to nail down a single, definite cause.
Assuming you are using CCS, here are some tips:
1) Catch ACCV (UNMI) and VMA (SYSNMI) interrupts and set a break point within the handlers. If one of these trips, examine the stack for clues as to what triggered the interrupt.
2) If you have any interrupt handlers which re-enable interrupts (GIE bit), make sure they are not being retriggered repeatedly.
3) I have seen this error (inexplicably) when stepping through optimized code; so it may help to turn off optimizations.
If you are using code composer studio, as an alternative to initializing your RAM, you can set a breakpoint on stack overflow. Also, with a paused debug session, CCS gives you the option to fill a portion of memory with any value you choose via the "Memory" sub-window.

Hunting down EOutOfResources

Question:
Is there an easy way to get a list of types of resources that leak in a running application? IOW by connecting to an application ?
I know memproof can do it, but it slows down so much that the application won't even last a minute. Most taskmanager likes can show the number, but not the type.
It is not a problem that the check itself is catastrophic (halts the app process), since I can check with a taskmgr if I'm getting close (or at least I hope)
Any other insights on resource leak hunting (so not memory) is also welcomed.
Background:
I've an Delphi 7/2006/2009 app (compiles with all three) and after about a few week it starts acting funny. However only on one of the places it runs, on several other systems it runs till the power goes out.
I've tried to put in some debug code to narrow the problem down. and found out that the exception is EOutofResources on a save of a file. (the file save can happen thousands of times a day).
I have tried to reason out memory leaks (with fastmm), but since the dataflow is quite high (60MByte/s from gigabit industrial camera's), I can only rule out "creeping" memory leaks with fastmm, not quick flashes of memoryleaks that exhaust memory around the time it happens. If something goes wrong, the app fills memory in under half a minute.,
Main suspects are filehandles that are somehow left on some error and TMetafiles (which are streamed to these files). Minor suspects are VST, popupmenu and tframes
Updates:
Another possible tip: It ran fine for two years with D7, and now the problems are with Turbo Explorer (which I use for stable projects not converted to D2009 ).
Paul-Jan: Since it only happens once a week (and that can happen at night), information acquisition is slow. Which is why I ask this question, need to combine stuff for when I'm there thursday. In short: no I don't know 100% sure. I intend to bring the entire Systemtools collection to see if I can find something (because then it will be running for days). There is also a chance that I see open files. (maybe should try to find some mingw lsof and schedule it)
But the app sees very little GUI action (it is an machine vision inspection app), except screen refresh +/- 15/s which is tbitmap stretchdraw + tmetafile, but I get this error when saving to disk (TFileStream) handles are probably really exhausted. However in the same stream, TMetafile is also savetostreamed, something which later apps don't have anymore, and they can run from months.
------------------- UPDATE
I've searched and searched and searched, and managed to reproduce the problems in-vitro two or three times. The problems happened when memusage was +/- 256MB (the systems have 2GB), user objects 200, gdi objects 500, not one file more open than expected ).
This is not really exceptional. I do notice that I leak small amounts of handles, probably due to reparenting frames (something in the VCL seems to leak HPalette's), but I suspect the core cause is a different problem. I reuse TMetafile, and .clear it inbetween. I think clearing the metafile doesn't really (always?) resize the resource, eventually each metafile in the entire pool of tmetafile at maximum size, and with 20-40+ tmetafiles (which can be several 100ks each) this will hit the desktop heap limit.
That's theory, but I'll try to verify this by setting the desktop limit to 10MB at the customers, but it will be several weeks before I have confirmation if this changes anything. This theory also confirms why this machine is special (it's possible that this machine naturally has slightly larger metafiles on average). Occasionally freeing and recreating a tmetafile in the pool might also help.
Luckily all these problems (both tmetafile and reparenting) have already been designed out in newer generations of the apps.
Due to the special circumstances (and the fact that I have very limited test windows), this is going to be a while, but I decided to accept the desktop heap as an example for now (though the GDILeaks stuff was also somewhat useful).
Another thing that the audit revealed GDI-types usage in a thread (though only saving tmetafiles (that weren't used or connected otherwise) to streams.
------------- Update 2.
Increasing the desktop limit only seemed to minorly increase the time till the problem occurred.
Unfortunately, I won't be able to follow up on this further, since the machines were updated to a newer version of the framework that doesn't have the problem.
In summary I can only state what the three core modifications were going from the old to the new framework:
I no longer change screens by reparenting frames. I now work with forms that I hide and show. I changed this since I also had very rare crashes or exceptions (that could be clicked away) due to this. The crashes were all while operating the GUI though, not spontaneously like the main problem
The routine where the crash happened dealt with TMetafile. TMetafile has been designed out, and replace by a simpler own made format. (basically arrays with Opengl vertices)
Drawing no longer happened with tbitmap with a tmetafile overlay strechdrawn over it, but using OpenGL.
Of course it could be something else too, that got changed in the rewrite of the above parts, fixing some very nasty detail bug. It would have to be an extremely bad one, since I analysed the above system as much as I could.
Updated nov 2012 after some private mail discussion: In retrospect, the next step would have been adding a counter to the metafiles objects, and simply reinstantiate them every x * 1000 uses or so, and see if that changes anything. If you have similar problems, try to see if you can somewhat regularly destroy and reinitialize long living resources that are dynamically allocated.
There is a slim chance that the error is misleading. The VCL naively reports EOutOfResources if it is unable to obtain a DC for a window (see TWinControl.GetDeviceContext in Controls.pas).
I say "naively" because there are other reasons why GetDC() might return a NULL handle and the VCL should report the OS error, not assume an out of resources condition (there is a Windows version check required for this to be reliably possible, but the VCL could and should take of that too).
I had a situation where I was getting the EOutOfResources error as the result of a window handle becoming invalid. Once I'd discovered the true problem, finding the cause and fixing it was simple, but I wasted many, many hours trying to find a non-existent resource leak.
If possible I would examine the stack trace leading to this exception - if it is coming from TWinControl.GetDeviceContext then the problem may not be what you think (it's impossible to say what it might be of course, but eliminating the impossible is always the first step toward discovering the solution, no matter how improbable).
If they are GDI handle leaks you can have a look at MSDN Magazine January 2003 which uses the tool GDILeaks. Other tools are GDIObj or GDIView. Also see here.
Another source of EOutOfResources could be that the Desktop Heap is full. I've had that issue on busy terminal servers with large screens.
If there are lots of file handles you are leaking you could check out Process Explorer and have a look at the open file handles of your process and see any out of the ordinary. Or use WinDbg with the !htrace command.
I've run into this problem before. From what I've been able to tell, Delphi may throw an EOutOfResources any time the Windows API returns ERROR_NOT_ENOUGH_MEMORY, and (as the other answers here discuss) Windows may return ERROR_NOT_ENOUGH_MEMORY for a variety of conditions.
In my case, EOutOfResources was being caused by a TBitmap - in particular, TBitmap's call to CreateCompatibleBitmap, which it uses with its default PixelFormat of pfDevice. Apparently Windows may enforce fairly strict systemwide limits on the memory available for device-dependent bitmaps (see, e.g, this discussion), even if your system otherwise has plenty of memory and plenty of GDI resources. (These systemwide limits are apparently because Windows may allocate device-dependent bitmaps in the video card's memory.)
The solution is simply to use device-independent bitmaps (DIBs) instead (although these may not offer quite as good of a performance). To do this in Delphi, set TBitmap.PixelFormat to anything other than pfDevice. This KB article describes how to pick the optimal DIB format for a device, although I generally just use pf32Bit instead of trying to determine the optimal format for each of the monitors the application is displayed on.
Most of the times I saw EOutOfResources, it was some sort of handle leak.
Did you try something like MadExcept?
--jeroen
"I've tried to put in some debug code to narrow the problem down. and found out that the exception is EOutofResources on a save of a file. (the file save can happen thousands of times a day)."
I'm shooting in the dark here, but could it be that you're using the Windows API to (GetTempFileName) create a temp file and you're blowing out some file system indexes or forgetting to close a file handle?
Either way, I do agree that with your supposition about it being a file handle problem. That seems to be the most likely thing given your symptoms and diagnosis.
Also try to check handle count for the application with Process Explorer from SysInternals. Handle leaks can be very dangerous and they build slowly through time.
I am currently having this problem, in software that is clearly not leaking any handles in my own code, so if there are leaks they could be happening in a component's source code or the VCL sourcecode itself.
The handle count and GDI and user object counts are not increasing, nor is anything being created. Deltic's answer shows corner cases where the message is kind of a red-herring, and Allen suggests that even a file write can cause this error.
So far, The best strategy I have found for hunting them down is to use either JCL JCLDEBUG stack tracebacks, or the exception report save features in MadExcept to generate the context information to find out what is actually failing.
Secondly, AQTime contains many tools to help you, including a resource profiler that can keep the links between where the code that created the resources is, and how it was called, along with counts of the total numbers of handles. It can grab results MID RUN and so it is not limited to detecting unfreed resources after you exit. So, run AQTime, do a results capture in mid run, wait several hours, and capture again, and you should have two points in time to compare handle counts. Just in case it is the obvious thing. But as Deltics wisely points out, this exception class is raised in cases where it probably shouldn't have been.
I spent all of today chasing this issue down. I found plenty of helpful resources pointing me in the direction of GDI, with the fact that I'm using GDI+ to produce high-speed animations directly onto the main form via timer/invalidate/onpaint (animation performed in separate thread). I also have a panel in this form with some dynamically created controls for the user to make changes to the animation.
It was extremely random and spontaneous. It wouldn't break anywhere in my code, and when the error dialog appeared, the animation on the main form would continue to work. At one point, two of these errors popped up at the same time (as opposed to sequential).
I carefully observed my code and made sure I wasn't leaking any handles related to GDI. In fact, my entire application tends to keep less than 300 handles, according to Task Manager. Regardless, this error would randomly pop up. And it would always correspond with the simplest UI related action, such as just moving the mouse over a standard VCL control.
Solution
I believe I have solved it by changing the logic to performing the drawing within a custom control, rather than directly to the main form as I had been doing before. I think the fact that I was rapidly drawing on the same form canvas which shared other controls, somehow they interfered. Now that it has its own dedicated canvas to draw on, it seems to be perfectly fixed.
That is with about 1 hour of vigorous testing at least.
[Fingers crossed]

Resources