Software memory bit-flip detection for platforms without ECC

Software memory bit-flip detection for platforms without ECC - memory

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.

The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.
http://www.cyberciti.biz/faq/ecc-memory-modules/
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)
try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.
If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.
To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.
See also:
https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/

The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!
Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.

Related

Memory allocation during startup of a C++ program

I sometimes read that the code segment is placed into ROM/FLASH. Others state that it is also loaded into RAM.
Is my understanding correct that it is common to place it into FLASH primary memory in case of an embedded system? And what are the advantages? I assume the startup of the program will be faster but since the FLASH memory is much slower it would be better to additionally load it from FLASH to RAM during the startup phase when RAM usage does not matter?

Sometimes you don't have enough memory for your program, so you just leave it in ROM or flash. On a system flush with memory you just load everything into RAM, it's much faster.
Some embedded CPUs have 2K of memory, but 2MB of flash. As an example the RP2040 has 264KB of SRAM (RAM) but 2MB of Flash memory for your programs. That's a lot bigger than the memory footprint.
Flash is slow compared to modern DRAM but in an embedded environment the CPU isn't always that fast either. The RP2040 only runs at 133MHz, so it won't notice the difference between flash latency and SRAM latency like a chip running in the 2GHz range might. It's clocked 15x slower.
If you want to explore this more, embedded CPUs like the RP2040 are really cheap, some less than $1, so you can experiment on them and see how it plays out in real life without having to spend much money at all.

Generally, RAM is much faster than flash. Where to run the code from however, depends on the system. On most traditional embedded systems, you don't execute from RAM.
On low-end embedded systems (8 and 16-bitters) you always keep all code in flash and there won't be a performance difference between executing from RAM or flash. Such systems typically don't have a MMU nor protection against writing to the code area, so running code from RAM is highly dangerous since bugs can write straight into physical memory. Also, these systems tend to have very limited RAM.
On mid-range embedded systems (Cortex M etc) where you start to clock the core faster than the flash can keep up with, you need to introduce wait states, where the CPU waits for the flash to read. Typically you need wait states when you go beyond somewhere around 40-50MHz system clock on modern systems. The higher the clock, the more wait states you need.
Such systems do not typically execute code from RAM either, since they usually don't need extreme performance. And they typically don't have a lot of RAM either. In some cases like mid-range Power PC, you'll have instruction cache, which helps a lot in compensating for the slower flash, since instructions can be pre-loaded from flash to cache by branch prediction.
On high-end systems (Cortex A, x86 etc) there will be lots of RAM available for the purpose of executing the code from there and then you are expected to do so. On these systems, cache rather serves the purpose of speeding up access to RAM.
Historically, RAM was also much more prone to electromagnetic interference and could also lose charge over time unless you kept writing to the cells, so you didn't want to keep code in RAM for those reasons alone. That's not much of an issue today though.

ESP32: Best way to store data frequently?

I'm developing a C++ application in the ESP32-DevKitC board where I sense acceleration from an accelerometer. The application goal is to store the accelerometer data until storage is full and then send all the data through WiFi and start all again. The micro also goes to deep-sleep mode when is possible.
I'm currently using the ESP32 NVS library which is very well documented and pretty easy to use. The negative side of this is that the library uses Flash memory, therefore a lot of writings will end up degrading the drive.
I know that Espressif also offers some other storage libraries (FAT, SPIFFS, etc.) but, as far as I know (correct me if I'm wrong), they all use Flash drive.
Is there any other possibility of doing what I want to but without using the Flash storage?
Aclarations
Using Flash memory is not the problem itself, but degrading it.
Storage has to be non volatile or at least not being erased when the micro goes to deep-sleep mode.
I'm not using any Arduino library.

That's a great question that I wish more people would ask.
ESP32s use NOR flash storage, which is usually rated for between 10,000 to 100,000 write cycles (100,000 seems to be the standard these days). Flash can't write single bytes; instead of writes a "page" of bytes, which I believe is 256 bytes. So each 256 byte page is rated for at least 100,000 cycles. When a device is rated for 100,000 cycles it's likely to be usable for at least 10 times that, but the manufacturer is not going to make any promises beyond the 100,000.
SPIFFS (and LittleFS, now used on the ESP8266 Arduino Core) perform "wear leveling", to minimize the number of times a particular page is written. So if you modify the same section of a file repeatedly, it will automatically be written to different pages of flash. FAT is not designed to work well with flash storage; I would avoid it.
Whether SPIFFS with wear leveling will be adequate for your needs depends on your needed lifetime of the device versus how much data you'll be writing and how frequently.
NVS may perform some level of wear levelling, to an extent I'm unsure about. Here, in a forum post with 2 ESP employees, they both confirm that NVS does do some form of wear levelling. NVS is best used to persist things like configuration information that doesn't change frequently. It's not a great choice for storing information that's updated often.
You mentioned that the data just needs to survive deep sleep. If that's the case, your best option (if it's large enough) is to use the ESP32's RTC static RAM. This chunk of memory will survive restarts and deep sleep mode, but will lose its state if power is interrupted. It's real RAM so you won't wear it out by writing to it frequently, and it doesn't cost a lot of energy to write to. The catch is there's only 8KB of it.
If the 8KB of RTC RAM isn't enough and you're writing too much data too frequently to trust that SPIFFS will be okay, your best bet would be an SD card. The ESP32 can talk to an SD card adapter. SD cards use NAND flash, which has a much greater lifespan than NOR and can be safely overwritten many more times (which is why these kinds of cards are usable for filesystems in devices like Raspberry Pis).
Writing to flash also takes much more energy than writing to regular RAM. If your device is going to be battery powered, the RTC RAM is also a better choice than SPIFFS or an SD card from a power savings perspective.
Finally, if you use the RTC RAM I'd recommend starting to write it over wifi before it's full, as bringing up wifi and transmitting the data could easily take long enough that you might run out of space for some samples. Using it as a ring buffer and starting the transmit process when you hit a high water mark rather than when the buffer is full would probably be your best bet.

I know i'm late with this answer but you can buy ESP32 modules with external RAM even with 4-8mb. External ram is really fast ( at least much faster than the flash, it uses SPI interface to communicate ) and you can fit a lot of sensor readings in there.
I'm using an ESP32_WROVER_E module with 8mb external ram ( 4mb is usable with normal function calls ) and 16mb flash.
Here is a link of the module that i'm using at TME's site.

Should a process always consume the same amount of memory if executed in the same way?

Hi folks and thanks for your time in advance.
I'm currently extending our C# test framework to monitor the memory consumed by our application. The intention being that a bug is potentially raised if the memory consumption significantly jumps on a new build as resources are always tight.
I'm using System.Diagnostics.Process.GetProcessByName and then checking the PrivateMemorySize64 value.
During developing the new test, when using the same build of the application for consistency, I've seen it consume differing amounts of memory despite supposedly executing exactly the same code.
So my question is, if once an application has launched, fully loaded and in this case in it's idle state, hence in an identical state from run to run, can I expect the private bytes consumed to be identical from run to run?
I need to clarify that I can expect memory usage to be consistent as any degree of varience starts to reduce the effectiveness of the test as a degree of tolerance would need to be introduced, something I'd like to avoid.
So...
1) Should the memory usage be 100% consistent presuming the application is behaving consistenly? This was my expectation.
or
2) Is there is any degree of variance in the private byte usage returned by windows or in the memory it allocates when requested by an app?
Currently, if the answer is memory consumed should be consistent as I was expecteding, the issue lies in our app actually requesting a differing amount of memory.
Many thanks
H

Almost everything in .NET uses the runtime's garbage collector, and when exactly it runs and how much memory it frees depends on a lot of factors, many of which are out of your hands. For example, when another program needs a lot of memory, and you have a lot of collectable memory at hand, the GC might decide to free it now, whereas when your program is the only one running, the GC heuristics might decide it's more efficient to let collectable memory accumulate a bit longer. So, short answer: No, memory usage is not going to be 100% consistent.
OTOH, if you have really big differences between runs (say, a few megabytes on one run vs. half a gigabyte on another), you should get suspicious.

If the program is deterministic (like all embedded programs should be), then yes. In an OS environment you are very unlikely to get the same figures due to memory fragmentation and numerous other factors.
Update:
Just noted this a C# app, so no, but the numbers should be relatively close (+/- 10% or less).

How to find number of memory accesses

Can anybody tell me a unix command that can be used to find the number of memory accesses that took place in a given interval. vmstat, top and sar only give the amount of physical memory space occupied/available .. But do not give the number of memory of accesses in a given interval

If I understand what you're asking, such a feature would almost certainly require hardware support at a very low level (e.g. a counter of some sort that monitors memory bus activity).
I don't think such support is available for the common architectures supported by
Unix or Linux, so I'm going to go out on a limb and say that no such Unix command exists.
The situation is somewhat different when considering memory in units of pages,
because most architectures that support virtual memory have dedicated MMU hardware
which operates at that level of granularity, and can be accessed by the operating
system. But as far as I know, the sorts of counter data you'd get from the MMU would
represent events like page faults, allocations, and releases, rather than individual
reads or writes.

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören

So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.

I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören

Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart