Hardware issue ? zfs scrub always repair - memory

Hello Stackoverflowers,
I have a curious problem with an old Tyan server motherboard and ZFS.
In short, I can run zfs scrubevery hour, it always repair checksums, with no further error.
I ran memtest86 all nignt long with no error (16GB ECC memory)
I ran smartctl -t long /dev/ada{0,1,2} showhing no error neither
But scrubbing keep showing checksum errors.
Thanks for any clue
Xav

This means that either a) you're writing bad sectors to the disk, or b) you're reading bad sectors back. If it's a small number of sectors being corrected each time, my experience is that it's a bad controller or driver.
That is all assuming you don't get console errors.
Reasoning? Well ... if it's the drives, they generally are smart enough to report their errors (at least most of them). If it's the cables, generally you'll be getting checksum errors from the driver on your console. You've mostly eliminated memory, so... You're left with controllers and drivers.
Luckily with ZFS, you can "try" the drives in another machine without too much hassle, usually.

Thanks,
I'll suspect the controller too, while it dosen't want to load its BIOS for the last reboot (Two Adaptec 1210 cards)
Is ZFS smart enough to recognise the pool if I blindly move the cables to the motherboard's controller ?

Related

Do I need to worry about corrupt memory in an otherwise correct program?

We're working on an application meant to run on an embedded system, in a moderately harsh environment (a controller for a heating system in a residential building).
That application should run for years without needing to reboot the system. It runs on an embedded PC running Linux. The program instantiates several classes whose lifetime is the same as the application's.
Should I worry about memory becoming corrupt over such a long lifetime? Does it make sense to periodically check the class invariants to detect any such memory corruption? Or does modern hardware make such corruption astronomically unlikely?
I have seen my share of cheap sd cards on boards, they can die on you easily.
Few months ago have been dealing with one maker, under high data throughput SD card was unable to react in time. Some irq failure messages pop up and whole partition blows up.
If it's not intended for mass production I would definitely suggest you to choose some good and recommended storage.
But really, I can not remember memory corruption issues(besides rom), I would worry about memory leaks. Those are the most nasty problems for embedded system intended to last long without reboot.
Have to be really careful, they can happen either in userspace or in kernel space. Even software which you have always had confidence in may have them, depending on the build version. Have to choose Linux distribution carefully, if there is no dedicated kernel development team usually this stuff is outsourced to companies which build stable systems, where every included package is tested and confirmed to not leak.
In the end, definitely a few cycles of stress testing are needed, if there are problems with memory you will notice.

What determines the mosquitto.db file size limit

I am new to mosquitto and have a few questions I hope you all can help me with:
what determines the limit size of the persistence file in mosquitto? Is it the system momory or disk space?
What happens when the persistence file gets larger than the limit size? Can I transfer it to another server for temporary storage?
How would mosquitto use the transferred file to publish messages when it restarts?
I appreciate any feedback.
Thanks,
Probably a combination of both Filesystem maximum filesize and system/process memory, which ever is smallest. But I would expect the performance problems that would be apparent before you reached these limits to be a bigger problem.
Mosquitto probably crashes. If mossquitto exceeds the system/process memory limits then it's going to get killed by the OS or crash instantly. I doubt there would be any benefit to moving it to a different machine as if mosquitto crashes due to hitting either of those limits the file is likely to be corrupted so unable to be read in even if restarted on the same machine.
See answer 2
In reality you should never come close to these limits, having that many inflight messages means there are some very SERIOUS issues with the design of your whole system.

Digital Ocean server memory usage above 50%

I am deploying a Flask-based website on the server of Digital Ocean. And the website deployed is mainly static pages, config files and jsons.
This morning I found the memory usage has exceeded 51%. Here is the snapshot.
My memory is 512MB. Would someone please instruct me how to lower the memory usage? Thanks so much!
Update: I've use the "top" command in shell as suggested. Here is the snapshot, does it mean that it is the server itself eaten up those memories?
The memory issue is not related to my application.
I just received the answer from Digital Ocean. Here it is:
Hi there!
Thank you for contacting us! We can help with any memory issues you're having!
Since the Droplet is set up with only 512MB of RAM, once the system and any installed services start, it doesn't take much to push it past 50%. As a result, I don't think what you're seeing is necessarily abnormal under the circumstances. This leaves a few options: the Droplet can be resized and made larger to provide more memory (see https://www.digitalocean.com/community/tutorials/how-to-resize-your-droplets-on-digitalocean), you can add swap space to use part of the Droplet's file system as RAM (see https://www.digitalocean.com/community/tutorials/how-to-add-swap-on-ubuntu-14-04), or you can review the applications and services running on the Droplet and attempt to optimize them to reduce memory use.
We hope this is helpful! Please let us know if there is anything else we can do!
Regards,
I am assuming your are running a Linux server. If so, you can use the top command. It shows you all of the running processes and the system resources they are using. You would then be able to optimize from there.
I found out the cause! Linux borrows unused memory for disk caching. This makes it look like you are low on memory, but you are not! Everything is fine! If your application, or any other process needs more memory, Linux will automatically clear the cache and give memory for your application. Linux does this to speed up the system for you.
If, however, you find yourself needing to clear some RAM quickly to workaround another issue, like a VM misbehaving, you can force Linux to nondestructively drop caches using:
echo 3 | sudo tee /proc/sys/vm/drop_caches

How much free memory does Redis need to run?

I'm pretty sure at this stage that Redis needs a certain amount of free memory on the OS in order to run. In the past few weeks, I've seen Redis (Linux) run out of memory with a couple of gigabytes of RAM still free, and on Windows, it refuses to start when you are using a lot of memory on the system but still have a bunch left free, as in the screenshot below.
The error on Windows gives a hint as to why this is happening (although I'm not assuming it's the same on Linux). However, my question is more generic. How much free memory does Redis need in order to operate?
Redis requires RAM between x2 to x3 the size of your data. The maxheap flag is Windows-specific.
According to Redis FAQ, without a specific Linux configuration, it might need 2x the memory of your dataset. From the document:
Short answer: echo 1 > /proc/sys/vm/overcommit_memory :)
With this configuration, the forked process (responsible for saving the dataset to disk) will be able to share memory pages more easily with the original process, so it won't need that much memory.
You can read more about this here: https://redis.io/topics/faq#background-saving-fails-with-a-fork-error-under-linux-even-if-i-have-a-lot-of-free-ram

Windows Mobile memory corruption

Is WM operating system protects process memory against one another?
Can one badly written application crash some other application just mistakenly writing over the first one memory?
Windows Mobile, at least in all current incarnations, is build on Windows CE 5.0 and therefore uses CE 5.0's memory model (which is the same as it was in CE 3.0). The OS doesn't actually do a lot to protect process memory, but it does enough to generally keep processes from interfering with one another. It's not hard and fast though.
CE processes run in "slots" of which there are 32. The currently running process gets swapped to slot zero, and it's addresses are re-based to zero (so all memory in the running process effectively has 2 addresses, the slot 0 address and it's non-zero slot address). These addresses are proctected (though there's a simple API call to cross the boundary). This means that pointer corruptions, etc will not step on other apps but if you want to, you still can.
Also CE has the concept of shared memory. All processes have access to this area and it is 100% unprotected. If your app is using shared memory (and the memory manager can give you a shared address without you specifically asking, depending on your allocation and its size). If you have shared memory then yes, any process can access that data, including corrupting it, and you will get no error or warning in either process.
Is WM operating system protects process memory against one another?
Yes.
Can one badly written application crash some other application just mistakenly writing over the first one memory?
No (but it might do other things like use up all the 'disk' space).
Even if you're a device driver, to get permission to write to memory that's owned by a different process there's an API which you must invoke explicitly.
While ChrisW's answer is technically correct, my experience of Windows mobile is that it is much easier to crash the entire device from an application than it is on the desktop. I could guess at a few reasons why this is the case;
The operating sytem is often much more heavily OEMed than Windows desktop, that is the amount of manufacturer specific low level code can be very high, which leads to manufacturer specific bugs at a level that can cause bad crashes. On many devices it is common to see a new firmware revision every month or so, where the revisions are fixes to such bugs.
Resources are scarcer, and an application that exhausts all available resources is liable to cause a crash.
The protection mechanisms and architecture vary quite a bit. The device I'm currently working with is SH4 based, while you mostly see ARM, X86 and the odd MIPs CPU..

Resources