Understanding file mapping - memory

I try to understand mmap and got the following link to read:
http://duartes.org/gustavo/blog/post/page-cache-the-affair-between-memory-and-files
I understand the text in general and it makes sense to me. But at the end is a paragraph, which I don't really understand or it doesn't fit to my understanding.
The read-only page table entries shown above do not mean the mapping is read only, they’re merely a kernel trick to share physical memory until the last possible moment. You can see how ‘private’ is a bit of a misnomer until you remember it only applies to updates. A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programs as long as the page has only been read from. Once copy-on-write is done, changes by others are no longer seen. This behavior is not guaranteed by the kernel, but it’s what you get in x86 and makes sense from an API perspective. By contrast, a shared mapping is simply mapped onto the page cache and that’s it. Updates are visible to other processes and end up in the disk. Finally, if the mapping above were read-only, page faults would trigger a segmentation fault instead of copy on write.
The folloing to lines doesn't match for me. I see no sense.
A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programs as long as the page has only been read from.
It is private. So it can't see changes by others!
Finally, if the mapping above were read-only, page faults would trigger a segmentation fault instead of copy on write.
Don't know what the author means with this. Is their a flag "MAP_READ_ONLY"? Until a write occurs, every pointer from the programs virtual-pages to the page-table-entries in the page-cache is read-only.
Can you help me to understand this two lines?
Thanks
Update
It seems it got it, with some help.
A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programs as long as the page has only been read from.
Although a mapping is private, the virtual page really can see the changes by others, until it modifiy itselfs a page. The modification becomes is private and is only visible to the virtual page of the writing program.
Finally, if the mapping above were read-only, page faults would trigger a segmentation fault instead of copy on write.
I'm told that pages itself can also have permissions (read/write/execute).
Tell me if I'm wrong.

This fragment:
A consequence of this design is that a virtual page that maps a file privately sees changes done to the file by other programs as long as the page has only been read from.
is telling you that the kernel cheats a little bit in the name of optimization. Even though you've asked for a private mapping, the kernel will actually give you a shared one at first. Then, if you write the page, it becomes private.
Observe that this "cheating" doesn't matter (doesn't make any difference) if all processes which are accessing the file are doing it with MAP_PRIVATE, because no actual changes to the file will ever occur in that case. Different processes' mappings will simply be upgraded from "fake cheating MAP_PRIVATE" to true "MAP_PRIVATE" at different times according to whenever each process first writes to the file. This is probably a common scenario. It's only if the file is being concurrently updated by other means (MAP_SHARED with PROT_WRITE or else regular, non-mmap I/O operations) that it makes a difference.
I'm told that pages itself can also have permissions (read/write/execute).
Sure, they can. You have to ask for the permissions you want when you initially map the file, in fact: the third argument to mmap, which will be a combination of PROT_READ, PROT_WRITE, PROT_EXEC, and PROT_NONE.

Related

swapbuffers minifilter problems

I implemented a minifilter driver using the swapbuffers example. I made two changes:
attach only to \Device\HarddiskVolume3
encryption XORing with 0xFF
Encryption works, but the volume3 (which in my system is E:) not working. E: is not recognized file system. chkdsk E: results all boot sectors corrupted message.
After investigations (using procmon.exe): the chkdsk.exe creates a shadow copy of volume. If the driver attaches the shadow copy too the chkdsk E: is OK, the filesystem is perfect. But E: remains unrecognized.
Any idea what I should change?
Assuming no simple mistake was made, that is, the volume was unmounted, you added the filter, and remounted, obviously the mount/filesystem is not using your filter.
I noticed a comment in the example code about "not for kernel mode drivers".
What you want to research is "whole disk encryption". A google search AND search on: windows whole disk encryption will help.
In particular, TrueCrypt does what you want. Since it is open source, and is available on sourceforge.net, you could download the source and figure out how to hook your stuff in by learning how TrueCrypt does it.
Just one problem: TrueCrypt has security gaps, so the sourceforge.net page is now just migration info to BitLocker. But, it still exists and other pages have been created where you can get it. Notably, a fork of TrueCrypt is VeraCrypt
Just one of the pages in the search is: http://www.howtogeek.com/203708/3-alternatives-to-the-now-defunct-truecrypt-for-your-encryption-needs/
UPDATE
Note: After I wrote this update, I realized that there may be hope ... So, keep reading.
Minifilter appears to be for filesystems but not underlying storage. It may work, you just need to find a lower level hook. What about filter stack altitute? Here's a link: https://msdn.microsoft.com/en-us/library/windows/hardware/ff540402%28v=vs.85%29.aspx It also has documentation on fltmc and the !fltkd debugger extension
In this [short] blog: http://blogs.msdn.com/b/erick/archive/2006/03/27/562257.aspx it says:
The Filter Manager was meant to create a simple mechanism for drivers to filter file system operations: file system minifilter drivers. File system minifilter driver are located between the I/O manager and the base filesystem, not between the filesystem and the storage driver(s) like legacy file system filter drivers.
Figuring out what that means will help. Is the hook point between FS and I/O manager [which I don't know about] sufficient? Or, do you need to hook between filesystem and storage drivers [implying legacy filter]?
My suspicion is that a "legacy" driver filter may be what you need, if the minifilter does not have something that can do the same.
Since your hooks need to work on unmounted storage so that chkdsk will work, this may imply the legacy filter. On the other hand, you mentioned that you were able to hook the shadow copy and it worked for chkdsk. That implies minifilter has the right stuff.
Here's a link that I think is a bit more informative: http://blogs.msdn.com/b/ntdebugging/archive/2013/03/25/understanding-file-system-minifilter-and-legacy-filter-load-order.aspx It has a direct example about the altitute of an encryption filter. You just may need more hook points and to lower the altitude of your minifilter
UPDATE #2
Swapbuffers just hooks a few things: IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL. These are file I/O related, not device I/O related. The example is fine, just not necessarily for your purposes.
The link I gave you to fltmc is one page in MS's entire reference for filters. If you meander around that, you'll find more interesting things like IoGetDeviceAttachmentBaseRef, IoGetDiskDeviceObject. You need to find the object for the device and filter its I/O operations.
I think that you'll have to read the reference material in addition to examples. As I've said previously, your filter needs to hook more or different things.
In the VeraCrypt source, the Driver subdirectory is an example of the types of things you may need to do. In DriveFilter.c, it uses IRP_MJ_READ but also uses IRP_MN_START_DEVICE [A hook when the device is started].
Seriously, this may be more work than you imagine. Is this just for fun, or is this just a test case for a much larger project?

How can i regard a page as volatile or predict the next time of a page's content modification?

I'm running virtual machine, so all the system information i can get. How can I use them to detect a page or revalant pages volatile? The result can be just a approximate volatile time of empirical conclusion. I want to use time series analysis to predict the next time of a page's content modification, is it possible and accurate? Are there any better methods? Thanks very much!
I'm going to answer for pages inside a process as the question if related to the OS gets very complex.
You can use VirtualQuery() and VirtualQueryEx()to determine the status of a given memory page. This includes if it is read only, guard page, DLL image section, writeable, etc. From these statuses you can infer the volatility of some pages.
All the read only pages can be assumed to be none-volatile. But that isn't strictly accurate since you can use VirtualProtect() to change the protection status of a page. And you can use VirtualProtextEx() to the same in a different application. So you'd need to re-check these.
What about the other pages? Any writeable page you're going to have to periodically check them. For example calculate a checksum and compare to previous checksum's to see if they've changed. And then record the time between changes.
You could use the NTDLL Function NtQueryInformationProcess() with ProcessWorkingSetWatch to get data on the page faults for the system.
Note sure if this what you're looking for but it's the simplest approach I can think of. It's potentially a bit CPU hungry. And reading each page regularly to calculate the checksums will trash your cache.

How do I save the memory state of a C program to jumpstart later

In a large complex C program, I'd like to save to a file the contents of all memory that is used by static variables, global structures and dynamically allocated variables. Those memory variables are more than 10,000.
The C program has only single thread, no file operation and program itself is not so complex (calculation is complex).
Then, in a same execution of the program, I want to initialize the memory from this saved state.
If this is even possible, can someone offer an approach to accomplish this?
You have to define a Struct to keep al your data in and then you have to implement a function to save it into a file.
Something like this: Saving struct to file
Please note, however, that this method is the simplest, but comes with no portability at all.
Edit after Comment: basically, what you would like to do is save whatever is happening in the program and then restart it after a load. I don't think this is possible in any simple way. You MUST understand what "status of your application" means.
Think about it: doing a dump of the memory saves not only the data, but also the current Instruction Pointer. So, with that "dumb" dump, you would have also saved the actual instruction currently running. And many more complications you really don't want to care about.
The closest thing you are thinking about is running the program in a Virtual Machine. If you pause the VM the execution status will be "saved", but whenever you restart the VM, the program will restart at the exact same execution point you paused it.
If the configurations are scattered through the application, still you can access a global struct used to save everything.
But still you have to know your program and identify what you have to save. No shortcuts on that.

How to revert changes with procedural memory?

Is it possible to store all changes of a set by using some means of logical paths - of the changes as they occur - such that one may revert the changes by essentially "stepping back"? I assume that something would need to map the changes as they occur, and the process of reverting them would thus ultimately be linear.
Apologies for any incoherence and this isn't applicable to any particular language. Rather, it's a problem of memory – i.e. can a set * (e.g. which may be some store of user input)* of a finite size that's changed continuously * (e.g. at any given time for any amount of time - there's no limit with regards to how much it can be changed)* be mapped procedurally such that new - future - changes are assumed to be the consequence of prior change * (in a second, mirror store that can be used to revert the state of the set all the way to its initial state)*.
You might want to look at some functional data structures. Functional languages, like Erlang, make it easy to roll back to the earlier state, since changes are always made on new data structures instead of mutating existing ones. While this feature can be used at repeatedly internally, Erlang programming typically uses this abundantly at the top level of a "process" so that on any kind of failure, it aborts both processing as well as all the changes in their entirety simply by throwing an exception (in a non-functional language, using mutable data structures, you'd be able to throw an exception to abort, but restoring originals would be your program's job not the runtime's job). This is one reason that Erlang has a solid reputation.
Some of this functional style of programming is usefully applied to non-functional languages, in particular, use of immutable data structures, such as immutable sets, lists, or trees.
Regarding immutable sets, for example, one might design a functionally-oriented data structure where modifications always generate a new set given some changes and an existing set (a change set consisting of additions and removals). You'd leave the old set hanging around for reference (by whomever); languages with automatic garbage collection reclaim the old ones when they're no longer being used (referenced).
You can put a id or tag into your set data structure, this way you can do some introspection to see what data structure id someone has a hold of. You also can capture the id of the base off of which each new version was generated; this gives you some history or lineage.
If desired, you can also capture a reference to the entire old data structure in the new one, or, one can maintain a global list of all of the sets as they are being generated. If you do, however, you'll have to take over more responsibility for storage management, as an automatic collector will probably not find any unused (unreferenced) garbage to collect without additional some help.
Database designs do some of this in their transaction controllers. For the purposes of your question, you can think of a database as a glorified set. You might look into MVCC (Multi-version Concurrency Control) as one example that is reasonably well written up in literature. This technique keeps old snapshot versions of data structures around (temporarily), meaning that mutations always appear to be in new versions of the data. An old snapshot is maintained until no active transaction references it; then is discarded. When two concurrently running transactions both modify the database, they each get a new version based off the same current and latest data set. (The transaction controller knows exactly which version each transaction is based off of, though the transaction's client doesn't see the version information.) Assuming both concurrent transactions choose to commit their changes, the versioning control in the transaction controller recognizes that the second committer is trying to commit a change set that is not a logical successor to the first (since both changes sets as we postulated above were based on the same earlier version). If possible, the transaction controller will merge the changes as if the 2nd committer was really working off the other, newer version committed by the first committer. (There are varying definitions of when this is possible, MVCC says it is when there are no write conflicts, which is a less-than-perfect answer but fast and scalable.) But if not possible, it will abort the 2nd committers transaction and inform the 2nd committer thereof (they then have the opportunity, should they like, to retry their transaction starting from the newer base). Under the covers, various snapshot versions in flight by concurrent transactions will probably share the bulk of the data (with some transaction-specific change sets that are consulted first) in order to make the snapshots cheap. There is usually no API provided to access older versions, so in this domain, the transaction controller knows that as transactions retire, the original snapshot versions they were using can also be (reference counted and) retired.
Another area this is done is using Append-Only-Files. Logging is a way of recording changes; some databases are based 100% on log-oriented designs.
BerkeleyDB has a nice log structure. Though used mostly for recovery, it does contain all the history so you can recreate the database from the log (up to the point you purge the log in which case you should also archive the database). Again someone has to decide when they can start a new log file, and when they can purge old log files, which you'd do to conserve space.
These database techniques can be applied in memory as well. (Nothing is free, though, of course ;)
Anyway, yes, there are fields where this is done.
Immutable data structures help preserve history, by simply keeping old copies; changes always go to new copies. (And efficiency techniques can make this not as bad as it sounds.)
Id's can help understand lineage without necessarily holding onto all the old copies.
If you do want to hold onto all old the copies, you have to look at your domain design to understand when/how/if old data structures possibly can get accessed with an eye toward how to eventually reclaim them. You'll mostly likely have to help get involved in defining how they get released, if ever. Or how they get archived for posterity though at the cost of slower access later.

possible to have two working sets? 1) data 2) code

In regards to Operating System concepts... Can a process to have two working sets, one that represents data and another that represents code?
A "Working Set" is a term associated with Virtual Memory Manangement in Operating systems, however it is an abstract idea.
A working set is just the concept that there is a set of virtual memory pages that the application is currently working with and that there are other pages it isn't working with. Any page that is being currently used by the application is by definition part of the 'Working Set', so its impossible to have two.
Operating systems often do distinguish between code and data in a process using various page permissions and memory protection but this is a different concept than a "Working set".
This depends on the OS.
But on common OSes like Windows there is no real difference between data and code so no, it can't split up it's working set in data and code.
As you know, the working set is the set of pages that a process needs to have in primary store to avoid thrashing. If some of these are code, and others data, it doesn't matter - the point is that the process needs regular access to these pages.
If you want to subdivide the working set into code and data and possibly other categorizations, to try to model what pages make up the working set, that's fine, but the working set as a whole is still all the pages needed, regardless of how these pages are classified.
EDIT: Blocking on I/O - does tis affect the working set?
Remember that the working set is a model of the pages used over a given time period. When the length of time the process is blocked is short compared to the time period being modelled, then it changes little - the wait is insignificant and the working set over the time period being considered is unaffected.
But when the I/O wait is long compared to the modelled preriod, then it changes much. During the period the process is blocked, it's working set is emmpty. An OS could theoretically swap out all the processes' pages on the basis of this.
The working set model attempts to predict what pages the process will need based on it's past behaviour. In this case, if the process is still blocked at time t+1, then the model of an empty working set is correct, but as soon as the process is unblocked, it's working set will be non-empty - the prediction by the model still says no pages needed, so the predictive power of the model breaks down. But this is to be expected - you can't really predict the future. Normally. And the working set is expected to change over time.
This question is from the book "operating system concepts". The answer they are looking for (found elsewhere on the web) is:
Yes, in fact many processors provide two TLBs for this very reason. As
an example, the code being accessed by a process may retain the same
working set for a long period of time. However, the data the code
accesses may change, thus reflecting a change in the working set for
data accesses.
Which seems reasonable but is completely at odds with some of the other answers...

Resources