Recently taking a deeper delve into memory in computer science, I came across a thought? How do text editors typically manage their memory? When you insert a character, does the entire rest of the document get copied one unit forward (O(n)) in RAM? Is there some sort of linked list involved? Is there padding around each character to prevent many reallocations?
Related
I now understand how virtual memory works and what is responsible for setting up this virtual memory. However, some days ago, I encountered memory segmentation that split up the address space into segments like data and text. I cannot find any clear, non-ambiguous resources (at least to me) that explains memory segmentation. For instance, I would like to know,
What is responsible for splitting up address spaces into segments?
How exactly does it work? Like how are segments translated to physical addresses, and what checks if an address within a certain segment has been accessed?
I have found this wiki article but it does not really answer such questions.
The term "segment" appears in at least two distinct memory contexts.
In ye olde days, segmentation was method used for memory protection. The Intel chips continued the use of segments for decades after they were obsolete. Intel finally dropped the used of segments in 64-bit mode but they still exist in vestigial form and they still exist in 32-bit mode.
That is the type of "segmentation" described in the wikipedia link.
The "code" and "data"-type segmentation is something entirely different. Another term for this is "program section."
When you link your code, the linker usually groups memory with the same attributes into "program sections" (aka "segments"). Typically you will have memory that:
Is read only/execute (code)
read/write and initialized to zero
read/write and initialized to specified values
Read only
In order to control the grouping of related memory, linkers generally used named segments/program sections. A linker may, by default, create a program section/segment called "Code" and place all the executable code in that segment. It make create, by default, a segment called "Data" and place the read only data in that segment.
Powerful linkers allow the programmer to override these. Some assembly languages and system languages allow you to specify program sections.
"Segments" in this context only exist only in the linking process. There is no area in memory marked "Code" or "Data" (unless you are using the olde Intel system).
What is responsible for splitting up address spaces into segments?
The address space is not split up into segments of this second type on modern systems (ie those designed after 1970 and not from Intel). Some confusing books use this as a pedagogical concept in diagrams. A process can (and usually does) have code pages interspersed with data pages.
Like how are segments translated to physical addresses, and what checks if an address within a certain segment has been accessed?
That question relates to the use of the term "Segment" described at the top. That translation is done using hardware registers.
Well, to be honest I prefer you to consult books that have basics and thorough materials rather reading articles. Because, their content is specific and of above basic level (to me).
Every term in your question is a separate topic that are very well described in bellow reference. If you really want answers and clear concepts then you should go through this:
Read out Abraham Silberschatz's "Operating system concepts".
Chapter 8: Memory Management
Sub topics: Paging basic method and hardware support, Segmentation
In my program all the state is held in a giant map in an atom, which is updated by a load of pure functions in each iteration. I have determined that the heap size is increasing, how do I find the code that's responsible ? I tried VisualVM, but it gives generic information and I can't find which part of my state is growing and which function is causing it to grow.
Look for common gotchas like forgetting to use with-open, hanging onto the head of a sequence, etc.
Isolate smaller segments of your code and see if you still see the same kinds of memory growth using JVisualVM. If knocking out or mocking some piece makes no difference then put it back, and if it makes a difference then you can focus on that and figure out what is going on.
I don't know of any silver bullet tool or technique, it's just a process of divide and conquer, and thinking about what you are doing in your code.
I have a requirement to parse a huge text file and send parts of this file to be added as seperate rows in Content Manager. what is the best way of parsing and then update the DB?
I also would need identify certain tokens within this text file.
Please suggest what language should I use to code this requirement.
Thanks
All widely used programming languages can do that, though scripting languages (especially Perl) may be better suited to the task than others. However, your personal experience is a bigger factor: using the language you're most familiar with would probably be best, unless you have specific reasons not to use it, or to use a different language.
A classic problem when working with large files is just reading them in the first place. A lot of standard libraries tend to want to read the entire file into memory / array. However for really large files this is usually not practical.
For what ever language you end up choosing, look over the file I/O libraries carefully and select a method that will allow you to read in the file in chunks. Then run your parsing logic over the chunks and when getting to the end of a chunk, read in the next. Be careful with the parsing logic, it can sometimes be tricky to handle a chunk when it ends in a place that your parsing is not expecting.
Additionally a double buffer system sometimes works well. Process one chunk and when you get near the end, you fill the other buffer with the next chunk. If your parsing is CPU intensive, you might even look at filling a buffer on another thread to overlap the file I/O with the parsing. However, I wouldn't do this first. Start with just getting the logic working before any performance optimizations.
Without more detailed requirements it's difficult to suggest a particular language. Certainly no language is going to magically solve the problem of parsing such a big file. Depending on the format of the file there might be parsing library particularly suited to the job which might guide your choice of language.
If by "Content Manager" you mean Microsoft Content Manager Server I guess one of the Microsoft languages such as C# or VB.Net might be a better choice.
So my answer would pick one of the languages you already know, probably the one you know best.
I am trying to write a statistics tool for a game by extracting values from game's process memory (as there is no other way). The biggest challenge is to find out required addresses that store data I am interested. What makes it even more harder is dynamic memory allocation - I need to find not only addresses that store data but also pointers to those memory blocks, because addresses are changing every time game restarts.
For now I am just manually searching game memory using memory editor (ArtMoney), and looking for addresses that change their values as data changes (or don't change). After address is found I am looking for a pointer that points to this memory block in a similar way.
I wonder what techniques/tools exist for such tasks? Maybe there are some articles I can read? Is mastering disassembler the only way to go? For example game trainers are solving similar tasks, but they make them in days and I am struggling already for weeks.
Thanks.
PS. It's all under windows.
Is mastering disassembler the only way to go?
Yes; go download WinDbg from http://www.microsoft.com/whdc/devtools/debugging/default.mspx, or if you've got some money to blow, IDA Pro is probably the best tool for doing this
If you know how to code in C, it is easy to search for memory values. If you don't know C, this page might point you to your solution if you can code in C#. It will not be hard to port the C# they have to Java.
You might take a look at DynInst (Dynamic Instrumentation). In particular, look at the Dynamic Probe Class Library (DPCL). These tools will let you attach to running processes via the debugger interface and insert your own instrumentation (via special probe classes) into them while they're running. You could probably use this to instrument the routines that access your data structures and trace when the values you're interested in are created or modified.
You might have an easier time doing it this way than doing everything manually. There are a bunch of papers on those pages you can look at to see how other people built similar tools, too.
I believe the Windows support is maintained, but I have not used it myself.
Most of the literature on Virtual Memory point out that the as a Application developer,understanding Virtual Memory can help me in harnessing its powerful capabilities. I have been involved in developing applications on Linux for sometime but but didn't care about Virtual Memory intricacies while I code. Am I missing something? If so, please shed some light on how I can leverage the workings of Virtual Memory. Else let me know if am I not making sense with the question!
Well, the concept is pretty simple actually. I won't repeat it here, but you should pick up any book on OS design and it will be explained there. I recommend the "Operating System Concepts" from Silberscahtz and Galvin - it's what I had to use in the University and it's good.
A couple of things that I can think of what Virtual Memory knowledge might give you are:
Learning to allocate memory on page boundaries to avoid waste (applies only to virtual memory, not the usual heap/stack memory);
Lock some pages in RAM so they don't get swapped to HDD;
Guardian pages;
Reserving some address range and committing actual memory later;
Perhaps using the NX (non-executable) bit to increase security, but im not sure on this one.
PAE for accessing >4GB on a 32-bit system.
Still, all of these things would have uses only in quite specific scenarios. Indeed, 99% of applications need not concern themselves about this.
Added: That said, it's definately good to know all these things, so that you can identify such scenarios when they arise. Just beware - with power comes responsibility.
It's a bit of a vague question.
The way you can use virtual memory, is chiefly through the use of memory-mapped files. See the mmap() man page for more details.
Although, you are probably using it implicitly anyway, as any dynamic library is implemented as a mapped file, and many database libraries use them too.
The interface to use mapped files from higher level languages is often quite inconvenient, which makes them less useful.
The chief benefits of using mapped files are:
No system call overhead when accessing parts of the file (this actually might be a disadvantage, as a page fault probably has as much overhead anyway, if it happens)
No need to copy data from OS buffers to application buffers - this can improve performance
Ability to share memory between processes.
Some drawbacks are:
32-bit machines can run out of address space easily
Tricky to handle file extending correctly
No easy way to see how many / which pages are currently resident (there may be some ways however)
Not good for real-time applications, as a page fault may cause an IO request, which blocks the thread (the file can be locked in memory however, but only if there is enough).
May be 9 out of 10 cases you need not worry about virtual memory management. That's the job of the kernel. May be in some highly specialized applications do you need to tweak around them.
I know of one article that talks about computer memory management with an emphasis on Linux [ http://lwn.net/Articles/250967 ]. Hope this helps.
For most applications today, the programmer can remain unaware of the workings of computer memory without any harm. But sometimes -- for example the case when you want to improve the footprint of your program -- you do end up having to manipulate memory yourself. In such situations, knowing how memory is designed to work is essential.
In other words, although you can indeed survive without it, learning about virtual memory will only make you a better programmer.
And I would think the Wikipedia article can be a good start.
If you are concerned with performance -- understanding memory hierarchy is important.
For small data sets which are fully contained in physical memory you need to be concerned with caching (accessing memory from the cache is much faster).
When dealing with large data sets -- which may be paged out due to lack of physical memory you need to be careful to keep your access patterns localized.
For example if you declare a matrix in C (int a[rows][cols]), it is allocated by rows. Thus when scanning the matrix, you need to scan by rows rather than by columns. Otherwise you will be paging the same data in and out many times.
Another issue is the difference between dirty and clean data held in memory. Clean data is information loaded from file that was not modified by the program. The OS may page out clean data (perhaps depending on how it was loaded) without writing it to disk. Dirty pages must first be written to the swap file.