Memory allocation while zlib-compressing data? - memory

Go:
Assume, there are 10 threads running in parallel to compress the 100 MB of data each. For each and every thread going to compress 100 MB of data. I am using zlib compression to compress the data.
Consider one process took nearly 2 sec to compress the 100 MB of data. So What happens if all the threads running parallel to compress the data.
And also i need clarification about Memory allocation for each thread.
Case 1:
having 1 GB RAM memory, Now the 10 threads going to run parallel to start the compression means , Whether it will take all the RAM Memory for the process of compression?
10 Threads * 100 MB = 1000 (Approx)

zlib itself will need a relatively trivial amount of memory, up to 256kb per thread. This will be dominated by the memory you use to store your input and output, if you are keeping those in memory. For details, see the
zlib web site (Look for the "Memory Footprint" topic).

Related

GPU's memory is far from zero, why?

I have a GPU with the total memory of 6 Gb. However, around 500 Mb (or 400 Mb) is always in use (see the screenshot below) prohibiting me to exploit that portion of memory for my own needs. How to release that memory, so I would have available all 6 Gb for myself?

Unity profiling my scripts' memory usage

I took a look at the profiler when running my game and I can watch a whole bunch of stuff there - but not my scripts' memory usage. Thing is, my game's total mem allocation is 223 MB, but the textures are only 112 of this and one or two MB that I see except for that. I have no clue where my other 100 MB memory went and I would like to optimize my scripts a bit. Side note: I code using Visual Studio. Maybe I should look there?
Here are a few memory profiling tips I have learned from shipping apps in Unity.
GC ALLOC
In the CPU Usage view of the Unity Profiler window you can see how much memory your script allocated in any given frame. This is shown in the GC Alloc (Garbage Collection Allocations) column.
This won't give you the total memory usage of a single script, but it is very useful for improving performance and memory consumption. If your scripts are allocating every single frame, memory will continue to accumulate until the Garbage Collector gets triggered. This accumulation will increase your memory footprint and running the GC will cause a performance hit.
See here for more info: https://docs.unity3d.com/Manual/ProfilerCPU.html
Memory View
The Detailed Memory View of the Unity Profiler Window tells you the memory usage of anything loaded in the game (including lots of built-in assets). This will let you identify which textures, meshes, or other assets are overly large. When you look at an asset in this view it tells you where it is referenced in your scene, which can help you identify which game objects may be using too much memory.
One problem with this view is that many assets show up as a blank because they don't have a name. This is happens when you create assets (textures, meshes, etc.) in script. You can change this by setting the .name property of any asset you create. This will then show up in the memory profiler window.
See here for more info: https://docs.unity3d.com/Manual/ProfilerMemory.html
Build Report
After you perform a build (Standalone, Windows Store, etc.) a build report is generated in the editor log. It can be hard to find, but it provides a lot of good information on what assets are contributing to your build size. One thing to remember is that this report uses uncompressed asset sizes, so many types of assets (textures in particular) will actually end up smaller than it shows here. In the upper-right corner of the console window, there's a drop-down menu for opening up the Editor logs. The part you are interested in will look something like this:
Textures 0.0 kb 0.0%
Meshes 0.0 kb 0.0%
Animations 0.0 kb 0.0%
Sounds 0.0 kb 0.0%
Shaders 18.6 kb 0.4%
Other Assets 0.7 kb 0.0%
Levels 5.2 kb 0.1%
Scripts 460.8 kb 10.2%
Included DLLs 3.9 mb 89.1%
File headers 8.4 kb 0.2%
Complete size 4.4 mb 100.0%
Used Assets and files from the Resources folder, sorted by uncompressed size:
18.9 kb 0.4% Resources/unity_builtin_extra
4.0 kb 0.1% ...UnityEngine.UI.dll
1.8 kb 0.0% ...UnityEngine.Networking.dll
0.1 kb 0.0% ...UnityEngine.Advertisements.dll
0.1 kb 0.0% ...UnityEngine.Purchasing.dll
0.1 kb 0.0% Assets/TestClass.cs
0.1 kb 0.0% Assets/MemoryTester.cs
0.1 kb 0.0% Assets/Rotator.cs
For the lazy, there is a Unity addon in the asset store that can parse this for you: https://www.assetstore.unity3d.com/en/#!/content/8162
(I have neither used this addon nor endorse its use)
System.GC.GetTotalMemory
If you are developing on a Windows Desktop PC you can ask the system for the amount of memory it thinks is allocated for your app using System.GC.GetTotalMemory(...). The actually number it reports may not be of interest to you, but if you place calls to this function at various points in your app you can see when total memory usage goes up. For example, you might put a call to GetTotalMemory before a large block of initialization and then again at the end of the initialization. Comparing the two numbers give you an estimate of how much your memory grew*
*It may not be totaly accurate because there are processes going on in the background, like GC, that can affect this number
See here for more info: https://msdn.microsoft.com/en-us/library/system.gc(v=vs.110).aspx
Hope some of this helps!!

What applications require 1GB pages?

X86 and x64 processors allow for 1GB pages when the PDPE flag is set on the cpu. In what application would this be practical or required and for what reason?
Hugepage would help in cases where you have a large memory footprint and memory access pattern spans large distance (across 4K pages).
It not only reduces TLB miss but also saves OS mm system page tables size.
A very good example is packet processing. In high throughput network applications (1Gbps or more), packets are normally stored in a packet buffer pool (i.e. pooling technique). For example, every packet buffer is 2KB in size and the pool contains 512 buffers. Access pattern of this packet buffer pool might not be sequential (buffer indexed at 1,2,3,4,5...) but rather random over time (1,104,407,45,905...). Since normal page size is 4K, normal TLB won't help here since each packet access would incur a TLB miss and there is a lot of different buffers sitting on different pages.
In contrast, if you put the pool in a 1GB hugepage, then all packet buffers share the same hugepageTLB entry thus avoiding misses.
This is used in DPDK (Data Plane Development Kit) where the packet
rate is very high that cycles wasted on TLB miss is not negligible.
Hugepage support is required for the large memory pool allocation used
for packet buffers (the HUGETLBFS option must be enabled in the
running kernel as indicated the previous section). By using hugepage
allocations, performance is increased since fewer pages are needed,
and therefore less Translation Lookaside Buffers (TLBs, high speed
translation caches), which reduce the time it takes to translate a
virtual page address to a physical page address. Without hugepages,
high TLB miss rates would occur with the standard 4k page size,
slowing performance.
http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html#bios-setting-prerequisite-on-x86
Another example from Oracle:
...almost 6.8 GB of memory used for page tables when hugepages were not
configured...
...after hugepages were allocated and used by the Oracle database. The page table overhead was reduced to slightly less than 23 MB
http://www.databasejournal.com/features/oracle/understanding-hugepages-in-oracle-database.html
Related links:
https://en.wikipedia.org/wiki/Object_pool_pattern
--Edit--
However, hugepage should be used carefully. Above I mentioned that memory pool would benefit from 1GB hugepage. However, if you have an access pattern even across 1GB page boundary, then it might not help. There is an excellent blog on this:
http://www.pvk.ca/Blog/2014/02/18/how-bad-can-1gb-pages-be/
Imagine an application that uses huge amounts of memory—Molecular modeling. Weather prediction—especially if it has no user interaction.
Large pages:
(1) reduce the amount of page table overhead memory
(2) increases the amount of memory that can be stored in the MMU cache. (The same number of cache entries references more memory).
I have LabView installed on my Dell ws with 8 cores and 16GB DDRM, driving 4 24" monitors.If I create a video processor or compositor of most any type, with a 1024px x 1024px 'drawing' display, LabView reserves 1.5GB before I even began to composite. It was built from C and C++. I often store image details in 3D arrays of 256 x 256 x 256 of U32 integers that hold each RGB pixel color, plus the alpha channel for opacity or masking. That's 64MB per each layer of buffered video. If I need to remember 128 layers, thats 8GB right there. LabView is a programming langauge structured much like a CAD program. If I need 8GB for a series of video (HDTV) buffers, that is what it will give me, with a few seconds wait for malloc to do its work. If I created a 8GB 3D array for a database, it would be no different, even if I did it in MySQL (not as an array). To me, having many gigabytes of ram to play with is the norm, not an exception.

linux virtual memory parameters

Can anyone tell me the working of dirty_bytes and dirty_background_bytes in the Linux VM tunable parameters.
I infer that dirty_bytes specifies the amount of memory after which the application doing a write, starts writing directly to disk. Is it correct or if the amount of memory allocated is used up, that portion is first transferred to disk and then new data is again stored in memory. eg. suppose I want to transfer a 1 GB file to disk and I set dirty_bytes to be 100 MB then once 100 MB have been written to memory, the application doing the writing now starts writing the data directly to disk or the 100 MB is transferred to the disk and then again 100 MB is written to memory and then transferred to disk and so on?
And in case of dirty_background_bytes, when the portion of dirty memory exceeds this then pdflush writes the dirty data back to disk in the background.
Is my understanding correct for these 2 parameters?
No, exceeding dirty_bytes (or dirty_ratio) does not cause processes to start writing directly to disk.
Instead, when a process dirties a page in excess of the limit, that process is used to perform synchronous writeout of some dirty pages - exactly which ones is still decided by the usual heuristics. They may not necessarily even be pages that were originally dirtied by that particular process.
Effectively, the process sees its write (which may just be a memory write) suspended until some writeout has occurred.
You are correct about dirty_background_*. When the background limit is exceeded, asynchronous writeout is started, but the userspace process is allowed to continue.

Clarify: Processor operates at 800 Mhz and 200Mhz DDR RAM

I have an evaluation kit which has an implementation of ARM Cortex-A8 core. The processor data sheet states that it has a
ARM Cortex A8™ core, which operates at speeds as high as 800MHz and Up to 200MHz DDR2 RAM.
What can I expect from this system? Am I right to assume that the memory accesses will be a bottleneck because it operates at only 200MHz?
Need more info on how to interpret this.
The processor works with an internal cache (actually, several) which it can access at "full speed". The cache is small (typically 8 to 32 kilobytes) and is filled by chunks ("cache lines") from the external RAM (a cache line will be a few dozen consecutive bytes). When the code needs some data which is not presently in the cache, the processor will have to fetch the line from main RAM; this is called a cache miss.
How fast the cache line can be obtained from main RAM is described by two parameters, called latency and bandwidth. Latency is the amount of time between the moment the processor issues the request, and the moment the first cache line byte is received. Typical latencies are about 30ns. At 800 MHz, 30ns mean 24 clock cycles. Bandwidth describes how many bytes per nanoseconds can be sent on the bus. "200 MHz DDR2" means that the bus clock will run at 200 MHz. DDR2 RAM can send two data elements per cycle (hence 400 millions of elements per second). Bandwidth then depends on how many wires there are between the CPU and the RAM: with a 64-bit bus, and 200 MHz DDR2 RAM, you could hope for 3.2 GBytes/s in ideal conditions. So that while the first byte takes quite some time to be obtained (latency is high with regards to what the CPU can do), the rest of the cache line is read quite quickly.
In the other direction: the CPU writes some data to its cache, and some circuitry will propagate the modification to main RAM at its leisure.
The description above is overly simplistic; caches and cache management are a complex area. Bottom-line is the following: if your code uses big data tables in memory and accesses them in a seemingly random way, then the application will be slow, because most of the time the processor will just wait for data from main memory. On the other hand, if your code can operate with little RAM, less than a few dozen kilobytes, then chances are that it will run most of the time with the innermost cache, and external RAM speed will be unimportant. Ability to make memory accesses in a way which operates well with the caches is called locality of reference.
See the Wikipedia page on caches for an introduction and pointers on the matter of caches.
(Big precomputed tables were a common optimization trick during the 80s' because at that time processors were not faster than RAM, and one-cycle memory access was the rule. Which is why an 8 MHz Motorola 68000 CPU had no cache. But these days are long gone.)
Yes, the memory may well be a bottleneck but you will be very unlikely to be running an application that does nothing but read and write to memory.
Inside the CPU, the memory bottleneck will not have an effect.

Resources