2-way set associative cache hit/miss ratio calculations - memory

I am having a hard time figuring out how to know when there will be a hit or a miss. Here is the problem I'm doing (I have the answer but I can't figure out how they got the answer):
A 2-way set associative cache consists of four sets. Main memory contains 2K blocks of eight words each.
Show the main memory address format that allows us to map addresses from main memory to cache. Be sure to include the fields as well as their sizes. (I understand this and have done work and gotten the answer)
Compute the hit ratio for a program that loops 3 times from locations 8 to 51 in main memory. You may leave the hit ratio in terms of a fraction. Here is the answer:
First iteration of the loop: Address 8 is a miss, and then entire
block brought into Set 1. Hence, 9-15 are then hits. 16 is a miss,
entire block brought into Set 2, 17- 23 are hits. 24 is a miss, entire
block brought into Set 3, 25-31 are hits. 32 is a miss, entire block
brought into Set 0, 33-39 are then hits. 40 is a miss, entire block
brought into Set 1 (note we do NOT have to throw out the block with
address 8 as this is 2-way set associative), 41-47 are hits. 48 is a
miss, entire block brought into Set 2, 49-51 are hits.
For the first iteration of the loop, we have 6 misses, and 5*7 + 3
hits, or 38 hits. On the remaining iterations, we have 5*8+4 hits, or
44 hits each, for 88 more hits.
Therefore, we have 6 misses and 126 hits, for a hit ratio of 126/132,
or 95.45%.
I still having problem wrapping my head around how do I figure out what memory addresses/blocks will be hits or misses.

There are some ambiguities in the question:
Cache line size is not given
Each memory entry is said to be 8 words long.
Therefore I've made few assumptions:
Cache line is 8 words
Main memory is word addressed
The main memory has 2048 entries, hence 11 bits for the memory address. Cache line is 8 words wide, hence the least 3 bits are used for the word selection within a cache line. There are 4 cache blocks, hence the next two bits are used for the indexing. This leaves 6 bits for the TAG.
When the address 8 (x00000001000) is issued, index is 01, and the TAG is 000000.
This is not in the cache, hence a miss.
For address 9 (x00000001001), index is 01, and the TAG is 000000. This is already in the cache, hence its a hit.
For address 10 (x00000001010), index is 01, and the TAG is 000000. This is already in the cache, hence its a hit.
The same pattern continues until address 15 (x00000001111).
When the address 16 (x00000010000) is issued, index is 10, and the TAG is 000000.
This is not in the cache, hence a miss.
When the address 17 (x00000010001) is issued, index is 10, and the TAG is 000000. This is already in the cache, hence its a hit.
The same pattern continues until address 23 (x00000010111)
8-15 :cache index 01, TAG 00000: 1 Miss, 7 Hits
16-23 :cache index 10, TAG 00000: 1 Miss, 7 Hits
24-31 :cache index 11, TAG 00000: 1 Miss, 7 Hits
32-39 :cache index 00, TAG 00000: 1 Miss, 7 Hits
Now for address 40 (x00000110000), index is 10, but the TAG is 000001. This a miss, when the data is brought from memory, this can go to the second entry in the cache block indexed by 10, as our caches are 2 way.
40-47 :cache index 01, TAG 00001: 1 Miss, 7 Hits
48-51 :cache index 10, TAG 00001: 1 Miss, 3 Hits
For the first iteration: 6 misses and 38 hits. For the second iteration, 44 hits and for the third iteration 44 hits.
So overall we have 126 hits over 142 accesses.
The hit ratio is 126/132

Related

How to debug leak in native memory on JVM?

We have a java application running on Mule. We have the XMX value configured for 6144M, but are routinely seeing the overall memory usage climb and climb. It was getting close to 20 GB the other day before we proactively restarted it.
Thu Jun 30 03:05:57 CDT 2016
top - 03:05:58 up 149 days, 6:19, 0 users, load average: 0.04, 0.04, 0.00
Tasks: 164 total, 1 running, 163 sleeping, 0 stopped, 0 zombie
Cpu(s): 4.2%us, 1.7%sy, 0.0%ni, 93.9%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 24600552k total, 21654876k used, 2945676k free, 440828k buffers
Swap: 2097144k total, 84256k used, 2012888k free, 1047316k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3840 myuser 20 0 23.9g 18g 53m S 0.0 79.9 375:30.02 java
The jps command shows:
10671 Jps
3840 MuleContainerBootstrap
The jstat command shows:
S0C S1C S0U S1U EC EU OC OU PC PU YGC YGCT FGC FGCT GCT
37376.0 36864.0 16160.0 0.0 2022912.0 1941418.4 4194304.0 445432.2 78336.0 66776.7 232 7.044 17 17.403 24.447
The startup arguments are (sensitive bits have been changed):
3840 MuleContainerBootstrap -Dmule.home=/mule -Dmule.base=/mule -Djava.net.preferIPv4Stack=TRUE -XX:MaxPermSize=256m -Djava.endorsed.dirs=/mule/lib/endorsed -XX:+HeapDumpOnOutOfMemoryError -Dmyapp.lib.path=/datalake/app/ext_lib/ -DTARGET_ENV=prod -Djava.library.path=/opt/mapr/lib -DksPass=mypass -DsecretKey=aeskey -DencryptMode=AES -Dkeystore=/mule/myStore -DkeystoreInstance=JCEKS -Djava.security.auth.login.config=/opt/mapr/conf/mapr.login.conf -Dmule.mmc.bind.port=1521 -Xms6144m -Xmx6144m -Djava.library.path=%LD_LIBRARY_PATH%:/mule/lib/boot -Dwrapper.key=a_guid -Dwrapper.port=32000 -Dwrapper.jvm.port.min=31000 -Dwrapper.jvm.port.max=31999 -Dwrapper.disable_console_input=TRUE -Dwrapper.pid=10744 -Dwrapper.version=3.5.19-st -Dwrapper.native_library=wrapper -Dwrapper.arch=x86 -Dwrapper.service=TRUE -Dwrapper.cpu.timeout=10 -Dwrapper.jvmid=1 -Dwrapper.lang.domain=wrapper -Dwrapper.lang.folder=../lang
Adding up the "capacity" items from jps shows that only my 6144m is being used for java heap. Where the heck is the rest of the memory being used? Stack memory? Native heap? I'm not even sure how to proceed.
If left to continue growing, it will consume all memory on the system and we will eventually see the system freeze up throwing swap space errors.
I have another process that is starting to grow. Currently at about 11g resident memory.
pmap 10746 > pmap_10746.txt
cat pmap_10746.txt | grep anon | cut -c18-25 | sort -h | uniq -c | sort -rn | less
Top 10 entries by count:
119 12K
112 1016K
56 4K
38 131072K
20 65532K
15 131068K
14 65536K
10 132K
8 65404K
7 128K
Top 10 entries by allocation size:
1 6291456K
1 205816K
1 155648K
38 131072K
15 131068K
1 108772K
1 71680K
14 65536K
20 65532K
1 65512K
And top 10 by total size:
Count Size Aggregate
1 6291456K 6291456K
38 131072K 4980736K
15 131068K 1966020K
20 65532K 1310640K
14 65536K 917504K
8 65404K 523232K
1 205816K 205816K
1 155648K 155648K
112 1016K 113792K
This seems to be telling me that because the Xmx and Xms are set to the same value, there is a single allocation of 6291456K for the java heap. Other allocations are NOT java heap memory. What are they? They are getting allocated in rather large chunks.
Expanding a bit more details on Peter's answer.
You can take a binary heap dump from within VisualVM (right click on the process in the left-hand side list, and then on heap dump - it'll appear right below shortly after). If you can't attach VisualVM to your JVM, you can also generate the dump with this:
jmap -dump:format=b,file=heap.hprof $PID
Then copy the file and open it with Visual VM (File, Load, select type heap dump, find the file.)
As Peter notes, a likely cause for the leak may be non collected DirectByteBuffers (e.g.: some instance of another class is not properly de-referencing buffers, so they are never GC'd).
To identify where are these references coming from, you can use Visual VM to examine the heap and find all instances of DirectByteByffer in the "Classes" tab. Find the DBB class, right click, go to instances view.
This will give you a list of instances. You can click on one and see who's keeping a reference each one:
Note the bottom pane, we have "referent" of type Cleaner and 2 "mybuffer". These would be properties in other classes that are referencing the instance of DirectByteBuffer we drilled into (it should be ok if you ignore the Cleaner and focus on the others).
From this point on you need to proceed based on your application.
Another equivalent way to get the list of DBB instances is from the OQL tab. This query:
select x from java.nio.DirectByteBuffer x
Gives us the same list as before. The benefit of using OQL is that you can execute more more complex queries. For example, this gets all the instances that are keeping a reference to a DirectByteBuffer:
select referrers(x) from java.nio.DirectByteBuffer x
What you can do is take a heap dump and look for object which are storing data off heap such as ByteBuffers. Those objects will appear small but are a proxy for larger off heap memory areas. See if you can determine why lots of those might be retained.

Assembly Memory Diagram Verification

Given this data,
.data
Alpha WORD 0022h, 45h
Beta BYTE 56h
Gamma DWORD 4567h
Delta BYTE 23h
Assuming that the data segment begins at 0x00404000, can anyone verify how correct this table is?
Address Variable Data
00404000 Alpha 22
00404001 Alpha + 1 00
00404002 Alpha + 2 45
00404003 Beta 56
00404004 Gamma 67
00404005 Gamma+1 45
00404006 Delta 23
Impossible to answer without knowing the addressing of the processor in question (and how the assembler views the addressing). Nonetheless, you'd need a pretty unusual system for it to be correct.
Alpha is defined has having the type "word". You're showing the first word as allocating two bytes (fairly reasonable), but the second only one byte. This is much less reasonable--a word might be one byte or it might be two, but its size is normally going to at least be consistent.
For the moment, let's assume a word is two bytes, and a dword is four bytes. In that case, I'd expect something more like:
Alpha 22h
alpha+1 00h
alpha+2 45h
Alpha+3 00h
Beta 56h
Gamma 67h
Gamma+1 45h
Gamma+2 00h
Gamma+3 00h
Delta 23h

How to reduce Ipython parallel memory usage

I'm using Ipython parallel in an optimisation algorithm that loops a large number of times. Parallelism is invoked in the loop using the map method of a LoadBalancedView (twice), a DirectView's dictionary interface and an invocation of a %px magic. I'm running the algorithm in an Ipython notebook.
I find that the memory consumed by both the kernel running the algorithm and one of the controllers increases steadily over time, limiting the number of loops I can execute (since available memory is limited).
Using heapy, I profiled memory use after a run of about 38 thousand loops:
Partition of a set of 98385344 objects. Total size = 18016840352 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 5059553 5 9269101096 51 9269101096 51 IPython.parallel.client.client.Metadata
1 19795077 20 2915510312 16 12184611408 68 list
2 24030949 24 1641114880 9 13825726288 77 str
3 5062764 5 1424092704 8 15249818992 85 dict (no owner)
4 20238219 21 971434512 5 16221253504 90 datetime.datetime
5 401177 0 426782056 2 16648035560 92 scipy.optimize.optimize.OptimizeResult
6 3 0 402654816 2 17050690376 95 collections.defaultdict
7 4359721 4 323814160 2 17374504536 96 tuple
8 8166865 8 196004760 1 17570509296 98 numpy.float64
9 5488027 6 131712648 1 17702221944 98 int
<1582 more rows. Type e.g. '_.more' to view.>
You can see that about half the memory is used by IPython.parallel.client.client.Metadata instances. A good indicator that results from the map invocations are being cached is the 401177 OptimizeResult instances, the same number as the number of optimize invocations via lbview.map - I am not caching them in my code.
Is there a way I can control this memory usage on both the kernel and the Ipython parallel controller (who'se memory consumption is comparable to the kernel)?
Ipython parallel clients and controllers store past results and other metadata from past transactions.
The IPython.parallel.Client class provides a method for clearing this data:
Client.purge_everything()
documented here. There is also purge_results() and purge_local_results() methods that give you some control over what gets purged.

Logical Addresses & Page numbers

I just started learning Memory Management and have an idea of page,frames,virtual memory and so on but I'm not understanding the procedure from changing logical addresses to their corresponding page numbers,
Here is the scenario-
Page Size = 100 words /8000 bits?
Process generates this logical address:
10 11 104 170 73 309 185 245 246 434 458 364
Process takes up two page frames,and that none of its are resident (in page frames) when the process begins execution.
Determine the page number corresponding to each logical address and fill them into a table with one row and 12 columns.
I know the answer is :
0 0 1 1 0 3 1 2 2 4 4 3
But can someone explain how this is done? Is there a equation or something? I remember seeing something with a table and changing things to binary and putting them in the page table like 00100 in Page 1 but I am not really sure. Graphical representations of how this works would be more than appreciated. Thanks

How to change hours in duration to a number in google spreedsheet?

I want to subtract a number form a duration but not sure how can I do it.
A1 : 137:47:00 (formatted as duration)
A2 : 126 (formatted as number)
When I subtract it is showing unexpected value
=(A1-A2) = -120.26
I was expecting something similar to 11.
Subtracting a number (without dimension) from a duration does not really make a lot of sense but if 137:47:00 represented 137 hours and 47 minutes then subtracting 126 hours from that would (and give a result between 11 and 12 hours). To be able to compare like with like, the duration can be represented as a number by accessing the fact that Google spreadsheets treats 24 hours as number 1. So multiply 137:47:00 (if representing hour, minutes, seconds) by 24 to get a number from which another number can be subtracted to give a meaningful result (ie 11.7833333 - representing 11 hours 47 minutes if to subtract 126 hours from 137 hours and 47 minutes). Therefore:
=24*A1-A2
might suit.
Calculating time worked per day on Web Applications addresses a vaguely similar issue.

Resources