What is the Average instruction Execution time? - memory

This is my homework question. I just want to confirm my approach and hence the answer.
A computer system with 2 level paging scheme in which regular memory access takes 300 nanoseconds and servicing a pagefault takes 500ns. An average instruction takes 200 ns of CPU time and one memory access. The TLB hit ratio is 80% and page fault ratio is 20%. The average instruction execution time is ?
My Approach =>
**Average time To Execute Instruction = CPU Time + Memory Access Time**
It is given that CPU Time = 200 ns
Probability of having a page fault for an instruction = 20% = 1/5
Hence, probability of not having a page fault = 4/5
If TLB hit occurs,
then memory Access time = 0 + 300 = 300 ns ( Here, TLB is taken negligible, so, 0 )
and
if TLB miss occurs,
then Memory Access Time = TLB access time + Access time Page Table 1 +
Access time Page table 2 + one memory Access = 0+ 300 + 300 + 300 = 900 ns
( Assume all the page tables in main memory )
Hit ratio of TLB = 80 %
Hence, Memory Access Time = prob. of no page fault (...Memory
access...........) + prob. of page fault (.......Page fault service
time..........)
Memory Access Time = 4/5 ( 0.80 * 300 + 0.20 * 900 )
+ 1/5 ( (0.80 * 300 + 0.20 * 900) + 500 ns for servicing a page fault )
= 336 + 184
= 520 ns
Average time To Execute Instruction = CPU Time + Memory Access Time
Average time To Execute Instruction = 200 ns + 520 ns = 720 ns .
Please correct me, if i am doing any mistake.

Related

2-way set associative cache hit/miss ratio calculations

I am having a hard time figuring out how to know when there will be a hit or a miss. Here is the problem I'm doing (I have the answer but I can't figure out how they got the answer):
A 2-way set associative cache consists of four sets. Main memory contains 2K blocks of eight words each.
Show the main memory address format that allows us to map addresses from main memory to cache. Be sure to include the fields as well as their sizes. (I understand this and have done work and gotten the answer)
Compute the hit ratio for a program that loops 3 times from locations 8 to 51 in main memory. You may leave the hit ratio in terms of a fraction. Here is the answer:
First iteration of the loop: Address 8 is a miss, and then entire
block brought into Set 1. Hence, 9-15 are then hits. 16 is a miss,
entire block brought into Set 2, 17- 23 are hits. 24 is a miss, entire
block brought into Set 3, 25-31 are hits. 32 is a miss, entire block
brought into Set 0, 33-39 are then hits. 40 is a miss, entire block
brought into Set 1 (note we do NOT have to throw out the block with
address 8 as this is 2-way set associative), 41-47 are hits. 48 is a
miss, entire block brought into Set 2, 49-51 are hits.
For the first iteration of the loop, we have 6 misses, and 5*7 + 3
hits, or 38 hits. On the remaining iterations, we have 5*8+4 hits, or
44 hits each, for 88 more hits.
Therefore, we have 6 misses and 126 hits, for a hit ratio of 126/132,
or 95.45%.
I still having problem wrapping my head around how do I figure out what memory addresses/blocks will be hits or misses.
There are some ambiguities in the question:
Cache line size is not given
Each memory entry is said to be 8 words long.
Therefore I've made few assumptions:
Cache line is 8 words
Main memory is word addressed
The main memory has 2048 entries, hence 11 bits for the memory address. Cache line is 8 words wide, hence the least 3 bits are used for the word selection within a cache line. There are 4 cache blocks, hence the next two bits are used for the indexing. This leaves 6 bits for the TAG.
When the address 8 (x00000001000) is issued, index is 01, and the TAG is 000000.
This is not in the cache, hence a miss.
For address 9 (x00000001001), index is 01, and the TAG is 000000. This is already in the cache, hence its a hit.
For address 10 (x00000001010), index is 01, and the TAG is 000000. This is already in the cache, hence its a hit.
The same pattern continues until address 15 (x00000001111).
When the address 16 (x00000010000) is issued, index is 10, and the TAG is 000000.
This is not in the cache, hence a miss.
When the address 17 (x00000010001) is issued, index is 10, and the TAG is 000000. This is already in the cache, hence its a hit.
The same pattern continues until address 23 (x00000010111)
8-15 :cache index 01, TAG 00000: 1 Miss, 7 Hits
16-23 :cache index 10, TAG 00000: 1 Miss, 7 Hits
24-31 :cache index 11, TAG 00000: 1 Miss, 7 Hits
32-39 :cache index 00, TAG 00000: 1 Miss, 7 Hits
Now for address 40 (x00000110000), index is 10, but the TAG is 000001. This a miss, when the data is brought from memory, this can go to the second entry in the cache block indexed by 10, as our caches are 2 way.
40-47 :cache index 01, TAG 00001: 1 Miss, 7 Hits
48-51 :cache index 10, TAG 00001: 1 Miss, 3 Hits
For the first iteration: 6 misses and 38 hits. For the second iteration, 44 hits and for the third iteration 44 hits.
So overall we have 126 hits over 142 accesses.
The hit ratio is 126/132

opencv write video without manually timing the frames

So it seems according to this answer, that the opencv VideoWriter is not really smart (or well, maybe not suited for the purpose I would like to use it) about handling frames. According to the answer of this question, you have to time your frames manually, thus the creation of a two hour long video will take two hours.
If you want to check, the following script creates a 100 fps VideoWriter and writes 1500 frames to it, which should be exactly 15 seconds long, but ends up being 26 seconds or so.
EDIT: The code was edited to create six videos, with 3 fps-s intended to be 15 and 30 seconds long. The table at the end of the question was made using this.
import numpy as np
import cv2
for fps in [20,50,100]:
vWriter = cv2.VideoWriter("test" +str(fps)+".avi", cv2.VideoWriter_fourcc('P','I','M','1'),fps,(500,500),True)
y = 0
for x in range(15*fps):
img = np.zeros((500,500,3)).astype(np.uint8)
cv2.circle(img,(250,int(y)),5,(255,255,255),-1,cv2.LINE_AA)
y += 500/15/fps
vWriter.write(img)
for fps in [20,50,100]:
vWriter = cv2.VideoWriter("test2_" +str(fps)+".avi", cv2.VideoWriter_fourcc('P','I','M','1'),fps,(500,500),True)
y = 0
ts = time.time()
for x in range(30*fps):
img = np.zeros((500,500,3)).astype(np.uint8)
cv2.circle(img,(250,int(y)),5,(255,255,255),-1,cv2.LINE_AA)
y += 500/30/fps
vWriter.write(img)
Is there any workaround for this? This manual timing of frames seems really cumbersome. Or if there are no workarounds, any other cross-platform video creation method that you can recommend, that does not suffer from this problem?
I made a little test with different lengths and framerates, I checked 20, 50 and 100 fps with 15 and 30 second long videos (intended length, so I generated 15 or 30 times the fps frames).
FPS intended_length actual_length
20 15 12
50 15 15
100 15 25
20 30 25
50 30 30
100 30 50
Looks like the 50 fps is the one where it gets it correctly, but why?

How to reduce Ipython parallel memory usage

I'm using Ipython parallel in an optimisation algorithm that loops a large number of times. Parallelism is invoked in the loop using the map method of a LoadBalancedView (twice), a DirectView's dictionary interface and an invocation of a %px magic. I'm running the algorithm in an Ipython notebook.
I find that the memory consumed by both the kernel running the algorithm and one of the controllers increases steadily over time, limiting the number of loops I can execute (since available memory is limited).
Using heapy, I profiled memory use after a run of about 38 thousand loops:
Partition of a set of 98385344 objects. Total size = 18016840352 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 5059553 5 9269101096 51 9269101096 51 IPython.parallel.client.client.Metadata
1 19795077 20 2915510312 16 12184611408 68 list
2 24030949 24 1641114880 9 13825726288 77 str
3 5062764 5 1424092704 8 15249818992 85 dict (no owner)
4 20238219 21 971434512 5 16221253504 90 datetime.datetime
5 401177 0 426782056 2 16648035560 92 scipy.optimize.optimize.OptimizeResult
6 3 0 402654816 2 17050690376 95 collections.defaultdict
7 4359721 4 323814160 2 17374504536 96 tuple
8 8166865 8 196004760 1 17570509296 98 numpy.float64
9 5488027 6 131712648 1 17702221944 98 int
<1582 more rows. Type e.g. '_.more' to view.>
You can see that about half the memory is used by IPython.parallel.client.client.Metadata instances. A good indicator that results from the map invocations are being cached is the 401177 OptimizeResult instances, the same number as the number of optimize invocations via lbview.map - I am not caching them in my code.
Is there a way I can control this memory usage on both the kernel and the Ipython parallel controller (who'se memory consumption is comparable to the kernel)?
Ipython parallel clients and controllers store past results and other metadata from past transactions.
The IPython.parallel.Client class provides a method for clearing this data:
Client.purge_everything()
documented here. There is also purge_results() and purge_local_results() methods that give you some control over what gets purged.

optimize hive query for multitable join

INSERT OVERWRITE TABLE result
SELECT /*+ STREAMTABLE(product) */
i.IMAGE_ID,
p.PRODUCT_NO,
p.STORE_NO,
p.PRODUCT_CAT_NO,
p.CAPTION,
p.PRODUCT_DESC,
p.IMAGE1_ID,
p.IMAGE2_ID,
s.STORE_ID,
s.STORE_NAME,
p.CREATE_DATE,
CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
FROM image i
JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
JOIN STORE s ON p.STORE_NO = s.STORE_NO
JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = custImg.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = custImg1.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = custImg2.IMAGE_ID;
I have a join query where i am joining huge tables and i am trying to optimize this hive query. Here are some facts about the tables
image table has 60m rows,
product table has 1b rows,
product_cat has 1000 rows,
store has 1m rows,
stock_info has 100 rows,
customizable_image has 200k rows.
a product can have one or two images (image1 and image2) and product level information are stored only in product table. i tried moving the join with product to the bottom but i couldnt as all other following joins require data from the product table.
Here is what i tried so far,
1. I gave the hint to hive to stream product table as its the biggest one
2. I bucketed the table (during create table) into 256 buckets (on image_id) and then did the join - didnt give me any significant performance gain
3. changed the input format to sequence file from textfile(gzip files) , so that it can be splittable and hence more mappers can be run if hive want to run more mappers
Here are some key logs from hive console. I ran this hive query in aws. Can anyone help me understand the primary bottleneck here ? This job is only processing a subset of the actual data.
Stage-14 is selected by condition resolver.
Launching Job 1 out of 11
Number of reduce tasks not specified. Estimated from input data size: 22
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Kill Command = /home/hadoop/bin/hadoop job -kill job_201403242034_0001
Hadoop job information for Stage-14: number of mappers: 341; number of reducers: 22
2014-03-24 20:55:05,709 Stage-14 map = 0%, reduce = 0%
.
2014-03-24 23:26:32,064 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 34198.12 sec
MapReduce Total cumulative CPU time: 0 days 9 hours 29 minutes 58 seconds 120 msec
.
2014-03-25 00:33:39,702 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 20879.69 sec
MapReduce Total cumulative CPU time: 0 days 5 hours 47 minutes 59 seconds 690 msec
.
2014-03-26 04:15:25,809 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 3903.4 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 3 seconds 400 msec
.
2014-03-26 04:25:05,892 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 2707.34 sec
MapReduce Total cumulative CPU time: 45 minutes 7 seconds 340 msec
.
2014-03-26 04:45:56,465 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3901.99 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 1 seconds 990 msec
.
2014-03-26 04:54:56,061 Stage-26 map = 100%, reduce = 100%, Cumulative CPU 2388.71 sec
MapReduce Total cumulative CPU time: 39 minutes 48 seconds 710 msec
.
2014-03-26 05:12:35,541 Stage-4 map = 100%, reduce = 100%, Cumulative CPU 3792.5 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 3 minutes 12 seconds 500 msec
.
2014-03-26 05:34:21,967 Stage-5 map = 100%, reduce = 100%, Cumulative CPU 4432.22 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 13 minutes 52 seconds 220 msec
.
2014-03-26 05:54:43,928 Stage-21 map = 100%, reduce = 100%, Cumulative CPU 6052.96 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 40 minutes 52 seconds 960 msec
MapReduce Jobs Launched:
Job 0: Map: 59 Reduce: 18 Cumulative CPU: 3903.4 sec HDFS Read: 37387 HDFS Write: 12658668325 SUCCESS
Job 1: Map: 48 Cumulative CPU: 2707.34 sec HDFS Read: 12658908810 HDFS Write: 9321506973 SUCCESS
Job 2: Map: 29 Reduce: 10 Cumulative CPU: 3901.99 sec HDFS Read: 9321641955 HDFS Write: 11079251576 SUCCESS
Job 3: Map: 42 Cumulative CPU: 2388.71 sec HDFS Read: 11079470178 HDFS Write: 10932264824 SUCCESS
Job 4: Map: 42 Reduce: 12 Cumulative CPU: 3792.5 sec HDFS Read: 10932405443 HDFS Write: 11812454443 SUCCESS
Job 5: Map: 45 Reduce: 13 Cumulative CPU: 4432.22 sec HDFS Read: 11812679475 HDFS Write: 11815458945 SUCCESS
Job 6: Map: 42 Cumulative CPU: 6052.96 sec HDFS Read: 11815691155 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 days 7 hours 32 minutes 59 seconds 120 msec
OK
The query is still taking longer than 5 hours in Hive where as in RDBMS it takes only 5 hrs. I need some help in optimizing this query, so that it executes much faster. Interestingly, when i ran the task with 4 large core instances, the time taken improved only by 10 mins compared to the run with 3 large instance core instances. but when i ran the task with 3 med cores, it took 1hr 10 mins more.
This brings me to the question, "is Hive even the right choice for such complex joins" ?
I suspect the bottleneck is just in sorting your product table, since it seems much larger than the others. I think joins with Hive for tables over a certain size become untenable, simply because they require a sort.
There are parameters to optimize sorting, like io.sort.mb, which you can try setting, so that more sorting occurs in memory, rather than spilling to disk, re-reading and re-sorting. Look at the number of spilled records, and see if this much larger than your inputs. There are a variety of ways to optimize sorting. It might also help to break your query up into multiple subqueries so it doesn't have to sort as much at one time.
For the stock_info , and product_cat tables, you could probably keep them in memory since they are so small ( Check out the 'distributed_map' UDF in Brickhouse ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java ) For custom image, you might be able to use a bloom filter, if having a few false positives is not a real big problem.
To completely remove the join, perhaps you could store the image info in a keystone DB like HBase to do lookups instead. Brickhouse also had UDFs for HBase , like hbase_get and base_cached_get .

SPSS: Recoding time to make sampling rate explicit

I have data from an experiment that is sampling responses between 59 to 60 hz. There is no way to predict the drop-down in sampling rate throughout the experiment which runs for 18 minutes.
Each of the sampled responses are numbered from 1 to N (for total number of rows) showing relative passage of time, stored in variable 'frame'. I also have a unix time stamp marking absolute time stored in 'unixtime'. But unixtime is reported in whole integers & not in fractional units. For example:
1376925380 may be repeated 59 times;
1376925381 may be repeated 60 times in the data file.
I would like to create a new variable that tracks each consecutive frame (or sampled response) from 1 to 60 or from 1 to 59, as the case may be, for each given unixtime stamp in SPSS. See the desired re-arrangement below. Any help w/ appropriate SPSS-syntax is appreciated!
unixtime newframe
1376925380 1
1376925380 2
1376925380 3
1376925380 4
1376925380 5
1376925380 6
....
1376925380 58
1376925380 59
1376925381 1
1376925381 2
1376925381 3
1376925381 4
.... ....
1376925381 60
1376925382 1
1376925382 2
....
If I understand correctly, you can use LAG to figure out your counter between the time stamps. Example below.
*fake data.
set seed 10.
input program.
loop #i = 1 to 100.
loop #j = 1 to TRUNC(RV.UNIFORM(59,61)).
compute unixtime = 1376925379 + #i.
end case.
end loop.
end loop.
end file.
end input program.
*Using lag to calculate newframe variable.
DO IF ($casenum = 1) OR (unixtime <> lag(unixtime)).
compute newframe = 1.
ELSE.
compute newframe = lag(newframe) + 1.
END IF.
exe.
See related discussion for using lag at, Using sequential case processing for data management in SPSS.

Resources