perf report. Twice or more entries that takes too much time - perf

I present the perf's output of my samples collected with the perf -g -p.
I don't know how to interpret the fact that there is a lot entries that takes > 90% time. After all, if a process spent 90% time in start_thread (and its children) it is not possible that that process spent > 90% time in java_start (for example) as well.
Please explain

Let us start with the command that you used to perform a perf record.
The use of -g switch indicates that you are trying to collect the information about callchains along with the information about overhead.
The overhead can be shown in two columns as Children and Self when perf collects callchains.
Children Self Command Shared Object Symbol ◆
- 14.19% 0.00% qemu-system-x86 [unknown] [.] 0xbbbe258d4c544155 ▒
0xbbbe258d4c544155 ▒
__libc_start_main ▒
+ main ▒
The 'self' overhead values indicate the count of timer ticks (period values) that is spent in an individual function. So if only the 'self' overhead values were to be displayed in perf report, the sum of the overhead values would always be 100% as you are probably expecting.
However, when both 'children' and 'self' columns get displayed in perf report, things become a lot more confusing. The 'children' overhead column sums up the overhead values of all of the child functions that are being called from a parent.
In your case, start_thread has a 'children' overhead percentage of 98.78%, but its 'self' overhead is 0.00%. This means that the sum of the time spent by the execution in all of the functions that are being called by start_thread(i.e. child functions) is 98.78%, but start_thread alone does not lead to any overheads at all, since its 'self' overhead is 0.00%.
Now coming to Java_start. It looks like start_thread calls Java_start. Once again, the 'children' overhead for Java_start will include the sum of overheads of all the functions that are being called by it. That is why, you again see almost the same overhead values for both functions.
Consider an example--
void main(){
do_main();
}
void do_main() {
foo();
}
void foo(){
bar();
}
void bar(){
/* do something here */
}
Let us assume the 'self' overheads of foo() and bar() are 60% and 40% respectively. And also let main() and do_main() each have 'self' overheads of 0% and 0% respectively.
Then the 'children' overheads of each of the functions will be like -
main() children: 100% self: 0%
do_main() children: 100% self: 0%
foo() children: 100% self: 60%
bar() children: 40% self: 40%

Related

Understanding memory leakage with Rc in rust

If we look at the following code (playground link):
use std::cell::RefCell;
use std::rc::Rc;
struct Person {
name: String,
parent: Option<Rc<RefCell<Person>>>,
child: Option<Rc<RefCell<Person>>>,
}
impl Drop for Person {
fn drop(&mut self) {
println!("dropping {}", &self.name);
}
}
pub fn main() {
let anakin = Rc::new(RefCell::new(Person {
parent: None,
child: None,
name: "anakin".to_owned(),
}));
let luke = Rc::new(RefCell::new(Person {
parent: None,
child: None,
name: "luke".to_owned(),
}));
luke.borrow_mut().parent = Some(anakin.clone());
anakin.borrow_mut().child = Some(luke.clone());
println!("anakin {}", Rc::strong_count(&anakin));
println!("luke {}", Rc::strong_count(&luke));
}
When the code runs, the message from the drop implementation is not printed, because this code is supposed to leak memory.
When the code finishes the count for both luke and anakin is supposed to end up at 1 rather than 0 (which is when the managed heap data would be cleaned up).
Why does the count not end up being 0? I would have thought the following sequence would happen:
we start with two data locations in heap, one pointed to by luke and anakin.child, the other pointed to by anakin and luke.parent. The object luke owns luke.parent. The object anakin owns anakin.child
luke goes out of scope, which means its members which it owns are also dropped. So luke and luke.parent drop. Both are Rcs, so the ref count for both the memory locations therefore goes down to 1
anakin is dropped, causing the Rc object anakin and anakin.child to drop, which causes ref count for both memory locations to go down to 0.
This is where the heap data should be cleaned up? Are the members of an object not dropped when it is dropped?
When I remove or comment the lines connecting luke and anakin via borrow_mut, the drop happens as expected and the messages are correctly printed.
luke goes out of scope, which means its members which it owns are also dropped. So luke and luke.parent drop. Both are Rcs, so the ref count for both the memory locations therefore goes down to 1
This is the step where you are misunderstanding.
Are the members of an object not dropped when it is dropped?
Exactly. The count that is decreased is the one directly associated with the data that the Rc points. When luke goes out of scope, the Rc goes out of scope, and the count of things referencing luke's RefCell decreases, the code knows nothing about luke.parent.
In your code, after
luke.borrow_mut().parent = Some(anakin.clone());
anakin.borrow_mut().child = Some(luke.clone());
there are 4 Rc objects (2 on stack, 2 on heap), and 2 RefCell objects on the heap with counts associated with them. Each RefCell has a count of 2 as you've seen, because there are 2 Rcs referencing each RefCell.
When anakin drops, the count for it's RefCell decreases, so you then have the luke RefCell on the heap with a count of 2 and the anakin RefCell with a count of 1.
When luke drops, that then decreases the count of that RefCell, so each one now has a count of 1. You end up with no Rc values on the stack that reference the RefCell, but each RefCell references the other RefCell, so there is no way for Rust to know that they are safe to be dropped.
I can't quite tell from your question, but this is absolutely an expected limitation of Rc type when used with RefCell because it allows for the introduction of cycles in the ownership of objects.
Your code is nearly the minimum reproducible example of an Rc cycle: What is a minimal example of an Rc dependency cycle?

Calculate Time Complexity In This Function

Given this function (written in pseudocode) what's the time complexity? Trying it out i would say the time complexity is θ(n^3) since we need to traverse the tree first θ(n) then multiply the contribute of ANCESTOR which is θ(n) and the contribute of ADDTOQUEUE θ(n). Is this correct?
====================================================================
ANCESTOR does a number of operations proportionate to the depth of the node
ADDTOHEAD does a constant number of operations
ADDTOQUEUE does a number of operations proportionate to the lenght of the list
`FUNCTION(T) /* T is a tree filled with integers */
L.head = NULL /* L is a new empty linked list (of integers) */
RIC_FUNC(T.root,L)
return L
REC_FUNC(v,L)
if(v==NULL)return
if(ANCESTOR(v))
ADDTOQUEUE(L,v.info)
else
ADDHEAD(L,v.info)
REC_FUNC(v.left,L)
REC_FUNC(v.right,L)
``
Basically, you are correct: O(n^3).
But, I have a feeling (also not proveable by you pseudo-code) that ANCESTOR and ADDHEAD are on the contrary sides - which means at the first ran L is short and v is high therefor ANCESTOR will be long and ADDHEAD short and after some steps they will gets equal and from that point v is lower and L is bigger so ANCESTOR will be fast but ADDHEAD will be long.
If my assumption is correct, and the "speed" of ADDHEAD and ANCESTOR are the same complexity in different direction then you're complexity is O(n^2) (as in every node you will get: 1+n, 2+(n-1), 3+(n-2) ... which conclude each step in n+1).

What is the most memory-efficient array of nullable vectors when most of the second dimension will be empty?

I have a large fixed-size array of variable-sized arrays of u32. Most of the second dimension arrays will be empty (i.e. the first array will be sparsely populated). I think Vec is the most suitable type for both dimensions (Vec<Vec<u32>>). Because my first array might be quite large, I want to find the most space-efficient way to represent this.
I see two options:
I could use a Vec<Option<Vec<u32>>>. I'm guessing that as Option is a tagged union, this would result each cell being sizeof(Vec<u32>) rounded up to the next word boundary for the tag.
I could directly use Vec::with_capacity(0) for all cells. Does an empty Vec allocate zero heap until it's used?
Which is the most space-efficient method?
Actually, both Vec<Vec<T>> and Vec<Option<Vec<T>>> have the same space efficiency.
A Vec contains a pointer that will never be null, so the compiler is smart enough to recognize that in the case of Option<Vec<T>>, it can represent None by putting 0 in the pointer field. What is the overhead of Rust's Option type? contains more information.
What about the backing storage the pointer points to? A Vec doesn't allocate (same link as the first) when you create it with Vec::new or Vec::with_capacity(0); in that case, it uses a special, non-null "empty pointer". Vec only allocates space on the heap when you push something or otherwise force it to allocate. Therefore, the space used both for the Vec itself and for its backing storage are the same.
Vec<Vec<T>> is a decent starting point. Each entry costs 3 pointers, even if it is empty, and for filled entries there can be additional per-allocation overhead. But depending on which trade-offs you're willing to make, there might be a better solution.
Vec<Box<[T]>> This reduces the size of an entry from 3 pointers to 2 pointers. The downside is that changing the number of elements in a box is both inconvenient (convert to and from Vec<T>) and more expensive (reallocation).
HashMap<usize, Vec<T>> This saves a lot of memory if the outer collection is sufficiently sparse. The downsides are higher access cost (hashing, scanning) and a higher per element memory overhead.
If the collection is only filled once and you never resize the inner collections you could use a split data structure:
This not only reduces the per-entry size to 1 pointer, it also eliminates the per-allocation overhead.
struct Nested<T> {
data: Vec<T>,
indices: Vec<usize>,// points after the last element of the i-th slice
}
impl<T> Nested<T> {
fn get_range(&self, i: usize) -> std::ops::Range<usize> {
assert!(i < self.indices.len());
if i > 0 {
self.indices[i-1]..self.indices[i]
} else {
0..self.indices[i]
}
}
pub fn get(&self, i:usize) -> &[T] {
let range = self.get_range(i);
&self.data[range]
}
pub fn get_mut(&mut self, i:usize) -> &mut [T] {
let range = self.get_range(i);
&mut self.data[range]
}
}
For additional memory savings you can reduce the indices to u32 limiting you to 4 billion elements per collection.

Julia - nested loops are consuming a lot of memory

I have a nested loop iteration scheme that's taking up a lot of memory. I do understand that I should not have globals but even after I included everything in a function, the memory situation didn't improve. It just accumulates after each iteration as if there is no garbage collection.
Here is a workable example that's similar to my code.
I have two files. First, functions.jl:
##functions.jl
module functions
function getMatrix(A)
L = rand(A,A);
return L;
end
function loopOne(A, B)
res = 0;
for i = 1:B
res = inv(getMatrix(A));
end
return(res);
end
end
Second file main.jl:
##main.jl
include("functions.jl")
function main(C)
res = 0;
A = 50;
B = 30;
for i =1:C
mat = functions.loopOne(A,B);
res = mat .+ 1;
end
return res;
end
main(100)
When I execute julia main.jl, the memory increases as I extend C in main(C) (sometimes to more than millions of allocations and 10GiBs when I increase C to 1000000).
I know that the example looks useless but it resembles the structure that I have. Can someone please help? Thank you.
UPDATE:
Michael K. Borregaard gave an answer which is very helpful:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
However, when I time it, allocations and memory still increase as I dial up C.
#time some_descriptive_name(30)
0.057177 seconds (11.77 k allocations: 58.278 MiB, 9.58% gc time)
#time some_descriptive_name(60)
0.113808 seconds (23.53 k allocations: 116.518 MiB, 9.63% gc time)
I believe that the problem comes from the inv function. If I change the code to:
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= res .+ 1
end
return res
end
The memory and allocations will then stay constant:
#time some_descriptive_name(3)
0.000007 seconds (8 allocations: 39.438 KiB)
#time some_descriptive_name(60)
0.000037 seconds (8 allocations: 39.438 KiB)
Is there a way to "clear" the memory after using inv? Since I'm not creating anything new or storing anything new, the memory usage should stay constant.
A few pointers at least:
The getMatrix function allocates a new AxA matrix every time. That will certainly consume memory. It is better to avoid the allocations if you can, e.g. by using rand! to fill an existing array with random values.
The res = 0 line defines res as an Int but you subsequently assign a Matrix{Float} to it (the result of inv(getMatrix)). Changing the type of a variable in the code makes it hard for the compiler to figure out what the type is, which makes for slow code.
It seems you have a module called functions but you don't write it.
The res = inv code line constantly overwrites the value, so the loop does nothing!
The structure and code looks like C++. Try looking at the style guide.
Here's how the code would look in a more ideomatic way that avoids allocations:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
end
Comments:
Use a module if you like - it's up to you whether to put things in different files. Module name in caps.
If you can, it's an advantage to use functions that overwrite values of an existing container. Such functions end with ! to signal that they will modify the arguments (like passing a variably by reference without making it const in C++).
Use the .= operator to indicate that you're not creating a new container, you're overwriting the elements of the existing one. The rand! function overwrites mymatrix.
The return keyword is not strictly needed, but DNF suggested it is better style in a comment
The main convention isn't used in Julia, as most code gets called by the user, not by execution of a program.
Compact assignment format for multiple variables.
Note that in this case, none of these optimisation matter much, as 99% of the calculation time is spent in the expensive inv function.
RESPONSE TO THE UPDATE:
There's nothing wrong with the inv function, it just is a costly operation to do. But again I think you may be misunderstanding what the memory counting does. It is not that the memory use is increasing, like you would be looking for in C++ if you had a pointer to an object that was never released (a memory leak). The memory use is constant, but the total sum of allocations increase, because the inv function has to make some internal allocations.
Consider this example
for i in 1:n
b = [1, 2, 3, 4] # Here a length-4 Array{Int64} is initialized in memory, cost is 32 bytes
end # Here, that memory is released.
For each run through the for loop, 32 bytes is allocated, and 32 bytes is released. When the loop ends, regardless of n, 0 bytes will be allocated from this operation. But Julia's memory tracking only adds the allocations - so after running the code you will see allocation of 32*n bytes.
The reason julia does this is that allocating space in the RAM is one of the costliest operations in computing - so reducing allocations is a good way to speed up code. But you cannot avoid it.
There is thus nothing wrong with your code (in the new format) - the memory allocation and time taken you see is just a result of doing a big (expensive) operation.

Swift stack and heap understanding

I want to understand what is stored in the stack and heap in swift. I have a rough estimation:
Everything that you print and the memory address appears not the values, those are stored in the stack, and what is printed out as values, those are on the heap, basically according to value and reference types. Am I completely wrong? And optionally, could you provide a visual representation of the stack/heap?
As #Juul stated Reference types are stored in the Heap and values in the stack.
Here is the explanation:
Stack and Heap
Stack is used for static memory allocation and Heap for dynamic memory allocation, both stored in the computer's RAM .
Variables allocated on the stack are stored directly to the memory, and access to this memory is very fast, and its allocation is determined when the program is compiled. When a function or a method calls another function which in turns calls another function, etc., the execution of all those functions remains suspended until the very last function returns its value. The stack is always reserved in a LIFO order, the most recently reserved block is always the next block to be freed. This makes it really simple to keep track of the stack. Freeing a block from the stack is nothing more than adjusting one pointer.
Variables allocated on the heap have their memory allocated at run time, and accessing this memory is a bit slower, but the heap size is only limited by the size of virtual memory. Elements of the heap have no dependencies with each other and can always be accessed randomly at any time. You can allocate a block at any time and free it at any time. This makes it more complex to keep track of which parts of the heap are allocated or free at any given time.
For Escaping Closure:
An important note to keep in mind is that in cases where a value stored on a stack is captured in a closure, that value will be copied to the heap so that it's still available by the time the closure is executed.
For more reference: http://net-informations.com/faq/net/stack-heap.htm
Classes (reference types) are allocated in the heap, value types (like Struct, String, Int, Bool, etc) live in the stack. See this topic for more detailed answers: Why Choose Struct Over Class?
Stack vs Heap
Stack is a part of thread. It consists of method(function) frames in LIFO order. Method frame contains only local variables. Actually it is method stack trace which you see during debugging or analysing error[About].
Heap another part of memory where ARC[About] come in play. It takes more time to allocate memory here(find appropriate place and allocate it in synchronous way).
Theses concepts are the same as [JVM illustration]
Xcode propose you next variant using Debug Memory Graph
*To see Backtrace use:
Edit Scheme... -> <Action> -> Diagnostics -> Malloc Stack Logging
[Value vs Reference type]
[Class vs Struct]
Usually when we ask question like this (is it stack or is it heap) we care about performance and are motivated by the desire to avoid the excessive cost of heap allocation. Following the general rule saying that "reference types are heap-allocated and the value types are stack-allocated" may lead to suboptimal design decisions and needs further discussion.
One may falsely conclude that passing around structs (value types) is universally faster than passing classes (reference types) because it never requires heap allocation. Turns out this is not always true.
The important counter-example are protocol types where concrete polymorphic types with value-semantics (structs) implement a protocol, like in this toy example:
protocol Vehicle {
var mileage: Double { get }
}
struct CombustionCar: Vehicle {
let mpg: Double
let isDiesel: Bool
let isManual: Bool
var fuelLevel: Double // gallons
var mileage: Double { fuelLevel * mpg }
}
struct ElectricCar: Vehicle {
let mpge: Double
var batteryLevel: Double // kWh
var mileage: Double { batteryLevel * mpge / 33.7 }
}
func printMileage(vehicle: Vehicle) {
print("\(vehicle.mileage)")
}
let datsun: Vehicle = CombustionCar(mpg: 18.19,
isDiesel: false,
isManual: false,
fuelLevel: 12)
let tesla: Vehicle = ElectricCar(mpge: 132,
batteryLevel: 50)
let vehicles: [Vehicle] = [datsun, tesla]
for vehicle in vehicles {
printMileage(vehicle: vehicle)
}
Note that CombustionCar and ElectricCar objects have different sizes, yet we are able to mix them in an array of Vehicle protocol types. This raises the question: don't the array container elements need to be of the same size? How can the compiler compute the offset of the array element if it doesn't always know the element size at the compile time?
It turns out there's quite a lot of logic under the hood. Swift compiler will create what is called an Existential Container. It's a fixed-sized data structure acting as a wrapper around an object. It this container that gets passed to a function call (is pushed onto the stack) instead of the actual struct.
Existential Container is five words long:
| |
|valueBuffer|
| |
| vwt |
| pwt |
The first three words are called the valueBuffer and this is where actual structure gets stored. Well, that's unless the struct size is greater than three words - in such case the compiler will allocate the struct on the heap and store the reference to it in the valueBuffer:
STACK STACK HEAP
| mpge | | reference |-->| mpg |
|batteryLevel| | | | isDiesel |
| | | | | isManual |
| vwt | | vwt | | fuelLevel |
| pwt | | pwt |
So passing a protocol type object like this to a function may actually require a heap allocation. The compiler will do the allocation and copy the values so you still get the value semantics but the cost will vary depending on whether your struct is 3 words long or more. This renders the "value-types on stack, reference-types on heap" not always correct.

Resources