Too large perf.data with too few events - perf

Recorded some stats using:
perf record -a -F 20 -o perf.data -e major-faults sleep 1800
and got perf.data ~ 1GiB with samples: 355, event count: 7592:
# Total Lost Samples: 0
#
# Samples: 355 of event 'major-faults'
# Event count (approx.): 7592
Why a few samples took lots of space? Is there a tool to look inside perf.data to find out what it actually contains?
Using command:
perf report -i perf.data -D
I've found these events in perf.data:
(cut hex dumps of each event)
0 0 0x63a0 [0x30]: PERF_RECORD_COMM: gmain:1000/1010
0x63d0 [0x38]: event: 7
0 0 0x63d0 [0x38]: PERF_RECORD_FORK(1004:1004):(1:1)
0x6408 [0x30]: event: 3
0 0 0x6408 [0x30]: PERF_RECORD_COMM: cron:1004/1004
0x6438 [0x70]: event: 10
0 0 0x6438 [0x70]: PERF_RECORD_MMAP2 1004/1004: [0x5586d93d0000(0xb000) # 0 fd:01 3285696 93896821003936]: r-xp /usr/sbin/cron
But I didnt asked perf to record those events with -e selector. How to avoid recording this?

As #osgx has already mentioned, the tool to look inside the perf.data file is perf script. perf script -D dumps raw events from the perf.data file in the hex format.
The perf.data file contains all the events generated by the Performance Monitoring Units, plus some metadata. The on-disk perf.data file usually begins with a perf_header struct of the form below -
struct perf_header {
char magic[8]; /* PERFILE2 */
uint64_t size; /* size of the header */
uint64_t attr_size; /* size of an attribute in attrs */
struct perf_file_section attrs;
struct perf_file_section data;
struct perf_file_section event_types;
uint64_t flags;
uint64_t flags1[3];
};
The bulk of the perf.data file includes perf events, which can include any of the event types mentioned here that contain metadata about each of the events, like the 32-bit process id and thread id, instruction pointer, information about the CPU being used etc. Storage of this metadata is subject to various flags being passed, like PERF_SAMPLE_PID or PERF_SAMPLE_TID etc. Look into the perf_event_open manpage. You can disable recording some of the metadata and reduce the size of each event data being written to the file.
The PERF_RECORD_COMM, PERF_RECORD_FORK and PERF_RECORD_MMAP etc. are sideband events recorded by the kernel, to help in further post-processing and detailed analysis. They are enabled by default in the kernel source code, which can be seen here.
struct perf_event_attr {
........
mmap : 1, /* include mmap data */
comm : 1, /* include comm data */
freq : 1, /* use freq, not period */
inherit_stat : 1, /* per task counts */
enable_on_exec : 1, /* next exec enables */
task : 1, /* trace fork/exit */
Having 1 in these fields means they are enabled by default, and to disable logging of these events, you'd have to make them 0 in the source code and recompile only the userspace perf module of the kernel. If set to 0, these events will not be recorded as can be seen here.
There are no command line switches or options with perf record, that would enable these events to be disabled.

Related

How to open and use huge page and transparent huge page in code on Ubuntu

I want to use huge page or transparent huge page in my code to optimize the performance of data structure. But when I use the madvise() in my code, it Can allocate memory for me.
There is always [madvise] never in /sys/kernel/mm/transparent_hugepage/enabled.
There is always defer defer+madvise [madvise] never in /sys/kernel/mm/transparent_hugepage/defrag.
#include <iostream>
#include <sys/mman.h>
#include <string.h>
int main()
{
void* ptr;
std::cout << madvise(ptr, 1, MADV_HUGEPAGE) << std::endl;
std::cout << strerror(errno) << std::endl;
return 0;
}
The result of the above code is:
-1
Cannot allocate memory
Problems with the provided code example in the question
On my system, your code prints:
-1
Invalid argument
And I don't see how it would work in the first place. madvise does not allocate memory for you, it it used to set policies for existing memory ranges. Therefore, specifying an uninitialized pointer as the first argument is not gonna work.
There exists documentation for the MADV_HUGEPAGE argument in the madvise manual:
Enable Transparent Huge Pages (THP) for pages in the range
specified by addr and length. Currently, Transparent Huge
Pages work only with private anonymous pages (see
mmap(2)). The kernel will regularly scan the areas marked
as huge page candidates to replace them with huge pages.
The kernel will also allocate huge pages directly when the
region is naturally aligned to the huge page size (see
posix_memalign(2)).
How to use permanently reserved huge pages
Here is a rewritten code that uses mmap instead of mavise. With that I can reproduce your error of Cannot allocate memory:
#include <iostream>
#include <sys/mman.h>
int main()
{
const auto memorySize = 16ULL * 1024ULL * 1024ULL;
void* data = mmap(
/* "If addr is NULL, then the kernel chooses the (page-aligned) address at which to create the mapping" */
nullptr,
memorySize,
/* memory protection / permissions */ PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
/* fd should for compatibility be -1 even though it is ignored for MAP_ANONYMOUS */ -1,
/* "The offset argument should be zero [when using MAP_ANONYMOUS]." */ 0
);
if ( data == MAP_FAILED ) {
std::cout << "Failed to allocate memory: " << strerror( errno ) << "\n";
} else {
std::cout << "Allocated pointer at: " << data << "\n";
}
munmap( data, memorySize );
return 0;
}
That error can be solved by actually making the kernel reserve some huge pages that can be allocated. Normally, this should be done during boot time when most memory is unused for better success but in my case, I was able to allocate 37 huge pages with 2 MiB, i.e., 74 MiB of memory. I find that surprisingly low because I have 370 MiB "free" and 3.9 GiB "available" memory. Maybe I should close firefox first and then try to reserve more huge pages or maybe kswapd can somehow be triggered to defragment memory before reserving more huge pages.
echo 128 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
head /sys/kernel/mm/hugepages/hugepages-2048kB/*
Output:
==> /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages_mempolicy <==
37
==> /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages <==
0
==> /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages <==
0
==> /sys/kernel/mm/hugepages/hugepages-2048kB/surplus_hugepages <==
0
Now when I run the code snipped with clang++ hugePages.cpp && ./a.out, I get this output:
Allocated pointer at: 0x7f4454e00000
As can be seen from the trailing zeros, it is aligned to quite a large alignment value of 2 MiB.
How to use transparent huge pages
I have not seen any system actually using these fixed reserved huge pages. It seems that transparent huge pages have superseded that usage. Probably, partly because:
Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure.
To mitigate these complexities, transparent huge pages were introduced:
No application changes need to be made to take advantage of THP, but interested application developers can try to optimize their use of it. A call to madvise() with the MADV_HUGEPAGE flag will mark a memory range as being especially suited to huge pages, while MADV_NOHUGEPAGE will suggest that huge pages are better used elsewhere. For applications that want to use huge pages, use of posix_memalign() can help to ensure that large allocations are aligned to huge page (2MB) boundaries.
That basically says it all but I think the first statement is not true anymore because most systems nowadays are configured to madvise in /sys/kernel/mm/transparent_hugepage/enabled instead of always, for which the statement probably was intended for. So, here is another try with madvise:
#include <array>
#include <chrono>
#include <fstream>
#include <iostream>
#include <string_view>
#include <thread>
#include <stdlib.h>
#include <string.h> // streerror
#include <sys/mman.h>
int main()
{
const auto memorySize = 16ULL * 1024ULL * 1024ULL;
void* data{ nullptr };
const auto memalignError = posix_memalign(
&data, /* alignment equal or higher to huge page size */ 2ULL * 1024ULL * 1024ULL, memorySize );
if ( memalignError != 0 ) {
std::cout << "Failed to allocate memory: " << strerror( memalignError ) << "\n";
return 1;
}
std::cout << "Allocated pointer at: " << data << "\n";
if ( madvise( data, memorySize, MADV_HUGEPAGE ) != 0 ) {
std::cerr << "Error on madvise: " << strerror( errno ) << "\n";
return 2;
}
const auto intData = reinterpret_cast<int*>( data );
intData[0] = 3;
/* This access is at offset 3000 * 8 = 24 kB, i.e.,
* still in the same 2 MiB page as the access above */
intData[3000] = 3;
intData[memorySize / sizeof( int ) / 2] = 3;
/* Check whether transparent huge pages have been allocated. */
std::ifstream smapsFile( "/proc/self/smaps" );
std::array<char, 4096> lineBuffer;
while ( smapsFile.good() ) {
/* Getline always appends null. */
smapsFile.getline( lineBuffer.data(), lineBuffer.size(), '\n' );
std::string_view line{ lineBuffer.data() };
if ( line.starts_with( "AnonHugePages:" ) && !line.contains( " 0 kB" ) ) {
std::cout << "We are successfully using transparent huge pages!\n " << line << "\n";
}
}
/* During this sleep /proc/meminfo and /proc/vmstat can be checked for transparent anonymous huge pages. */
using namespace std::chrono_literals;
std::this_thread::sleep_for( 100s );
free( data );
return intData[3000] == 3 ? 0 : 3;
}
Running this with clang++ -std=c++2b hugeTransparentPages.cpp && ./a.out (C++23 is necessary for the string_view functionalities like contains), the output on my system is:
Allocated pointer at: 0x7f38cd600000
We are successfully using transparent huge pages!
AnonHugePages: 4096 kB
And this test was executed while cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages yields 0, i.e., there are no persistently reserved huge pages.
Note that only two pages (4096 kB) out of the requested 16 MiB were actually used because the other pages have not been written to. This is also why the call to madvise is possible and yields huge pages. It has to be done before the actual physical allocation, i.e., before writing to the allocated memory.
The example code includes a check for transparent huge pages for the process itself. This site lists multiple ways to check the amount of anonymous transparent huge pages that are in use. For example, you can check system-wide with:
grep AnonHugePages /proc/meminfo
What I find interesting is that normally, this is 0 kB on my system and while the example code with madvise is running it yields 4096 kB.
To me, it seems like this means that none of my normally used programs use any persistent huge pages and also no transparent huge pages. I find that very surprising because there should be a lot of use cases for which huge page advantages should outstrip their disadvantages (wasted memory).

Questions when using Streams with input from Linux pipes

Following questions as a beginner to Dart and RxDart. The versions of Dart and that of RxDart are latest as of yesterday.
In the following example Dart program, saved in file 't.dart', only one of the two options, A or B, is un-commented at a time. Before executing it a 'fifo' is created by executing 'mkfifo fifo'. The results of the execution are as below.
Questions:
Why does a Stream opened using File show only one byte received, whereas when using stdin Stream and input from the same fifo sees all the input?
Why does the RxDart operator take emits only one value?
Option-A: Executed as 'dart t.dart' in one window, and '(for i in A B C D; do echo -n $i; sleep 1; done) > fifo' another window in same directory. The output is:
byte count: 1, bytes: A
File is now closed.
Option-B: Executed as 'cat fifo | dart t.dart' in one window, and '(for i in A B C D; do echo -n $i; sleep 1; done) > fifo'. The output is:
byte count: 1, bytes: A
byte count: 1, bytes: B
byte count: 1, bytes: C
byte count: 1, bytes: D
File is now closed.
import 'dart:io';
import 'dart:convert';
main(List<String> args) {
// Option-A
// Stream<List<int>> inputStream = File("fifo").openRead();
// Option-B
// Stream<List<int>> inputStream = stdin;
inputStream
.transform(utf8.decoder)
.take(16)
.listen((bytes) => print('byte count: ${bytes.length}, bytes: ${bytes}'),
onDone: () { print('File is now closed.'); },
onError: (e) { print(e.toString()); }
);
}
(I'm not knowledgeable enough in the internals of how Dart I/O works to give a firm answer, so this is my best guess as to what is happening.)
What it seems is going on is that in Option A, you are creating a stream to a yet-to-exist file. Dart sees that the file doesn't yet exist, so publishing to the stream is delayed. Then when you run the echo script, it creates the file and appends the first value to it "A", after which you tell it to sleep for 1 second.
During that second, Dart sees that the file now exists and begins streaming data from it. It reads "A", and then it reaches the end of the file. As far as Dart is concerned, that's the end of the story, so it closes the stream. By the time the script adds the "B", "C", and "D" to the file, Dart has already finished executing the program and exited the process.
In Option B, rather than telling Dart to stream from a file, you are tapping into the process's input stream which (as far as I am aware) is going to remain open for as long as there is stuff being written to it. I have a feeling that understanding what is exactly happening requires better knowledge of cat and how piping works in the terminal than I possess, but I believe the long-story-short of it is that the cat program knows that the file is being written to which prevents it from terminating early. As such, whenever cat gets new data, it pipes that data to the Dart process's input stream.
Back to the Dart code, you are listening to the input stream which is still expecting data since cat is still executing, and as such hasn't closed. Only when the file writing process is complete does cat recognized that it has reached the true end of the file and shuts down, at which point Dart recognizes that it isn't going to get more data and so closes the input stream.
(As I said, this is merely my best guess, but I suspect that an easy way to tell would be to look at the times your Dart script and other script are finishing. If in Option A the Dart finishes long before the script does and in Option B they finish at roughly the same time, that would be sufficient evidence to me to indicate the above is indeed what is happening.)

Linux size command, why are bss and data sections not zero?

I came across the size command which gives the section size of the ELF file. While playing around with it, I created an output file for the simplest C++ program :
int main(){return 0;}
Clearly, I have not defined any initialized or uninitialized, data then why are my BSS and DATA sections of the size 512 and 8 bytes?
I thought it might be because of int main(), I tried creating object file for the following C program :
void main(){}
I still don't get 0 for BSS and DATA sections.
Is it because a certain minimum sized memory is allocated to those section?
EDIT- I thought it might be because of linked libraries but my object is dynamically linked so probably it shouldn't be the issue
int main(){return 0;} puts data in .text only.
$ echo 'int main(){return 0;}' | gcc -xc - -c -o main.o && size main.o
text data bss dec hex filename
67 0 0 67 43 main.o
You're probably sizeing a fully linked executable.
$ gcc main.o -o main && size main
text data bss dec hex filename
1415 544 8 1967 7af main
In fact, if you are compiling with the libc attached to the binary, there are functions that are added before (and after) the main() function. They are here mostly to load dynamic libraries (even if you do not need it in your case) and unload it properly once main() end.
These functions have global variables that require storage; uninitialized (zero initialized) global variables in the BSS segment and initialized global variables in the DATA segment.
This is why, you will always see BSS and DATA in all the binaries compiled with the libc. If you want to get rid of this, then you should write your own assembly program, like this (asm.s):
.globl _start
 _start:
mov %eax, %ebx
And, then compile it without the libc:
$> gcc -nostdlib -o asm asm.s
You should reduce your footprint to the BSS and DATA segment on this ELF binary.

Is Linux perf imprecise for page fault and tlb miss?

I wrote a simple program to test page faults and tlb miss with perf.
The code is as follow. It writes 1 GB data sequentially and is expected
to trigger 1GB/4KB=256K tlb misses and page faults.
#include<stdio.h>
#include <stdlib.h>
#define STEP 64
#define LENGTH (1024*1024*1024)
int main(){
char* a = malloc(LENGTH);
int i;
for(i=0; i<LENGTH; i+=STEP){
a[i] = 'a';
}
return 0;
}
However, the result is as follow and far smaller than expected. Is perf so imprecise? I would be very appreciated if anyone can run the code on his machine.
$ perf stat -e dTLB-load-misses,page-faults ./a.out
Performance counter stats for './a.out':
12299 dTLB-load-misses
1070 page-faults
0.427970453 seconds time elapsed
Environment: Ubuntu 14.04.5 LTS , kernel 4.4.0; gcc 4.8.4 glibc 2.19. No compile flags.
The CPU is Intel(R) Xeon(R) CPU E5-2640 v2 # 2.00GHz.
The kernel prefetches pages on a fault, at least after it has evidence of a pattern. Can't find a definitive reference on the algorithm, but perhaps https://github.com/torvalds/linux/blob/master/mm/readahead.c is a starting point to seeing what is going on. I'd look for other performance counters that capture the behavior of this mechanism.

BSS, Stack, Heap, Data, Code/Text - Where each of these start in memory?

Segments of memory - BSS, Stack, Heap, Data, Code/Text (Are there any more?).
Say I have a 128MB RAM, Can someone tell me:
How much memory is allocated for each of these memory segments?
Where do they start? Please specify the address range or something like that for better clarity.
What factors influence which should start where?
That question depends on the number of variables used. Since you did not specify what compiler or language or even operating system, that is a difficult one to pin down on! It all rests with the operating system who is responsible for the memory management of the applications. In short, there is no definite answer to this question, think about this, the compiler/linker at runtime, requests the operating system to allocate a block of memory, that allocation is dependent on how many variables there are, how big are they, the scope and usage of the variables. For instance, this simple C program, in a file called simpletest.c:
#include <stdio.h>
int main(int argc, char **argv){
int num = 42;
printf("The number is %d!\n", num);
return 0;
}
Supposing the environment was Unix/Linux based and was compiled like this:
gcc -o simpletest simpletest.c
If you were to issue a objdump or nm on the binary image simpletest, you will see the sections of the executable, in this instance, 'bss', 'text'. Make note of the sizes of these sections, now add a int var[100]; to the above code, recompile and reissue the objdump or nm, you will find that the data section has appeared - why? because we added a variable of an array type of int, with 100 elements.
This simple exercise will prove that the sections grows, and hence the binary gets bigger, and it will also prove that you cannot pre-determine how much memory will be allocated as the runtime implementation varies from compiler to compiler and from operating system to operating system.
In short, the OS calls the shot on the memory management!
you can get all this information compiling your program
# gcc -o hello hello.c // you might compile with -static for simplicity
and then readelf:
# readelf -l hello
Elf file type is EXEC (Executable file)
Entry point 0x80480e0
There are 3 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x08048000 0x08048000 0x55dac 0x55dac R E 0x1000
LOAD 0x055dc0 0x0809edc0 0x0809edc0 0x01df4 0x03240 RW 0x1000
NOTE 0x000094 0x08048094 0x08048094 0x00020 0x00020 R 0x4
Section to Segment mapping:
Segment Sections...
00 .init .text .fini .rodata __libc_atexit __libc_subfreeres .note.ABI-tag
01 .data .eh_frame .got .bss
02 .note.ABI-tag
The output shows the overall structure of hello. The first program header corresponds to the process' code segment, which will be loaded from file at offset 0x000000 into a memory region that will be mapped into the process' address space at address 0x08048000. The code segment will be 0x55dac bytes large and must be page-aligned (0x1000). This segment will comprise the .text and .rodata ELF segments discussed earlier, plus additional segments generated during the linking procedure. As expected, it's flagged read-only (R) and executable (X), but not writable (W).
The second program header corresponds to the process' data segment. Loading this segment follows the same steps mentioned above. However, note that the segment size is 0x01df4 on file and 0x03240 in memory. This is due to the .bss section, which is to be zeroed and therefore doesn't need to be present in the file. The data segment will also be page-aligned (0x1000) and will contain the .data and .bss ELF segments. It will be flagged readable and writable (RW). The third program header results from the linking procedure and is irrelevant for this discussion.
If you have a proc file system, you can check this, as long as you get "Hello World" to run long enough (hint: gdb), with the following command:
# cat /proc/`ps -C hello -o pid=`/maps
08048000-0809e000 r-xp 00000000 03:06 479202 .../hello
0809e000-080a1000 rw-p 00055000 03:06 479202 .../hello
080a1000-080a3000 rwxp 00000000 00:00 0
bffff000-c0000000 rwxp 00000000 00:00 0
The first mapped region is the process' code segment, the second and third build up the data segment (data + bss + heap), and the fourth, which has no correspondence in the ELF file, is the stack. Additional information about the running hello process can be obtained with GNU time, ps, and /proc/pid/stat.
example taken from:
http://www.lisha.ufsc.br/teaching/os/exercise/hello.html
memory depend on the global variable and local variable

Resources