Usually The sk_buff is allocated by the module(interface to the network driver) developed by me on transmit side through "alloc_skb" and given to the network driver. It's basically free'd by the network driver after send.
Is there a way to do the SKB free from my developed module or Is there a way to ensure whether the network driver is freeing skb buff properly?
If I recall correctly, you may use skb_get() to increment reference counter of your buffer on the "send side". This means that as soon as the network driver decides to free the buffer after putting it on wire, it will call a proper function which will decrement the reference counter but it will still remain non-zero (because of your previous + 1), so it won't be freed. Then you will be able to hold the buffer and free it (as you are the last user) whenever you want.
However, if you have a chain of sk_buff-s, I'm afraid you need to perform skb_get() on each segment.
As for your need to make sure that the network driver itself frees the buffer correctly, you may, for example, find the places in the network driver where a buffer is freed on transmit and insert debug printouts. Then recompile, try to interact with the driver in a normal way (i.e. without doing skb_get() prior handing the buffer on to the driver) and observe debug printouts appearing in dmesg.
Related
I have several doubts about processes and memory management. List the main. I'm slowly trying to solve them by myself but I would still like some help from you experts =).
I understood that the data structures associated with a process are more or less these:
text, data, stack, kernel stack, heap, PCB.
If the process is created but the LTS decides to send it to secondary memory, are all the data structures copied for example on SSD or maybe just text and data (and PCB in kernel space)?
Pagination allows you to allocate processes in a non-contiguous way:
How does the kernel know if the process is trying to access an illegal memory area? After not finding the index on the page table, does the kernel realize that it is not even in virtual memory (secondary memory)? If so, is an interrupt (or exception) thrown? Is it handled immediately or later (maybe there was a process switch)?
If the processes are allocated non-contiguously, how does the kernel realize that there has been a stack overflow since the stack typically grows down and the heap up? Perhaps the kernel uses virtual addresses in PCBs as memory pointers that are contiguous for each process so at each function call it checks if the VIRTUAL pointer to the top of the stack has touched the heap?
How do programs generate their internal addresses? For example, in the case of virtual memory, everyone assumes starting from the address 0x0000 ... up to the address 0xffffff ... and is it then up to the kernel to proceed with the mapping?
How did the processes end? Is the system call exit called both in case of normal termination (finished last instruction) and in case of killing (by the parent process, kernel, etc.)? Does the process itself enter kernel mode and free up its associated memory?
Kernel schedulers (LTS, MTS, STS) when are they invoked? From what I understand there are three types of kernels:
separate kernel, below all processes.
the kernel runs inside the processes (they only change modes) but there are "process switching functions".
the kernel itself is based on processes but still everything is based on process switching functions.
I guess the number of pages allocated the text and data depend on the "length" of the code and the "global" data. On the other hand, is the number of pages allocated per heap and stack variable for each process? For example I remember that the JVM allows you to change the size of the stack.
When a running process wants to write n bytes in memory, does the kernel try to fill a page already dedicated to it and a new one is created for the remaining bytes (so the page table is lengthened)?
I really thank those who will help me.
Have a good day!
I think you have lots of misconceptions. Let's try to clear some of these.
If the process is created but the LTS decides to send it to secondary memory, are all the data structures copied for example on SSD or maybe just text and data (and PCB in kernel space)?
I don't know what you mean by LTS. The kernel can decide to send some pages to secondary memory but only on a page granularity. Meaning that it won't send a whole text segment nor a complete data segment but only a page or some pages to the hard-disk. Yes, the PCB is stored in kernel space and never swapped out (see here: Do Kernel pages get swapped out?).
How does the kernel know if the process is trying to access an illegal memory area? After not finding the index on the page table, does the kernel realize that it is not even in virtual memory (secondary memory)? If so, is an interrupt (or exception) thrown? Is it handled immediately or later (maybe there was a process switch)?
On x86-64, each page table entry has 12 bits reserved for flags. The first (right-most bit) is the present bit. On access to the page referenced by this entry, it tells the processor if it should raise a page-fault. If the present bit is 0, the processor raises a page-fault and calls an handler defined by the OS in the IDT (interrupt 14). Virtual memory is not secondary memory. It is not the same. Virtual memory doesn't have a physical medium to back it. It is a concept that is, yes implemented in hardware, but with logic not with a physical medium. The kernel holds a memory map of the process in the PCB. On page fault, if the access was not within this memory map, it will kill the process.
If the processes are allocated non-contiguously, how does the kernel realize that there has been a stack overflow since the stack typically grows down and the heap up? Perhaps the kernel uses virtual addresses in PCBs as memory pointers that are contiguous for each process so at each function call it checks if the VIRTUAL pointer to the top of the stack has touched the heap?
The processes are allocated contiguously in the virtual memory but not in physical memory. See my answer here for more info: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?. I think stack overflow is checked with a page guard. The stack has a maximum size (8MB) and one page marked not present is left underneath to make sure that, if this page is accessed, the kernel is notified via a page-fault that it should kill the process. In itself, there can be no stack overflow attack in user mode because the paging mechanism already isolates different processes via the page tables. The heap has a portion of virtual memory reserved and it is very big. The heap can thus grow according to how much physical space you actually have to back it. That is the size of the swap file + RAM.
How do programs generate their internal addresses? For example, in the case of virtual memory, everyone assumes starting from the address 0x0000 ... up to the address 0xffffff ... and is it then up to the kernel to proceed with the mapping?
The programs assume an address (often 0x400000) for the base of the executable. Today, you also have ASLR where all symbols are kept in the executable and determined at load time of the executable. In practice, this is not done much (but is supported).
How did the processes end? Is the system call exit called both in case of normal termination (finished last instruction) and in case of killing (by the parent process, kernel, etc.)? Does the process itself enter kernel mode and free up its associated memory?
The kernel has a memory map for each process. When the process dies via abnormal termination, the memory map is crossed and cleared off of that process's use.
Kernel schedulers (LTS, MTS, STS) when are they invoked?
All your assumptions are wrong. The scheduler cannot be called otherwise than with a timer interrupt. The kernel isn't a process. There can be kernel threads but they are mostly created via interrupts. The kernel starts a timer at boot and, when there is a timer interrupt, the kernel calls the scheduler.
I guess the number of pages allocated the text and data depend on the "length" of the code and the "global" data. On the other hand, is the number of pages allocated per heap and stack variable for each process? For example I remember that the JVM allows you to change the size of the stack.
The heap and stack have portions of virtual memory reserved for them. The text/data segment start at 0x400000 and end wherever they need. The space reserved for them is really big in virtual memory. They are thus limited by the amount of physical memory available to back them. The JVM is another thing. The stack in JVM is not the real stack. The stack in JVM is probably heap because JVM allocates heap for all the program's needs.
When a running process wants to write n bytes in memory, does the kernel try to fill a page already dedicated to it and a new one is created for the remaining bytes (so the page table is lengthened)?
The kernel doesn't do that. On Linux, the libstdc++/libc C++/C implementation does that instead. When you allocate memory dynamically, the C++/C implementation keeps track of the allocated space so that it won't request a new page for a small allocation.
EDIT
Do compiled (and interpreted?) Programs only work with virtual addresses?
Yes they do. Everything is a virtual address once paging is enabled. Enabling paging is done via a control register set at boot by the kernel. The MMU of the processor will automatically read the page tables (among which some are cached) and will translate these virtual addresses to physical ones.
So do pointers inside PCBs also use virtual addresses?
Yes. For example, the PCB on Linux is the task_struct. It holds a field called pgd which is an unsigned long*. It will hold a virtual address and, when dereferenced, it will return the first entry of the PML4 on x86-64.
And since the virtual memory of each process is contiguous, the kernel can immediately recognize stack overflows.
The kernel doesn't recognize stack overflows. It will simply not allocate more pages to the stack then the maximum size of the stack which is a simple global variable in the Linux kernel. The stack is used with push pops. It cannot push more than 8 bytes so it is simply a matter of reserving a page guard for it to create page-faults on access.
however the scheduler is invoked from what I understand (at least in modern systems) with timer mechanisms (like round robin). It's correct?
Round-robin is not a timer mechanism. The timer is interacted with using memory mapped registers. These registers are detected using the ACPI tables at boot (see my answer here: https://cs.stackexchange.com/questions/141870/when-are-a-controllers-registers-loaded-and-ready-to-inform-an-i-o-operation/141918#141918). It works similarly to the answer I provided for USB (on the link I provided here). Round-robin is a scheduler priority scheme often called naive because it simply gives every process a time slice and executes them in order which is not currently used in the Linux kernel (I think).
I did not understand the last point. How is the allocation of new memory managed.
The allocation of new memory is done with a system call. See my answer here for more info: Who sets the RIP register when you call the clone syscall?.
The user mode process jumps into a handler for the system call by calling syscall in assembly. It jumps to an address specified at boot by the kernel in the LSTAR64 register. Then the kernel jumps to a function from assembly. This function will do the stuff the user mode process requires and return to the user mode process. This is often not done by the programmer but by the C++/C implementation (often called the standard library) that is a user mode library that is linked against dynamically.
The C++/C standard library will keep track of the memory it allocated by, itself, allocating some memory and by keeping records. Then, if you ask for a small allocation, it will use the pages it already allocated instead of requesting new ones using mmap (on Linux).
I want to develop an application with OpenCL to run on multiGPU. At some point, data from one GPU should be transferred to another one. Is there any way to avoid transferring through host. This can be done on CUDA via cudaMemcpyPeerAsync function. Is there any function similar to it in OpenCL?
In OpenCL, a context is treated as a memory space. So if you have multiple devices associated with the same context, and you create a command queue per device, you can potentially access the same buffer object from multiple devices.
When you access a memory object from a specific device, the memory object first needs to be migrated to the device so it can physically access it. Migration can be done explicitly using clEnqueueMigrateMemObjects.
So a sequence of a simple producer-consumer with multiple devices can be implemented like so:
command queue on device 1:
migrate memory buffer1
enqueue kernels that process this buffer
save last event associated with buffer1 processing
command queue on device 2:
migrate memory buffer1 - use the event produced by queue 1 to sync the migration.
enqueue kernels that process this buffer
How exactly migration occurs under the hood I cannot tell, but I assume that it can either be DMA from device 1 to device 2 or (more likely) DMA from device 1 to host and then host to device 2.
If you wish to avoid the limitation of using a single context or would like to insure the data transfer is efficient, then you are at the mercy of vendor-specific extensions.
For example, AMD offers DirectGMA technology that allows explicit remote DMA between GPU and any other PCIe device (including other GPUs). From experience it works very nice.
If I execute a kernel that uses a small piece of constant memory, then write to that constant memory while the kernel is running, does the kernel immediately see the change, or is the contents of the constant memory "cached" upon kernel launch - or does the OpenCL driver unconditionally delay the constant memory update until the kernel is done running?
If the first or third options occur, then how can I execute the same kernel with different constant memory data simultaneously? Do I need to create multiple kernel/constant buffer objects and work with that? Note I can't precalculate anything as kernel launches are a result of external signals that can occur at any time and rate. I could also create kernel objects on the fly, but that seems like an ugly solution.
It's a fundamental concept in OpenCL that commands that are 'Enqueued' into the same command queue are executed in order. This includes WriteBuffer and similar commands. This means if you do
EnqueueNDKernalRange()
EnqueueWriteBuffer()
EnqueueNDKernalRange()
Then regardless of them being blocking or non-blocking, the write will only effect the second set of kernels.
If you're updating via a mapped pointer then it should be unmapped before any kernels run. Running kernels which access a buffer that is currently mapped is undefined (Spec 1.1 - Section 5.4.2.1).
As EnqueueMapBuffer and EnqueUnmapMemObject are also placed on the command queue, as long as you unmap the ordering of updates is still guaranteed.
Does that answer your question, or are you updating your buffer in another way?
how can I execute the same kernel with different constant memory data simultaneously? Do I need to create multiple kernel/constant buffer objects and work with that?
Yes, multiple buffer objects.
First, malloc a buffer from userspace and fill the buffer with all 'A'
Then, pass the pointer of the buffer to kernel ,using netlink socket,
Finally, I can read and write the buffer, using the raw pointer directly passed from userspace.
Why ?
Why directly access to user space memory from kernel is allowed?
Linux Device Driver, Third Edition, Page 415, said that The kernel cannot directly manipulate memory that is not mapped into the kernel’s address space.
The point is that accessing user addresses directly in kernel only sometimes work.
As long as you try to access the user address in the context of the same process that allocated it and that the process has already faulted it in and you are using a kernel with a 3:1 memory mapping (as opposed to 4:4 mapping that is sometimes used) and that the kernel did not swap out the page the allocation is in - the access will work.
The problem is that all these conditions are not always true and they can change even from run time of the program to another. Therefore the kernel driver writers needs to not count on being able to access user addresses.
The worst thing that can happen is for you to assume it works, have it always work in the lab, and have it crash at a customer site every so often. This is the reason for the book statement.
In this book - words 'The kernel cannot directly manipulate memory that is not mapped into the kernel’s address space' is about physical memory. Other words - kernel has only 800-900 MB (on x86) that can be mapped to physical memory at one time. To access whole physical memory kernel need constantly remap this region.
Netlink not dealing with physical memory at all - it is designed for bidirectional communication between userspace<->userspace or userspace<->kernelspace.
We have an application server which have been observed sending headers with TCP window size 0 at times when the network had congestion (at a client's site).
We would like to know if it is Indy or the underlying Windows layer that is responsible for adjusting the TCP window size down from the nominal 64K in adaptation to the available throughput.
And we would be able to act upon it becoming 0 (nothing gets send, users wait => no good).
So, any info, link, pointer to Indy code are welcome...
Disclaimer: I'm not a network specialist. Please keep the answer understandable for the average me ;-)
Note: it's Indy9/D2007 on Windows Server 2003 SP2.
More gory details:
The TCP zero window cases happen on the middle tier talking to the DB server.
It happens at the same moments when end users complain of slowdowns in the client application (that's what triggered the network investigation).
2 major Network issues causing bottlenecks have been identified.
The TCP zero window happened when there was network congestion, but may or may not be caused by it.
We want to know when that happen and have a way to do something (logging at least) in our code.
So the core question is who sets the window size to 0 and where?
Where to hook (in Indy?) to know when that condition occurs?
The window size in the TCP header is noramlly set by the TCP stack software to reflect the size of the buffer space available. If your server is sending packets with a window set to zero, it probably because the client is sending data faster than the application running on the server is reading it, and the buffers associated with the TCP connection are now full.
This is perfectly normal operation for the TCP protocol if the client sends data faster than the server can read it. The client should refrain from sending data until the server sends a non-zero window size (there's no point, as it would be discarded anyway).
This may or may not reflect a serious problem between client and server, but if the condition persists it probably means the application running on the server has stopped reading the received data (once it starts reading, this frees up buffer space for TCP, and the TCP stack will send a new non-zero window size).
A TCP header with a window size of zero indicates that the receiver's buffers are full. This is a normal condition for a faster writer than reader.
In reading your description, it's not clear if this is unexpected. What caused you to open a protocol analyzer?
Since you might be interested in a solution to your problem, too:
If you have some control on what's running on the server side (the one that sends the 0 window size messages):
Did you consider using setsockopt() with SO_RCVBUF to significantly increase the size of the receive buffer of your socket?
In Indy, setsockopt() is a method of the TIdSocketHandle.
You should apply it to all the TIdSocketHandle objects associated with your socket.
And in Indy 9, those are located through property Bindings in your TIdTCPServer.
I suggest first using getsockopt() with SO_RCVBUF to see what the OS gives you as a default buffer size. Then significantly increase this, may be by successive trials, doubling the size every time.
You might also want re-run a getsockopt() call after your setsockopt() to insure that your setsockopt was actually performed: There is usually an upper limit that the socket implementation sets to the buffer sizes. And in this case, there is usually an OS-dependent way to move that ceiling value up. But those are rather extreme cases, and you are not too likely to need this.
If you don't have control on the source code on the side that gets overflowed, just check to see if the software running there exposes some parameter to change that buffer size.
Good luck!