We wrote the simplest possible TCP server (with minor logging) to examine the memory footprint (see tcp-server.go below)
The server simply accepts connections and does nothing. It is being run on an Ubuntu 12.04.4 LTS server (kernel 3.2.0-61-generic) with Go version go1.3 linux/amd64.
The attached benchmarking program (pulse.go) creates, in this example, 10k connections, disconnects them after 30 seconds, repeats this cycle three times, and then continuously repeats small pulses of 1k connections/disconnections. The command used to test was ./pulse -big=10000 -bs=30.
The first attached graph is obtained by recording runtime.ReadMemStats when the number of clients has changed by a multiple of 500, and the second graph is the RES memory size seen by “top” for the server process.
The server starts with a negligible 1.6KB of memory. Then the memory is set by the “big” pulses of 10k connections at ~60MB (as seen by top), or at about 16MB “SystemMemory” as seen by ReadMemStats. As expected, when the 10K pulses end, the in-use memory drops, and eventually the program starts releasing memory back to OS as evidenced by the grey “Released Memory” line.
The problem is that the System Memory (and correspondingly, the RES memory seen by “top”) never drops significantly (although it drops a little as seen in the second graph).
We would expect that after the 10K pulses end, memory would continue to be released until the RES size is the minimum needed for handling each 1k pulse (which is 8m RES as seen by “top” and 2MB in-use reported by runtime.ReadMemStats). Instead, the RES stays at about 56MB and in-use never drops from its highest value of 60MB at all.
We want to ensure scalability for irregular traffic with occasional spikes as well as be able to run multiple servers on the same box that have spikes at different times. Is there a way to effectively ensure that as much memory is released back to the system as possible in a reasonable time frame?
Code https://gist.github.com/eugene-bulkin/e8d690b4db144f468bc5 :
server.go:
package main
import (
"net"
"log"
"runtime"
"sync"
)
var m sync.Mutex
var num_clients = 0
var cycle = 0
func printMem() {
var ms runtime.MemStats
runtime.ReadMemStats(&ms)
log.Printf("Cycle #%3d: %5d clients | System: %8d Inuse: %8d Released: %8d Objects: %6d\n", cycle, num_clients, ms.HeapSys, ms.HeapInuse, ms.HeapReleased, ms.HeapObjects)
}
func handleConnection(conn net.Conn) {
//log.Println("Accepted connection:", conn.RemoteAddr())
m.Lock()
num_clients++
if num_clients % 500 == 0 {
printMem()
}
m.Unlock()
buffer := make([]byte, 256)
for {
_, err := conn.Read(buffer)
if err != nil {
//log.Println("Lost connection:", conn.RemoteAddr())
err := conn.Close()
if err != nil {
log.Println("Connection close error:", err)
}
m.Lock()
num_clients--
if num_clients % 500 == 0 {
printMem()
}
if num_clients == 0 {
cycle++
}
m.Unlock()
break
}
}
}
func main() {
printMem()
cycle++
listener, err := net.Listen("tcp", ":3033")
if err != nil {
log.Fatal("Could not listen.")
}
for {
conn, err := listener.Accept()
if err != nil {
log.Println("Could not listen to client:", err)
continue
}
go handleConnection(conn)
}
}
pulse.go:
package main
import (
"flag"
"net"
"sync"
"log"
"time"
)
var (
numBig = flag.Int("big", 4000, "Number of connections in big pulse")
bigIters = flag.Int("i", 3, "Number of iterations of big pulse")
bigSep = flag.Int("bs", 5, "Number of seconds between big pulses")
numSmall = flag.Int("small", 1000, "Number of connections in small pulse")
smallSep = flag.Int("ss", 20, "Number of seconds between small pulses")
linger = flag.Int("l", 4, "How long connections should linger before being disconnected")
)
var m sync.Mutex
var active_conns = 0
var connections = make(map[net.Conn] bool)
func pulse(n int, linger int) {
var wg sync.WaitGroup
log.Printf("Connecting %d client(s)...\n", n)
for i := 0; i < n; i++ {
wg.Add(1)
go func() {
m.Lock()
defer m.Unlock()
defer wg.Done()
active_conns++
conn, err := net.Dial("tcp", ":3033")
if err != nil {
log.Panicln("Unable to connect: ", err)
return
}
connections[conn] = true
}()
}
wg.Wait()
if len(connections) != n {
log.Fatalf("Unable to connect all %d client(s).\n", n)
}
log.Printf("Connected %d client(s).\n", n)
time.Sleep(time.Duration(linger) * time.Second)
for conn := range connections {
active_conns--
err := conn.Close()
if err != nil {
log.Panicln("Unable to close connection:", err)
conn = nil
continue
}
delete(connections, conn)
conn = nil
}
if len(connections) > 0 {
log.Fatalf("Unable to disconnect all %d client(s) [%d remain].\n", n, len(connections))
}
log.Printf("Disconnected %d client(s).\n", n)
}
func main() {
flag.Parse()
for i := 0; i < *bigIters; i++ {
pulse(*numBig, *linger)
time.Sleep(time.Duration(*bigSep) * time.Second)
}
for {
pulse(*numSmall, *linger)
time.Sleep(time.Duration(*smallSep) * time.Second)
}
}
First, note that Go, itself, doesn't always shrink its own memory space:
https://groups.google.com/forum/#!topic/Golang-Nuts/vfmd6zaRQVs
The heap is freed, you can check this using runtime.ReadMemStats(),
but the processes virtual address space does not shrink -- ie, your
program will not return memory to the operating system. On Unix based
platforms we use a system call to tell the operating system that it
can reclaim unused parts of the heap, this facility is not available
on Windows platforms.
But you're not on Windows, right?
Well, this thread is less definitive, but it says:
https://groups.google.com/forum/#!topic/golang-nuts/MC2hWpuT7Xc
As I understand, memory is returned to the OS about 5 minutes after is has been marked
as free by the GC. And the GC runs every two minutes top, if not
triggered by an increase in memory use. So worst-case would be 7
minutes to be freed.
In this case, I think that the slice is not marked as freed, but in
use, so it would never be returned to the OS.
It's possible you weren't waiting long enough for the GC sweep followed by the OS return sweep, which could be up to 7 minutes after the final "big" pulse. You can explicitly force this with runtime.FreeOSMemory, but keep in mind that it won't do anything unless the GC has been run.
(Edit: Note that you can force garbage collection with runtime.GC() though obviously you need to be careful how often you use it; you may be able to sync it with sudden downward spikes in connections).
As a slight aside, I can't find an explicit source for this (other than the second thread I posted where someone mentions the same thing), but I recall it being mentioned several times that not all of the memory Go uses is "real" memory. If it's allocated by the runtime but not actually in use by the program, the OS actually has use of the memory regardless of what top or MemStats says, so the amount of memory the program is "really" using is often very overreported.
Edit: As Kostix notex in the comments and supports JimB's answer, this question was crossposted on Golang-nuts and we got a rather definitive answer from Dmitri Vyukov:
https://groups.google.com/forum/#!topic/golang-nuts/0WSOKnHGBZE/discussion
I don't there is a solution today.
Most of the memory seems to be occupied by goroutine stacks, and we don't release that memory to OS.
It will be somewhat better in the next release.
So what I outlines only applies to heap variables, memory on a Goroutine stack will never be released. How exactly this interacts with my last "not all shown allocated system memory is 'real memory'" point remains to be seen.
As LinearZoetrope said, you should wait at least 7 minutes to check how much memory is freed. Sometimes it needs two GC passes, so it would be 9 minutes.
If that is not working, or it is too much time, you can add a periodic call to FreeOSMemory (no need to call runtime.GC() before, it is done by debug.FreeOSMemory() )
Something like this: http://play.golang.org/p/mP7_sMpX4F
package main
import (
"runtime/debug"
"time"
)
func main() {
go periodicFree(1 * time.Minute)
// Your program goes here
}
func periodicFree(d time.Duration) {
tick := time.Tick(d)
for _ = range tick {
debug.FreeOSMemory()
}
}
Take into account that every call to FreeOSMemory will take some time (not much) and it can be partly run in parallel if GOMAXPROCS>1 since Go1.3.
The answer is unfortunately pretty simple, goroutine stacks can't currently be released.
Since you're connecting 10k clients at once, you need 10k goroutines to handle them. Each goroutine has an 8k stack, and even if only the first page is faulted in, you still need at least 40M of permanent memory to handle your max connections.
There are some pending changes that may help in go1.4 (like 4k stacks), but it's a fact we have to live with for now.
Related
I have a weird drop problem, and to understand my question the best way is to have a look at this simple snippet:
while( 1 )
{
if( config->running == false ) {
break;
}
num_of_pkt = rte_eth_rx_burst( config->port_id,
config->queue_idx,
buffers,
MAX_BURST_DEQ_SIZE);
if( unlikely( num_of_pkt == MAX_BURST_DEQ_SIZE ) ) {
rx_ring_full = true; //probably not the best name
}
if( likely( num_of_pkt > 0 ) )
{
pk_captured += num_of_pkt;
num_of_enq_pkt = rte_ring_sp_enqueue_bulk(config->incoming_pkts_ring,
(void*)buffers,
num_of_pkt,
&rx_ring_free_space);
//if num_of_enq_pkt == 0 free the mbufs..
}
}
This loop is retrieving packets from the device and pushing them into a queue for further processing by another lcore.
When I run a test with a Mellanox card sending 20M (20878300) packets at 2.5M p/s the loop seems to miss some packets and pk_captured is always like 19M or similar.
rx_ring_full is never true, which means that num_of_pkt is always < MAX_BURST_DEQ_SIZE, so according to the documentation I shall not have drops at HW level. Also, num_of_enq_pkt is never 0 which means that all the packets are enqueued.
Now, if from that snipped I remove the rte_ring_sp_enqueue_bulk call (and make sure to release all the mbufs) then pk_captured is always exactly equal to the amount of packets I've send to the NIC.
So it seems (but I cant deal with this idea) that rte_ring_sp_enqueue_bulk is somehow too slow and between one call to rte_eth_rx_burst and another some packets are dropped due to full ring on the NIC, but, why num_of_pkt (from rte_eth_rx_burst) is always smaller than MAX_BURST_DEQ_SIZE (much smaller) as if there was always sufficient room for the packets?
Note, MAX_BURST_DEQ_SIZE is 512.
edit 1:
Perhaps this information might help: the drops seems to be visible also by rte_eth_stats_get, or to be more correct, no drops are reported (imissed and ierrors are 0) but the value of ipackets equals my counter pk_captured (are the missing packets just disappeared??)
edit 2:
According to ethtools rx_crc_errors_phy is zero and all the packets are received at PHY level (rx_packets_phy is updated with the correct amount of transferred packets).
The value from rx_nombuf from rte_eth_stats seems to contain trash (this is a print from our test application):
OUT(4): Port 1 stats: ipkt:19439285,opkt:0,ierr:0,oerr:0,imiss:0, rxnobuf:2061021195718
For a transfer of 20M packets, as you can see rxnobuf is garbage OR it has a meaning which I do not understand. The log is generated by:
log("Port %"PRIu8" stats: ipkt:%"PRIu64",opkt:%"PRIu64",ierr:%"PRIu64",oerr:%"PRIu64",imiss:%"PRIu64", rxnobuf:%"PRIu64,
config->port_id,
stats.ipackets, stats.opackets,
stats.ierrors, stats.oerrors,
stats.imissed, stats.rx_nombuf);
where stats came from rte_eth_stats_get.
The packets are not generated on the fly but replayed from an existing PCAP.
edit 3
After the answer from Adriy (Thanks!) I've included the xstats output for the Mellanox card, while reproducing the same problem with a smaller set of packets I can see that rx_mbuf_allocation_errors get's updated, but it seems to contain trash:
OUT(4): rx_good_packets = 8094164
OUT(4): tx_good_packets = 0
OUT(4): rx_good_bytes = 4211543077
OUT(4): tx_good_bytes = 0
OUT(4): rx_missed_errors = 0
OUT(4): rx_errors = 0
OUT(4): tx_errors = 0
OUT(4): rx_mbuf_allocation_errors = 146536495542
Also those counters seems interesting:
OUT(4): tx_errors_phy = 0
OUT(4): rx_out_of_buffer = 257156
OUT(4): tx_packets_phy = 9373
OUT(4): rx_packets_phy = 8351320
Where rx_packets_phy is the exact amount of packets I've been sending, and summing up rx_out_of_buffer with rx_good_packets I get that exact amount. So it seems that the mbufs get depleted and some packets are dropped.
I made a tweak in the original code and now I'm making a copy of the mbuf from the RX ring using link and them releasing immediately the memory, further processing is done on the copy by another lcore. This do not fix the problem sadly, it turns out that to solve the problem I've to disable the packet processing and release also the packet copy (on the other lcore), which make no sense.
Well, will do a bit more investigation, but at least rx_mbuf_allocation_errors seems to need a fix here.
I guess, debugging rx_nombuf counter is a way to go. It might look like a garbage, but in fact this counter does not reflect the number of dropped packets (like ierrors or imissed do), but rather number of failed RX attempts.
Here is a snippet from MLX5 PMD:
uint16_t
mlx5_rx_burst(void *dpdk_rxq, struct rte_mbuf **pkts, uint16_t pkts_n)
{
[...]
while (pkts_n) {
[...]
rep = rte_mbuf_raw_alloc(rxq->mp);
if (unlikely(rep == NULL)) {
++rxq->stats.rx_nombuf;
if (!pkt) {
/*
* no buffers before we even started,
* bail out silently.
*/
break;
So, the plausible scenario for the issue is as follow:
There is a packet in RX queue.
There are no buffers in the corresponding mempool.
The application polls for new packets, i.e. calls in a loop: num_of_pkt = rte_eth_rx_burst(...)
Each time we call rte_eth_rx_burst(), the rx_nombuf counter gets increased.
Please also have a look at rte_eth_xstats_get(). For MLX5 PMD there is a hardware rx_out_of_buffer counter, which might confirm this theory.
The solution to missing packets will be in changing ring API from bulk to burst. In dpdk there are 2 modes ring operation bulk and burst. In bulk dequeue mode if the requested elements are 32, and there are 31 elements the API returns 0.
I have faced similar issue too.
I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.
In my kernel, if a condition is met, I update an item of the output buffer
if (condition(input[i])) //?
output[i] = 1;
otherwise the output may stay the same, having value of 0.
The density of updates are quite unpredictable, depending on the input. Furthermore which output location will be updated is also not known. (i may force them though, in some cases)
My question is, is it better to write all items, to achieve coalescing, or do a selective write?
output[i] = condition(input[i]); //?
Would you mind discussing your statements?
Coalescing is achieved even if some threads in the warp do not participate in the load or store, as long as all participating threads satisfy the requirements of coalescing. So conditional writes should have no effect on memory throughput.
However, doing a conditional write may involve additional instructions due to involving a branch (this would probably explain, for example, the difference in performance measured by Eugene in his answer).
On my setup kernel that does conditional set (option 1) runs for 1.727 us and option 2 1.399 us. This is my code (setConditional is the faster one):
__global__ void conditionalSet(unsigned int* array) {
if ((threadIdx.x & 3) == 0) {
array[threadIdx.x] = 1;
}
}
__global__ void setConditional(unsigned int* array) {
array[threadIdx.x] = (threadIdx.x & 3) == 0 ? 1 : 0;
}
I'm getting ready to release a tool that is only effective with regular hard drives, not SSD (solid state drive). In fact, it shouldn't be used with SSD's because it will result in a lot of read/writes with no real effectiveness.
Anyone knows of a way of detecting if a given drive is solid-state?
Finally a reliable solution! Two of them, actually!
Check /sys/block/sdX/queue/rotational, where sdX is the drive name. If it's 0, you're dealing with an SSD, and 1 means plain old HDD.
I can't put my finger on the Linux version where it was introduced, but it's present in Ubuntu's Linux 3.2 and in vanilla Linux 3.6 and not present in vanilla 2.6.38. Oracle also backported it to their Unbreakable Enterprise kernel 5.5, which is based on 2.6.32.
There's also an ioctl to check if the drive is rotational since Linux 3.3, introduced by this commit. Using sysfs is usually more convenient, though.
You can actually fairly easily determine the rotational latency -- I did this once as part of a university project. It is described in this report. You'll want to skip to page 7 where you see some nice graphs of the latency. It goes from about 9.3 ms to 1.1 ms -- a drop of 8.2 ms. That corresponds directly to 60 s / 8.2 ms = 7317 RPM.
It was done with simple C code -- here's the part that measures the between positions aand b in a scratch file. We did this with larger and larger b values until we have been wandered all the way around a cylinder:
/* Measure the difference in access time between a and b. The result
* is measured in nanoseconds. */
int measure_latency(off_t a, off_t b) {
cycles_t ta, tb;
overflow_disk_buffer();
lseek(work_file, a, SEEK_SET);
read(work_file, buf, KiB/2);
ta = get_cycles();
lseek(work_file, b, SEEK_SET);
read(work_file, buf, KiB/2);
tb = get_cycles();
int diff = (tb - ta)/cycles_per_ns;
fprintf(stderr, "%i KiB to %i KiB: %i nsec\n", a / KiB, b / KiB, diff);
return diff;
}
This command lsblk -d -o name,rota lists your drives and has a 1 at ROTA if it's a rotational disk and a 0 if it's an SSD.
Example output :
NAME ROTA
sda 1
sdb 0
Detecting SSDs is not as impossible as dseifert makes out. There is already some progress in linux's libata (http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg03625.html), though it doesn't seem user-ready yet.
And I definitely understand why this needs to be done. It's basically the difference between a linked list and an array. Defragmentation and such is usually counter-productive on a SSD.
You could get lucky by running
smartctl -i sda
from Smartmontools. Almost all SSDs has SSD in the Model field. No guarantee though.
My two cents to answering this old but very important question... If a disk is accessed via SCSI, then you will (potentially) be able to use SCSI INQUIRY command to request its rotational rate. VPD (Vital Product Data) page for that is called Block Device Characteristics and has a number 0xB1. Bytes 4 and 5 of this page contain a number with meaning:
0000h "Medium rotation rate is not reported"
0001h "Non-rotating medium (e.g., solid state)"
0002h - 0400h "Reserved"
0401h - FFFEh "Nominal medium rotation rate in rotations per minute (i.e.,
rpm) (e.g., 7 200 rpm = 1C20h, 10 000 rpm = 2710h, and 15 000 rpm = 3A98h)"
FFFFh "Reserved"
So, SSD must have 0001h in this field. The T10.org document about this page can be found here.
However, the implementation status of this standard is not clear to me.
I wrote the following javascript code. I needed to determine if machine was ussing SSD drive and if it was boot drive. The solution uses MSFT_PhysicalDisk WMI interface.
function main()
{
var retval= false;
// MediaType - 0 Unknown, 3 HDD, 4 SSD
// SpindleSpeed - -1 has rotational speed, 0 has no rotational speed (SSD)
// DeviceID - 0 boot device
var objWMIService = GetObject("winmgmts:\\\\.\\root\\Microsoft\\Windows\\Storage");
var colItems = objWMIService.ExecQuery("select * from MSFT_PhysicalDisk");
var enumItems = new Enumerator(colItems);
for (; !enumItems.atEnd(); enumItems.moveNext())
{
var objItem = enumItems.item();
if (objItem.MediaType == 4 && objItem.SpindleSpeed == 0)
{
if (objItem.DeviceID ==0)
{
retval=true;
}
}
}
if (retval)
{
WScript.Echo("You have SSD Drive and it is your boot drive.");
}
else
{
WScript.Echo("You do not have SSD Drive");
}
return retval;
}
main();
SSD devices emulate a hard disk device interface, so they can just be used like hard disks. This also means that there is no general way to detect what they are.
You probably could use some characteristics of the drive (latency, speed, size), though this won't be accurate for all drives. Another possibility may be to look at the S.M.A.R.T. data and see whether you can determine the type of disk through this (by model name, certain values), however unless you keep a database of all drives out there, this is not gonna be 100% accurate either.
write text file
read text file
repeat 10000 times...
10000/elapsed
for an ssd will be much higher, python3:
def ssd_test():
doc = 'ssd_test.txt'
start = time.time()
for i in range(10000):
with open(doc, 'w+') as f:
f.write('ssd test')
f.close()
with open(doc, 'r') as f:
ret = f.read()
f.close()
stop = time.time()
elapsed = stop - start
ios = int(10000/elapsed)
hd = 'HDD'
if ios > 6000: # ssd>8000; hdd <4000
hd = 'SSD'
print('detecting hard drive type by read/write speed')
print('ios', ios, 'hard drive type', hd)
return hd
I've written a complicated lua script which uses the lua sockets library. It reads a list of files from disk, sorts them by date and sends them to a HTTP process. The number of files on disk is around 65K.The memory usage in taskmanager doesn't exceed 200Mb.
After quite a while the script returns:
lua: not enough memory
I print out the current GC count at points and it never goes above 110Mb
local freeMem = collectgarbage('count');
print("GC Count : " .. freeMem/1024 .. " MB");
This is on a 32 bit windows machine.
What's the best way to diagnose this?
All memory goes through the single lua_Alloc function. This takes the form of:
typedef void* (*lua_Alloc) (void* ud, void* ptr, size_t oszie, size_t nsize);
All allocations, reallocations and frees go through this. The documentation for this can be found at this web page. You can easily write your own to track all memory operations. For example,
void* MyAlloc (void* ud, void* ptr, size_t osize, size_t nsize)
{
(void)ud; (void)osize; // Not used
if (nsize == 0)
{
free(ptr)
TrackSubtract(osize);
return NULL;
}
else
{
void* p = realloc(ptr,nsize);
TrackSubtract(osize);
if (p) TrackAdd(nsize);
return p;
}
}
You can write the TrackAdd() and TrackSubtract() functions to whatever you want: output to a log; adjust a counter and so on.
To use your new function you pass a pointer to it when you create the Lua state:
lua_State* L = lua_newstate(&MyAlloc,0);
The documentation to lua_newstate is found here.
Good luck.
Use perfmon to monitor your process and add counters for private bytes and virtual bytes.
When your script ends with 'not enough memory' see the value of each counter. If you see sudden peaks in your memory usage, try to add more points in which you print the memory usage.