I have a program which calculates 'Printer Queues Total' value using '/usr/bin/lpstat' through popen() system call.
{
int n=0;
FILE *fp=NULL;
printf("Before popen()");
fp = popen("/usr/bin/lpstat -o | grep '^[^ ]*-[0-9]*[ \t]' | wc -l", "r");
printf("After popen()");
if (fp == NULL)
{
printf("Failed to start lpstat - %s", strerror(errno));
return -1;
}
printf("Before fscanf");
fscanf(fp, "%d", &n);
printf("After fscanf");
printf("Before pclose()");
pclose(fp);
printf("After pclose()");
printf("Value=%d",n);
printf("=== END ===");
return 0;
}
Note: In the command line, '/usr/bin/lpstat' command is hanging for some time as there are many printers available in the network.
The problem here is, the execution is hanging at popen() system call, Where as I would expect it to hang at fscanf() which reads the output from the file stream fp.
If anybody can tell me the reasons for the hang at popen() system call, it will help me in modifying the program to work for my requirement.
Thanks for taking time in reading this post and your efforts.
What people expect does not always have a basis in reality :-)
The command you're running doesn't actually generate any output until it's finished. That would be why it would seem to be hung in the popen rather than the fscanf.
There are two possible reasons for that which spring to mind immediately.
The first is that it's implemented this way, with popen capturing the output in full before delivering the first line. Based on my knowledge of UNIX, this seems unlikely but I can't be sure.
Far more likely is the impact of the pipe. One thing I've noticed is that some filters (like grep) batch up their lines for efficiency. So, while popen itself may be spewing forth its lines immediately (well, until it gets to the delay bit anyway), the fact that grep is holding on to the lines until it gets a big enough block may be causing the delay.
In fact, it's almost certainly the pipe-through-wc, which cannot generate any output until all lines are received from lpstat (you cannot figure out how many lines there are until all the lines have been received). So, even if popen just waited for the first character to be available, that would seem to be where the hang was.
It would be a simple matter to test this by simply removing the pipe-through-grep-and-wc bit and seeing what happens.
Just one other point I'd like to raise. Your printf statements do not have newlines following and, even if they did, there are circumstances where the output may still be fully buffered (so that you probably wouldn't see anything until that program exited, or the buffer filled up).
I would start by changing them to the form:
printf ("message here\n"); fflush (stdout); fsync (fileno (stdout));
to ensure they're flushed fully before continuing. I'd hate this to be a simple misunderstanding of a buffering issue :-)
It sounds as if popen may be hanging whilst lpstat attempts to retrieve information from remote printers. There is a fair amount of discussion on this particular problem. Have a look at that thread, and especially the ones that are linked from that.
Related
I'm new to Lua and I have question regarding to memory management in Lua.
Question 1) When calling function using io.popen(), I saw many Lua programmers wrote a close statement after using popen() function. I wonder what is the reason for that? For example, to demonstrate look at this code:
handle = io.popen("ls -a")
output = handle:read("*all")
handle:close()
print(output)
handle = io.popen("date")
output = handle:read("*all")
handle:close()
print(output)
I heard Lua can manage memory itself. So do I really need to write handle:close like above? What will happen to memory if I just ignore the handle:close() statement and just write it like this?
handle = io.popen("ls -a")
handle = io.popen("date")
output = handle:read("*all")
Question 2) From the code in question 1, in term of memory usage, can we write the handle:close() statement at the end with only one line instead of two like this ?:
handle = io.popen("ls -a")
output = handle:read("*all")
-- handle:close() -- dont close it yet do at the end
print(output)
handle = io.popen("date") -- this use the same variable `handle` previously
output = handle:read("*all")
handle:close() -- only one statement to close all above
print(output)
You can see that I didn't close this from the first statement when I use io.popen but I close it at the end, will this make the program slow because I close it only with one close statement at the end?
Lua will close the file handle automatically when the garbage collector gets around to collecting it.
Lua Manual 5.4: file:close
Closes file. Note that files are automatically closed when their handles are garbage collected, but that takes an unpredictable amount of time to happen.
BUT, it is best practice to close the handles yourself as soon as you are done with the handle, this is because it will take an unknown amount of time for the GC to do it.
This is not an issue of memory but of a much more limited resource of open file handles, something like 512 on a windows machine, a small pool for all the applications running on it.
As for the second question, when you reassign a variable AND there are no other remaining references to the previous value, that value will eventually be collected by the GC.
Question 1
In this case the close is not for memory reasons, but to close the file. When a file handle gets collected, it will be closed automatically, but if a program doesn't generate much garbage (which some programmers specifically optimize for), the GC might not run for quite a while after the program is done with the file handling and the file would stay open.
Also, if the variable stays in scope, then the GC won't get to collect it at all until the scope ends, which might be a very long time.
Question 2
That wouldn't work. Methods get called on values, not on variables, so when you assign a new value to a variable, the old one just disappears. Calling a method on the new value won't affect any other value that used to be stored in the variable.
I'm learning about buffer overflow shellcode methods under Linux. https://seedsecuritylabs.org/Labs_16.04/Software/Buffer_Overflow/
The shellcode I've used ends with movb $0x0b, %a1 and then int $0x80. The shellcode executes and I get my command prompt. I've read many places that execve and int 0x80 "do not return". Well.. okay, but where ~does~ program execution flow go when execve process succeeds and exits (aka I enter "exit" on the command line prompt)?
I thought the calling program has its stack frame replaced with the new execve code's information. Does the new execve code preserve the return address of the overwritten process and return to that address as if it were its own? (So it does sort of return .. to a borrowed address?) As far as int $0x80 goes, doesn't execution continue at the next byte after the int 0x80 instruction? If not, what next byte?
In context of the buffer overflow problem and int 0x80, say (for example) a 517 byte hack overwrites a 24 byte buffer. Bytes will replace values at stack memory addresses beyond the buffer, including return address pointing to its own executable code higher up in the stack. But the intentional code stomps on 100s of other stack bytes higher in memory, destroying stack frames of unrelated outer-scope processes. With these destroyed stack frames, what happens when...
1) when the shell returns from the int 0x80 and executed more stack data that is not part of the hack. What's there is now unspecified bytes that are probably invalid CPU opcodes.
2) context of outer stack frames have been destroyed, so how does the system gracefully continue after I enter "exit" at my shell command prompt?
Any help appreciated!
I think you'll understand what is going on if we discuss what execve is and how it works.
I've read many places that execve and int 0x80 "do not return". Well.. okay, but where ~does~ program execution flow go when execve process succeeds and exits (aka I enter "exit" on the command line prompt)?
The following is from execve's manpage.
execve() executes the program pointed to by filename. filename must be
either a binary executable, or a script starting with a line of the
form:
#! interpreter [optional-arg]
execve is a system call which executes a specified program.
Continuing,
execve() does not return on success, and the text, data, bss, and stack
of the calling process are overwritten by that of the program loaded.
This statement deals with your question.
Every process has it's own memory layout. Memory layout consists of text segment, data segment, stack, heap, dependent libraries etc., Refer to /proc/PID/maps of any process to get a clear picture of memory layout.
When execve is executed and it succeeds, the complete memory layout is erased (contents of the caller process is lost forever) and contents of the new process is loaded into memory. New text segment, new data segment, new stack, new heap, everything new.
So, when you try to exit on your command-line, you'll just terminate /bin/sh which you had run using execve. There are no seg-faults, no errors.
Does the new execve code preserve the return address of the overwritten process and return to that address as if it were its own? (So it does sort of return .. to a borrowed address?)
No. This doesn't happen. The new process launched by execve has no clue about the old process.
As far as int $0x80 goes, doesn't execution continue at the next byte after the int 0x80 instruction? If not, what next byte?
int 0x80 instruction is present to request the OS to execute a specified system call. So, whether execution continues later once int 0x80 returns totally depends on what the system call is.
Generally, read, write, open, creat etc., all execute and return back. But, the exec class of functions(go to man exec) are different. Each of those functions never return on success. They return only on failure.
The last part of the question, because the memory layout has been erased and new contents are loaded, there is no sign of buffer-overflow here, no memory corruption.
I hope this answers your questions.
Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.
I am running into the following issue while profiling an application under VC6. When I profile the application, the profiler is indicating that a simple getter method similar to the following is being called many hundreds of thousands of times:
int SomeClass::getId() const
{
return m_iId;
};
The problem is, this method is not called anywhere in the test app. When I change the code to the following:
int SomeClass::getId() const
{
std::cout << "Is this method REALLY being called?" << std::endl;
return m_iId;
};
The profiler never includes getId in the list of invoked functions. Comment out the cout and I'm right back to where I started, 130+ thousand calls! Just to be sure it wasn't some cached profiler data or corrupted function lookup table, I'm doing a clean and rebuild between each test. Still the same results!
Any ideas?
I'd guess that what's happening is that the compiler and/or the linker is 'coalescing' this very simple function to one or more other functions that are identical (the code generated for return m_iId is likely exactly the same as many other getters that happen to return a member that's at the same offset).
essentially, a bunch of different functions that happen to have identical machine code implementations are all resolved to the same address, confusing the profiler.
You may be able to stop this from happening (if this is the problem) by turning off optimizations.
I assume you are profiling because you want to find out if there are ways to make the program take less time, right? You're not just profiling because you like to see numbers.
There's a simple, old-fashioned, tried-and-true way to find performance problems. While the program is running, just hit the "pause" button and look at the call stack. Do this several times, like from 5 to 20 times. The bigger a problem is, the fewer samples you need to find it.
Some people ask if this isn't basically what profilers do, and the answer is only very few. Most profilers fall for one or more common myths, with the result that your speedup is limited because they don't find all problems:
Some programs are spending unnecessary time in "hotspots". When that is the case, you will see that the code at the "end" of the stack (where the program counter is) is doing needless work.
Some programs do more I/O than necessary. If so, you will see that they are in the process of doing that I/O.
Large programs are often slow because their call trees are needlessly bushy, and need pruning. If so, you will see the unnecessary function calls mid-stack.
Any code you see on some percentage of stacks will, if removed, save that percentage of execution time (more or less). You can't go wrong. Here's an example, over several iterations, of saving over 97%.
I've been writing some scripts for a game, the scripts are written in Lua. One of the requirements the game has is that the Update method in your lua script (which is called every frame) may take no longer than about 2-3 milliseconds to run, if it does the game just hangs.
I solved this problem with coroutines, all I have to do is call Multitasking.RunTask(SomeFunction) and then the task runs as a coroutine, I then have to scatter Multitasking.Yield() throughout my code, which checks how long the task has been running for, and if it's over 2 ms it pauses the task and resumes it next frame. This is ok, except that I have to scatter Multitasking.Yield() everywhere throughout my code, and it's a real mess.
Ideally, my code would automatically yield when it's been running too long. So, Is it possible to take a Lua function as an argument, and then execute it line by line (maybe interpreting Lua inside Lua, which I know is possible, but I doubt it's possible if all you have is a function pointer)? In this way I could automatically check the runtime and yield if necessary between every single line.
EDIT:: To be clear, I'm modding a game, that means I only have access to Lua. No C++ tricks allowed.
check lua_sethook in the Debug Interface.
I haven't actually tried this solution myself yet, so I don't know for sure how well it will work.
debug.sethook(coroutine.yield,"",10000);
I picked the number arbitrarily; it will have to be tweaked until it's roughly the time limit you need. Keep in mind that time spent in C functions etc will not increase the instruction count value, so a loop will reach this limit far faster than calls to long-running C functions. It may be viable to set a far lower value and instead provide a function that sees how much os.clock() or similar has increased.