windbg crash dump analysis, high cpu usage - - memory

My application(web api) is suffering with high cpu, while analyzing dump, I see my most of the threads have this !dumpstack -:
Child-SP RetAddr Caller, Callee
00000030497bec00 00007ffbb19e1118 KERNELBASE!WaitForSingleObjectEx+0x94,
calling ntdll!NtWaitForSingleObject
00000030497beca0 00007ffba8375dda clr!CLRSemaphore::Wait+0xee, calling kernel32!WaitForSingleObjectEx
00000030497becd0 00007ffba837345d clr!GCCoop::GCCoop+0xe, calling clr!GetThread
00000030497bed60 00007ffba8375842 clr!ThreadpoolMgr::WorkerThreadStart+0x482, calling clr!CLRSemaphore::Wait
00000030497bee00 00007ffba8393e1e clr!Thread::intermediateThreadProc+0x7d
00000030497bee30 00007ffbb19e1f86 KERNELBASE!ConsoleCallServerGeneric+0xf2, calling KERNELBASE!_security_check_cookie
00000030497bee50 00007ffbb45011a5 ntdll!RtlpLowFragHeapAllocFromContext+0x355, calling ntdll!memset
00000030497beed0 00007ffbb450d0c6 ntdll!LdrpGetProcedureAddress+0x66, calling ntdll!RtlImageNtHeaderEx
00000030497bef50 00007ffbb450c6f5 ntdll!LdrpResolveNonStaticDependency+0x1cd, calling ntdll!LdrpDereferenceNode
00000030497befd0 00007ffbb4500d07 ntdll!RtlAllocateHeap+0xd7, calling ntdll!RtlpLowFragHeapAllocFromContext
00000030497bf000 00007ffbb45011a5 ntdll!RtlpLowFragHeapAllocFromContext+0x355, calling ntdll!memset
00000030497bf020 00007ffbb450f5f3 ntdll!LdrGetProcedureAddressForCaller+0x153, calling ntdll!_security_check_cookie
00000030497bf030 00007ffb93fc8c84 mfc120u!DllMain+0x210, calling mfc120u!__security_check_cookie
00000030497bf080 00007ffba8ce2cbb mscoreei!operator delete+0x34, calling kernel32!HeapFreeStub
00000030497bf0d0 00007ffbb4500d07 ntdll!RtlAllocateHeap+0xd7, calling ntdll!RtlpLowFragHeapAllocFromContext
00000030497bf140 00007ffbb4500d07 ntdll!RtlAllocateHeap+0xd7, calling ntdll!RtlpLowFragHeapAllocFromContext
00000030497bf180 00007ffbac0c51bd gzip!DllMainCRTStartup+0x139, calling gzip!DllMain
00000030497bf1e0 00007ffb979dbc9d clrcompression!calloc_impl+0x5d, calling ntdll!RtlAllocateHeap
00000030497bf210 00007ffb979d8eff clrcompression!initptd+0xb7, calling clrcompression!unlock
00000030497bf230 00007ffbb44ebf57 ntdll!RtlDeactivateActivationContextUnsafeFast+0xc7, calling ntdll!_security_check_cookie
00000030497bf240 00007ffb979d7919 clrcompression!CRT_INIT+0x135, calling kernel32!GetCurrentThreadId
00000030497bf270 00007ffb979d7a0e clrcompression!__DllMainCRTStartup+0x8a, calling clrcompression!DllMain
00000030497bf280 0000000056b32052 msvcr100!_initptd+0xaa, calling msvcr100!_unlock
00000030497bf2a0 00007ffbac051779 IitTlsCleanupHelper!UnregisterTLSCleanupCallback+0x679, calling IitTlsCleanupHelper!UnregisterTLSCleanupCallback+0xf0
00000030497bf2b0 0000000056b31308 msvcr100!__CRTDLL_INIT+0x16c, calling msvcr100!_CrtEndBoot
00000030497bf2d0 00007ffbb450bee8 ntdll!LdrpReleaseModuleEnumLock+0x1c, calling ntdll!RtlReleaseSRWLockShared
00000030497bf2e0 00007ffbb44ec0f4 ntdll!LdrpCallInitRoutine+0x4c
00000030497bf300 00007ffbb450be9b ntdll!LdrpReleaseLoaderLock+0x27, calling ntdll!LdrpReleaseModuleEnumLock
00000030497bf340 00007ffbb44ebe53 ntdll!LdrpInitializeThread+0x1f3, calling ntdll!LdrpReleaseLoaderLock
00000030497bf3b0 00007ffbb44ebd93 ntdll!LdrpInitializeThread+0x133, calling ntdll!RtlActivateActivationContextUnsafeFast
00000030497bf3b8 00007ffbb44ebdc6 ntdll!LdrpInitializeThread+0x166, calling ntdll!RtlDeactivateActivationContextUnsafeFast
00000030497bf420 00007ffbb44e8d73 ntdll!_LdrpInitialize+0x93, calling ntdll!NtTestAlert
00000030497bf490 00007ffbb44e8c98 ntdll!LdrInitializeThunk+0x18, calling ntdll!NtContinue
00000030497bf900 00007ffba8393e07 clr!Thread::intermediateThreadProc+0x66, calling clr!_chkstk
00000030497bf940 00007ffbb43613d2 kernel32!BaseThreadInitThunk+0x22
00000030497bf970 00007ffbb44e54e4 ntdll!RtlUserThreadStart+0x34
My doubt is in these 3 lines -:
ntdll!RtlAllocateHeap+0xd7, calling ntdll!RtlpLowFragHeapAllocFromContext
00000030497bf140 00007ffbb4500d07 ntdll!RtlAllocateHeap+0xd7, calling ntdll!RtlpLowFragHeapAllocFromContext
00000030497bf180 00007ffbac0c51bd gzip!DllMainCRTStartup+0x139, calling gzip!DllMain
Can this thread be cause of high cpu usage ?

The first WinDBG command you will want to run is: !runaway.This command will show you which thread was using the CPU for the longest time.After receiving input from this command we can think forward on what that is going on...

Windbg is not the right tool for this job. Dumps are only snapshots so you have no idea what happened before. Use ETW and here the CPU Sampling, which sums all calls and shows you in detail the CPU usage.
Install the Windows Performance Toolkit which is part of the Windows 10 SDK (V1607 works on Win8/8.1(Server2012/R2) and Win10 or the V1511 SDK if you use Windows 7/Server2008R2)), run WPRUi.exe and select CPU Usage
and press on Start. Capture 1 minute of the high CPU usage and next click on Save. Open the generated ETL with WPA.exe (Perf analyzer), drag and drop the CPU Usage (Sampled) graph to the analysys pane
and load the Debug Symbols. Now select your process in the graph, zoom in and expand the stack, here you see the weight of the CPU usage of all calls
In this sample most CPU usage from Internet Explorer comes from HTML stuff.
For .NET applications WPA shows you .net related groupings like GC or JIT:

Set correct symbols path after any analysis.
Set at File->Symbol File Path menu:
YOUR_SYMBOLS_PATH;OTHERS_PATH;SRVC:\symcachehttp://msdl.microsoft.com/download/symbols
Try this commands to view managed stack to:
.cordll -ve -u -l
ld*
!EEStack

As per article - http://msdn.microsoft.com/en-us/library/bb742546.aspx I should not focus on this thread.. because it is waiting and perhaps is in sleep mode -WaitForSingleObjectEx and sleeping does not cause cpu usage..
A few more resources if somebody is in same situation -:
https://channel9.msdn.com/Series/-NET-Debugging-Stater-Kit-for-the-Production-Environment
https://msdn.microsoft.com/en-IN/library/ms182372.aspx

Related

Segmentation Fault after creating PyQGIS standalone application the second time even after exitQgis() - Debian

I am trying to create several .qgs project files to be served at a later time by an instance of qgis Server.
To this end I need to start a new PyQGIS application several times upon request. The application runs smoothly the first time it is called, but if I try to run it a second time I get a Segmentation Fault error.
Here is an example code that triggers the problem:
from qgis.core import QgsApplication
import os
os.environ["QT_QPA_PLATFORM"] = "offscreen"
for i in range(2):
print(f'Iteration number {i}')
print('\tSet prefix path')
QgsApplication.setPrefixPath('/usr', False)
print('\tInstantiating application')
qgs = QgsApplication([], False)
print('\tInitializing application')
qgs.initQgis()
print('\tExiting')
qgs.exitQgis()
When executed, I get this output:
Iteration number 0
Set prefix path
Instantiating application
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-root'
Initializing application
Exiting
Iteration number 1
Set prefix path
Instantiating application
Initializing application
Segmentation fault
Something similar happens if I enclose the content of the loop inside a function and call it multiple times. In this case the segmentation fault happens upon calling qgs.exitQgis() the second time (and any vector or raster layers added before that would be invalid).
Maybe the problem is that for some reason qgs.exitQgis() is not really cleaning up before exit?
The code is running on a Python:3.9 docker container that comes with Debian Bullseye.
Qgis has been installed following the instruction from the QGIS docs:
https://qgis.org/en/site/forusers/alldownloads.html#debian-ubuntu. QGIS version is QGIS 3.22.3-Białowieża 'Białowieża'.
To prevent an import error when loading qgis.core I had to set the environment variable PYTHONPATH = /usr/lib/python3/dist-packages/.
UPDATE: Following a suggestion of one comment found on this post:
https://gis.stackexchange.com/questions/250933/using-exitqgis-in-pyqgis,
I substituted qgs.exitQgis() with qgs.exit() and now the app can be instantiated again any number of times without crashing.
It is still not clear what causes the segmentation fault, but at least I found this workaround.
It seems like the problem was fixed in QGIS ver. 3.24 Tisler. Now qgs.exitQgis() can be called in a loop without triggering a segmentation fault.

Lua script in FreeSWITCH exits bridge immediately when bypass_media = true

How do I make the script wait for the bridge to terminate before continuing when I have "bypass_media" set to true?
This snippet -
freeswitch.consoleLog("err","session before="..tostring(session:ready()).."\n")
session:execute("set","bypass_media=true")
session:execute("bridge","sofia/gateway/carrierb/12345678")
freeswitch.consoleLog("err","session after="..tostring(session:ready()).."\n")
from an audio perspective, it works perfectly with bridge_media set to either true or false, and a wireshark trace shows the audio either passing through (false) or end to end (true).
But with bypass set to true, the script continues without pausing, and the session is no longer ready (session:ready() == false).
The channel seems to go into a hibernate state, but I have housekeeping to do after the bridge is finished which I simply cannot do.
Same happens if i do the bridge in XML dialplan, immediately carries on causing my housekeeping to fire early.
FreeSWITCH (Version 1.6.20 git 43a9feb 2018-05-07 18:56:11Z 64bit)
Lua 5.2
EDIT 1 -
I can get "api_hangup_hook=lua housekeeping.lua" to work, but then I have to pass tons of variables and it fires up a new process/thread/whatever, which seems a little overkill unless that's the only way.
I have a workaround, but would still like an answer to the question if anyone has one (ie how to stop the bridge exiting immediately).
If I set this before the bridge :
session:execute("set","bypass_media=true")
session:execute("set","session_in_hangup_hook=true")
session:execute("set","api_hangup_hook=lua housekeeping.lua "..<vars>)
then "housekeeping.lua" fires when the bridge actually terminates (which is long after the script does).
In housekeeping.lua I can then do :
session:getVariable("billmsec")
and it seems to have the correct values in them, allowing me to do my housekeeping.
My problem with this is the uncertainty of playing with variables from a session that appears to have gone away from a script that fires at some point in the future.
Anyway, it'll have to do until I can find out how to keep control inside the original script.

Windows driver dev: Can ntoskrnl code get paged out?

I'm trying my driver with Driver Verifier turned on in Windows 7 x64, and get IRQL_NOT_LESS_OR_EQUAL(0A) bugcheck. From analyze -v info, it seems that the memory page of RtlAnsiCharToUnicodeChar function gets paged out, so calling that function causes bugcheck 0A . RtlAnsiCharToUnicodeChar is an ntoskrnl.exe exported function. Can it really be paged out? If so, how can I prevent it?
On spot debug info screen shot below:
yes. of course - very many ntoskrnl routines in PAGE* section.
RtlAnsiCharToUnicodeChar also paged - read in documentation:
IRQL <= APC_LEVEL
also read about DbgPrintEx routine
DbgPrint and DbgPrintEx can be called at IRQL<=DIRQL. However, Unicode
format codes (%wc and %ws) can be used only at IRQL = PASSIVE_LEVEL.
and
However, the Unicode format codes (%C, %S, %lc, %ls, %wc, %ws, and
%wZ) can only be used with IRQL = PASSIVE_LEVEL.
so if you not use Unicode format you can use DbgPrint or KdPrint(this is macro) at any IRQL but if you use Unicode format - only on PASSIVE_LEVEL or APC_LEVEL (about APC_LEVEL i say by self)
You can try to use the MmLockPagableCodeSection on that specific routine to prevent it being paged out, however it's probably not advisable (and you don't know what dependencies it has, if they're located in pagable sections as well). In any case, make sure you read the documentation thoroughly.
A better approach is to run at Passive/APC level in the first place before invoking the printing function - e.g., by scheduling work item (you can also force lowering the IRQL with KeLowerIrql function but it's not advisable by MSFT).

Are Indy 10 TCP command handlers asynchronous?

I have been having exceptions crop in my application, either stack overflow or our of memory. They show up in different places, depending on when the system has had enough. To put it another way, running the app twice won’t lead to the same exception in the same place.
I have some timers which cause database access. The AnyDac d/b component guys tell me that I can't reuse a global TADConnection but have to allocate it dynamically in each timer handler, which I have done.
I just thought that I had had a d'oh! moment when I looked at the latest stack trace.
fMainForm.TMainForm.GetToolNumberFromContext($31846FB4)
fMainForm.TMainForm.Received_HEART_BEAT($249AEFD0)
IdCommandHandlers.TIdCommand.DoCommand
IdCommandHandlers.TIdCommandHandler.DoCommand(???,$31846FB4,'')
IdCommandHandlers.TIdCommandHandler.Check('HEART_BEAT',$31846FB4)
IdCommandHandlers.TIdCommandHandlers.HandleCommand($31846FB4,'HEART_BEAT') <===
uADDatSManager.TADDatSRow.SetBlobLength($7DA10FDC,0,$C18DDDC,10,0,1,False)
uADDatSManager.TADDatSRow.SetBlobData($7DA10FDC,0,$C18DDDC,10,False)
uADDatSManager.TADDatSRow.SetData(0,$C18DDDC,10)
uADPhysMySQL.TADPhysMySQLCommand.FetchRow($7D2F4F90,nil)
uADPhysMySQL.TADPhysMySQLCommand.InternalFetchRowSet($7D2F4F90,nil,50)
uADPhysManager.DoFetch(0,50,50,False)
uADPhysManager.TADPhysCommand.FetchBase($7D2F4F90,False)
uADPhysManager.TADPhysCommandAsyncFetch.Execute
uADStanAsync.TADStanAsyncExecutor.ExecuteOperation(False)
uADStanAsync.TADStanAsyncExecutor.Run
uADPhysManager.TADPhysCommand.ExecuteTask(TADPhysCommandAsyncFetch($7DA24FEC) as IADStanAsyncOperation,TADPhysCommandAsyncFetch($7DA24FF8) as IADStanAsyncHandler,True)
uADPhysManager.TADPhysCommand.Fetch($7D2F4F90,False,True)
uADCompClient.TADCustomCommand.Fetch($7D2F4F90,False,True)
uADCompClient.TADCustomTableAdapter.Fetch(False)
uADCompClient.TADAdaptedDataSet.DoFetch($7D2F4F90,False,fdDown)
uADCompDataSet.TADDataSet.InternalFetchRows(False,True,fdDown)
uADCompDataSet.TADDataSet.GetRecord($7DA1AFF4,gmNext,True)
Data.DB.TDataSet.GetNextRecord
Data.DB.TDataSet.GetNextRecords
Data.DB.TDataSet.SetBufferCount(???)
Data.DB.TDataSet.UpdateBufferCount
Data.DB.TDataSet.DoInternalOpen
Data.DB.TDataSet.OpenCursor(???)
uADCompDataSet.TADDataSet.OpenCursor(False)
uADCompClient.TADRdbmsDataSet.OpenCursor(False)
Data.DB.TDataSet.SetActive(???)
uADCompDataSet.TADDataSet.SetActive(True)
Data.DB.TDataSet.Open
uADCompClient.TADRdbmsDataSet.Open('SELECT * FROM tagged_chemicals',(...),(...))
uADCompClient.TADRdbmsDataSet.Open('SELECT * FROM tagged_chemicals')
fMainForm.TMainForm.CheckEndOfScheduleTimerTimer($B116FAC)
Vcl.ExtCtrls.TTimer.Timer
Vcl.ExtCtrls.TTimer.WndProc(???)
System.Classes.StdWndProc(133584,275,1,0)
:768a62fa ; C:\Windows\syswow64\USER32.dll
:768a6d3a USER32.GetThreadDesktop + 0xd7
:768a77c4 ; C:\Windows\syswow64\USER32.dll
:768a788a USER32.DispatchMessageW + 0xf
Vcl.Forms.TApplication.ProcessMessage(???)
I don't understand that marked line, the sudden switch from AnyDac to Indy code
IdCommandHandlers.TIdCommandHandlers.HandleCommand($31846FB4,'HEART_BEAT') <===
uADDatSManager.TADDatSRow.SetBlobLength($7DA10FDC,0,$C18DDDC,10,0,1,False)
Can someone please explain it? Thanks
My first thought was that Indy was interrupting AnyDac, perhaps because it called Applciation.ProcessMessages or similar, but I don't see that on the stack ...
But if it can do that, then can it interrupt "normal" non-timer handler code?
I was sure that I had it cracked and that the problem was that my TCP command handlers were reusing an AnyDac component used by something else ... then I looked at my code and saw that there is no database access in the command handlers or in anything that they call.
I am stumped. Does what I wrote even make sense? Can anyone offer any advice?
Thanks a 1,000,000 in advance for any help.
Indy's commands handlers are used by TIdCmdTCPServer and TIdCmdTCPClient, which are both multi-threaded components. The command handlers are invoked inside of worker threads that Indy creates internally. There is no way that a command handler can interrupt an operation that is running in a different thread.

Why is my Ruby script utilizing 90% of my CPU?

I wrote a admin script that tails a heroku log and every n seconds, it summarizes averages and notifies me if i cross a certain threshold (yes I know and love new relic -- but I want to do custom stuff).
Here is the entire script.
I have never been a master of IO and threads, I wonder if I am making a silly mistake. I have a couple of daemon threads that have while(true){} which could be the culprit. For example:
# read new lines
f = File.open(file, "r")
f.seek(0, IO::SEEK_END)
while true do
select([f])
line = f.gets
parse_heroku_line(line)
end
I use one daemon to watch for new lines of a log, and the other to periodically summarize.
Does someone see a way to make it less processor-intensive?
This probably runs hot because you never really block while reading from the temporary file. IO::select is a thin layer over POSIX select(2). It looks like you're trying to block until the file is ready for reading, but select(2) considers EOF to be ready ("a file descriptor is also ready on end-of-file"), so you always return right away from select then call gets which returns nil at EOF.
You can get a truer EOF reading and nice blocking behavior by avoiding the thread which writes to the temp file and instead using IO::popen to fork the %x[heroku logs --ps router --tail --app pipewave-cedar] log tailer, connected to a ruby IO object on which you can loop over gets, exiting when gets returns nil (indicating the log tailer finished). gets on the pipe from the tailer will block when there's nothing to read and your script will only run as hot as it takes to do your line parsing and reporting.
EDIT: I'm not set up to actually try your code, but you should be able to replace the log tailer thread and your temp file read loop with this code to get the behavior described above:
IO.popen( %w{ heroku logs --ps router --tail --app my-heroku-app } ) do |logf|
while line = logf.gets
parse_heroku_line(line) if line =~ /^/
end
end
I also notice your reporting thread does not do anything to synchronize access to #total_lines, #total_errors, etc. So, you have some minor race conditions where you can get inconsistent values from the instance vars that parse_heroku_line method updates.
select is about whether a read would block. f is just a plain old file, so you when get to the end reads don't block, they just return nil instantly. As a result select returns instantly rather than waiting for something to be appending to the file as I assume you're expecting. Because of this you're sitting in a tight busy loop, so high cpu is to be expected.
If you are at eof (you could either check f.eof? or whether gets returns nil), then you could either start sleeping (perhaps with some sort of back off) or use something like listen to be notified of filesystem changes

Resources