I'm writing an Electron app and a few builds back testers started noticing that two electron.exe processes were consuming a lot of CPU time all the time. One pegging a CPU core and the other using about 85% of a core.
I'm certain that this was not always the case as builds several months ago didn't do this. But I'm at a loss to know how to debug what code changes may have introduced this as the code base has evolved dramatically over that time.
process.getIOCounters() reports that several gigabytes of IO is occurring every few minutes. The application is not deadlocked and everything still works it is just chewing through CPU. It happens anytime the app is open even if it is in the background without any user input. I only have windows 10 x64 systems that I've deployed this to as Electron 1.7.9 and also 1.7.5.
Based on the behavior I'm certain that this IO is interprocess communication between the render and main threads, but I'm not manually performing any IPC. I think this problem is being caused by some module we've introduced that improperly resides in the rendered thread.
My question, how does one debug the The Electron render/main thread IPC pipe? Can it be hooked to know what the contents of the gigabytes of traffic are?
Based on the past few days of attempting to debug this I've answer the question for myself:
My question, how does one debug the The Electron render/main thread IPC pipe?
Don't, electron seemed like a good idea, writing all your client and platform code in the same place. But there are a lot of catches, and out of the blue libraries will have strange bugs that are costly to address because they are outside the main stream use case. This certainly has a lot to do with me not being an Electron Expert, but in the real world there are deadlines and timelines and I can't always get up to speed as much as I would like to.
I've updated my architecture to the tried an true Service/GUI model. I'll be maintaining full browser support for the client code as well as an Electron mode with hooks for some features when electron is detected.
This allows me to quickly identify issues that are specific to browser, version or platform framework. It also lets me use which ever version of NodeJS that I would like to for the service which has also been an issue in my case.
I still love Electron though, I'm just going to be more careful as I use it. If I do discover the specifics of why I had this problem I'll check back and report those details.
Update
So this issue was not directly related to Electron like I had supposed, the IPC was not between the renderer and main threads and was a red herring. It was actually a chrome key frame animation issue which was causing a 60 FPS redraw rate, still not sure why this caused GBs of IPC, but whatever. See https://github.com/Microsoft/vscode/issues/22900
I was able to discover this by porting this app back to native browser ( with nodejs service ). I then ran in chrome, edge and firefox. Only chrome behaved this way.
Related
We decided to upgrade our website Asp.net core code from .net5 to .net6, we simply set the 'target framework' of the web application to.net6 from .net5. There were no compilation errors, we gave it a test in our development environment and all seemed well.
There were no code changes at all made, and previously the .net5 application has been running for many months without issue (and before that .net framework 4.8).
When we deployed our app to our live production environment, within a few minutes we noticed a slowdown of external calls (calls to https endpoints, often REST-like), we log any calls that take more than 5 seconds, over the space of a few minutes all calls went from slow to timing out (20 seconds).
We are using System.Net.WebClient for all of our calls, which I understand is now obsolete in .net6, however, I would not expect this to suddenly change behavior, and even so, we attempted to change to HttpClient, the recommended approach, with the same results.
I feel like I must be missing something really fundamental, we just upgraded the target framework and redeployed and now all calls made by WebClient eventually timeout.
It feels like a "running out of resources" issue, in code, due to the slow down then timeout, but I am at a loss to explain what is going on here.
To be clear, we are not doing anything special, just calling about 3 external services via WebClient for each user, and we have maybe 100 users a minute at peak, previously, there have been no timeouts.
Any pointers on what might be causing the timeouts would be greatly appreciated.
I guess time will tell if this is the answer, but we changed all of our calls to use DownloadStringTaskAsync and UploadStringTaskAsync, i.e. all calls from blocking to async await, and after 24hrs, we have not seen the same behaviour in our live environment under full load.
Why a web app using .net 5 core would not have these issues but .net 6 would, is hard to understand. For context, we are not under crazy high load, we are talking a peak of perhaps 150 users per minute, but that is what we are seeing.
Perhaps it was something specific to our set up, but I am writing this to save someone else the pain of trying to debug this issue in the future.
That is suspicious and unexpected. If you have HttpClient repro, can you please post it on GitHub https://github.com/dotnet/runtime/issues? (ideally minimal repro we can run locally for debugging)
If your repro is not transferable to another machine, or requires specific endpoints you can't expose, we may have to guide you through some local debugging ...
-Karel (.NET Networking team)
Problem
I have an application running on a Cloud Run instance for a 5 months now.
The application has a startup time of about 3 minutes and when the startup is over it does not need much RAM.
Here are two snapshots of docker stats when I run the app locally :
When the app isn't excited
When the app is receiving 10 requests per seconds (Which is way over our use case for now) :
There aren't any problems when I run the app locally however problems arise when I deploy it on Cloud Run. I keep receiving : "OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k" messages followed by the restart of the app. This is a problem because as I said the app takes up to 3 minutes to restart, during which the requests take a lot of time to get treated.
I already fixed the cold start issue by using a minimum instance of 1 AND using a google cloud scheduler to query the service every minutes.
Examples
Here are examples of what I see in the logs.
In the second example the warnings came once again just after the application restart which caused a second restart in a row, this happens quite often.
Also note that those warnings/restarts are not necessarily happening when users are connected to the app but can happen when the only activity is due to the Google Cloud Scheduler
I tried increasing the allocated RAM and CPU to 4 CPUs and 4 Go of RAM (which is a huge over kill) and yet the problem remains.
Update 02/21
As of 01/01/21 we stopped witnessing such behavior from our cloud run service (maybe due an update, I don't know). I did contact the GCP support but they just told me to raise an issue on the OpenBLAS github repo but since I can't reproduce the behavior I did not do so. I'll leave the question open as nothing I did really worked.
OpenBLAS performs high performance compute optimizations and need to know what are the CPU capacity to tune itself the best.
However, when you run a container on Cloud Run, you run it in a sandbox GVisor, to increase the security and the isolation of all the container running on the same serverless platform.
This sandbox intercepts low level kernel calls and discard the abnormal/dangerous ones. I guess that for this reason that OpenBLAS can't determine the L2 cache size. On your environment, you haven't this sandbox, and you can access directly to the CPU info.
Why it's restart?? It could be a problem with OpenBLAS or a problem with Cloud Run (suspicious kernel call, kill the instance and restart it).
I haven't immediate solution because I don't know OpenBLAS. I had similar behavior with Tensorflow Serving, and tensorflow proposes a compiled version without any CPU optimization: less efficient but more portable and resilient to different environment constraint. If a similar compilation exists for OpenBLAS, it could be great to test it.
I've been tasked with analyzing some memory consumption issues with one of our web apps. I'd made myself passibly familiar with tools like Mission Control and VisualVM and used them to resolve a number of leaks, but in doing so came across behavior for which I can't account.
Setup
JBoss 7.1.1 AS
Java 1.7.0_67
Specifically, I've found that even when I run only JBoss 7 by itself (that is, I turn off the deployer and just let the server itself run) I can see regular allocations (followed by garbage collection) of about 1MB/3 seconds or so.
On a whim, I took heap dumps immediately after doing a GC and then once the allocations had been going on. It seems like the majority of the objects I'm seeing have to do with modules, either Xerces activity (reading the module XML, I guess?) or objects associated with ModuleLoader. The majority of the objects I see all have 'References' that look something like this:
http://i.stack.imgur.com/LlUmv.png (sorry, I can't mark up images)
My thinking (which may be entirely off base) was that JBoss scans for new modules to support hot deploys? The thing is though, that use case isn't one I ever use: new deployments always involve just shutting down the server, so dynamically scanning for modules is really unnecessary.
I guess my questions are:
Does my belief about module loading have any merit?
If so, is there any way to get JBoss to stop scanning?
If not, does anyone have any suggestions about what else I can investigate?
Thanks for reading!
I am expericencing heavy performance problems with generating PDFs while using Jasper Reports in my grails application. I am invoking the jasperService:
def reportDef = jasperService.buildReportDefinition(parameter, LocaleContextHolder.getLocale(), [data: emptyData])
Running in Jboss several times, performance is good. After X hours, performance is 100+ times worse than after the start of Jboss... Response time is changing from 7-12 seconds to several minutes for creating a PDF with one single page. I am sure, that the performance lag is within this invocation, because I have added time measurements around it. As the report data is passed within the parameters, I can exclude also data base connection issues.
I have analyzed the HEAP, but it is used ~50% and not changing much during PDF creation. Overall memory is also not fully used.
I have analyzed the PermGen, but it is also far from being full.
The CPU ist permanently at 100% during creation, which is ok, knowing that PDF creation is very CPU consuming. I have ensured that no other process is holding the PDF creation up, 1st by restarting the process several times and measuring no difference, so I can exclude external interruption and 2nd) knowing that performance is much better if JBoss is restarted.
Due to the facts, I have started to analyze the JBoss itself by analyzing the Thread dumps while running the PDF creation thread. I see that nothing else is running (except the thread dumping thread), neither when it is slow nor fast after restart. I can just see that in several Thread dumps Groovy is making several AST transformations which is not strange for Groovy...
Now, I am despaired. HEAP/PermGen is ok, CPU is ok. What the hell is Jasper Reports / Grails doing?
Maybe someone has made similar experiences or an idea for the root cause? Is there something which needs/should to be cleaned up in Jasper Reports?
EDIT: My further analysis yield to the unproofed but certain outcome that JBoss 7.1.1 (latest stable) is the root cause. After installing the app on a Tomcat, everything runs smoothly, also after several days. I'll keep this open. Maybe someone has made same experience and likes to post it...? Otherwise, I will close it with this solution. I will maybe test my app on earlier versions of Jboss or 7.2/7.3.
The solution was that we haven't perceived that JBoss was partially ignoring our Log4J configuration and was massively logging into the server.log which we were not monitoring. Jasper and Grails plugins were writing dozens of MB for each PDF generation into the log file. After removing these log inserts, performance was good again.
We are using TestComplete from AQTime to test the GUI at client with our Client/Server application. It is compiled with Delphi 2007. The source of client is about 1.4 millions sourcelines. The hardware is a Intel dualcore 2.13 Mhz, 2 GB RAM using Windows XP Pro.
I compile the application with all debug options and also link in TCOpenApp, tcOpenAppClasses, tcPublicInfo, tcDUnitSupport as described in documentation to make it an Open Application. The resulting exe-file is about 50 MB.
Now when running the testscript and it works, but running very very slow. The CPU is running at 100 % and it is a bit frustrating to change the testscript because of the slowness. I have turned off all desktop effects like rounded window corners. No desktop background.
Anyone else with the same experience or even an solution ?
Your problem probably lies in the fact you compiled with debug info and are using the tcXXX units, resulting in an enormous amount of objects being created.
A transcript from AutomatedQA message boards
Did you compile it in debug mode? We have an app that when compiled in
Debug mode is slow when used with TC. This is because of the enormous # of
objects in it. If we compile w/o debug but with the TC enabler(s),
everything is fine.
and this one might help to
A couple of areas where you can
increase speed.
If you are just using record and
playback, then look into replacing the
.Keys("xxx") calls to .wText = "xxx".
The Keys function will use the ms
delay between keystrokes, while wText
just forces the text overwrite
internally.
The second suggestion (which you
likely have already looked at) is
Tools->Default Project
Properties->Project->Playback, setting
the delays to 100 ms, 5 ms, and 5 ms
to keep the pauses to a minimum.
As for the object properties, yes, TC
loads them all. You can force this
with a process refresh on yor
application, so that the data is
forced into being available without a
load delay when called. This might
help with reducing the appearance of
delay.
Edit:
We also have been evaluating TestComplete and also encountered this performance problems. I would be very interested to know if and how you've finally solved them.
That said, I think it is a product with great potential and can really help you with organizing all of your unit, integration and GUI tests.
Now when running the testscript and it works, but running very very slow. The CPU is running at 100 % and it is a bit frustrating to change the testscript because of the slowness. I have turned off all desktop effects like rounded window corners. No desktop background.
Anyone else with the same experience or even an solution ?
I recommend that you try changing the TCP ports that TestComplete use for remote connections. You can change them in the Network Suite Options Dialog. For example, you can set 6100-6102 ports. Does this help? A similar issue was described in the TC 9.20 consuming high 98% cpu SmartBear forum thread.