Why is saving a ID3D11Texture2D to file so painfully slow and what to do about it

Why is saving a ID3D11Texture2D to file so painfully slow and what to do about it - directx

I'm having some code (compiled into a standalone executable) to take a screenshot. It runs pretty fast (around 100ms), however, as soon as I run something graphic intensive (like a game), it takes up to 2 minutes, even though CPU and GPU are not fully exhausted. Specifically, I use IDXGIOutputDuplication to grab the screen, and then some code I have "borrowed" from here https://github.com/microsoft/DirectXTex/blob/main/ScreenGrab/ScreenGrab11.cpp to save the screen as a picture to file. Specifically the "encoding" part, i.e., use of the encoder seems to be problem. Any ideas about (1) where this massive performance impact is coming from and (2) how to improve it?
Full code here: https://github.com/plengauer/DXGIOutputDuplication

Related

Dask looping overhead from libraries

When calling another libary to dask such as scikit image contrast stretch, I realise that dask is creating a result for each block, storing in either memory or spilling to disk seperately. Then it attempts to merge all the results. Thats fine if your on a cluster or on a single computer and the dataset for the array is small, everything is fairly controlled. The problems start to happen when you work with data sets that are much larger than your RAM or disk. Is there a way to mitigate this or use the zarr file format to save to updating values as you go along? May be thats too fanciful. Any other ideas bar buy more ram would be helpful.
edit
I was looking at the documentation on dask and the suggestions on chunk sizes for dask, is something like about 100MB. I ended up reducing significantly from this amount to 30-70MB depending on file size. I then ran a contrast stretch (not from a library but with numpy unfunc and I didnt have any issue! In fact i played with the way the compuation is done. Since I start with a uint8 3dim array, when multiplying by the ratio for contrast stretch I am inevitably increasing the array chunk to a float64 array. Which takes up significant memory and computation. So what I have been do is treating the da.array as np.asarray(float64) but only prior to the multiplication by a float number. Then returning to a uint8 to finish the computation. The stretch time has reduced to just under 5 mins for a 20GB file. So I think thats a positive step. Just means image processing without libraries, I will, have a look at rechunker though.
The image processing pipeline i am building is to inevitable be used for a merged dataset of about 250-300GB (definitely outside the limits of my laptop). I also dotn have time to get to grips with cloud or parralell processing in the cloud. Thats for a few months down the line. Right now its trying to get through this analysis.

Yes, you can do the kind of thing you are talking about. I encourage you to check out the rechunker project, which is specialied around changing the layout of the data in zarr storage, but shows the idea of how to save temporary intermediated for the purpose of mitigating memory and communication issues.

How can I investigate failing calibration on Spartan 6 MIG DDR

I’m having problems with a Spartan 6 (XC6SLX16-2CSG225I) and DDR (IS43R86400D) memory interface on some custom hardware. I've tried on a SP601 dev board and all works as expected.
Using the example project, when I enable soft_calibration, it never completes and calib_done stays low.
If I disable calibration I can write to the memory perfectly as far as I can see. But when I try to read from it, I get a variable number of successful read commands before the Xilinx memory controller stops implementing the commands. Once this happens, the command fifo fills up and stays full. The number of successful commands varies from 8 to 300.
I'm fairly convinced it's a timing issue, probably related to DQS centering. But because I can't get calibration to complete when enabled, I don't have continuous DQS Tuning. So I'm assuming it works with calibration disabled until the timing drifts.
Is there any obvious places I should be looking for why calibration fails?
I know this isn't a typical stack overflow question, so if it's an inappropriate place then I'll withdraw.
Thanks

Unfortunately, the calibration process just tries to write and read content successively while adjusting taps internally. It finds one end of success then goes the other direction and identifies that successful tap and then final settles on some where in the middle.
This is probably more HW centric as well, so I post what I think and let someone else move the thread.
Is it just this board? Or is it all of them that are doing it? Have you checked? If it's one board, and the RAM is BGA style, it could be a bad solider job. Push you finger down slightly on the chip and see if you get different results... After this is gets more HW centric
Does the FPGA image you are running on your custom board, have the ability to work on your devkit? A lot of times, that isn't practical I know, but I thought I would ask as it rules out that the image you are using on the devkit has FPGA constraints you aren't getting in your custom image.
Check your length tolerances on the traces. There should have been a length constraint. Plus or minus 50 mils something like that. No one likes to hear they need a board re-spin, but if those are out, it explains a lot.
Signal integrity. Did you get your termination resistors in there and are they the right values? Don't supposed you have an active probe?
Did you get the right DDR memory. Sometimes they use a different speed grade and that can cause all sorts of issue.
Slowing down the interface will usually help items 4 and 5. so if you are just trying to work done, you might ask for a new FPGA image with a slower clock.

Is "Running Time", "CPU Usage" a useful metric under Instruments to draw any conclusions?

Have profiled an app on an iPhone 4 using "Time Profiler" and "CPU Monitor" and trying to make sense of it.
Given that execution time is 8 minutes, CPU "Running Time" is around 2 minutes.
About 67% of that is on the main thread, out of which 52% is coming from "own code".
Now, I can see the majority of time being spent in enumerating over arrays (and associated work), UIKit operations, etc.
The problem is, how do I draw any meaningful conclusions out of this data? i.e. there is something wrong going on here that needs fixing.
I can see a lot of CPU load over that running time (median at 70%) that isn't "justifiable" given the nature of the app.
Having said that, there are some things that do stand out. Parsing HTTP responses on the main thread, creating objects eagerly (backed up by memory profiling as well).
However, what I am looking for here is offending code along with useful conclusions solely based on CPU running time. i.e. spending "too much" time here.
Update
Let me try and elaborate in order to give a better picture.
Based on the functional requirements of this app, I can't see why it shouldn't be able to run on an iPhone 3G. A median CPU usage of around 70%, with a peak of 97% only looks like a red flag on an iPhone 4.
The most obvious response to this is to investigate the code and draw conclusions from that.
What I am hoping for is a categorical answer of the following form
if you spend anywhere between 25% - 50% of your time on CA, there is something wrong with your animations
if you spend 1000ms on anything related to UIKit, better check your processing
Then again, maybe there aren't any answers only indications of things being off when it comes to running time and CPU usage.

Answer for question "is there something wrong going on here that needs fixing" is simple: do you see the problem while using application? If yes (you see glitches in animation, or app hang for a while), you probably want to fix it. If not, you may be looking for premature optimization.
Nonetheless, parsing http responses in main thread, may be a bad idea.

In dev presentations Apple have pointed out that whilst CPU usage is not an accurate indicator in the simulator it is something to hold stock of when profiling on device. Personally I would consider any thread that takes significant CPU time without good reason a problem that needs to be resolved.
Find the time sinks, prioritise by percentage, and start working through them. These may not be visible problems now but they will begin to, if they have not already, degrade the user's experience of the app and potentially the device too.
Check out their documentation on how to effectively use CPU profiling for some handy hints.
If enumeration of arrays is taking a lot of time then I would suggest that dictionaries or other more effective caches could be appropriate, assuming you can spare some memory to ease CPU.
An effective approach may be to remove all business logic from the main thread (a given) and make a good boundary layer between the app and the parsing / business logic. From here you can better hook in some test suites that could better tell you if the code is at fault or if it's simply the significant requirements of the app UI itself...

Eight minutes?
Without beating around the bush, you want to make your application faster, right?
Forget looking at CPU load and wondering if it's the right amount.
Forget guessing if it's HTTP parsing. Maybe it is, but guessing won't tell you.
Forget rummaging around in the code timing things in hopes that you will find the problem(s).
You can find out directly why it is spending so much time.
Here's the method I use,
and here's an (amateurish) video of it.
Here's what will happen if you do that.
First you will find something you would never have guessed, and when you fix it you will lop a big chunk off that 8 minutes, like maybe down to 6 minutes.
Then you do it again, and lop off another big chunk.
You repeat until you can't find anything to fix, and then it will be much faster than your 8 minutes.
OK, now the ball is in your court.

How to improve accuracy of profiling

I want to improve the running time of some code.
In order to that I first time the running time of all relevant code, using code like this:
before:= rdtsc;
myobject.run;
after:= rdtsc;
Then I zoom in and time a relevant part, like so:
procedure myobject.part;
begin
StartTime:= rdtsc;
...
EndTime:= rdtsc;
inc(TotalTime, (EndTime- StartTime));
end;
I have some code to copy paste the timings into Excel, a typical outcome would look like:
(the 89.8% and 10.2% adding up to 100% is a coincidence and has nothing to do with the data or the question)
(when the data shows 1 it means 0 to avoid divide by zero errors)
Note the difference between run A and run B.
I have not changed anything yet so run A and B should give the same running time.
Further note that I know that on both runs procedure part was invoked exactly the same number of times (the data is the same and the algorithm is deterministic).
The running time of procedure part is very short (it is just called many times).
If there was some way to block out other processes during these short bursts of runtime (less than 700 CPU cycles) my timings would be much more accurate.
How do I get these timings to be more reliable?
Is there a way to monopolize the CPU to only run my task when timing and nothing else?
Note that I'm not looking for obvious answers like:
- Close other running programs
- Disable the virusscanner etc...
I've tagged the question Delphi because I'm using Delphi right now (and there may be some Delphi specific option to achieve this result).
I've also tagged it language-agnostic because there may be some more general way.
Update
Because I'm using the CPU instruction RDTSC I'm not affected by CPU throttling. If the CPU slows down, the number of cycles stays the same.
Update2
I have 2 answers, but neither answers the question...
The question is how do I prevent these changes in running time?
Do I have to run the code 20x and always compare the lowest running time out of the 20 runs?
Or to I set my program priority to realtime?
Or is there some other trick to use so my code sample does not get interrupted?

To want to improve the running time of some code.
In order to that I first time the running time of all relevant code, ...
OK, I'm a bit of a stuck record on this subject, but lots of people think that to improve running time requires first measuring it accurately.
Not So.
Improving running time requires finding out what's taking a large fraction of time (the exact fraction does not matter) and doing it differently or maybe not at all.
What it's doing is often not revealed by timing individual routines.
Here's the method I use,
and here's a very amateur video of it.

The problem with profiling your code like that, by sticking special statements into it, is that those special statements themselves take time to run. And since the things taking the most time are likely to be things happening in tight loops, the more they run, the more they distort your timings. What you need for good information is something that will observe your program from outside, without modifying the executing code.
In other words, you need a sampling profiler. And there just happens to be a very good one for Delphi available for free, by the rather descriptive name of Sampling Profiler. It runs your program and watches what it's doing, then correlates that against the map file (make sure to set up your project options to generate a Detailed map file) to give you an intelligible readout on what your program is spending its time on.
And if you want to narrow things down, you can use OutputDebugString to output profiling commands to make it only pay attention to specific parts of your code. It's got instructions in the help file.
I've used a lot of different methods, and this is the most useful way I've found to figure out what Delphi programs are spending their time on. And it's free. Give it a try.

What is the fastest way of loading and re-sizing an image?

I need to display thumbnails of images in a given directory. I use TFileStream to read the image file before loading the image into an image component. The bitmap is then resized to the thumbnail size, and assigned to a TImage component on a TScrollBox.
It seems to work ok, but slows down quite a lot with larger images.
Is there a faster way of loading (image) files from disk and resizing them?
Thanks, Pieter

Not really. What you can do is resize them in a background thread, and use a "place holder" image until the resizing is done. I would then save these resized images to some sort of cache file for later processing (windows does this, and calls the cache thumbs.db in the current directory).
You have several options on the thread architecture itself. A single thread that does all images, or a thread pool where a thread only knows how to process a single image. The AsyncCalls library is even another way and can keep things fairly simple.

I'll complement the answer by skamradt with an attempt to design this for being as fast as possible. For this you should
optimize I/O
use multiple threads to make use of multiple CPU cores, and to keep even a single CPU core working while you read (or write) files
The use of multiple threads implies that using VCL classes for the resizing isn't going to work, as the VCL isn't thread-safe, and all hacks around that don't scale well. efg's Computer Lab has links for image processing code.
It's important to not cause several concurrent I/O operations when using multiple threads. If you choose to write the thumbnail images back to files, then once you have started reading a file you should read it completely, and once you have started writing a file you should also write it completely. Interleaving both operations will kill your I/O, because you potentially cause a lot of seeking operations of the hard disc head.
For best results the reading (and writing) of files should also not happen in the main (GUI) thread of your application. That would suggest the following design:
Have one thread read files into TGraphic objects, and put these into a thread-safe list.
Have a thread pool wait on the list of files in original size, and have one thread process one TGraphic object, resize it into another TGraphic object, and add this to another thread-safe list.
Notify the GUI thread for each thumbnail image added to the list, so it can be displayed.
If thumbnails are to be written to file, do this in the reading thread as well (see above for an explanation).
Edit:
On re-reading your question I notice that you maybe only need to resize one image, in which case a single background thread is of course enough. I'll leave my answer in place anyway, maybe it will be of use to someone else some time. It's what I learned from one of my latest projects, where the final program could have needed a little more speed but was only using about 75% of the quad core machine at peak times. Decoupling I/O from processing would have made the difference.

I often use TJPEGImage with Scale:=jsEighth (in Delphi 7). This is really fast because the JPEG de-compression can skip a lot of the data to fill a bitmap of only an eighth of width and height.
Another option is to use the shell's method to extract a thumbnail, which is pretty fast as well

I'm in the vision business, and I simply upload the images to the GPU using OpenGL. (typically 20x 2048x2000x8bpp per second), a bmp per texture, and let the videocard scale (win32, Mike Lischke's opengl headers)
Upload of such an image costs 5-10ms depending on exact videocard (if not integrated and nvidia 7300 series or newer. Very recent integrated GPUs might be doable also). Scaling and displaying costs 300us. Which means customers can pan and zoom like crazy without touching the app. I draw an overlay (which used to be a tmetafile but is now an own format) on top of it.
My biggest picture is 4096x7000x8bpp which shows and scales in under 30ms. (GF 8600)
A limitation of this technology is max texture size. It can be resolved by fragmenting the picture into multiple textures, but I haven't bothered yet because I deliver the systems with the software.
(some typical sizes:
nv6x00 series: 2k*2k but uploading is just about break even compared to GDI
nv7x00 series: 4k*4k For me the baseline cards. GF7300's are like $20-40
nv8x00 series: 8k*8k
)
Note that this might not be for everybody. But if you are in the lucky situation to specify hardware limits, it might work. The main problem are laptops like Thinkpads, the GPUs of which are older than the avg laptop, which are in turn often a generation behind Desktops.
I chose OpenGL over DirectX because it is more static in time, and easier to find non-game related examples.

Try to look at the Graphics32 library : it's very good at drawing things and works great with Bitmaps. They are Thread - Safe with good example, and it's totally free.

Exploit windows capacity to create thumbnails. Remember that hidden Thumbs.db files in folders that contain images?
I have implemented something like this feature but in VB. My software is able to build thumbnails of 100 files (mixed size) in around 10 seconds.
I am not able to convert it to Delphi though.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart