What is the limit of Scissor rectangles in metal.? - metal

I am executing a big file on metal, it is showing the following error:
-[MTLDebugRenderCommandEncoder initWithRenderCommandEncoder:parent:descriptor:]_block_invoke:807: failed assertion `Exceeded HW limit of scissor rectangles for render encoder working in Memoryless mode.'
Message from debugger: failed to send the k packet
Any way to solve it

According to the Metal Feature Set tables (PDF), the limit is 16 for macOS 10.13 (and later, presumably) and 1 everywhere else. Annoyingly, these limits are not query-able programmatically; they're only available in these tables and, as you've found, empirically by exceeding them.

Related

iOS16.1 Mapkit [VKDefault] Exceeded Metal Buffer threshold of 50000

So for our maps, we are using MapKit. We overlay a layer using MKPolygons above the map. This feature has been working since iOS15 but since 16.1 we get the following error and the app freezes (does not crash).
[VKDefault] Exceeded Metal Buffer threshold of 50000 with a count of 50892 resources, pruning resources now (Time since last prune:6.497636): Assertion with expression - false : Failed in file - /Library/Caches/com.apple.xbs/Sources/VectorKit/src/MDMapEngine.mm line - 1363
Metal API Validation Enabled [PipelineLibrary] Mapping the pipeline data cache failed, errno 22
Another interesting log is the following
[IconManager] No config pack found for key SPR London Landmarks
Any idea how to manually clear the metal cache?
There seems to be a problem related to the number of MKOverlayRenderers making draw calls. Having more than a few seems to trigger this issue. Using MKMultiPolyline/Polygon seems to be a workaround but that does not work if you need to support iOS 12 or style the lines and polygons independently.

There can be at most 65535 Thread Groups in each dimension of a Dispatch call

I have a DirectCompute application making computation on images (Like computing average pixel value, applying a filter and much more). For some computation, I simply treat the image as an array of integer and dispatch a computer shader like this:
FImmediateContext.Dispatch(PixelCount, 1, 1);
The result is exactly the expected value, so the comptation is correct. Nevertheless, at runt time, I see in the debug log the following message:
D3D11 ERROR: ID3D11DeviceContext::Dispatch: There can be at most 65535 Thread Groups in each dimension of a Dispatch call. One of the following is too high: ThreadGroupCountX (3762013), ThreadGroupCountY (1), ThreadGroupCountZ (1) [ EXECUTION ERROR #2097390: DEVICE_DISPATCH_THREADGROUPCOUNT_OVERFLOW]
This error is shown only in the debug log, everything else is correct, including the computation result. This makes me thinking that the GPU somehow manage the very large thread group, probably breaking it to smaller groups sequentially executed.
My question is: should I care about this error or is it OK to keep it and letting the GPU do the work for me?
Thx.
If you only care about it working on your particular piece of hardware and driver, then it's fine. If you care about it working on all Direct3D Feature Level 11.0 cards, then it's not fine as there's no guarantee it will work on any other driver or device.
See Microsoft Docs for details on the limits for DirectCompute.
If you care about robust behavior, it's important to test DirectCompute applications across a selection of cards & drivers. The same is true of basically any use of DirectX 12. Much of the correctness behavior is left up to the application code.

MAximum Number of Modules

this doc states the maximum number of modules in a deployment is 20. I am having problems getting over 15. Nothing ever happens, no error messages but the modules don't get deployed.
I also would like to know if this is a soft limit and if it is, what is the process to override it.
Did you find any error in edgeAgent log? probably you hit the limit of twin message size; Maximum size per twin section (tags, desired properties, reported properties) is 8 KB.

CAN 2.0 - fault confinement - modification of error counters

There's one thing in CAN 2.0B specification which I'm not sure if I understand correctly.
In chapter 8 - Fault Confinement - there are following rules regarding modification of error counters:
Rule2: When a RECEIVER detects a 'dominant' bit as the first bit after sending an ERROR FLAG the RECEIVE ERROR COUNT will be increased by 8.
And
Rule6: Any node tolerates up to 7 consecutive 'dominant' bits after sending an ACTIVE ERROR FLAG, PASSIVE ERROR FLAG or OVERLOAD flag. After detecting the 14th consecutive 'dominant' bit 9in case of ACTIVE ERROR FLAG or an OVERLOAD FLAG) or after detecting the 8th consecutive 'dominant' bit following a PASSIVE ERROR FLAG, and after each sequence of additional eight consecutive 'dominant' bits every TRANSMITTER increases its TRANSMIT ERROR COUNT by 8 and every RECEIVER increases its RECEIVE ERROR COUNT by 8.
So does rule2 mean that if node A sends an error frame and after sending 6 dominant bits (error flag) it detects that next bit is dominant it should increase its RECEIVE ERROR COUNTER? I thought it's OK that there might be more than 6 dominant bits in error flag (6 to 12 to be precise)... Also, rule6 says " Any node tolerates up to 7 consecutive 'dominant' bits after (...) "
Rule6 also says about sequence of 8 consecutive dominant bits. But for what exactly is this rule applying? Only for such sequences which comes after initial transmission of an error frame?
Let's have an example:
Node A sends an error frame, other nodes start to send their own error frames.
Node A sends 6 bits of its error flag, then detects 7th dominant bit (increase counter? - rule2)
Then we have another 6 dominant bits and after the 14th dominant bit node A increase its error counter again (first part of rule6).
Then we have another 8 dominant bits - node A increases its error counter again (second part of rule6).
Am I correct? I'm so confused with these rules and I need to understand them thoroughly. Hope somebedy will help me :)

CL_OUT_OF_RESOURCES for 2 millions floats with 1GB VRAM?

It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when I do a clEnqueueReadBuffer, which seems odd. Am I able to sniff out where the trouble really started? OpenCL shouldn't be failing like this at clEnqueueReadBuffer right? It should be when I allocated the data right? Is there some way to get more details than just the error code? It would be cool if I could see how much VRAM was allocated when OpenCL declared CL_OUT_OF_RESOURCES.
I just had the same problem you had (took me a whole day to fix).
I'm sure people with the same problem will stumble upon this, that's why I'm posting to this old question.
You propably didn't check for the maximum work group size of the kernel.
This is how you do it:
size_t kernel_work_group_size;
clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &kernel_work_group_size, NULL);
My devices (2x NVIDIA GTX 460 & Intel i7 CPU) support a maximum work group size of 1024, but the above code returns something around 500 when I pass my Path Tracing kernel.
When I used a workgroup size of 1024 it obviously failed and gave me the CL_OUT_OF_RESOURCES error.
The more complex your kernel becomes, the smaller the maximum workgroup size for it will become (or that's at least what I experienced).
Edit:
I just realized you said "clEnqueueReadBuffer" instead of "clEnqueueNDRangeKernel"...
My answer was related to clEnqueueNDRangeKernel.
Sorry for the mistake.
I hope this is still useful to other people.
From another source:
- calling clFinish() gets you the error status for the calculation (rather than getting it when you try to read data).
- the "out of resources" error can also be caused by a 5s timeout if the (NVidia) card is also being used as a display
- it can also appear when you have pointer errors in your kernel.
A follow-up suggests running the kernel first on the CPU to ensure you're not making out-of-bounds memory accesses.
Not all available memory can necessarily be supplied to a single acquisition request. Read up on heap fragmentation 1, 2, 3 to learn more about why the largest allocation that can succeed is for the largest contiguous block of memory and how blocks get divided up into smaller pieces as a result of using the memory.
It's not that the resource is exhausted... It just can't find a single piece big enough to satisfy your request...
Out of bounds acesses in a kernel are typically silent (since there is still no error at the kernel queueing call).
However, if you try to read the kernel result later with a clEnqueueReadBuffer(). This error will show up. It indicates something went wrong during kernel execution.
Check your kernel code for out-of-bounds read/writes.

Resources