Receiving ERR_QUEUE_FULL error while replaying a log file over a Vector interface via python-can but runs fine while using Vector CANoe replay block - can-bus

Context:
I am trying to replay a BLF file using python-can over a vector interface with an implementation of MessageSync iterator object and can.send operating on the yielded messages. While it functions for 20-30 seconds as expected but after that it keeps raising ERR_QUEUE_FULL exception while sending CAN messages. Have tried to handle that using can_bus.flush_tx_buffer() and can_bus.reset() but to no effect. I understand that the transmit buffer gets full while the messages are written too fast at a given segment causing buffer overflow.
Usage:
replayReaderObj = LogReader(replay_file_path)
msgSyncObj = MessageSync(messages=replayReaderObj, timestamps=True)
I am iterating via msgSyncObj using a for loop and using can.send() on messages (provided message is not an error frame). Default args of gap(0.0001) and skip(60) are considered in which case replay timestamps are considerably delayed compared to the replay file. Hence gap as 0 is included in next attempt to ensure only offset difference is considered. It aligns the replay timestamps but causes buffer overflow in few seconds.
The same replay file while run over a Vector CANoe replay blocks runs just fine without any buffer issues in given replay duration(+10%).
Question:
Can anyone shed light on whether python-can and Vector CANoe (both running on Win10 PC) has different way of configuring transmit queue buffer? Any suggestions on how I can increase the transmit queue buffer used by python-can is highly appreciated along with handling such buffer overflows(since flush_tx_buffer isn't having any impact).
Note: In Vector Hardware Configuration, transmit queue size is configured as 256 messages. I am not sure if python-can uses the same configuration before I want to change it.
Additional context
OS and version: Win 10
Python version: Python 3
python-can version: 3.3.4
python-can interface/s (if applicable): Vector VN1630
There is another real ECU for acknowledgement of Tx messages. This runs fine if I keep a decent wait time(10 ms - minimum that time.sleep() in Python Windows can provide) between consecutive messages. Drawback is that with the wait time injection, it takes 6x-7x times the actual replay time.
Let me know for any further information on top of this. Sorry, I will not be able to share the trace file as it is proprietary, but any details regarding it's nature I can get back on it.

Related

Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?

This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.
I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.
Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.
In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).
One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.
In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).
At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?
The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.
As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.
However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.

Why do you use `stream` for GRPC/protobuf file transfers?

I've seen a couple examples like this:
service Service{
rpc updload(stream Data) returns (google.protobuf.Empty) {};
rpc download(google.protobuf.Empty) returns (stream Data) {};
}
message Data { bytes bytes = 1; }
What is the purpose of using stream, does it make the transfer more efficient?
In theory - yes - I obviously wan't to stream my file transfers but that's what happens over a connection... So, what is the actual benefit to this keyword, does it enforce some form of special buffering to reduce some overhead? Either way, the data is being transmitted, in full!
It's more efficient because, within a single call, multiple messages may be sent.
This avoids, not only re-establishing another (hopefully TLS i.e. even more work) connection with the server but also avoids spinning up client and server "stubs"; both the client and server are ready for more messages.
It's somewhat similar to being connected on a telephone call with your friend who, before hanging up, says "Oh, another thing...". Instead of hanging up the call and then, 5 minutes later, calling you back, interrupting dinner and causing you to pause a movie.
The answer is very similar to the gRPC + Image Upload question, although from a different perspective.
Doing a large download (10+ MB) as a single response message puts strong limits on the size of that download, as the entire response message is sent and processed at once. For most use cases, it is much better to chunk a 100 MB file into 1-10 MB chunks than require all 100 MB to be in memory at once. That also allows the downloader to begin processing the file before the entire file is acquired which reduces processing latency.
Without streaming, chunking would require multiple RPCs, which are annoying to coordinate and have performance complications. Because there is latency to complete RPCs, for reasonable performance you either have to do many RPCs in parallel (but how many?) or have a large batch size (but how big?). Multiple RPCs can also hit colder application caches, as each RPC goes to a different backend.
Using streaming provides the same throughput as the non-chunking approach without as many headaches of normal chunking approaches. Since streaming is pipelined (server can start sending next chunk as soon as previous chunk is sent) there's no added per-chunk latency between the client and server. This makes it much easier to choose a chunk size, as there is a wide range of "reasonable" sizes that will behave similarly and the system will naturally react as network performance varies.
While sending a message on an existing stream has less overhead than creating a new RPC, for many users the difference is negligible and it is generally better to structure your RPCs in a way that is architecturally beneficial to your application and not just to eek out small optimizations in gRPC. The reason to use the stream in this case is to make your application perform better at lower complexity.

Time between callback calls?

I have a lab project that uses mainly PyAudio and to further understand its way of working I made some measurements, in this case time between callbacks (using callback mode).
I timed it, and got an interesting result
(#256 chunk size, 44.1k fs): 0.0099701;0.0000365;0.0000201;0.0201579
This pattern goes on and on.
Between two longer calls, we have two shorter calls and sometimes the longer call is shorter (mind you I don't do anything else in the program than time the callbacks).
If we average this out we get our desired callback time:
1/44100 * 256 (roughly 5.8ms)
Here is my measurement visualized:
So can someone explain what exactly happens here under the hood?
What happens under the hood in PortAudio is dependent on a number of factors, including:
Which native audio API PortAudio is talking to
What buffer size and latency parameters you passed to Pa_OpenStream()
The capabilities of the audio hardware and its drivers, including its supported buffer sizes, buffering model and timing characteristics.
Under some circumstances PortAudio will request larger buffers from the native audio API and then invoke the PortAudio user callback multiple times in quick succession. This can happen if you have selected a small callback buffer size and a long latency.
Another scenario is that the native audio API doesn't support the buffer size that you requested for your callback size (framesPerBuffer parameter to Pa_OpenStream()). In this case PortAudio will be forced to use a driver-supported buffer size and then "adapt" between that buffer size and your callback buffer size. This adaption process can cause irregular timing.
Yet another possibility is that the native audio API uses a large ring buffer. Each time PortAudio polls the native host API, it will work to fill the native ring buffer by calling your callback as many times as needed. In this case irregular timing is related to the polling rate.
The above are not the only possibilities.
One likely explanation of what is happening in your case is that PortAudio is calling your callback 3 times in fast succession (a guess would be that the native buffer size is 3x your callback buffer size), for one of the reasons above.
Another possibility is that the native audio subsystem is signalling PortAudio irregularly. This can happen if a system layer below PortAudio is doing similar kinds of buffering to what I described above. I have seen this happen with DirectSound on Windows 7 for example. ASIO4ALL drivers will exhibit +/- 1ms jitter (which is not what you're seeing).
You can try reducing the requested stream latency to 0 and see if that changes the result. This will force double-buffering, which may or may not produce stable output. Another thing to try is to use the paFramesPerBufferUnspecified parameter, which will cause the callback to be called with the native buffer size -- then you can observe whether there is greater periodicity, what that buffer size is, and also whether the buffer size varies from callback to callback.
You didn't say which operating system and host API you're targetting, so it's hard to give more specific details than the above.
The internal buffering models used by the various PortAudio host API backends are described in some detail on the PortAudio wiki.
To answer a related question: why is it like this? Aside from the cases where it is a function of the lower layers of the native audio subsystem, or the buffer adaption process, it is often a result of specifying a large suggested latency to Pa_OpenStream(). Some PortAudio host APIs will relax the buffer periodicity if the specified latency is very high, in order to reduce system load that would be caused by high-frequency timer callbacks.

What is the best size for a buffer in BlackBerry?

In my application I need to read data from an input stream. I have set the current buffer size for reading as 1024. But I have seen in some Android applications buffer size has been kept as 8192 (8 KB). Will there be any specific advantage if I increase the buffer size in my application to 8KB?
Any expert opinion will be much appreciated.
Edit: (I am using BB OS 6 and 7 and I am dealing with network inputstream.)
I can't say that I've found the universally best buffer size, but it seems to me that something in the range of 1KB to 8KB should be fine in most situations (for BlackBerry Java apps).
Keep in mind that if the amount of data is small (so you'd probably only need one or two buffers at 1KB-8KB), it's probably best just to use the IOUtilities method:
byte[] result = IOUtilities.streamToBytes(inputStream);
with which you don't need to actually pick a buffer size. But, if you know that result would be a large block of data, you're probably right in wanting to read one buffer at a time.
However, I would argue that the answer should almost always be obtained simply by building the app, and measuring performance with a few different values for byte buffer size. It's easy enough to change one constant, build, run and measure again, and then you're not guessing, or taking the advice of someone who doesn't know all the details of your app.
See here for information about BlackBerry Eclipse plugin memory analysis, and
here for BlackBerry Eclipse plugin profiling.
These tools are found in Eclipse by selecting the Window menu, then Show View -> Other... -> BlackBerry -> BlackBerry Memory Statistics View, or BlackBerry Profiler View, while debugging.
This way, you can see how much memory, or processor, the network code is using during the call to retrieve data and populate your buffer.
More
BlackBerry InputStream to String conversion
This question was also asked in the official BlackBerry forum here:
http://supportforums.blackberry.com/t5/Java-Development/What-is-the-best-size-for-a-buffer-in-BlackBerry/td-p/2559417
The OP gave this clarification:
"I am reading from network. Once I establish socket connection with the server, the server will send me notifications one after the other. I need to read the notifications/data from the inputstream available in the socket connection. For this I have a background thread which checks anything is available in the inputstream and if something is available, it will read with the help of a buffer and then passes the read data to a StringBuffer."
Given this information, I have a different take, in that I think the BlackBerry network handling abstracts the Java application from the network buffer processing to the extent that the application buffer size will have little if any impact on the performance.
But be aware, this is only my opinion.
My response on that thread was as follows:
First thing to note is that the method "isAvailable()", in my experience, does not work correctly on OS 5.0 and earlier. It is fixed in OS 6 (at least from my testing).
Because isAvailable() was broken, (and for other application reasons) what I have implemented for a socket connection is that each message is preceded by a length. So in the socket connection, I read the length of the next message, and then the actual data. This is done with no blocking - in other words I read the entire message, regardless of size. I recommend you do the same. The message must exist in full somewhere so it makes no difference if it is in some memory managed by the socket connection, or in some memory managed by you.
Note also, until OS 6.0, when you did the read you would get all the data to fill the buffer you had - in other words it waited till the buffer was full. In OS 6.0 and later, the read can complete without giving you a full buffer.
In your case, you might be working in a post OS 6.0 only, so you could use isAvailable() - create a buffer of that size, and read everything. I can't see that it makes any difference whether you have the bytes in memory managed by the socket, or memory managed by you.
But in fact, I would argue that the best approach is the one that makes your processing simplest. So for example, if you know that the next message is 200 bytes, then read 200 bytes, and then process that message. Then read the next message.
You could spend a lot of time attempting to manage the buffers to match the underlying socket buffers. I don't know exactly how the underlying BlackBerry socket processing code works, but it doesn't put data directly into your buffers. So let it manage its buffer size to optimize the network, you manage your buffer size to optimize your processing. That will work best for everyone.

DSP on Beaglebone

I have a Beaglebone running Ubuntu. We want to continuously sample from 3 on-board ATD converters at 100KS/s, and every window of samples we will run a cross correlation DSP algorithm. Once we find a correlation value above a threshold, we will send the value to a PC.
My concern is the process scheduling in Ubuntu. If our process gets swapped out and an ATD sample becomes available during this time, the process will miss the sample. We need to ensure that our process will capture every sample and save it in memory.
With this being said, is there a way to trigger interrupts on the Beaglebone so that if an ATD sample is ready, the sample will be saved in the memory of our program even if the program does not have the processor at the time?
Thanks!
You might be able to trigger the EDMA or use the PRUSS. Probably best to ask on beagleboard#googlegroups.com. There isn't a DSP per-se on the BeagleBone.
This is not exactly an answer to your question, but hopefully it explains how the process works. Since you didn't mention what hardware you are running for AD conversion, maybe this is the best that can be done:
With audio hardware, which faces the same problem, the solution comes from the hardware and the drivers working together: whenever the hardware has filled up enough of the buffer it signals the driver (via an interrupt or some similar mechanism). In some cases, it's also possible that the driver polls the hardware or something like that, but that's a less efficient solution, and I'm not sure anyone does it that way anymore (maybe on cheaper hardware?). From there, the driver process may call right into the end-user process, or it may simply mark the relevant end-user process as "runnable". Either way, control needs to be transferred to the end user process.
For that to happen, the end user process must be running at a higher priority than anything else occupying the CPUs at that moment. To guarantee that your process will always be first in the queue, you can run it at a high priority, with the appropriate permissions, you can even run in very high priorities.
The time it takes for the top priority process to go from runnable to running is sometimes called the "latency" of the OS, though I am sure there's a more specific technical term. The latency of Linux is on the order of 1 ms, but since it's not a "hard" real-time OS, this is not a guarantee. If this is too long to handle your chunks of data, you may have to buffer some of it in your driver.

Resources