I am wondering if there are any good frameworks available that would allow me to:
Identify a "ball" object in a video. There will ALWAYS be a ball object, usually an identifiable color, but not always the same darkness, etc
Track the movement of that ball object over time. For example, I need to know how far it moves (x, y coordinates) in a 5 second period.
Take into consideration camera movement. If the user backs up, twitches, etc, I still need my x,y calculations to be accurate based on the new scale factor of the video frame.
Can anyone point me to a library that would get me started down this path?
Thanks
You should look at the OpenCV library. You might be asking too much, but that's probably your best bet.
Related
We have this camera array arranged in an arc around a person (red dot). Think The Matrix - each camera fires at the same time and then we create an animated gif from the output. The problem is that it is near impossible to align the cameras exactly and so I am looking for a way in OpenCV to align the images better and make it smoother.
Looking for general steps. I'm unsure of the order I would do it. If I start with image 1 and match 2 to it, then 2 is further from three than it was at the start. And so matching 3 to 2 would be more change... and the error would propagate. I have seen similar alignments done though. Any help much appreciated.
Here's a thought. How about performing a quick and very simple "calibration" of the imaging system by using a single reference point?
The best thing about this is you can try it out pretty quickly and even if results are too bad for you, they can give you some more insight into the problem. But the bad thing is it may just not be good enough because it's hard to think of anything "less advanced" than this. Here's the description:
Remove the object from the scene
Place a small object (let's call it a "dot") to position that rougly corresponds to center of mass of object you are about to record (the center of area denoted by red circle).
Record a single image with each camera
Use some simple algorithm to find the position of the dot on every image
Compute distances from dot positions to image centers on every image
Shift images by (-x, -y), where (x, y) is the above mentioned distance; after that, the dot should be located in the center of every image.
When recording an actual object, use these precomputed distances to shift all images. After you translate the images, they will be roughly aligned. But since you are shooting an object that is three-dimensional and has considerable size, I am not sure whether the alignment will be very convincing ... I wonder what results you'd get, actually.
If I understand the application correctly, you should be able to obtain the relative pose of each camera in your array using homographies:
https://docs.opencv.org/3.4.0/d9/dab/tutorial_homography.html
From here, the next step would be to correct for alignment issues by estimating the transform between each camera's actual position and their 'ideal' position in the array. These ideal positions could be computed relative to a single camera, or relative to the focus point of the array (which may help simplify calculation). For each image, applying this corrective transform will result in an image that 'looks like' it was taken from the 'ideal' position.
Note that you may need to estimate relative camera pose in 3-4 array 'sections', as it looks like you have a full 180deg array (e.g. estimate homographies for 4-5 cameras at a time). As long as you have some overlap between sections it should work out.
Most of my experience with this sort of thing comes from using MATLAB's stereo camera calibrator app and related functions. Their help page gives a good overview of how to get started estimating camera pose. OpenCV has similar functionality.
https://www.mathworks.com/help/vision/ug/stereo-camera-calibrator-app.html
The cited paper by Zhang gives a great description of the mathematics of pose estimation from correspondence, if you're interested.
I have few related questions.
1) Can you get the position of the device in the global coordinates? I tried to get this value using ARFrame.camera.transform.colums.3, but it seems like the [X,Y,Z] in this column is alway [0,0,0]. I interpreted this transform to be the camera's orientation with respective to the body frame. Can someone explain what exactly you get out of the ARFrame.camera.transform?
2) If we have the position of the device (camera) in the global coordinates, I assume we can easily get the velocity of the device. Is this a valid statement?
3) Can you only get the global position when you are tracking an object? Thus, you get your position relative to the tracked object? I would like to get the speed of the device even when the camera shakes a lot, thus the tracking quality is not always good.
Yes, you can make a speedometer with ARKit. A few people have already.
Regarding your more specific questions...
ARKit doesn’t have “global coordinates”, or probably not in the sense you’re thinking. Camera and anchor transforms use a shared reference frame (“world” space in traditional 3D graphics parlance), but that reference frame is valid only within the session: 0,0,0 is where your camera/device was at the beginning of the session.
If you have two positions at two different times in any shared reference frame, the difference between those positions is the average velocity over that time.
ARKit doesn’t track objects. The camera transform is always relative to “world” space. As mentioned above, it’s 0,0,0 at the beginning of your session because the reference frame is based within the session.
If you want Global positioning — that is, relative to the Earth — you should be looking at Core Location. Note that there’s a difference of scale and precision, though: GPS is accurate to a meter or two but operates at planet scale, and ARKit is accurate to a centimeter or two but operates at room scale.
How would I go about getting the screen coordinates of something that enters frame with motionDetection filter? I'm fairly new to programming, and would prefer a swift answer if possible.
Example - I have the iphone pointing at a wall - monitoring it with the motionDetector. If I bounce a tennis ball against the wall - I want the app to place an image of a tennis ball on the iphone display at the same spot it hit the wall.
To do this, I would need the coordinates of where the motion occurred.
I thought maybe the "centroid" argument did this.... but I'm not sure.
I should point out that the motion detector is pretty crude. It works by taking a low-pass filtered version of the video stream (a composite image generated by a weighted average of incoming video frames) and then subtracting that from the current video frame. Pixels that differ above a certain threshold are marked. The number of these pixels, along with the centroid of the marked pixels are provided as a result.
The centroid is a normalized (0.0-1.0) coordinate representing the centroid of all of these differing pixels. A normalized strength gives you the percentage of the pixels that were marked as differing.
Movement in a scene will generally cause a bunch of pixels to differ, and for a single moving object the centroid will generally be the center of that object. However, it's not a reliable measure, as lighting changes, shadows, other moving objects, etc. can also cause pixels to differ.
For true object tracking, you'll want to use feature detection and tracking algorithms. Unfortunately, the framework as published does not have a fully implemented version of any of these at present.
I'm doing some work with a camera and video stabilization with OpenCV.
Let's suppose I know exactly (in meters) how much my camera has moved from one frame to another and I want to use this to return the second frame where it should be.
I'm sure I have to do some math with this number before I make the translation matrix, but i'm a little lost with that... Any help?
Thanks.
EDIT:Ok I'll try to explain it better:
I want to remove from a video the movement (shaking) of the camera and I know how much the camera has moved (and the direction) from one frame to another.
So what I want to do is to move back the second frame where it should be using that information I have.
I have to make a traslation matrix for each two frames and apply it to the second frame.
But here is when I doubt: As the info I have is en meters and is the movement of the camera, and now I'm working with a image and pixels, I think I have to do some operations so the traslation is correct, but I'm not sure what they are exactly
Knowing how much the camera has moved is not enough for creating a synthesized frame. For that you'll need the 3D model of the world as well, which I assume you don't have.
To demonstrate that assume the camera movement is pure translation and you are looking at two objects, one is very far - a few kilometers away and the other is very close - a few centimeters away. The very far object will hardly move in the new frame, while the very close one can move dramatically or even disappear from the field of view of the second frame, you need to know how much the viewing angle has changed for each point and for that you need the 3D model.
Having sensor information may help in the case of rotation but it is not as useful for translations.
I want to track the head of a player in order to move the camera inside XNA.
When the player rotates left or right, the camera inside XNA will respond to this action and will also rotate.
I tried using the head joint from Skeleton Data and taking the vector value X,Y but this is not an accurate solution. I need another solution that can rotate the camera inside XNA.
Any suggestions?
You could use the Face Tracking API and see the difference from a certain point on the users face (like their nose) to decide whether or not the user looked in a different direction. The points on a users face are assembled like this:
Then you can see if the X changed and by what amount to see the rotation effects.
(You might want to see Facial Recognition with Kinect)