I know my question sounds very broad but that's the nature of the problem statement I am working on. I am not looking for code samples (unless it helps explain the solution) but an explanation more from a technology perspective and challenges therein.
My company has indoor positioning figured out from a 3rd party vendor using beacons. Accuracy is about 3-5m which is acceptable for the kind of use case we have. What we also want to try and do is map few fixed assets in 3D space based on their 2D position (x,y) information that we receive from the indoor positioning solution. We would like to use ARKit for doing the spatial mapping of these assets in AR view. What we have with us in order to realise a solution like this is real-time position information of the user in (x, y), real-time azimuth as observed on the user's device hence giving the user's bearing and as stated earlier the position information in (x, y) of the fixed assets in the building.
I would like to know, for doing spatial mapping of these assets in 3D space (assuming a fixed elevation for these assets), would the data that we have suffice? What else does ARKit need in order to accurately map these assets in AR view? I have done some reading on AR sessions and world tracking configurations in ARKit and not very sure if we should be using world alignment of gravity or gravity/heading? Should we be converting the x, y coordinates as received from the indoor positioning software to lat, long?
Any pointers towards the correct approach would be much appreciated. And sorry again if the questions sound too broad, I will try to be as specific in my responses to your questions if need be.
Thanks,
Myra
Related
I can see that it can measure horizontal and vertical distances with +/-5% accuracy. I have a use case scenario in which I am trying to formulate an algorithm to detect distances between two points in an image or video. Any pointers to how it could be working would be very useful to me.
I don't think the source is available for the Android measure app, but it is ARCore based and I would expect it uses a combination of triangulation and knowledge it reads from the 'scene', using the Google ARCore term, it is viewing.
Like a human estimating distance to a point, by basic triangulation between two eyes and the point being looked at, a measurement app is able to look at multiple views of the scene and to measure using its sensors how far the device has moved between the different views. Even a small movement allows the same triangulation techniques be used.
The reason for mentioning all this is to highlight that you do not have the same tools or information available to you if you are analysing image or video files without any position or sensor data. Hence, the Google measure app may not be the best template for you to look to for your particular problem.
I recently managed to get my augmented reality application up and running close to what is expected. However, I'm having an issue where, even though the values are correct, the augmentation is still off by some translation! It would be wonderful to get this solved as I'm so close to having this done.
The system utilizes an external tracking system (Polaris Spectra stereo optical tracker) with IR-reflective markers to establish global and reference frames. I have a LEGO structure with a marker attached which is the target of the augmentation, a 3D model of the LEGO structure created using CAD with the exact specs of its real-world counterpart, a tracked pointer tool, and a camera with a world reference marker attached to it. The virtual space was registered to the real world using a toolset in 3D Slicer, a medical imaging software which is the environment I'm developing in. Below are a couple of photos just to clarify exactly the system I'm dealing with (May or may not be relevant to the issue).
So a brief overview of exactly what each marker/component does (Markers are the black crosses with four silver balls):
The world marker (1st image on right) is the reference frame for all other marker's transformations. It is fixed to the LEGO model so that a single registration can be done for the LEGO's virtual equivalent.
The camera marker (1st image, attached to camera) tracks the camera. The camera is registered to this marker by an extrinsic calibration performed using cv::solvePnP().
The checkerboard is used to acquire data for extrinsic calibration using a tracked pointer (unshown) and cv::findChessboardCorners().
Up until now I've been smashing my face against the mathematics behind the system until everything finally lined up. When I move where I estimate the camera origin to be to the reference origin, the translation vector between the two is about [0; 0; 0]. So all of the registration appears to work correctly. However, when I run my application, I get the following results:
As you can see, there's a strange offset in the augmentation. I've tried removing distortion correction on the image (currently done with cv::undistort()), but it just makes the issue worse. The rotations are all correct and, as I said before, the translations all seem fine. I'm at a loss for what could be causing this. Of course, there's so much that can go wrong during implementation of the rendering pipeline, so I'm mostly posting this here under the hope that someone has experienced a similar issue. I already performed this project using a webcam-based tracking method and experienced no issues like this even though I used the same rendering process.
I've been purposefully a little ambiguous in this post to avoid bogging down readers with the minutia of the situation as there are so many different details I could include. If any more information is needed I can provide it. Any advice or insight would be massively appreciated. Thanks!
Here are a few tests that you could do to validate that each module works well.
First verify your extrinsic and intrinsic calibrations:
Check that the position of the virtual scene-marker with respect to the virtual lego scene accurately corresponds to the position of the real scene-marker with respect to the real lego scene (e.g. the real scene-marker may have moved since you last measured its position).
Same for the camera-marker, which may have moved since you last calibrated its position with respect to the camera optical center.
Check that the calibration of the camera is still accurate. For such a camera, prefer a camera matrix of the form [fx,0,cx;0,fy,cy;0,0,1] (i.e. with a skew fixed to zero) and estimate the camera distortion coefficients (NB: OpenCV's undistort functions do not support camera matrices with non-zero skews; using such matrices may not raise any exception but will result in erroneous undistortions).
Check that the marker tracker does not need to be recalibrated.
Then verify the rendering pipeline, e.g. by checking that the scene-marker reprojects correctly into the camera image when moving the camera around.
If it does not reproject correctly, there is probably an error with the way you map the OpenCV camera matrix into the OpenGL projection matrix, or with the way you map the OpenCV camera pose into the OpenGL model view matrix. Try to determine which one is wrong using toy examples with simple 3D points and simple projection and modelview matrices.
If it reprojects correctly, then there probably is a calibration problem (see above).
Beyond that, it is hard to guess what could be wrong without directly interacting with the system. If I were you and I still had no idea where the problem could be after doing the tests above, I would try to start back from scratch and validate each intermediate step using toy examples.
I am totally new to AR and I searched on the internet about marker based and markerless AR but I am confused with marker based and markerless AR..
Lets assume an AR app triggers AR action when it scans specific images..So is this marker based AR or markerless AR..
Isn't the image a marker?
Also to position the AR content does marker based AR use devices' accelerometer and compass as in markerless AR?
In a marker-based AR application the images (or the corresponding image descriptors) to be recognized are provided beforehand. In this case you know exactly what the application will search for while acquiring camera data (camera frames). Most of the nowadays AR apps dealing with image recognition are marker-based. Why? Because it's much more simple to detect things that are hard-coded in your app.
On the other hand, a marker-less AR application recognizes things that were not directly provided to the application beforehand. This scenario is much more difficult to implement because the recognition algorithm running in your AR application has to identify patterns, colors or some other features that may exist in camera frames. For example if your algorithm is able to identify dogs, it means that the AR application will be able to trigger AR actions whenever a dog is detected in a camera frame, without you having to provide images with all the dogs in the world (this is exaggerated of course - training a database for example) when developing the application.
Long story short: in a marker-based AR application where image recognition is involved, the marker can be an image, or the corresponding descriptors (features + key points). Usually an AR marker is a black&white (square) image,a QR code for example. These markers are easily recognized and tracked => not a lot of processing power on the end-user device is needed to perform the recognition (and optionally tracking).
There is no need of an accelerometer or a compass in a marker-based app. The recognition library may be able to compute the pose matrix (rotation & translation) of the detected image relative to the camera of your device. If you know that, you know how far the recognized image is and how it is rotated relative to your device's camera. And from now on, AR begins... :)
Well. Since I got downvoted without explanation. Here is a little more detail on markerless tracking:
Actual there are several possibilities for augmented reality without "visual" markers but none of them called markerless tracking.
Showing of the virtual information can be triggered by GPS, Speech or simply turning on your phone.
Also, people tend to confuse NFT(Natural feature tracking) with markerless tracking. With NFT you can take a real life picture as a marker. But it is still a "marker".
This site has a nice overview and some examples for each marker:
Marker-Types
It's mostly in german but so beware.
What you call markerless tracking today is a technique best observed with the Hololens(and its own programming language) or the AR-Framework Kudan. Markerless Tracking doesn't find anything on his own. Instead, you can place an object at runtime somewhere in your field of view.
Markerless tracking is then used to keep this object in place. It's most likely uses a combination of sensor input and solving the SLAM( simultaneous localization and mapping) problem at runtime.
EDIT: A Little update. It seems the hololens creates its own inner geometric representation of the room. 3D-Objects are then put into that virtual room. After that, the room is kept in sync with the real world. The exact technique behind that seems to be unknown but some speculate that it is based on the Xbox Kinect technology.
Let's make it simple:
Marker-based augmented reality is when the tracked object is black-white square marker. A great example that is really easy to follow shown here: https://www.youtube.com/watch?v=PbEDkDGB-9w (you can try out by yourself)
Markerless augmented reality is when the tracked object can be anything else: picture, human body, head, eyes, hand or fingers etc. and on top of that you add virtual objects.
To sum it up, position and orientation information is the essential thing for Augmented Reality that can be provided by various sensors and methods for them. If you have that information accurate - you can create some really good AR applications.
It looks like there may be some confusion between Marker tracking and Natural Feature Tracking (NFT). A lot of AR SDK's tote their tracking as Markerless (NFT). This is still marker tracking, in that a pre-defined image or set of features is used. It's just not necessarily a black and white AR Toolkit type of marker. Vuforia, for example, uses NFT, which still requires a marker in the literal sense. Also, in the most literal sense, hand/face/body tracking is also marker tracking in that the marker is a shape. Markerless, inherent to the name, requires no pre-knowledge of the world or any shape or object be present to track.
You can read more about how Markerless tracking is achieved here, and see multiple examples of both marker-based and Markerless tracking here.
Marker based AR uses a Camera and a visual marker to determine the center, orientation and range of its spherical coordinate system. ARToolkit is the first full featured toolkit for marker based tracking.
Markerless Tracking is one of best methods for tracking currently. It performs active tracking and recognition of real environment on any type of support without using special placed markers. Allows more complex application of Augmented Reality concept.
I am looking for a way to approach a computer vision problem I'm having.
I have working tracking system:
4-8 cameras
Gives (x,y,z) of a infrared led
Each led Transmits a unique 8 bit signal
The tracking system is expensive and the interface is too hard for our users to work with. I want to replace it with a possible my own/ OpenCV implementation.
My current approach which seems to require a lot of development of what seems to be common problems:
Calibrate the cameras to make a 3D space - The cameras need to know where they are in space and in relation to each other.
Given two or more camera sees a unique led it uses gray-scale image with the pixel to calculate the 3D position (x, y, z) of that led.
Right now I am attempting to write my own custom algorithm for both task and its proving to be a lot of work. Is it possible to approach this with OpenCV to help with the heavy lifting.
Take a look at Free track : http://www.free-track.net/english/ you can download sources there.
I have a very specific application in which I would like to try structure from motion to get a 3D representation. For now, all the software/code samples I have found for structure from motion are like this: "A fixed object that is photographed from all angle to create the 3D". This is not my case.
In my case, the camera is moving in the middle of a corridor and looking forward. Sometimes, the camera can look on other direction (Left, right, top, down). The camera will never go back or look back, it always move forward. Since the corridor is small, almost everything is visible (no hidden spot). The corridor can be very long sometimes.
I have tried this software and it doesn't work in my particular case (but it's fantastic with normal use). Does anybody can suggest me a library/software/tools/paper that could target my specific needs? Or did you ever needed to implement something like that? Any help is welcome!
Thanks!
What kind of corridors are you talking about and what kind of precision are you aiming for?
A priori, I don't see why your corridor would not be a fixed object photographed from different angles. The quality of your reconstruction might suffer if you only look forward and you can't get many different views of the scene, but standard methods should still work. Are you sure that the programs you used aren't failing because of your picture quality, arrangement or other reasons?
If you have to do the reconstruction yourself, I would start by
1) Calibrating your camera
2) Undistorting your images
3) Matching feature points in subsequent image pairs
4) Extracting a 3D point cloud for each image pair
You can then orient the point clouds with respect to one another, for example via ICP between two subsequent clouds. More sophisticated methods might not yield much difference if you don't have any closed loops in your dataset (as your camera is only moving forward).
OpenCV and the Point Cloud Library should be everything you need for these steps. Visualization might be more of a hassle, but the pretty pictures are what you pay for in commercial software after all.
Edit (2017/8): I haven't worked on this in the meantime, but I feel like this answer is missing some pieces. If I had to answer it today, I would definitely suggest looking into the keyword monocular SLAM, which has recently seen a lot of activity, not least because of drones with cameras. Notably, LSD-SLAM is open source and may not be as vulnerable to feature-deprived views, as it operates directly on the intensity. There even seem to be approaches combining inertial/odometry sensors with the image matching algorithms.
Good luck!
FvD is right in the sense that your corridor is a static object. Your scenario is the same and moving around and object and taking images from multiple views. Your views are just not arranged to provide a 360 degree view of the object.
I see you mentioned in your previous comment that the data is coming from a video? In that case, the problem could very well be the camera calibration. A camera calibration tells the SfM algorithm about the internal parameters of the camera (focal length, principal point, lens distortion etc.) In the absence of knowledge about these, the bundler in VSfM uses information from the EXIF data of the image. However, I don't think video stores any EXIF information (not a 100% sure). As a result, I think the entire algorithm is running with bad focal length information and cannot solve for the orientation.
Can you extract a few frames from the video and see if there is any EXIF information?