Recently I am studying the research NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis(https://www.matthewtancik.com/nerf), and I am wondering: What is it used for? Will there be any application of NeRF?
The result of this technique is very impressive, but what is it used for? I keep thinking of this question over and over again. It is very realistic, the quality is perfect, but we don't want to see the camera swinging around all the time, right?
Personally, this technique has some limitations:
Cannot generate views that never seen in input images. This technique interpolates between two views.
Long training and rendering time: According to the authors, it takes 12 hours to train a scene, and 30s to render one frame.
The view is static and not interactable.
I don't know if it is appropriate to compare NeRF with Panorama and 360° image/video, essentially they are different, only NeRF uses deep learning to generate new views, the others basically are just using smart phone/camera to capture scenes plus some computer vision techniques. Still, the long training time makes NeRF less competitive in this application area. Am I correct?
Another utility I can think of is product rendering, however, NeRF doesn't show advantages compare to using 3D software to render. Like commercial advertisement, usually it requires animation and special effects, then definitely 3D software can do better.
The potential use of NeRF might be 3D reconstruction, but that would be out of the scope, although it is able to do that. Why do we need to use NeRF for 3D reconstruction? Why not use other reconstruction techniques? The unique feature of NeRF is the ability of creating photo-realistic views, if we use NeRF for 3D reconstruction, then this feature becomes pointless.
Does anyone have new ideas? I would like to know.
Why do we need to use NeRF for 3D reconstruction?
The alternative would be multi-view stereo, which produces point clouds of finite resolution and is susceptible to illumination changes. If you then render such point cloud without non-trivial post-processing, it will not look photorealistic.
I don't know if it is appropriate to compare NeRF with Panorama and 360° image/video,
Well, if you deal with exactly flat scene with simple lighting (i.e. ambient light and Lambertian objects), then you can use panorama techniques for new view synthesis. In general though, it won’t produce the result you expect. You have to know the depth to interpolate correctly.
When it comes to practical limitations (slow; does not model deformations), NeRF should be considered a milestone that provided a proof of concept that representing surface as a level set of MLP-modelled function can result in sharp rendering. There is already good progress in addressing those limitations, and multiple works apply this idea for practical tasks.
Related
Problem
I would like to build a realistic view of the earth from low-orbit (here ~300km) with WebGL. That is to say, on the web, with all that it implies and moreover, on mobile. Do not stop reading here : to make this a little less difficult, the user can look everywhere but not pan, so the view does only concern a small 3000km-wide area. But the view follows a satellite so few minutes later, the user comes back to where it was before, with the slight shift of the earth's rotation, etc. So the clouds cannot be at the same place all the time.
I have actually yet been able to include city lights, auroras, lightnings... except clouds. I have seen a lot of demos of realtime rendering passionates and researchers, but none of them had a nice, realistic cloud layer. However I am sure I am the 100(...)00th person thinking about doing this, so please enlight me.
Few questions are implied :
what input to use for clouds ? Meteorological live data ?
what rendering possibilities ? A transparent layer with a cloud map, modified with shaders ? Few transparent layers to get a feeling of volumetric rendering ? But how to cast shadows one to another : the only solution would then be using a mesh ? Or shadows could be procedurally computed and mapped on the server every x minutes ?
Few specifications
Here are some ideas summing up what I have not seen yet, sorted by importance :
clouds hide 60% of the earth.
clouds scatter cities & lightnings'lights and have rayleigh scattering at night.
At this distance the parallax effect is visible and even quite awesome with the smallest clouds.
As far as i've seen, even expensive realtime meteorological online resources are not useful : they aim rainy or stormy clouds with help of UV and IR lightwaves, so they don't catch 100% of them and don't give the 'normal' view we all know. Moreover the rare good cloud textures shot in visible light hardly differentiate ground from clouds : sometimes a 5000km-long coast stands among nowhere. A server may be able to use those images to create better textures.
When I look at those pictures I imagine that the lest costy way would be to merge few nice cloud meshes from a database containing different models, then slightly transform those meshes inside a shader while the user passes over. If he is still here 90 minutes later when he comes back, no matter if the model are not the same again. However a hurrican cannot disappear.
What do you think about this ?
For such effects there is probably just one way to do it properly and that is:
Voxel maps + Volume rendering probably with Back-Ray-tracer rendering
As your position is fixed so it should not be so hard on memory requirements. You need to implement both MIE and Rayleigh scattering. Scattering can be simplified a lot and still looking good see
simplified atmosphere scattering
realistic n-body solar system simulation
voxel maps handle light gaps,shadows and scattering relatively easy but need a lot of memory and computational power. All the other 2D techniques just usually painfully work around what 3D voxel maps do natively with little effort. For example:
Voxel map shadows
Procedural cloud map generators
You need this for each type of clouds so you have something to render. There are libs/demos/examples out there see:
first relevant google hit
I want to do something like this but in reverse-- so that the cameras are outside and pointing inward. Let's start with the abstract and get specific:
1) Are there any TOOLS that will do this for me? How close can I get using existing software?
2) Say the nearest tool is a graphics library like OpenCV. I've taken linear algebra and have an undergraduate degree in CS but without any special training in graphics. Where should I go from there?
3) If I really am undergoing a decade-long spiritual quest of a self-teaching+programming exercise to make this happen, are there any papers or other resources that you aware of that might aid me?
I think the demo you linked uses a 360° camera (see the black circle on the bottom) and does not involve stitching in any way.
About your question, are you aware of this work? They don't do stitching either, just blending between different views.
If you use inward views, then the objects you will observe will probably be quite close to the cameras, while standard stitching assumes that objects are far away. Close 3D objects mean high distortion when you change the viewpoint (i.e. parallax & occlusions), which makes it difficult to interpolate between two views. Hence, if you want stitching, then your main problem is to correctly handle parallax effects & occlusions between the views.
In my opinion, the most promising approach would be to do live stereo matching (i.e. dense 3D reconstruction) between the two camera images closest to your current viewpoint, and then interpolate the estimated disparities to generate an expected image. However, it's not likely to run in real-time, as demonstrated in the demo you linked, and the result could be quite ugly...
EDIT
You can also have a look at this paper, which uses a different but interesting approach, however maybe not directly useful in your case since it requires the new viewpoint to be visible in the available images.
does anyone have any experience with using large and complex images as markers (e.g. magazine layout, photo, text-layout) for a.r.?
i am not sure which way to go:
flash, papervision and flar would be nice for distribution but i suspect them to be too bad in terms of performance for a more complex marker than the usual 9x9 or 12x12 blocks. i had difficulties achieving both a good 3d performance and a smooth and solid detection.
i can also do java or objective-c with opengl/opencv and this is definitely also an option for this project.
i just would like to know before if anyone has had experiences in this field and could give me a few hints or warnings. i know it has been done already so there is a way to do it smoothly.
thanks,
anton
It sounds like you might want to start investigating natural feature tracking libraries. In general the tracking is smoother and more robust than with markers, and any feature-full natural image can be used as the marker. The downside is, I'm not aware of any non-proprietary solutions.
Metaio Unifeye works in a web-browser via flash if I recall correctly, something like that might be what you're looking for.
You should look at MOPED.
MOPED is a real-time Object Recognition and Pose Estimation system. It recognizes objects from point-based features (e.g. SIFT, SURF) and their geometric relationships extracted from rigid 3D models of objects.
See this video for a demonstration.
I have a number of images where I know the focal length, pixel count, dimensions and position (from GPS). They are all in a high oblique manner, taken on the ground with commercially available cameras.
alt text http://desmond.yfrog.com/Himg411/scaled.php?tn=0&server=411&filename=mjbm.jpg&xsize=640&ysize=640
What would be the best method for calculating the euclidean distances between certain pixels within an image? If it is indeed possible.
Assuming you're not looking for full landscape modelling but a simple approximation then this shouldn't be too hard. Basically a first approximation of your image reduces to a camera with know focal length looking along a plane. So we can create a model of the system in 3D very easily - it's not too far from the classic observer looking over a checkerboard demo.
Normally our graphics problem would be to project the 3D model into 2D so we could render the image. Although most programs nowadays use an API (such as OpenGL) to do this the equations are not particularly complex or difficult to understand. I wrote my first code using the examples from 3D Graphics In Pascal which is a nice clear treatise, but there will be lots of other similar source (although probably less nowadays as a hardware API is invariably used).
What's useful about this is that the projection equations are commutative, in that if you have a point on the image and the model you can run the data back though the projection to retrieve the original 3D coordinates - which is what you wish to do.
So a couple of approaches suggest: either write the code to do the above yourself directly, or probably more simply use OpenGL (I'd recommend the GLUT toolkit for this). If your math is good and manipulating matrices causes you no issue then I'd recommend the former as the solution will be tighter and it's interesting stuff - otherwise take the OpenGL approach. You'd probably want to turn the camera/plane approximation into camera/sphere fairly early too.
If this isn't sufficient for your needs then in theory going to actual landscape modelling would be feasible. The SRTM data is freely available (albeit not in the friendliest of forms) so combined with your GPS position it should be possible to create a mesh model in with which you apply the same algorithms as above.
I'm looking for the fastest and more efficient method of detecting an object in a moving video. Things to note about this video: It is very grainy and low resolution, also both the background and foreground are moving simultaneously.
Note: I'm trying to detect a moving truck on a road in a moving video.
Methods I've tried:
Training a Haar Cascade - I've attempted training the classifiers to identify the object by taking copping multiple images of the desired object. This proved to produce either many false detects or no detects at all (the object desired was never detected). I used about 100 positive images and 4000 negatives.
SIFT and SURF Keypoints - When attempting to use either of these methods which is based on features, I discovered that the object I wanted to detect was too low in resolution, so there were not enough features to match to make an accurate detection. (Object desired was never detected)
Template Matching - This is probably the best method I've tried. It's the most accurate although the most hacky of them all. I can detect the object for one specific video using a template cropped from the video. However, there is no guaranteed accuracy because all that is known is the best match for each frame, no analysis is done on the percentage template matches the frame. Basically, it only works if the object is always in the video, otherwise it will create a false detect.
So those are the big 3 methods I've tried and all have failed. What would work best is something like template matching but with scale and rotation invariance (which led me to try SIFT/SURF), but i have no idea how to modify the template matching function.
Does anyone have any suggestions how to best accomplish this task?
Apply optical flow to the image and then segment it based on flow field. Background flow is very different from "object" flow (which mainly diverges or converges depending on whether it is moving towards or away from you, with some lateral component also).
Here's an oldish project which worked this way:
http://users.fmrib.ox.ac.uk/~steve/asset/index.html
This vehicle detection paper uses a Gabor filter bank for low level detection and then uses the response to create the features space where it trains an SVM classifier.
The technique seems to work well and is at least scale invariant. I am not sure about rotation though.
Not knowing your application, my initial impression is normalized cross-correlation, especially since I remember seeing a purely optical cross-correlator that had vehicle-tracking as the example application. (Tracking a vehicle as it passes using only optical components and an image of the side of the vehicle - I wish I could find the link.) This is similar (if not identical) to "template matching", which you say kind of works, but this won't work if the images are rotated, as you know.
However, there's a related method based on log-polar coordinates that will work regardless of rotation, scale, shear, and translation.
I imagine this would also enable tracking that the object has left the scene of the video, too, since the maximum correlation will decrease.
How low resolution are we talking? Could you also elaborate on the object? Is it a specific color? Does it have a pattern? The answers affect what you should be using.
Also, I might be reading your template matching statement wrong, but it sounds like you are overtraining it (by testing on the same video you extracted the object from??).
A Haar Cascade is going to require significant training data on your part, and will be poor for any adjustments in orientation.
Your best bet might be to combine template matching with an algorithm similar to camshift in opencv (5,7MB PDF), along with a probabilistic model (you'll have to figure this one out) of whether the truck is still in the image.