I am working on a project to identify damages and scratches to the packages while they move through different legs of transport.
Is there a way to detect damages and scratches by just using the OpenCV library and not diving into Neural Networks, ML, etc? At this point, we are mainly concerned with results and we don't care about accuracy. We can even work with algorithms that require a special environment for their proper functioning (eg: light, camera angle, etc)
I tried web search but most of the articles take me to neural network models which are not what we are trying to explore at this stage of our project.
Related
i am doing final year project as lane tracking using a camera. the most challenging task now is how i can measure distance between the camera (the car that carries it actually) and the lane.
While the lane is easily recognized (Hough line transform) but i found no way to measure distance to it.
given the fact that there is a way to measure distance to object in front of camera based on Pixel width of the object, but it does not work here be because the nearest point of the line, is blind in the camera.
What you want is to directly infer the depth map with a monocular camera.
You can refer my answer here
https://stackoverflow.com/a/64687551/11530294
Usually, we need a photometric measurement from a different position in the world to form a geometric understanding of the world(a.k.a depth map). For a single image, it is not possible to measure the geometric, but it is possible to infer depth from prior understanding.
One way for a single image to work is to use a deep learning-based method to direct infer depth. Usually, the deep learning-based approaches are all based on python, so if you only familiar with python, then this is the approach that you should go for. If the image is small enough, i think it is possible for realtime performance. There are many of this kind of work using CAFFE, TF, TORCH etc. you can search on git hub for more option. The one I posted here is what i used recently
reference:
Godard, Clément, et al. "Digging into self-supervised monocular depth estimation." Proceedings of the IEEE international conference on computer vision. 2019.
Source code: https://github.com/nianticlabs/monodepth2
The other way is to use a large FOV video for a single camera-based SLAM. This one has various constraints such as need good features, large FOV, slow motion, etc. You can find many of this work such as DTAM, LSDSLAM, DSO, etc. There are a couple of other packages from HKUST or ETH that does the mapping given the position(e.g if you have GPS/compass), some of the famous names are REMODE+SVO open_quadtree_mapping etc.
One typical example for a single camera-based SLAM would be LSDSLAM. It is a realtime SLAM.
This one is implemented based on ROS-C++, I remember they do publish the depth image. And you can write a python node to subscribe to the depth directly or the global optimized point cloud and project it into a depth map of any view angle.
reference: Engel, Jakob, Thomas Schöps, and Daniel Cremers. "LSD-SLAM: Large-scale direct monocular SLAM." European conference on computer vision. Springer, Cham, 2014.
source code: https://github.com/tum-vision/lsd_slam
In comma.ai's self-driving car software they use a client/server architecture. Two processes are started separately, server.py and train_steering_model.py.
server.py sends data to train_steering_model.py via http and sockets.
Why do they use this technique? Isn't this a complicated way of sending data? Isn't this easier to make train_steering_model.py load the data set by it self?
The document DriveSim.md in the repository links to a paper titled Learning a Driving Simulator. In the paper, they state:
Due to the problem complexity we decided to learn video prediction with separable networks.
They also mention the frame rate they used is 5 Hz.
While that sentence is the only one that addresses your question, and it isn't exactly crystal clear, let's break down the task in question:
Grab an image from a camera
Preprocess/downsample/normalize the image pixels
Pass the image through an autoencoder to extract representative feature vector
Pass the output of the autoencoder on to an RNN that will predict proper steering angle
The "problem complexity" refers to the fact that they're dealing with a long sequence of large images that are (as they say in the paper) "highly uncorrelated." There are lots of different tasks that are going on, so the network approach is more modular - in addition to allowing them to work in parallel, it also allows scaling up the components without getting bottlenecked by a single piece of hardware reaching its threshold computational abilities. (And just think: this is only the steering aspect. The Logs.md file lists other components of the vehicle to worry about that aren't addressed by this neural network - gas, brakes, blinkers, acceleration, etc.).
Now let's fast forward to the practical implementation in a self-driving vehicle. There will definitely be more than one neural network operating onboard the vehicle, and each will need to be limited in size - microcomputers or embedded hardware, with limited computational power. So, there's a natural ceiling to how much work one component can do.
Tying all of this together is the fact that cars already operate using a network architecture - a CAN bus is literally a computer network inside of a vehicle. So, this work simply plans to farm out pieces of an enormously complex task to a number of distributed components (which will be limited in capability) using a network that's already in place.
I want to use an already implemented SLAM algorithm for mapping my college campus.
I have found some algorithms on OpenSLAM.org and some other independent ones such as LSD-SLAM and Hector SLAM, which show some promises but they have limitation such as they use LIDAR, or don't extend to large dataset etc.
SLAM has been an active topic for many years and some groups have also mapped an entire town. Can someone point me to such efficient algorithm?
My requirements are:
It must use RGB camera/cameras.
Preferably produce (somewhat) dense map of area.
It should be able to map large area (I have seen some algo which can only map up to a desk table or room, but they usually lose track if there is jerk in camera motion (Observed in LSD SLAM) or take very few landmarks which is only useful for study purposes).
Preferably a ROS implementation.
Recently, I'm studying about facial recognition with OpenCV, and I'm trying some simple example based on study.
I'm considering to use it at front door condition.
Nowadays some buildings or apartments use facial recognition for preventing intruders. When someone joins them (such as company or houses), they require the person's picture. As I know, they require just one picture.
I didn't care about that last time, but now, I'm very curious about it.
The famous algorithms such as PCA, LDA use machine learning, so they increase successful percentages(cases). To use machine learning, they need sample images as many as I can provide. That's why I'm curious about that. Buildings or companys require just one picture, but they can recognize each person. Moreover, their accuracy is very good. How can this happen? Is there any other algorithm besides PCA or LDA?
Thanks for reading!
As far as I know, this hasn't been achieved yet. So I don't think they can develop a software recognizing a person by using only one picture.
It is most likely that they teach the algorithm with the authorized person's pictures. So if that one picture does not match with the trained ones, the algorithm can say this is an intrusion.
Edit:
As linuxqwerty pointed out those commercial products are already trained with huge datasets.
As a result of this training, learning happens and the algorithm achieves feature extraction of all those sample faces.
Then the algorithm knows almost every kinds of features that an human face can have.
For example: thickness of eyebrows, distance between eyes, roundness of chin... These are only a human can say about faces. The algorithm can extract thousands of these features.
It can keep faces as a representation of those features.
So now we have this commercial software which can represent faces as binary codes with a lot of digits.
I am getting your question again.
The apartment or company bought this software.
They included the picture of authorized person.
What the software does is simply converting the picture as it was a thousand digits password.
So that person has this unique password which the system can only reproduce that password only from his face.
To sum up:
The learning part was achieved using big face databases.
Thanks to learning part, the recognition part can be done by using only one picture.
PS: Corrections are welcome.
I happened to read about facial recognition before, that time I wanted to do it as my semester project. And of course I have heard and thought of using OpenCV as well.
Your question is simple, those company or home that use facial recognition, they usually use very well-developed product, which normally includes well-programmed facial recognition. As we are talking about security here, normally companies will buy these security products, unless if they just want to use it as a tool to deter intruders which focus less on the practical usage, and recognition accuracy, they can opt for free facial recognition software.
So, when I'm talking about well-programmed facial recognition, it means that it was trained with huge amount of databases (the photos to be recognized that you mentioned), this means the training is done even before the software is officially launched, which is during the development stage. A good facial recognition software requires both good, complete and detailed programming coding, and also huge photo databases (taken at different ambient light intensity, different facial features like hair style, spectacles) to train it.
Therefore, the accuracy of the software does not depend solely on the amount of pictures given during the usage of the software provided that it is well-programmed in the first place. Thanks and hope I answered your question and wonder.
ps: recognize is spelled this way (US); recognise (UK) =)
In brief, what are the available options for implementing the Tracking of a particular Image(A photo/graphic/logo) in webcam feed using OpenCv?In particular i am trying to collate opinion about the following:
Would HaarTraining be overkill(considering that it is not 3d objects but simply Images to be tracked) or is it the only way out?
Have tried Template Matching, Color-based detection but these don't offer reliable tracking under varying illumination/Scale/Orientation at all.
Would SIFT,SURF feature matching work as reliably in video as with static image
comparison?
Am a relative beginner to OpenCV , as is evident by my previous queries on SO (very helpful replies). Any cues or links to what could be good resources for beginning NFT implementation with OpenCV?
Can you talk a bit more about your requirements? Namely, what type of appearance variations do you expect/how much control you have over the environment. What type of constraints do you have in terms of speed/power/resource footprint?
Without those, I can only give some general assessment to the 3 paths you are talking about.
1.
Haar would work well and fast, particularly for instance recognition.
Note that Haar doesn't work all that well for 3D unless you train with a full spectrum of templates to cover various perspectives. The poster child application of Haar cascades is Viola Jones' face detection system which is largely geared towards frontal faces (can certainly be trained for many other things)
For a tutorial on doing Haar training using OpenCV, see here.
2.
Try NCC or better yet, Lucas Kanade tracking (cvCalcOpticalFlowPyrLK which is a pyramidal as in coarse-to-fine LK - a 4 level pyramid usually works well) for a template. Usually good upto 10% scale or 10 degrees rotation without template changes. Beyond that, you can have automatically evolving templates which can drift over time.
For a quick Optical Flow/tracking tutorial, see this.
3.
SIFT/SURF would indeed work very well. I'd suggest some additional geometric verification step to remove spurious matches.
I'd be a bit concerned about the amount of computational time involved. If there isn't significant illumination/scale/in-plane rotation, then SIFT is probably overkill. If you truly need it, check out Changchang Wu's excellent SIFTGPU implmentation. Note: 3rd party, not OpenCV.
It seems that none of the methods when applied alone could bring reliable results unless it is a hobby project. Probably some adaptive algorithm would be more or less acceptable. For example see a famous opensource project where they use machine learning.