Firebase MLKit iOS Text recognition seems to work pretty well if text is formatted in a paragraph or long phrase. However, it fails to work if the numbers are just scattered around which is our case and if there is some line geometry going on around. Some digits are recognized correctly while other exact same digits are not.
I would like to know:
Can MLKit team improve these cases? We are soo close to perfect results but something is causing 1-2 numbers each time to be missed.
Is there any way to hint MLKit of what kind of text we are looking for in the scene? I guess this can increase performance especially on live video conversion and use a smaller size model.
Any kind of image processing that can be done to improve results?
For this test, i have used the official MLKit quick-start project and in particular the MLVisionExample. Here are some cases:
Thank you very much!
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have been searching for emotion detection through voice/speech solution on mobile (iOS) and web.
I found Moodies-iOS and Vokaturi solution, but they are not free.
I couldn't find any open source or paid version software available to integrate in my app and test the solution.
Could someone share if you have any info on this related.
Is there any OPEN SOURCE for iOS for Emotion analysis and detection through Voice/Speech, Please let me know.
As a former research in affective computing, I highly doubt you can find a ready-for-use iOS open source solution for emotion recognition from speech. The main reason is that it is a damn difficult task that requires a lot of research and a lot of proper data to train models. That is why companies like BeyondVerbal and Vokaturi do not share their models with others. Thus, you will be very lucky if you can find anything in open source, I am not even talking about iOS solutions.
I am aware about some toolkits you can use for this task (namely, the openEAR toolkit), but to build something working from it, you need an expert knowledge in the field and data to train models. A comprehensive list of databases can be found here: http://emotion-research.net/wiki/Databases. A lot of them a freely available.
As Dmytro Prylipko said it is very doubtful that there is any open-source lib for emotion recognition from speech.
You may write your own solution. It is not hard. Trouble is, as mentioned before, proper training and/or trasholding takes a lot of time and nerves.
I will give you a short theory how you should begin writing the algo, but training and so on is on you.
First big trouble is that different people differently relay their emotions vocally.
For example: one shocked person will to their shock respond with overexclaimed sentence while another will "freeze" and their response would sound very flat (almost robot-like).
Therefore you will need a lot of templates from which to learn how to classify your input speech by emotions.
You can remove some difficulties by using context recognition along with voice prosody.
That is what I'd advise you to do.
First make an algorithm that will use speech-recognized text to put it into emotion context. E.g. you can use specific words and phrases that people use when expressing different emotions.
That is easily done. You may use a neural network or simple branching or whatever.
So you will be able to recognize whether person is thankful and surprised at the same time by combining context recognition and emotions from prosody.
Now, to recognize the emotion from prosody you have to get prosody parameters and some others.
For example, some emotions may be recognized by looking at duration of particular words in a sentence.
So you have the sentence and the text of that sentence. You know that the speed of normal speech is approximately 200 words per minute. Knowing this and number of words in the sentence you can see how fast is someone talking. Then you measure the duration of each word and get its speed. By knowing how fast is the speech and how long is the word you can get normalized ratios that can be used for classifications in order to determine the closest guess of the emotion.
For instance, when someone is presented with a present that he/she likes very much, the "thank you" will sound pretty long. It will also be of higher pitch than that person's usual speech.
So the next step would be to get the average pitch for each word to see the relation between them. So you will be able to see how the sentence prosody modulates. From lower to higher, or vice versa.
Also, how prosody changes inside the phrases within the sentence.
You may go about this by comparing curves of known emotion directly, or you may use aproximation to get coefficients from the prosody curve vector. The square function does good for normal speech prosody (with no particular emotions in). So some higher order polynomial should do. So, you can get coefficients of the polynom and use them to get what emotion should whole sentence or phrase relay.
The same goes for individual words within the sentence. You get the pitch for each phoneme or syllable or just the pitch curve for e.g. every 20 ms of the word. Then you either calculate few coefficients to aproximate the polynom you decided is good enough for you, or you take the whole curve and normalize it to e.g. 30 points to use it with recognition.
To compare curves directly you may use gesture recognition algorithm by Oleg Dopertchouk:
http://www.gamedev.net/reference/articles/article2039.asp
I tried it on pitch curves of melodies, it works just fine.
The trouble is, you need a database of speech with context and emotion with clear manually done classification to give your algo something to compare with.
If you use polynomials instead of whole curves, you can do some recognition by using thresholds on coefficients, but results will be a bit shaky. Only real excuse for using coeffs at all is that you do not need to know how long is the word in question. I.e. the same polynom should work on a word with 2 phonemes and on one with 5. (should work)
You see, a theory is nice and easy. Use speech recognition, measure speech rate, and duration of each word, construct pitch curve for whole phrase and pitch curve for each word using FFT, do some comparison between ready database and the input. And walla, emotion recognized.
But where will you find the database with word curves marked with emotions.
For example, you would need for each emotion at least one pitch curve for words with different number of phonemes. At least one, because it is important whether the word starts with vowel or ends with one, or simply someone differently relays the same emotion even if the curve represents the same word.
OK, so you can say that you can make one. Where would you find recorded samples to make your curves or calculate coeffs? Hm, perhaps a recording of some drama. Not bad idea, but the acted emotions aren't the same as the natural ones.
It is a big job to teach a machine such a thing.
Oh, yeah, I almost forgot, emotions aren't only, or sometimes at all transfered using pitch changes, sometimes it's only the way in which the word is being pronounced.
So, for some cases, you would probably need LPC or some other coefficients showing some more info on how phonemes in the word sounds. Or you would need to take in view other harmonics from FFT, not just the one representing the pitch of excitation train.
The best that you can do without following my hints and developing your own algo, is to use NLTK (natural language toolkit) to develop a statistical speech (emotionally rich) model and use algorithms from there (perhaps a bit modified) to try to get to the emotion in question.
But I fear it would be a greater job than going from zero. As far as I know NLTK doesn't support emotions. Just normal speech prosody.
You may try to integrate some things I wrote about into Sphinx, to develop emotion based speech models and introduce emotion recognition directly into sphinxes VR algorithm.
If you really need this, I advise you to learn enough DSP to write your own algo, then pay someone to make you initial database from audiobooks, radio dramas and similar stuff (using a tool you provide).
After your algo starts to work reasonably well, implement autolearning by giving users an option to correct the algo's wrong guesses. After some time you will get 90% reliable algo to recognize emotions from speech.
Is there a way to accomplish something similar to what the iTunes and App Store Apps do when you redeem a Gift Card using the device camera, recognizing a short string of characters in real time on top of the live camera feed?
I know that in iOS 7 there is now the AVMetadataMachineReadableCodeObject class which, AFAIK, only represents barcodes. I'm more interested in detecting and reading the contents of a short string. Is this possible using publicly available API methods, or some other third party SDK that you might know of?
There is also a video of the process in action:
https://www.youtube.com/watch?v=c7swRRLlYEo
Best,
I'm working on a project that does something similar to the Apple app store redeem with camera as you mentioned.
A great starting place on processing live video is a project I found on GitHub. This is using the AVFoundation framework and you implement the AVCaptureVideoDataOutputSampleBufferDelegate methods.
Once you have the image stream (video), you can use OpenCV to process the video. You need to determine the area in the image you want to OCR before you run it through Tesseract. You have to play with the filtering, but the broad steps you take with OpenCV are:
Convert the images to B&W using cv::cvtColor(inputMat, outputMat, CV_RGBA2GRAY);
Threshold the images to eliminate unnecessary elements. You specify the threshold value to eliminate, and then set everything else to black (or white).
Determine the lines that form the boundary of the box (or whatever you are processing). You can either create a "bounding box" if you have eliminated everything but the desired area, or use the HoughLines algorithm (or the probabilistic version, HoughLinesP). Using this, you can determine line intersection to find corners, and use the corners to warp the desired area to straighten it into a proper rectangle (if this step is necessary in your application) prior to OCR.
Process the portion of the image with Tesseract OCR library to get the resulting text. It is possible to create training files for letters in OpenCV so you can read the text without Tesseract. This could be faster but also could be a lot more work. In the App Store case, they are doing something similar to display the text that was read overlaid on top of the original image. This adds to the cool factor, so it just depends on what you need.
Some other hints:
I used the book "Instant OpenCV" to get started quickly with this. It was pretty helpful.
Download OpenCV for iOS from OpenCV.org/downloads.html
I have found adaptive thresholding to be very useful, you can read all about it by searching for "OpenCV adaptiveThreshold". Also, if you have an image with very little in between light and dark elements, you can use Otsu's Binarization. This automatically determines the threshold values based on the histogram of the grayscale image.
This Q&A thread seems to consistently be one of the top search hits for the topic of OCR on iOS, but is fairly out of date, so I thought I'd post some additional resources that might be useful that I've found as of the time of writing this post:
Vision Framework
https://developer.apple.com/documentation/vision
As of iOS 11, you can now use the included CoreML-based Vision framework for things like rectangle or text detection. I've found that I no longer need to use OpenCV with these capabilities included in the OS. However, note that text detection is not the same as text recognition or OCR so you will still need another library like Tesseract (or possibly your own CoreML model) to translate the detected parts of the image into actual text.
SwiftOCR
https://github.com/garnele007/SwiftOCR
If you're just interested in recognizing alphanumeric codes, this OCR library claims significant speed, memory consumption, and accuracy improvements over Tesseract (I have not tried it myself).
ML Kit
https://firebase.google.com/products/ml-kit/
Google has released ML Kit as part of its Firebase suite of developer tools, in beta at the time of writing this post. Similar to Apple's CoreML, it is a machine learning framework that can use your own trained models, but also has pre-trained models for common image processing tasks like Vision Framework. Unlike Vision Framework, this also includes a model for on-device text recognition of Latin characters. Currently, use of this library is free for on-device functionality, with charges for using cloud/SAAS API offerings from Google. I have opted to use this in my project, as the speed and accuracy of recognition seems quite good, and I also will be creating an Android app with the same functionality, so having a single cross platform solution is ideal for me.
ABBYY Real-Time Recognition SDK
https://rtrsdk.com/
This commercial SDK for iOS and Android is free to download for evaluation and limited commercial use (up to 5000 units as of time of writing this post). Further commercial use requires an Extended License. I did not evaluate this offering due to its opaque pricing.
'Real time' is just a set of images. You don't even need to think about processing all of them, just enough to broadly represent the motion of the device (or the change in the camera position). There is nothing built into the iOS SDK to do what you want, but you can use a 3rd party OCR library (like Tesseract) to process the images you grab from the camera.
I would look into Tesseract. It's an open source OCR library that takes image data and processes it. You can add different regular expressions and only look for specific characters as well. It isn't perfect, but from my experience it works pretty well. Also it can be installed as a CocoaPod if you're into that sort of thing.
If you wanted to capture that in real time you might be able to use GPUImage to catch images in the live feed and do processing on the incoming images to speed up Tesseract by using different filters or reducing the size or quality of the incoming images.
There's a project similar to that on github: https://github.com/Devxhkl/RealtimeOCR
Currently my company's R&D department is working on an iOS application that recognizes credit card numbers and expiration dates using a camera of a mobile device. We have successfuly completed most of the project and now working on the stage of splitting card number digits into separate blocks for further recognition. Read more to find the details: http://rnd.azoft.com/optical-recognition-ios-application/
If you have worked on similar tasks we will appreciate if you could share your approaches and the results you received.
All credit cards I know of use a Luhn scheme, something like a checksum, so you should use that to validate numbers. The most critical numbers are on the left as they encode the brand, that then informs you of the length. That said there are exceptions that make absolute assumptions difficult. The Luhn code is going to be your best friend.
There is an iOS Credit Card Entry project that has some code with Luhn code and brand identification that should be helpful.
1) Image quality judgement, only send the images with good quality to the next stages. Poor quality images will give you incorrect result. This will be a very bad user experience on smart phone. Many smart phones' camera have auto-focus functionality, you may only use the image when camera finish auto focusing.
2) Card detection. Detect entire card. This stage is very import, many algorithm fail to it. You can detect a card via morphological operator or other image operator. But this operator is a little bit slow on smart phone. Or you can training a cascade classifier for card detection. This detections is quicker than last method. But it's very hard to deal with general card, I mean, This method only can detect only one kind of card.
3) extract characters zone, it's very easy to get characters zone, because the numbers is be in the fix position of a specified card.
4) character segmentation, use vertical projection to do segmentation, you can get every individual digital images
5) character recognition. find connected component in each digital images, crop the images so that only contain the connected component. send the cropped images to a signal digital classifier. I tried tesseact , the result is not very bad. Maybe you can use other great digital classifiers.
6) number validation. Some cards' number have validation rules. you can utilise it.
I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).
I had an idea for which I need to be able to recognize certain objects or models from a rendered three dimensional digital movie.
After limited research, I know now that what I need is called feature detection in the field of Computer Vision.
So, what I want to do is:
create a few screenshots of a certain character in the movie (eg. front/back/leftSide/rightSide)
play the movie
while playing the movie, continuously create new screenshots of the movie
for each screenshot, perform feature detection (SIFT?, with openCV?) to see if any of our character appearances are there (they must still be recognized if the character is further away and thus appears smaller, or if the character is eg. lying down).
give a notice whenever the character is found
This would be possible with OpenCV, right?
The "issue" is that I would have to learn c++ or python to develop this application. This is not a problem if my movie and screenshots are applicable for what I want to do.
So, I would like to first test my screenshots of the movie. Is there a GUI version of OpenCV that I can input my test data and then execute it's feature detection algorithms manually as a means of prototyping?
Any feedback is appreciated. Thanks.
There is no GUI of OpenCV able to do what you want. You will be able to use OpenCV for some aspects of your problem, but there is no ready-made solution waiting there for you.
While it's definitely possible to solve your problem, the learning curve for this problem is quite long. If you're a professional, then an alternative to learning about it yourself would be to hire an expert to do it for you. It would cost money, but save you time.
EDIT
As far as template matching goes, you wouldn't normally use it to solve such a problem because the thing you're looking for is changing appearance and shape. There aren't really any "dynamic parameters to set". The closest thing you could try is have a massive template collection that would try to cover the expected forms that your target may take. But it would hardly be an elegant solution. Plus it wouldn't scale.
Next, to your point about face recognition. This is kind of related, but most facial recognition applications deal with a controlled environment: lighting, distance, pose, angle, etc. Outside of that controlled environment face detection effectiveness drops significantly. If you're detecting objects in a movie, then your environment isn't really controlled.
You may want to first try a simpler problem of accurately detecting where the characters are, without determining who they are (video surveillance, essentially). While it may sound simple, you'll find that it's actually non-trivial for arbitrary scenes. The result of solving that problem may be useful in identifying the characters.
There is Find-Object by Mathieu Labbé. It was very helpful for me to start getting an understanding of the descriptors since you can change them while your video is running to see what happens.
This is probably too late, but might help someone else looking for a solution.
Well, using OpenCV you would of taking a frame of a video file and do any computations on it.
You can do several different methods of detecting a character on that image, but it's not so easy to have it as flexible so you can even get that person if it's lying on the floor for example, if you only entered reference images of that character standing.
Basically you could try extracting all important features from your set of reference pictures and have a (in your case supervised) learning algorithm that gets a good feature-vector of that character for classification.
You then need to write your code that plays the video and which takes a video frame let's say each 500ms (or other as you desire), gets a segmentation of the object you thing would be that character and compare it with the reference values you get from your learning algorithm. If there's a match, your code can yell "Yehaaawww!" or do other things...
But all this depends on how flexible you want this to be. You could also try a template match or cross-correlation which basically shifts the reference image(s) over the frame and checks how equal both parts are. But this unfortunately is very sensitive for rotation, deformations or other noise... so you wouldn't get that person if its i.e. laying down. And I doubt you can get all those calculations done in realtime...
Basically: Yes OpenCV is good to use for your image processing/computer vision tasks. But it offers a lot of methods and ways and you'd need to find a way that works for your images... it's not a trivial task though...
Hope that helps...
Have you tried looking at some of the work of the Oxford visual geometry group?
Their Video Google system describes to a large extent what you want, instance detection.
Their work into Naming People in TV shows is also pretty relevant. A face detection and facial feature pipeline is included that can be run from Matlab. Are you familiar with Matlab?
Have you tried computer vision frameworks like Cassandra? There you can exactly do that just by some mouse clicks.