Poor recognition due to background noise using OpenEars on iOS - ios

I'm using OpenEars in my app for performing the recognition of some words and sentences. I have followed the basic tutorial for the offline speech recognition and executed a porting in Swift. This is the setup procedure
self.openEarsEventsObserver = OEEventsObserver()
self.openEarsEventsObserver.delegate = self
let lmGenerator: OELanguageModelGenerator = OELanguageModelGenerator()
addWords()
let name = "LanguageModelFileStarSaver"
lmGenerator.generateLanguageModelFromArray(words, withFilesNamed: name, forAcousticModelAtPath: OEAcousticModel.pathToModel("AcousticModelEnglish"))
lmPath = lmGenerator.pathToSuccessfullyGeneratedLanguageModelWithRequestedName(name)
dicPath = lmGenerator.pathToSuccessfullyGeneratedDictionaryWithRequestedName(name)
The recognition works well in a quiet room for both single words and whole sentences ( I would say it has a 90% hit rate). However, when I tried in quiet pub with a light background noise the app had serious difficulties in recognising even just word.
Is there any way to improve the speech recognition when there is background noise?

If the background noise is more or less uniform (i.e. has a regular pattern), you can try adaptation of the acoustic model, otherwise it's an open problem sometimes referred to as the cocktail party effect, which can be part solved using DNNs.

Try this setting, works well for me.
try? OEPocketsphinxController.sharedInstance().setActive(true)
OEPocketsphinxController.sharedInstance().secondsOfSilenceToDetect = 2
OEPocketsphinxController.sharedInstance().setSecondsOfSilence()
OEPocketsphinxController.sharedInstance().vadThreshold = 3.5
OEPocketsphinxController.sharedInstance().removingNoise = true
Or You can try iSphinx library.

Related

AKOscillator frequency range for theremin sound in iOS

I want to create similar sound to theremin using touch coordinate on screen. I'm using y axis as frequency, x axis as amplitude.
Due to my small research I believe I can create it using AKOscillator or AKFMOscillator from AudioKit framework (please let me know if any other oscillator works better in this case). I'm open to other frameworks like built-in AudioToolbox (MIDINoteMessage etc.) if I can create similar sound to theremin.
Here it says theremin has two oscillators. One with fixed-frequency on 260kHz and one is dynamic between 257-260kHz. It superimposes their output (it takes difference of them I guess?). And it outputs between frequency between 0-3 kHz.
When I create sounds using AKFMOscillator with baseFrequency between 257-260 kHz, it sounds high-pitched.
When I try with one oscillator range between 0-3kHz it sounds very robotic. How I can simulate timbre of theremin?
How can I make it sound better? Should I mix two oscillators? I tried mixing with AKMixer but when both oscillators use same frequency and amplitude, it makes no difference.
I tried to mapping to nearest note (auto-tune), I tried limiting the frequency between 3-4 octaves. It sounds better but still not good as theremin.
What should use ( AKOscillator or AKFMOscillator, OscillatorBank), with which parameters (rampDuration, baseFrequency, modulationIndex, amplitude) to simulate more thereminish sound?
Update:
I did some more research and played with Synth One presets. Now, I know I need two oscillators mixed (both set to saw-shape wave). Changing ADSR(envelope) values to specific ranges creates richer sound (this gives the instrumental sound type). And a lfo to create the wavy (or spooky) sound effect. Playing notes (specific frequencies) creates good sounds, if you play every frequency in between note frequencies it doesn't sound good.

How to render from ob openCV

In my python neat code I am using opencv to downscale and covert into gray every frame of the environment. what I want a archive is that opencv opens a window displaying the frame/ video that is it processing.
In short I want to view the the neat algorithm learning and evolving.
Because there are 3 environment running in parallel i want opencv to display the frame/video that is best performing right now.
I am working with the python neat library to do some machine learning tasks. At the moment I am doing parallel learning with 3 threads with the environment of sonic the hedgehog. I have tried to do simple open CV frame commands, but its just opening a black window.
net = neat.nn.FeedForwardNetwork.create(self.genome, self.config)
fitness = 0
xpos = 0
xpos_max = 0
counter = 0
imgarray = []
while not done:
# self.env.render()
ob = cv2.resize(ob, (inx, iny))
ob = cv2.cvtColor(ob, cv2.COLOR_BGR2GRAY)
ob = np.reshape(ob, (inx, iny))
imgarray = np.ndarray.flatten(ob)
actions = net.activate(imgarray)
ob, rew, done, info = self.env.step(actions)
xpos = info['x']
This is the part of the code that downscales the frame and converts it to gray scale.
Bonus if it could only show the frame/worker that is doing the best based on the fitness value.
View full code here: https://gitlab.com/lucasrthompson/Sonic-Bot-In-OpenAI-and-NEAT/blob/master/neat-paralle-sonic.py
by lucasrthompson
The output that I expect is one window that shows the frame/ video of the environment. Awesome
The built it render
self.env.render()
Pops up many many windows with past and present versions of the environment.
thanks
I am writing my own NEAT implementation and I am also testing with OpenAi gym.
You can use wrappers to record the video for you, and this will be the real video, without downscaling or changing colors:
env_wrapped = gym.make('OpenAI-env-id')
env = wrappers.Monitor(env_wrapped, dir , video_callable=record_video_function)
Where the "record_video_function" is a callable which can return true or false when you desire the episode to be recorded.
What I usually do to see the best performing genomes is:
Sort the genomes by fitness
Run the evaluation loop
If a last species champion is next, I change a global variable to True
In the "record_video_function" I return the value of this global variable, so if it's true it will enable the video recording for the episode
After the episode is over, I return this global variable to False
So, with this I can see the best genome performers of last generation. You can't see the best of the current generation because there's no way to know how they will perform. If the environment is deterministic, you would be able to see the best performance in the next generation. If it's stochastic, then it may not be the best anymore.

Marker based initial positioning with ARCore/ARKit?

problem situation: Creating AR-Visualizations always at the same place (on a table) in a comfortable way. We don't want the customer to place the objects themselves like in countless ARCore/ARKit examples.
I'm wondering if there is a way to implement those steps:
Detect marker on the table
Use the position of the marker as the initial position of the AR-Visualization and go on with SLAM-Tracking
I know there is something like an Marker-Detection API included in the latest build of the TangoSDK. But this technology is limited to a small amount of devices (two to be exact...).
best regards and thanks in advance for any idea
I am also interested in that topic. I think the true power of AR can only be unleashed when paired with environment understanding.
I think you have two options:
wait for the new Vuforia 7 to be released and supposedly it is going to support visual markers with ARCore and ARKit.
Engage CoreML / Computer Vision - in theory it is possible but I haven't seen many examples. I think it might be a bit difficult to start with (e.g. build and calibrate model).
However Apple have got it sorted:
https://youtu.be/E2fd8igVQcU?t=2m58s
if using Google Tango, you can implement this using the built in Area Descriptions File (ADF) system.
The system has a holding screen and you are told to "walk around". Within a few seconds, you can relocalise to an area the device has previously been. (or pull the information from a server etc..)
Googles VPS (Visual Positioning Service) is a similar Idea, (closed Beta still) which should come to ARCore. It will, as far as I understand, allow you to localise a specific location using the camera feed from a global shared map of all scanned locations. I think, when released, it will try to fill the gap of an AR Cloud type system, which will solve these problems for regular developers.
See https://developers.google.com/tango/overview/concepts#visual_positioning_service_overview
The general problem of relocalising to a space using pre-knowledge of the space and camera feed only is solved in academia and other AR offerings, hololens etc... Markers/Tags aren't required.
I'm unsure, however, which other commercial systems provide this feature.
This is what i got so far for ARKit.
#objc func tap(_ sender: UITapGestureRecognizer){
let touchLocation = sender.location(in: sceneView)
let hitTestResult = sceneView.hitTest(touchLocation, types: .featurePoint)
if let hitResult = hitTestResult.first{
if first == nil{
first = SCNVector3Make(hitResult.worldTransform.columns.3.x, hitResult.worldTransform.columns.3.y, hitResult.worldTransform.columns.3.z)
}else if second == nil{
second = SCNVector3Make(hitResult.worldTransform.columns.3.x, hitResult.worldTransform.columns.3.y, hitResult.worldTransform.columns.3.z)
}else{
third = SCNVector3Make(hitResult.worldTransform.columns.3.x, hitResult.worldTransform.columns.3.y, hitResult.worldTransform.columns.3.z)
let x2 = first!.x
let z2 = -first!.z
let x1 = second!.x
let z1 = -second!.z
let z3 = -third!.z
let m = (z1-z2)/(x1-x2)
var a = atan(m)
if (x1 < 0 && z1 < 0){
a = a + (Float.pi*2)
}else if(x1 > 0 && z1 < 0){
a = a - (Float.pi*2)
}
sceneView.scene.rootNode.addChildNode(yourNode)
let rotate = SCNAction.rotateBy(x: 0, y: CGFloat(a), z: 0, duration: 0.1)
yourNode.runAction(rotate)
yourNode.position = first!
if z3 - z1 < 0{
let rotate = SCNAction.rotateBy(x: 0, y: CGFloat.pi, z: 0, duration: 0.1)
yourNode.runAction(rotate)
}
}
}
}
Theory is:
Make three dots A,B,C such that AB is perpendicular to AC. Tap dots in order A-B-C.
Find angle of AB in x=0 of ARSceneView which gives required rotation for node.
Any one of the point can be refrenced to calculate position to place node.
From C find if node needs to be flipped.
I am still working on some exceptions that needs to be satisfied.
At the moment both ARKit 3.0 and ARCore 1.12 have all necessary API tools to fulfil almost any marker-based tasks for a precise positioning of 3D model.
ARKit
Right out-of-the-box, ARKit has the ability to detect 3D objects and place ARObjectAnchors in a scene as well as to detect images and use ARImageAnchors for accurate positioning. Main ARWorldTrackingConfiguration() class includes both instance properties – .detectionImages and .detectionObjects. It's not superfluous to say that ARKit primordially has indispensable built-in features from several frameworks:
CoreMotion
SceneKit
SpriteKit
UIKit
CoreML
Metal
AVFoundation
In addition to the above, ARKit 3.0 has tight integration with a brand-new RealityKit module helping to implement multiuser connectivity, list of ARAnchors and shared sessions.
ARCore
Although ARCore has a feature called Augmented Images, the framework has no built-in machine learning algorithms, helping us detect real-environment 3D objects, but Google ML Kit framework does have. So, as an Android developer you can use both frameworks at the same time to precisely auto-composite 3D model over a real object in AR scene.
It is worth recognizing that ARKit 3.0 has a more robust and advanced toolkit than ARCore 1.12.

IPhone X true depth image analysis and CoreML

I understant that my question is not directly related to programming itself and looks more like research. But probably someone can advise here.
I have an idea for app, when user takes a photo and app will analyze it and cut everythig except required object (a piece of clothin for example) and will save it in a separate image. Yesterday it was very difficult task, because developer should create pretty good neural network and educate it. But after Apple released iPhone X with true depth camera, half of the problems can be solved. As per my understanding, developer can remove background much more easily, because iPhone will know where background is located.
So only several questions left:
I. What is the format of photos which are taken by iPhone X with true depth camera? Is it possible to create neural network that will be able to use information about depth from the picture?
II. I've read about CoreML, tried some examples, but it's still not clear for me - how the following behaviour can be achieved in terms of External Neural Network that was imported into CoreML:
Neural network gets an image as an input data.
NN analyzes it, finds required object on the image.
NN returns not only determinated type of object, but cropped object itself or array of coordinates/pixels of the area that should be cropped.
Application gets all required information from NN and performs necessary actions to crop an image and save it to another file or whatever.
Any advice will be appreciated.
Ok, your question is actually directly related to programming:)
Ad I. The format is HEIF, but you access data of the image (if you develop an iPhone app) by means of iOS APIs, so you easily get information about bitmap as CVPixelBuffer.
Ad II.
1. Neural network gets an image as an input data.
As mentioned above, you want to get your bitmap first, so create a CVPixelBuffer. Check out this post for example. Then you use CoreML API. You want to use MLFeatureProvider protocol. An object which conforms to is where you put your vector data with MLFeatureValue under a key name picked by you (like "pixelData").
import CoreML
class YourImageFeatureProvider: MLFeatureProvider {
let imageFeatureValue: MLFeatureValue
var featureNames: Set<String> = []
init(with imageFeatureValue: MLFeatureValue) {
featureNames.insert("pixelData")
self.imageFeatureValue = imageFeatureValue
}
func featureValue(for featureName: String) -> MLFeatureValue? {
guard featureName == "pixelData" else {
return nil
}
return imageFeatureValue
}
}
Then you use it like this, and feature value will be created with initWithPixelBuffer initializer on MLFeatureValue:
let imageFeatureValue = MLFeatureValue(pixelBuffer: yourPixelBuffer)
let featureProvider = YourImageFeatureProvider(imageFeatureValue: imageFeatureValue)
Remember to crop/scale image before this operation so as to your network is being fed with a vector of a proper size.
NN analyzes it, finds required object on the image.
Use prediction function on your CoreML model.
do {
let outputFeatureProvider = try yourModel.prediction(from: featureProvider)
//success! your output feature provider has your data
} catch {
//your model failed to predict, check the error
}
NN returns not only determinated type of object, but cropped object itself or array of coordinates/pixels of the area that should be cropped.
This depends on your model and whether you imported it correctly. Under the assumption you did, you access output data by checking returned MLFeatureProvider (remember that this is a protocol, so you would have to implement another one similar to what I made for you in step 1, smth like YourOutputFeatureProvider) and there you have a bitmap and rest of the data your NN spits out.
Application gets all required information from NN and performs necessary actions to crop an image and save it to another file or whatever.
Just reverse step 1, so from MLFeatureValue -> CVPixelBuffer -> UIImage. There are plenty of questions on SO about this so I won't repeat answers.
If you are a beginner, don't expect to have results overnight, but the path is here. For an experienced dev I would estimate this work for several hours to get work done (plus model learning time and porting it to CoreML).
Apart from CoreML (maybe you find your model too sophisticated and it won't be able to port it to CoreML) check out Matthjis Hollemans' github (very good resources on different ways of porting models to iOS). He is also around here and knows a lot in the subject.

Tesseract on iOS - bad results

After spending over 10 hours to compile tesseract using libc++ so it works with OpenCV, I've got issue getting any meaningful results. I'm trying to use it for digit recognition, the image data I'm passing is a small square (50x50) image with either one or no digits in it.
I've tried using both eng and equ tessdata (from google code), the results are different but both get guess 0 digits. Using eng data I get '4\n\n' or '\n\n' as a result most of the time (even when there's no digit in the image), with confidence anywhere from 1 to 99.
Using equ data I get '\n\n' with confidence 0-4.
I also tried binarizing the image and the results are more or less the same, I don't think there's a need for it though since images are filtered pretty good.
I'm assuming that there's something wrong since the images are pretty easy to recognize compared to even simplest of the example images.
Here's the code:
Initialization:
_tess = new TessBaseAPI();
_tess->Init([dataPath cStringUsingEncoding:NSUTF8StringEncoding], "eng");
_tess->SetVariable("tessedit_char_whitelist", "0123456789");
_tess->SetVariable("classify_bln_numeric_mode", "1");
Recognition:
char *text = _tess->TesseractRect(imageData, (int)bytes_per_pixel, (int)bytes_per_line, 0, 0, (int)imageSize.width, (int)imageSize.height);
I'm getting no errors. TESSDATA_PREFIX is set properly and I've tried different methods for recognition. imageData looks ok when inspected.
Here are some sample images:
http://imgur.com/a/Kg8ar
Should this work with the regular training data?
Any help is appreciated, my first time trying tessarect out and I could have missed something.
EDIT:
I've found this:
_tess->SetPageSegMode(PSM_SINGLE_CHAR);
I'm assuming it must be used in this situation, tried it but got the same results.
I think Tesseract is a bit overkill for this stuff. You would be better off with a simple neural network, trained explicitly for your images. At my company, recently we were trying to use Tesseract on iOS for an OCR task (scanning utility bills with the camera), but it was too slow and inaccurate for our purposes (scanning took more than 30 seconds on an iPhone 4 at a tremendously low FPS). At the end, I trained a neural-network specifically for our target font, and this solution not only beat Tesseract (it could scan stuff flawlessly even on an iPhone 3Gs), but also a commercial ABBYY OCR engine, which we were given a sample from the company.
This course's material would be a good start in machine learning.

Resources