Waveform Comparison - machine-learning

Waveform Comparison - machine-learning

I am working on a personal research project.
My objective is to be able to recognize a sound and identify if it belongs to the IPA or not by comparing it's waveform to a wave form in my data base. I have some skill with Mathematica, SciPy, and PyBrain.
For the first phase, I'm only using the English (US) phonetic alphabet.
I have a simple test bank of English phonetic alphabet sound files I found online. The trick here is:
I want to separate a sound file into wave forms that correspond to different syllables- this will take a learning algorithm. So, 'I like apples' would be cut up into the syllable waveforms that would make up the sentence.
Each waveform is then compared against the English PA's wave forms. I'm not certain how to do this part. I was thinking of using Praat to detect the waveforms, capture the image of the wave form and compare it to the one stored in the database with image analysis (which is kind of fun to do).
The damage here, is that I don't know how to make Praat generate a wave form file automatically then cut it up between syllables into waveform chunks. Logically, I would just prepare test cases for a learning algorithm and teach the comp to do it.
Instead of needing a wave form image- could I do this with fast Fourier transformation and compare two fft's- within x% margin of error consider it y syllable?

Frankly I don't really know about Praat, But I find your project super cool and interesting. I have experience with car motor's fault detection using it's sound, which might be connected to your project. I used Neural Networks and SVM to do the classification because multiple research papers proved it. Thus I didn't have any doubt about the way I chose. So my advice is maybe you should research and read some Papers about it. It really helps when you have questions like this (Will it work?, Can I use it instead or Am I using optimal solution? etc...). And good luck that's an awesome project :)

You could try Praat scripting.
Using just FFT will give you rather terrible results. Very long feature vector that will be really difficult to segment and run any training on it. That's thousands of points for a single syllable. Some deep neural networks are able to cope with it, but that's assuming you design them properly and provide huge training set. The advantage of using neural networks is that they can build features for you from the "raw data" (and I would consider fft also "raw"). However, when you work with sound, it's not that badly needed - you can manually engineer features. In case of sounds, science knows very well what sort of "features" sound have.
You can calculate these features with libraries like Yaafe. I recommend checking it even if you are not doing it in C++ or Python - the link I provided also delivers formulas for calculating them. I used some of them in my kiwi classifier.
Another good approach comes from scikit-talkbox, which provides exactly the tooling you might need.

Related

Best approach to build a model to recognise licence plates (ALPR)

I am trying to make a deep learning model to detect and read number plates using deep learning techniques like CNN. I would be making a model in tensorflow. But i still don't know what can be the best approach to build such model.
i have checked few models like this
https://matthewearl.github.io/2016/05/06/cnn-anpr/
i have also checked some research papers but none show the exact way.
So the steps what i am planning to follow are
Image preprocessing using opencv ( grayscale,transformations etc i dont know much about this part)
Licence plate Detection (probably by sliding window method)
Train using CNN by building a synthetic dataset as in the above link.
My questions
Is there any better way to do this?
Can RNN also be combined after CNN for variable length number?
Should i prefer detecting and recognising individual characters rather the whole plate?
There are many old methods too who prefer image preprocessing and the directly passing to OCR.What will be the best?
PS- i want to make a commercial real time system. So i need good accuracy.

Firstly, I don't think combining RNN and CNN can achieve real time system. And I personally prefer detecting individual characters if I want real time system because there will not more than 10 characters on license plate. When detecting plates with variable length, detecting individual characters can be more feasible.
Before I learned deep learning, I also have tried to use OCR to detect plate. In my case, OCR is fast but the accuracy is limited especially when the plate is not clear enough. Even image processing cannot rescue some unclear case.......
So if I were you I will try as follows:
Simple image preprocessing on the whole image
Licence plate Detection (probably by sliding window method)
Image processing (filters and geometric transformations) on the extracted plate part to make it more clear. Separate characters.
Deploy CNN to each character. (Maybe I will try some short CNNs because of real time, such as LeNet used in MNIST handwritten digit data ) (Multithreading might be needed)
Hope my response can help.

Recognize "generic" objects

I'm working on a project for visually impaired people that converts the visual world to audio.
We prefer to create a prototype that doesn't need an internet connection. So we chose to work with OpenCV. After reading (a lot of) tutorials and documentation we were able to train OpenCV in recognizing specific objects.
For example: we trained OpenCV to recognize a certain chair and a door. That works fine.
But, we also tried to train OpenCV on a "generic" level. It should be possible to recognize (almost) all chairs. We did that by training OpenCV with a lot of positive and negative images as explained here: http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html
The actual result wasn't what we expected -he could not recognize any chair-. I know, there are a lot of different parameters to take into account (maybe we did something wrong with that) and we experimented a lot. But our time (and unfortunately our knowledge of opencv) is limited.
We are looking for some advice on how to train opencv to recognize generic objects.
Where do we start?
Is opencv even suited to do that?
Thank you for your time!

Open CV is the library to use. But object recognition is tricky. Often when people say they are doing "object recognition" they are not, they are processing one image, or at best a series of related images, to separate into object and background.
To recognise a "chair" - everything from an armchair to a dining chair to a throne - would be almost impossible. I'd want at least stereo images to give a chance to detect flat surfaces. I don't doubt that with a lot of work you can get quite a good result, maybe just recognising dining -style chairs, but it's skilled work, it's not just a case of feeding a few parameters to a hierarchical classifier.

Emotion detection through voice/speech solution for Mobile and web [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have been searching for emotion detection through voice/speech solution on mobile (iOS) and web.
I found Moodies-iOS and Vokaturi solution, but they are not free.
I couldn't find any open source or paid version software available to integrate in my app and test the solution.
Could someone share if you have any info on this related.
Is there any OPEN SOURCE for iOS for Emotion analysis and detection through Voice/Speech, Please let me know.

As a former research in affective computing, I highly doubt you can find a ready-for-use iOS open source solution for emotion recognition from speech. The main reason is that it is a damn difficult task that requires a lot of research and a lot of proper data to train models. That is why companies like BeyondVerbal and Vokaturi do not share their models with others. Thus, you will be very lucky if you can find anything in open source, I am not even talking about iOS solutions.
I am aware about some toolkits you can use for this task (namely, the openEAR toolkit), but to build something working from it, you need an expert knowledge in the field and data to train models. A comprehensive list of databases can be found here: http://emotion-research.net/wiki/Databases. A lot of them a freely available.

As Dmytro Prylipko said it is very doubtful that there is any open-source lib for emotion recognition from speech.
You may write your own solution. It is not hard. Trouble is, as mentioned before, proper training and/or trasholding takes a lot of time and nerves.
I will give you a short theory how you should begin writing the algo, but training and so on is on you.
First big trouble is that different people differently relay their emotions vocally.
For example: one shocked person will to their shock respond with overexclaimed sentence while another will "freeze" and their response would sound very flat (almost robot-like).
Therefore you will need a lot of templates from which to learn how to classify your input speech by emotions.
You can remove some difficulties by using context recognition along with voice prosody.
That is what I'd advise you to do.
First make an algorithm that will use speech-recognized text to put it into emotion context. E.g. you can use specific words and phrases that people use when expressing different emotions.
That is easily done. You may use a neural network or simple branching or whatever.
So you will be able to recognize whether person is thankful and surprised at the same time by combining context recognition and emotions from prosody.
Now, to recognize the emotion from prosody you have to get prosody parameters and some others.
For example, some emotions may be recognized by looking at duration of particular words in a sentence.
So you have the sentence and the text of that sentence. You know that the speed of normal speech is approximately 200 words per minute. Knowing this and number of words in the sentence you can see how fast is someone talking. Then you measure the duration of each word and get its speed. By knowing how fast is the speech and how long is the word you can get normalized ratios that can be used for classifications in order to determine the closest guess of the emotion.
For instance, when someone is presented with a present that he/she likes very much, the "thank you" will sound pretty long. It will also be of higher pitch than that person's usual speech.
So the next step would be to get the average pitch for each word to see the relation between them. So you will be able to see how the sentence prosody modulates. From lower to higher, or vice versa.
Also, how prosody changes inside the phrases within the sentence.
You may go about this by comparing curves of known emotion directly, or you may use aproximation to get coefficients from the prosody curve vector. The square function does good for normal speech prosody (with no particular emotions in). So some higher order polynomial should do. So, you can get coefficients of the polynom and use them to get what emotion should whole sentence or phrase relay.
The same goes for individual words within the sentence. You get the pitch for each phoneme or syllable or just the pitch curve for e.g. every 20 ms of the word. Then you either calculate few coefficients to aproximate the polynom you decided is good enough for you, or you take the whole curve and normalize it to e.g. 30 points to use it with recognition.
To compare curves directly you may use gesture recognition algorithm by Oleg Dopertchouk:
http://www.gamedev.net/reference/articles/article2039.asp
I tried it on pitch curves of melodies, it works just fine.
The trouble is, you need a database of speech with context and emotion with clear manually done classification to give your algo something to compare with.
If you use polynomials instead of whole curves, you can do some recognition by using thresholds on coefficients, but results will be a bit shaky. Only real excuse for using coeffs at all is that you do not need to know how long is the word in question. I.e. the same polynom should work on a word with 2 phonemes and on one with 5. (should work)
You see, a theory is nice and easy. Use speech recognition, measure speech rate, and duration of each word, construct pitch curve for whole phrase and pitch curve for each word using FFT, do some comparison between ready database and the input. And walla, emotion recognized.
But where will you find the database with word curves marked with emotions.
For example, you would need for each emotion at least one pitch curve for words with different number of phonemes. At least one, because it is important whether the word starts with vowel or ends with one, or simply someone differently relays the same emotion even if the curve represents the same word.
OK, so you can say that you can make one. Where would you find recorded samples to make your curves or calculate coeffs? Hm, perhaps a recording of some drama. Not bad idea, but the acted emotions aren't the same as the natural ones.
It is a big job to teach a machine such a thing.
Oh, yeah, I almost forgot, emotions aren't only, or sometimes at all transfered using pitch changes, sometimes it's only the way in which the word is being pronounced.
So, for some cases, you would probably need LPC or some other coefficients showing some more info on how phonemes in the word sounds. Or you would need to take in view other harmonics from FFT, not just the one representing the pitch of excitation train.
The best that you can do without following my hints and developing your own algo, is to use NLTK (natural language toolkit) to develop a statistical speech (emotionally rich) model and use algorithms from there (perhaps a bit modified) to try to get to the emotion in question.
But I fear it would be a greater job than going from zero. As far as I know NLTK doesn't support emotions. Just normal speech prosody.
You may try to integrate some things I wrote about into Sphinx, to develop emotion based speech models and introduce emotion recognition directly into sphinxes VR algorithm.
If you really need this, I advise you to learn enough DSP to write your own algo, then pay someone to make you initial database from audiobooks, radio dramas and similar stuff (using a tool you provide).
After your algo starts to work reasonably well, implement autolearning by giving users an option to correct the algo's wrong guesses. After some time you will get 90% reliable algo to recognize emotions from speech.

Machine Learning: sign visibility

I work at an airport where we need to determine the visibility conditions of pilots.
To do this, we have signs placed every 200 meters along the runway that allow us to determine how far the visibility is. We have multiple runways, and the visibility needs to be checked every hour.
Right now the visibility check is done manually with a human being who looks at the photos from the cameras placed at the end of each runway. So it can be tedious.
I'm a programmer who has very little experience with machine learning, but this sounds like an easy problem to automate. How should I approach this problem? Which algorithms should I study? Would OpenCV help me?
Thanks!

I think this can be automated using computer vision techniques. openCV could make the implementation easier. If all the signs are similar then ,we can train our program to recognize the sign in a specific conditions(lights). Then, we can use the trained classifier to check for the visibility of signs every hours using a simple script.
There is harr-like feature extraction already in openCV. You can use to train classifier which will output a .xml file and use that .xml file for detecting the sign regularly.
I have done a similar project RTVTR(Real Time Vehicle Tracking and Recognition) using openCV and it worked great. http://www.youtube.com/watch?v=xJwBT76VEZ4

Answering to your questions:
How should I approach this problem?
It depends on the result you want/need to obtain. Is this an "hobby" project (even if job-related) or do you need to build a machine vision system to solve the problem and should it be compliant with some regulations or standard?
Which algorithms should I study?
I am very interested in your question but I am not an expert in the field of meteorology and so searching in the relative literature is, for me, a time consuming task... so I reserve to update this part of the answer in the future. I think there will be different algorithms involved in the solution of the problem, some are very general like for example algorithms for the image segmentation, some are very specific like for example how to measure the visibility.
Update: one of the keyword for searching in the literature is Meteorological Visibility, for example
HAUTIERE, Nicolas, et al. Automatic fog detection and estimation of visibility distance through use of an onboard camera. Machine Vision and Applications, 2006, 17.1: 8-20.
LENOR, Stephan, et al. An Improved Model for Estimating the Meteorological Visibility from a Road Surface Luminance Curve. In: Pattern Recognition. Springer Berlin Heidelberg, 2013. p. 184-193.
Would OpenCV help me?
Yes, I think OpenCV can help giving you a starting point.
An idea for a naïve algorithm:
Segment the image in order to get the pixel regions belonging to the signs and to the background.
Compute the measure of visibility according to some procedure, the measure is computed by a function that has as input the regions of all the signs and the background region.
The segmentation can be simplified a lot if the signs are always in the same fixed and known position inside the image.
The measure of visibility is obviously the core of the algorithm and it can be performed in a lot of ways...
You can follow a simple approach where you compute the visibility with a mathematical formula based on the average gray level of the signs and background regions.
You can follow a more sophisticated and machine-learning oriented approach where you implement an algorithm that mimics your current human being based procedure. In this case your problem can be framed as a supervised learning task: you have a set of training examples, each training example is a pair composed by a) the photo of the runway (the input) and b) the visibility related to that photo and computed by human (the desired output). Then the system is trained by means of the training set and when you give a new photo as input it will give you back the visibility measure. I think you have a log for past visibility measures (METAR?) and if you saved the related images too, you will already have a relevant amount of data in order to build a training set and a test set.
Update in the age of Convolutional Neural Networks:
YOU, Yang, et al. Relative CNN-RNN: Learning Relative Atmospheric Visibility from Images. IEEE Transactions on Image Processing, 2018.

Both Tensor and uvts_cvs 's replies are very helpful. While the opencv mainly aims to recognize the sign pattern or even segment it from the background, when you extract the core feature in your problem : visibility, you may still need to include the background signal in your training set. I assume manual check of visibility is based on image contrast, if so, the signal-to-noise ratio(SNR) or contrast-to-noise ratio(CNR) is a good feature in learning. A threshold is defined to classify 'visible-1' and 'invisible-0'. The SNR/CNR can be obtained automatically especially if your sign position and size are fixed in your camera images.

Gather whole bunch of photos and videos and propose it as a challenge on Kaggle. I am sure many people would like to try solve it, even if reward would not be very high.

You can use the template matching functionality of openCV:
http://docs.opencv.org/doc/tutorials/imgproc/histograms/template_matching/template_matching.html
Where the template is the sign. If you manage to find a correct match, then the sign is visible. I think you can also get a sense of the scale of the sign in the image from that code.

As this is a very controlled and static environment, you have perfect conditions to estimate the visibility with vision-based approaches. Nonetheless, it is not so easy to decide which approach to take. In my thesis, I am reviewing this topic in depth for the less well-controlled environment of road traffic. See: LENOR, Stephan. Model-Based Estimation of Meteorological Visibility in the Context of Automotive Camera Systems. 2016. Doktorarbeit. (https://archiv.ub.uni-heidelberg.de/volltextserver/20855/1/20160509_lenor_thesis_final_print.pdf).
I see two major directions you could follow up:
Model-based approaches: Advantages: Not so much dependent on your very specific setup. You do not need heavy collection of data.
Data-based approaches/ML: Advantages: Can hide the whole complexity of different light and weather conditions. You seem to have a good source of data if there are people doing the job right now. Very promising without much engineering effort (just use a light-weighted CNN with few layers or so).
You could also combine both, etc. etc. If you are still interested in a solution, you can contact me again and I am happy to consult in more depth.

How to train an artificial neural network to play Diablo 2 using visual input?

I'm currently trying to get an ANN to play a video game and and I was hoping to get some help from the wonderful community here.
I've settled on Diablo 2. Game play is thus in real-time and from an isometric viewpoint, with the player controlling a single avatar whom the camera is centered on.
To make things concrete, the task is to get your character x experience points without having its health drop to 0, where experience point are gained through killing monsters. Here is an example of the gameplay:
Now, since I want the net to operate based solely on the information it gets from the pixels on the screen, it must learn a very rich representation in order to play efficiently, since this would presumably require it to know (implicitly at least) how divide the game world up into objects and how to interact with them.
And all of this information must be taught to the net somehow. I can't for the life of me think of how to train this thing. My only idea is have a separate program visually extract something innately good/bad in the game (e.g. health, gold, experience) from the screen, and then use that stat in a reinforcement learning procedure. I think that will be part of the answer, but I don't think it'll be enough; there are just too many levels of abstraction from raw visual input to goal-oriented behavior for such limited feedback to train a net within my lifetime.
So, my question: what other ways can you think of to train a net to do at least some part of this task? preferably without making thousands of labeled examples.
Just for a little more direction: I'm looking for some other sources of reinforcement learning and/or any unsupervised methods for extracting useful information in this setting. Or a supervised algorithm if you can think of a way of getting labeled data out of a game world without having to manually label it.
UPDATE(04/27/12):
Strangely, I'm still working on this and seem to be making progress. The biggest secret to getting a ANN controller to work is to use the most advanced ANN architectures appropriate to the task. Hence I've been using a deep belief net composed of factored conditional restricted Boltzmann machines that I've trained in an unsupervised manner (on video of me playing the game) before fine tuning with temporal difference back-propagation (i.e. reinforcement learning with standard feed-forward ANNs).
Still looking for more valuable input though, especially on the problem of action selection in real-time and how to encode color images for ANN processing :-)
UPDATE(10/21/15):
Just remembered I asked this question back-in-the-day, and thought I should mention that this is no longer a crazy idea. Since my last update, DeepMind published their nature paper on getting neural networks to play Atari games from visual inputs. Indeed, the only thing preventing me from using their architecture to play, a limited subset, of Diablo 2 is the lack of access to the underlying game engine. Rendering to the screen and then redirecting it to the network is just far too slow to train in a reasonable amount of time. Thus we probably won't see this sort of bot playing Diablo 2 anytime soon, but only because it'll be playing something either open-source or with API access to the rendering target. (Quake perhaps?)

I can see that you are worried about how to train the ANN, but this project hides a complexity that you might not be aware of. Object/character recognition on computer games through image processing it's a highly challenging task (not say crazy for FPS and RPG games). I don't doubt of your skills and I'm also not saying it can't be done, but you can easily spend 10x more time working on recognizing stuff than implementing the ANN itself (assuming you already have experience with digital image processing techniques).
I think your idea is very interesting and also very ambitious. At this point you might want to reconsider it. I sense that this project is something you are planning for the university, so if the focus of the work is really ANN you should probably pick another game, something more simple.
I remember that someone else came looking for tips on a different but somehow similar project not too long ago. It's worth checking it out.
On the other hand, there might be better/easier approaches for identifying objects in-game if you're accepting suggestions. But first, let's call this project for what you want it to be: a smart-bot.
One method for implementing bots accesses the memory of the game client to find relevant information, such as the location of the character on the screen and it's health. Reading computer memory is trivial, but figuring out exactly where in memory to look for is not. Memory scanners like Cheat Engine can be very helpful for this.
Another method, which works under the game, involves manipulating rendering information. All objects of the game must be rendered to the screen. This means that the locations of all 3D objects will eventually be sent to the video card for processing. Be ready for some serious debugging.
In this answer I briefly described 2 methods to accomplish what you want through image processing. If you are interested in them you can find more about them on Exploiting Online Games (chapter 6), an excellent book on the subject.

UPDATE 2018-07-26: That's it! We are now approaching the point where this kind of game will be solvable! Using OpenAI and based on the game DotA 2, a team could make an AI that can beat semi-professional gamers in a 5v5 game. If you know DotA 2, you know this game is quite similar to Diablo-like games in terms of mechanics, but one could argue that it is even more complicated because of the team play.
As expected, this was achieved thanks to the latest advances in reinforcement learning with deep learning, and using open game frameworks like OpenAI which eases the development of an AI since you get a neat API and also because you can accelerate the game (the AI played the equivalent of 180 years of gameplay against itself everyday!).
On the 5th of August 2018 (in 10 days!), it is planned to pit this AI against top DotA 2 gamers. If this works out, expect a big revolution, maybe not as mediatized as the solving of the Go game, but it will nonetheless be a huge milestone for games AI!
UPDATE 2017-01: The field is moving very fast since AlphaGo's success, and there are new frameworks to facilitate the development of machine learning algorithms on games almost every months. Here is a list of the latest ones I've found:
OpenAI's Universe: a platform to play virtually any game using machine learning. The API is in Python, and it runs the games behind a VNC remote desktop environment, so it can capture the images of any game! You can probably use Universe to play Diablo II through a machine learning algorithm!
OpenAI's Gym: Similar to Universe but targeting reinforcement learning algorithms specifically (so it's kind of a generalization of the framework used by AlphaGo but to a lot more games). There is a course on Udemy covering the application of machine learning to games like breakout or Doom using OpenAI Gym.
TorchCraft: a bridge between Torch (machine learning framework) and StarCraft: Brood War.
pyGTA5: a project to build self-driving cars in GTA5 using only screen captures (with lots of videos online).
Very exciting times!
IMPORTANT UPDATE (2016-06): As noted by OP, this problem of training artificial networks to play games using only visual inputs is now being tackled by several serious institutions, with quite promising results, such as DeepMind Deep-Qlearning-Network (DQN).
And now, if you want to get to take on the next level challenge, you can use one of the various AI vision game development platforms such as ViZDoom, a highly optimized platform (7000 fps) to train networks to play Doom using only visual inputs:
ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research in machine visual learning, and deep reinforcement learning, in particular.
ViZDoom is based on ZDoom to provide the game mechanics.
And the results are quite amazing, see the videos on their webpage and the nice tutorial (in Python) here!
There is also a similar project for Quake 3 Arena, called Quagents, which also provides easy API access to underlying game data, but you can scrap it and just use screenshots and the API only to control your agent.
Why is such a platform useful if we only use screenshots? Even if you don't access underlying game data, such a platform provide:
high performance implementation of games (you can generate more data/plays/learning generations with less time so that your learning algorithms can converge faster!).
a simple and responsive API to control your agents (ie, if you try to use human inputs to control a game, some of your commands may be lost, so you'd also deal with unreliability of your outputs...).
easy setup of custom scenarios.
customizable rendering (can be useful to "simplify" the images you get to ease processing)
synchronized ("turn-by-turn") play (so you don't need your algorithm to work in realtime at first, that's a huge complexity reduction).
additional convenience features such as crossplatform compatibility, retrocompatibility (you don't risk your bot not working with the game anymore when there is a new game update), etc.
To summarize, the great thing about these platforms is that they alleviate much of the previous technical issues you had to deal with (how to manipulate game inputs, how to setup scenarios, etc.) so that you just have to deal with the learning algorithm itself.
So now, get to work and make us the best AI visual bot ever ;)
Old post describing the technical issues of developping an AI relying only on visual inputs:
Contrary to some of my colleagues above, I do not think this problem is intractable. But it surely is a hella hard one!
The first problem as pointed out above is that of the representation of the state of the game: you can't represent the full state with just a single image, you need to maintain some kind of memorization (health but also objects equipped and items available to use, quests and goals, etc.). To fetch such informations you have two ways: either by directly accessing the game data, which is the most reliable and easy; or either you can create an abstract representation of these informations by implementing some simple procedures (open inventory, take a screenshot, extract the data). Of course, extracting data from a screenshot will either have you to put in some supervised procedure (that you define completely) or unsupervised (via a machine learning algorithm, but then it'll scale up a lot the complexity...). For unsupervised machine learning, you will need to use a quite recent kind of algorithms called structural learning algorithms (which learn the structure of data rather than how to classify them or predict a value). One such algorithm is the Recursive Neural Network (not to confuse with Recurrent Neural Network) by Richard Socher: http://techtalks.tv/talks/54422/
Then, another problem is that even when you have fetched all the data you need, the game is only partially observable. Thus you need to inject an abstract model of the world and feed it with processed information from the game, for example the location of your avatar, but also the location of quest items, goals and enemies outside the screen. You may maybe look into Mixture Particle Filters by Vermaak 2003 for this.
Also, you need to have an autonomous agent, with goals dynamically generated. A well-known architecture you can try is BDI agent, but you will probably have to tweak it for this architecture to work in your practical case. As an alternative, there is also the Recursive Petri Net, which you can probably combine with all kinds of variations of the petri nets to achieve what you want since it is a very well studied and flexible framework, with great formalization and proofs procedures.
And at last, even if you do all the above, you will need to find a way to emulate the game in accelerated speed (using a video may be nice, but the problem is that your algorithm will only spectate without control, and being able to try for itself is very important for learning). Indeed, it is well-known that current state-of-the-art algorithm takes a lot more time to learn the same thing a human can learn (even more so with reinforcement learning), thus if can't speed up the process (ie, if you can't speed up the game time), your algorithm won't even converge in a single lifetime...
To conclude, what you want to achieve here is at the limit (and maybe a bit beyond) of current state-of-the-art algorithms. I think it may be possible, but even if it is, you are going to spend a hella lot of time, because this is not a theoretical problem but a practical problem you are approaching here, and thus you need to implement and combine a lot of different AI approaches in order to solve it.
Several decades of research with a whole team working on it would may not suffice, so if you are alone and working on it in part-time (as you probably have a job for a living) you may spend a whole lifetime without reaching anywhere near a working solution.
So my most important advice here would be that you lower down your expectations, and try to reduce the complexity of your problem by using all the information you can, and avoid as much as possible relying on screenshots (ie, try to hook directly into the game, look for DLL injection), and simplify some problems by implementing supervised procedures, do not let your algorithm learn everything (ie, drop image processing for now as much as possible and rely on internal game informations, later on if your algorithm works well, you can replace some parts of your AI program with image processing, thus gruadually attaining your full goal, for example if you can get something to work quite well, you can try to complexify your problem and replace supervised procedures and memory game data by unsupervised machine learning algorithms on screenshots).
Good luck, and if it works, make sure to publish an article, you can surely get renowned for solving such a hard practical problem!

The problem you are pursuing is intractable in the way you have defined it. It is usually a mistake to think that a neural network would "magically" learn a rich reprsentation of a problem. A good fact to keep in mind when deciding whether ANN is the right tool for a task is that it is an interpolation method. Think, whether you can frame your problem as finding an approximation of a function, where you have many points from this function and lots of time for designing the network and training it.
The problem you propose does not pass this test. Game control is not a function of the image on the screen. There is a lot of information the player has to keep in memory. For a simple example, it is often true that every time you enter a shop in a game, the screen looks the same. However, what you buy depends on the circumstances. No matter how complicated the network, if the screen pixels are its input, it would always perform the same action upon entering the store.
Besides, there is the problem of scale. The task you propose is simply too complicated to learn in any reasonable amount of time. You should see aigamedev.com for how game AI works. Artitificial Neural Networks have been used successfully in some games, but in very limited manner. Game AI is difficult and often expensive to develop. If there was a general approach of constructing functional neural networks, the industry would have most likely seized on it. I recommend that you begin with much, much simpler examples, like tic-tac-toe.

Seems like the heart of this project is exploring what is possible with an ANN, so I would suggest picking a game where you don't have to deal with image processing (which from other's answers on here, seems like a really difficult task in a real-time game). You could use the Starcraft API to build your bot, they give you access to all relevant game state.
http://code.google.com/p/bwapi/

As a first step you might look at the difference of consecutive frames. You have to distinguish between background and actual monster sprites. I guess the world may also contain animations. In order to find those I would have the character move around and collect everything that moves with the world into a big background image/animation.
You could detect and and identify enemies with correlation (using FFT). However if the animations repeat pixel-exact it will be faster to just look at a few pixel values. Your main task will be to write a robust system that will identify when a new object appears on the screen and will gradually all the frames of the sprite frame to a database. Probably you have to build models for weapon effects as well. Those can should be subtracted so that they don't clutter your opponent database.

Well assuming at any time you could generate a set of 'outcomes' (might involve probabilities) from a set of all possible 'moves', and that there is some notion of consistency in the game (eg you can play level X over and over again), you could start with N neural networks with random weights, and have each of them play the game in the following way:
1) For every possible 'move', generate a list of possible 'outcomes' (with associated probabilities)
2) For each outcome, use your neural network to determine an associated 'worth' (score) of the 'outcome' (eg a number between -1 and 1, 1 being the best possible outcome, -1 being the worst)
3) Choose the 'move' leading to the highest prob * score
4) If the move led to a 'win' or 'lose', stop, otherwise go back to step 1.
After a certain amount of time (or a 'win'/'lose'), evaluate how close the neural network was to the 'goal' (this will probably involve some domain knowledge). Then throw out the 50% (or some other percentage) of NNs that were farthest away from the goal, do crossover/mutation of the top 50%, and run the new set of NNs again. Continue running until a satisfactory NN comes out.

I think your best bet would be a complex architecture involving a few/may networks: i.e. one recognizing and responding to items, one for the shop, one for combat (maybe here you would need one for enemy recognition, one for attacks), etc.
Then try to think of the simplest possible Diablo II gameplay, probably a Barbarian. Then keep it simple at first, like Act I, first area only.
Then I guess valuable 'goals' would be disappearance of enemy objects, and diminution of health bar (scored inversely).
Once you have these separate, 'simpler' tasks taken care of, you can use a 'master' ANN to decide which sub-ANN to activate.
As for training, I see only three options: you could use the evolutionary method described above, but then you need to manually select the 'winners', unless you code a whole separate program for that. You could have the networks 'watch' someone play. Here they will learn to emulate a player or group of player's style. The network tries to predict the player's next action, gets reinforced for a correct guess, etc. If you actually get the ANN you want this could be done with video gameplay, no need for actual live gameplay. Finally you could let the network play the game, having enemy deaths, level ups, regained health, etc. as positive reinforcement and player deaths, lost health, etc. as negative reinforcement. But seeing how even a simple network requires thousands of concrete training steps to learn even simple tasks, you would need a lot of patience for this one.
All in all your project is very ambitious. But I for one think it could 'in theory be done', given enough time.
Hope it helps and good luck!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart