I am successfully done with the speech to text conversion program and the results show a usual 80-90 % confidence in the transcript.
Now is there a way (a service) that can be used (or usually used) to improve the confidence in the transcript?
For example:
I want to search a name from the directory
.
.
Sunil Chauhan
Sunit Chavan
Sumit Chawhan
.
.
All the above three names are valid (as in they exist). But Sunit is less common than Sunil or Sumit, and all of them have an almost similar surname.
We can understand the difference in human speech but how to differentiate the text response from Google speech recognition which gives the most common Sunil/Sumit with most common Chauhan for Sunit Chavan.
Is there an available AI or ML service which can be used in such cases?
Related
I have built an Microsoft Azure ML Studio workspace predictive web service, and have a scernario where I need to be able to run the service with different training datasets.
I know I can setup multiple web services via Azure ML, each with a different training set attached, but I am trying to find a way to do it all within the same workspace and passing a Web Input Parameter as the input value to choose which training set to use.
I have found this article, which describes almost my scenario. However, this article relies on the training dataset that is being pulled from the Load Trained Data module, as having a static endpoint (or blob storage location). I don't see any way to dynamically (or conditionally) change this location based on a Web Input Parameter.
Basically, does Azure ML support a "conditional training data" loading?
Or, might there be a way to combine training datasets, then filter based on the passed Web Input Parameter?
This probably isn't exactly what you need, but hopefully, it helps you out.
To combine data sets, you can use the Join Data module.
To filter, that may be accomplished by executing a Python script. Here's an example.
Using the Adult Census Income Binary Classification dataset, on the age column, there's a minimum age of 17.
If I wanted to filter the data set by age, connect it to an Execute Python Script module and here's the filtering code with the pandas query method.
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
# Return value must be of a sequence of pandas.DataFrame
return dataframe1.query("age >= 25")
And looking at that output it filters out the data set where the minimum age is now 25.
Sure, you can do that. What you would want is to use an Execute R Script or SQL Transformation module to determine, based on your input data, what model to use. Something like this:
Notice, your input data is cleaned/updated/feature engineered, then it's passed to two different SQL transforms which will tell it to go to one of two paths.
Each path has it's own training data.
Note: I am not exactly sure what your use case is, but if it were me, I would instead train two different models using the two different training data, then try to just use the models in my web service, not actually train on the web service as that would likely be quite slow.
I went through tutorials on implementing Recommender System and most of them takes one variable (rank).
I want to implement an Item-Based Recommender System which takes multiple variables.
Eg : Let's say an Item (bar) has following varables (values ranging from -10 to +10, to express opposite polarities)
- price (cheap to expensive)
- environment (casual to fine)
- age range (young to adults)
Now I want to recommend items (bar) looking at the list of bars registered in User's history.
Is this kind of a "multi dimensional recommender system" possible to implement using Mahout or any other framework ?
You want the multi-modal, multi-indicator, multi-variable, how every you want to describe it—Universal Recommender. It can handle all this data. We've tested it on real datasets and get significant boost in precision test because of what we call "secondary indicators".
Good intuition. Give the UR a look: blog.actionml.com, check out the slides in one post. Code here: https://github.com/actionml/template-scala-parallel-universal-recommendation/tree/v0.3.0 Built on the new Spark version of Mahout: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
I have a data set as below,
Code | Description
AB123 | Cell Phone
B467A | Mobile Phone
12345 | Telephone
WP9876 | Wireless Phone
SP7654 | Satellite Phone
SV7608 | Sedan Vehicle
CC6543 | Car Coupe
Need to create a automated grouping based on the Code and Description. Lets assume I have so many such data already classified into 0-99 groups. Whenever a new data comes in with a Code and Description, the Machine Learning algorithm needs to automatically classify this based on the previously available data.
Code | Description | Group
AB123 | Cell Phone | 1
B467A | Mobile Phone | 1
12345 | Telephone | 1
WP9876 | Wireless Phone | 1
SP7654 | Satellite Phone | 1
SV7608 | Sedan Vehicle | 2
CC6543 | Car Coupe | 3
Can this be achieved to some level of accuracy? Currently this process is so manual. Any such ideas or references are there, please help with that.
Try reading up on Supervised Learning. You need to provide labels for your training data so that the algorithms know what are the correct answers - and are able to generate appropriate models for you.
Then you can "predict" the output classes for your new incoming data using the generated model(s).
Finally, you may wish to circle back to check the accuracy of the predicted results. If you then enter the labels for the newly received and predicted data then those data can then be used for further training on your model(s).
Yes, it's possible with supervised learning. You pick yourself a model which you "train" with the data you already have. The model/algorithm then "generalizes" to previously unseen data from the known data.
What you specify as a group would be called class or "label" which needs to be predicted based on 2 input features (code/description). Whether you input these features directly or preprocess them into more abstract features which suits the algorithm better, depends on which algorithm you choose.
If you have no experience with Machine Learning, you might start with learning some basics while testing already implemented algorithms in tools such as RapidMiner, Weka or Orange.
I don't think machine learning methods are the most appropriate for the solution of the problem, because text based machine learning algorithms tend to be quite complicated. From the examples you provided I'm not sure how
I think the simplest way of solving, or attempting to solve this problem is the following, which can be implemented in many free programming languages, such as python. Each description can be stored as a string. What you could do is to store all the substrings of all the strings (ie Phone is your string, the substrings will be 'P','h',Ph',..,'e') that belong in a particular group in a list (see this question for how to implement it in python... Substrings of a string using Python). Then you want to for each substring and all substrings stored, see which ones are unique to a certain group. Then select strings over a certain length (say 3 characters long, to get rid of random letter concatenations) as your classification criteria. Then when you get new data, check whether the description is unique to a certain group. With this for instance, you would be able to classify all objects that are in group 1 based on whether their description contains the word phone.
Its hard to provide concrete code to solves your problem without knowing what languages you are familiar with/are feasible to use. I hope this helps anyway. Yves
I would like to perform a binary classification of documents (.txt, .pdf, .jpeg, .img, etc.) into two categories: printable and non-printable. Essentially our school runs a free printing service for clubs, but the reality is that many clubs abuse the free printing and end up printing their homework, papers, etc., which amounts to thousands of dollars in ink and paper. Thus we would like to take some unsupervised methods to help limit this by determining whether a document is with high probability not club related (e.g. Biophysics paper, there is no biophysics club!).
So this is a very simple binary classification problem. I am not looking for low-level implementation details or which ML algorithms I should use, but rather how I should discover the relevant features that will then be fed to the training, etc.
My first idea was to gather all the documents that students print in the library. The idea is that if you have actual club printing, you'll do it for free at the club printing center rather than pay for it at the library. That would be a massive dataset, assuming every document printed at the library is assigned the non-printable/club material category. Unfortunately, the school is very liberal and opposed to allowing this due to privacy concerns, so it is not really an option without legal risks.
A similar-minded option would be to collect documents that are tied to courses / school work, e.g. course syllabi, available course documents online (homeworks, papers, etc.) and do feature extraction / selection on these. The assumption is that students would be abusing the printing to generally print material relevant to their studies.
While for .pdf and .txt based document this approach should have reasonable performance, I am at a loss at how to classify image based documents, besides perhaps using the title of the document and other meta data. A clever violator could simply convert all their text documents to image format to circumvent this system. However that is outside the scope of this question and should be saved for a future question / research. For now the scope is just text based documents.
Note that there are previous questions on topics similar to this, but mine is very specific and I believe it may pose challenges that something like movie review classification might not have to face.
I just wanted to leave a comment but it ended way longer than what I imagined.
While this is an interesting problem I'm not sure ML will get you what you need easily.
Firstly your classification problem is of the type A vs the World and A isn't strictly defined. Unless you know exactly what kind of stuff the clubs print you can't really say that new material belong or no to that class.
This will prove particularly difficult when you will need to assemble a large enough training set to be able to cover whatever can or cannot be printed. Such task will be extremely tedious, and as you said you won't have access to what the clubs usually print out so at best you will have a large class imbalance in your training set.
As the goal is to make the system automated (I mean if there is human interaction anyway, it's faster to check what will be printed than to make a ML algorithm that will provide a score that a human will have to investigate anyway) the number of false positives and false negatives will also be problematic. There will be cases where the clubs won't be able to print things they have the right to.
As you said you could simplify greatly the problem by classifying Course Material and Not Course Material. For that I will look towards BoW because some words are more present than others in papers or course material (everything remotely technical). The number of words as well as the overall size of the file seem like sensible things to extract. The structure is often also particular : it might be a good idea to extract such things : "number of lines with less than x words", "number of lines per page", "number of pictures" (if that's something you can extract from the file), ...
For pictures the major thing to check would be if this a scan of something (often they will scan and print course related things I guess), for that the format of the image is already a good indication but I don't see other things that would be particularly "course related".
So for me, if you can't really define precisely one of your two classes don't go with classification or reduce the problem to something you can really define (course related things).
If you are able to compile a "black list" of documents students are not allowed to print, you can then implement a several layers rejection mechanism.
I would suggest these 3 levels:
compare the md5 of the file they want to print with a database of all the md5 of the black-listed documents.
if the 1) is passed, compare repeat 1) but at a page level, rather than at document level (perhaps they want to print just few pages rather than the entire document).
if 2) is passed you can compare the page they want to print with the pages of the black-listed documents document using an image similarity method, like SSIM. if you get a high score between the page they want print and one of the black-listed items do not print, and update your md5 database accordingly.
if 3) is passed: print!
A few words about SSIM: this method is quite robust to noise, so even a smart student who added some sort of niose to the image will be caught
However:
you have to find a proper way to extract a region of interest (ROI) from the page and the db of documents (if the two ROIs are in two different area of the page, SSIM will be negative)
SSIM might be slow! definitely a C implementation is needed here.
I think SSIM is not rotational invariant, hence the check will fail if they print the page upside down (unless you have a smart way to rotate the page).
I want to tag text based on the category it belongs to ...
For example ...
"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic
"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..
How to do this using openNLP or other NLP engines.
MY WORKS
I tried NER model , but It needs large number of training corpus which I don't have ?
My Need
Do any ready made training corpus available for NER or classification (it must contains scientific and engineering words).. ?
If you want to create a set of class labels for an entire sentence, then you will want to use the Doccat lib. With Doccat you would get a prob distribution for each chunk of text.
with doccat your sample would produce something like this:
"Clutch and gear is monitored using microchip " -> mechanical 0.85847568, electronic 0.374658
with doocat you will lose the keyword->classlabel mapping, so if you really need it doccat might not cut it.
as for NER, OpenNLP has an addon called Modelbuilder-addon that may help you. It is designed to expedite the creation of NER model building. You can create a file/list of as many of the terms for each category as you can think of, then create a file of a bunch of sentences, then use the addon to create an NER model using the seed terms and the file of sentences. see this post where I described it before with code example. You will have to pull down the addon from SVN.
OpenNLP: foreign names does not get recognized