Customizing the Named Entity Recogntition model in Azure ML - named-entity-recognition

Can we customize the Named Entity Recognition (NER) model in Azure ML Studio with a separate training dataset? What I want to do is to find out non-English names from a text. (Training dataset includes the set of names that going to use for training)

Unfortunately, this module's ability to perform NER with a custom set of entities is planned for the future, but not currently available.
If you're familiar with Python and willing to put in the extra footwork, you might consider using the Natural Language Toolkit (NLTK). Sujit Pal has a nice blog post and sample code describing the creation of a custom NER with that package. You may be able to train an NLTK NER model and apply it to your data of interest from within an Execute Python Script module on Azure ML.

Related

HOG and LBP on weka

I'm new to the subject of ML so I apologize as my questions may seem too basic.
I have an image dataset and my supervisor asked me to do feature extraction using HOG and LBP filters. So far I have been working with weka, and I couldn't fine any useful tutorials on how to implement these filters on weka, is it possible? and if not, how else can implement these filters to extract features from my dataset?
Weka
You could use and/or extend the imageFilter Weka package. It uses LIRE under the hood, which has a range of feature extraction methods implemented.
ADAMS
If you don't mind using another framework, then you could make use of the plugins for LIRE in ADAMS. Its adams-imaging module also offers support for a range of LIRE feature generators (download either the adams-annotator or adams-base-all snapshot).
Using its workflow engine, you can run the flow adams-imaging-feature_generation, which generates generates PHOG and LocalBinaryPatterns features from a range of images and displays them as spreadsheet. You could use this flow as basis and turn it into one that allows you to select images interactively and then saves them as CSV or ARFF file.

Does AutoML accept external models?

I used random search and got the best hyper parameters for my model, can I pass that model to the AutoML?
Does AutoML do the random search for the best hyper parameters by itself? or is there something I need to pass?
I presume you're referring to Google Cloud AutoML. It is a cloud-based Machine Learning (ML) platform that suggests a no-code approach to building data-driven solutions. AutoML was designed to build custom models for both newcomers and experienced machine learning engineers.
For newcomers, you could use Vertex AI (fully automated) to build a ML model:
For experienced ML engineers, you could also use AutoML Tabular to build a custom model, with the ability to select a model and input the selected hyperparameters:
You can read more details from here

Tensorflow Object Detection API

I decided to take a dip into ML and with a lot of trial and error was able to create a model using TS' inception.
To take this a step further, I want to use their Object Detection API. But their input preparation instructions, references the use of Pascal VOC 2012 dataset but I want to do the training on my own dataset.
Does this mean I need to setup my datasets to either Pascal VOC or Oxford IIT format? If yes, how do I go about doing this?
If no (my instinct says this is the case), what are the alternatives of using TS object detection with my own datasets?
Side Note: I know that my trained inception model can't be used for localization because its a classifier
Edit:
For those still looking to achieve this, here is how I went about doing it.
The training jobs in the Tensorflow Object Detection API expect to get TF Record files with certain fields populated with groundtruth data.
You can either set up your data in the same format as the Pascal VOC or Oxford-IIIT examples, or you can just directly create the TFRecord files ignoring the XML formats.
In the latter case, the create_pet_tf_record.py or create_pascal_tf_record.py scripts are likely to still be useful as a reference for which fields the API expects to see and what format they should take. Currently we do not provide a tool that creates these TFRecord files generally, so you will have to write your own.
Except TF Object Detection API you may look at OpenCV Haar Cascades. I was starting my object detection way from that point and if provide well prepared data set it works pretty fine.
There are also many articles and tutorials about creating your own cascades, so it`s easy to start.
I was using this blog, it helps me a lot.

Looking for a dataset that contain string value in Machine Learning

I'm learning Machine Learning with Tensorflow. I've work with some dataset like Iris flower data and Boston House, but all those data's values was float.
Yes I'm looking for a dataset that contain data's values are in string format to practice. Can you give me some suggestions?
Thanks
I provide you just two easy-to-start places:
Tensorflow website has three very good tutorials to deal with word embedding, language modeling and sequence-to-sequence models. I don't have enough reputation to link them directly but you can easily find them here. They provide you with some tensorflow code to deal with human language
Moreover, if you want to build a model from scratch and you need only the dataset, try ntlk corpora. They are easy to download directly from the code.
Facebook's ParlAI project lists a good amount of datasets for Natural Language Processing tasks
IMDB's reviews dataset is also a classic example, also Amazon's reviews for sentiment analysis. If you take a look at kernels posted on Kaggle you'll get a lot of insights about the dataset and the task.

Disease named entity recognition

I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here:
Primary pulmonary hypertension is a progressive disease in which widespread occlusion of the smallest pulmonary arteries leads to increased pulmonary vascular resistance, and subsequently right ventricular failure.
What I need is a tool that finds all disease terms (e.g. "pulmonary hypertension" in this case) in the sentences and maps them to a controlled vocabulary like MeSH.
Thanks in advance for your answers!
Here are two pipelines that are specifically designed for medical document parsing:
Apache cTAKES
NLM's MetaMap
Both use UMLS, the unified medical language system, and thus require that you have a (free) license. Both are Java and more or less easy to set up.
See http://www.ebi.ac.uk/webservices/whatizit/info.jsf
Whatizit is a text processing system that allows you to do textmining
tasks on text. The tasks come defined by the pipelines in the drop
down list of the above window and the text can be pasted in the text
area.
You could also ask biostars: http://www.biostars.org/show/questions/
there are many tools to do that. some popular ones:
NLTK (python)
LingPipe (java)
Stanford NER (java)
OpenCalais (web service)
Illinois NER (java)
most of them come with some predefined models, i.e. they've already been trained on some general datasets (news articles, etc.). however, your texts are pretty specific, so you might want to first constitute a corpus and re-train one of those tools, in order to adjust it to your data.
more simply, as a first test, you can try a dictionary-based approach: design a list of entity names, and perform some exact or approximate matching. for instance, this operation is decribed in LingPipe's tutorial.
Open Targets has a module for this as part of LINK. It's not meant to be used directly so it might require some hacking and tinkering, but it's the most complete medical NER (named entity recognition) tool I've found for python. For more info, read their blog post.
a bash script that has as example a lexicon generated from the disease ontology:
https://github.com/lasigeBioTM/MER

Resources