Machine Learning in GATE - machine-learning

In GATE machine learning module 11, they have used an annotation set called 'Key', which has been manually prepared. when I applied that ML technique on my corpus, it didn't work. my data set does not have annotation set name, instead it has a default 'Original markups' name which is not been recognized by JAPE Transducer --> inputAsName property. How to give my annotation set a name so that it can process ML?
I followed this GATE tutorial http://gate.ac.uk/sale/talks/gate-course-may11/track-3/module-11-machine-learning/module-11.pdf

I just followed the same instructions, creating a blank corpus, and populating from the corpus folder in the sample materials they provide. It worked fine for me, I can see the Key annotations set by opening one of the documents and clicking the annotations sets button.
Did you populate a blank corpus by right clicking, populate, and then selecting the folder containing the course files?

Related

Any ideas on why my coreml model created with turicreate isn't working?

Pretty much brand new to ML here. I'm trying to create a hand-detection CoreML model using turicreate.
The dataset I'm using is from https://github.com/aurooj/Hand-Segmentation-in-the-Wild , which provides images of hands from an egocentric perspective, along with masks for the images. I'm following the steps in turicreate's "Data Preparation" (https://github.com/apple/turicreate/blob/master/userguide/object_detection/data-preparation.md) step-by-step to create the SFrame. Checking the contents of the variables throughout this process, there doesn't appear to be anything wrong.
Following data preparation, I follow the steps in the "Introductory Example" section of https://github.com/apple/turicreate/tree/master/userguide/object_detection
I get the hint of an error when turicreate is performing iterations to create the model. There doesn't appear to be any Loss at all, which doesn't seem right.
After the model is created, I try to test it with a test_data portion of the SFrame. The results of these predictions are just empty arrays though, which is obviously not right.
After exporting the model as a CoreML .mlmodel and trying it out in an app, it is unable to recognize anything (not surprisingly).
Me being completely new to model creation, I can't figure out what might be wrong. The dataset seems quite accurate to me. The only changes I made to the dataset were that some of the masks didn't have explicit file extensions (they are PNGs), so I added the .png extension. I also renamed the images to follow turicreate's tutorial formats (i.e. vid4frame025.image.png and vid4frame025.mask.0.png. Again, the SFrame creation process using this data seems correct at each step. I was able to follow the process with turicreate's tutorial dataset (bikes and cars) successfully. Any ideas on what might be going wrong?
I found the problem, and it basically stemmed from my unfamiliarity with Python.
In one part of the Data Preparation section, after creating bounding boxes out of the mask images, each annotation is assigned a 'label' indicating the type of object the annotation is meant to be. My data had a different name format than the tutorial's data, so rather than each annotation having 'label': 'bike', my annotations had 'label': 'vid4frame25`, 'label': 'vid4frame26', etc.
Correcting this such that each annotation has 'label': 'hand' seems to have corrected this (or at least it's creating a legitimate-seeming model so far).

Why is there a discrepancy in the imagenet dataset labels?

Are the labels used for training and the ones used for validation the same? I thought they should be the same; however, there seem to be a discrepancy in the labels that are available online. When I downloaded the imagenet 2012 labels for its validation data from the official website, I get labels that start with kit_fox as the first label, which matches the exact 2012's dataset validation images I downloaded from the official website. This is the example of the labels: https://gist.github.com/aaronpolhamus/964a4411c0906315deb9f4a3723aac57
However, for almost all the pretrained models, including those trained by Google, the imagenet labels they use for training, actually start with tench, tinca tinca instead. See here: https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a
Why is there such a huge discrepancy? Where did the 'tinca tinca' kind of labels come from?
If we use the first label mapping that corresponds to the actual validation images, we face another problem: 2 classes ("Crane" and "maillot") are actually duplicated, i.e. they have the same name but refer to different kind of crane - the mechanical crane and the animal crane - resulting in 100 image in 2 of the classes instead of the supposed 50. If we do not use the first mapping, where is a reliable source of the validation images that correspond to the second label mapping?
I have the same problem in my finetuning. You solve your problem change the name of classes tench, tinca tinca to the synset number. You can find here the mapping

can SPSS regard ordinal measures as producing continuous data?

In SPSS, when defining the measure of a variable, the usual options are "Scale", "Ordinal", and "Nominal" (see image).
However, when using actual dialog boxes to do analyses, SPSS will often ask us to describe whether the data are "Continuous" or "Categorical". E.g., I was watching this video by James Gaskin (a great YouTube teacher by the way), and saw this dialog box (image below).
My Question: In the second image, you can see that the narrator put some "Ordinal" variables in the "Continuous" box. Is it okay to do that? How come?
For most procedures, the treatment of a variable is determined by how you use it. The measurement level is just a reminder, so you can treat a variable however it makes sense.
There are some procedures that automatically determine how to treat a variable based on the measurement level, including CTABLES, the Chart Builder, and TREE, but you can change the level temporarily in the dialog box or in syntax or change it persistently via VARIABLE LEVEL or in the Data Editor. Also, most of the statistical extension commands use the declared measurement level to determine whether a variable is continuous or a factor.

Categorical PCA (CATPCA) in SPSS (23)

I am trying to conduct nonlinear principal component analysis using CATPCA in SPSS. I am following [a tutorial] (http://www.ncbi.nlm.nih.gov/pubmed/22176263) by Linting & Kooij (2012) and did not find that certain steps are straightforward. For the timebeing, my questions are:
How do I get a screeplot within CATPCA. The authors describe it as a necessary step but I can't seem to find it within the CATPCA drop-down menu.
Similarly, the tutorial describes the use of bootstrap confidence interval to test the significance of the factor loadings but the Bootstrap Confidence Ellipses option under the Save menu seems disabled (or I can't seem to activate those). What am I missing?
These are the most pressing questions that I encountered thus far. Thank you.
CATPCA does not produce a scree plot. You can create one manually by copying the eigenvalues out of the Model Summary table in the output, or (if you will need to create a lot of scree plots) you can use the SPSS Output Management System (OMS) to automate pulling the values out of the table and creating the plot.
In order to enable the the Bootstrap Confidence Ellipses controls on the Save subdialog, you need to check "Perform bootstrapping" on the Bootstrap subdialog.
See footnote of the literature (Linting & Kooij,2012 p20). "Eigenvalues are from the bottom row of the Correlations transformed variables table." You can create scree plot from these eigenvalues.

How to tag text based on its category using OpenNLP?

I want to tag text based on the category it belongs to ...
For example ...
"Clutch and gear is monitored using microchip " -> clutch /mechanical , gear/mechanical , microchip / electronic
"software used here to monitor hydrogen levels" -> software/computer , hydrogen / chemistry ..
How to do this using openNLP or other NLP engines.
MY WORKS
I tried NER model , but It needs large number of training corpus which I don't have ?
My Need
Do any ready made training corpus available for NER or classification (it must contains scientific and engineering words).. ?
If you want to create a set of class labels for an entire sentence, then you will want to use the Doccat lib. With Doccat you would get a prob distribution for each chunk of text.
with doccat your sample would produce something like this:
"Clutch and gear is monitored using microchip " -> mechanical 0.85847568, electronic 0.374658
with doocat you will lose the keyword->classlabel mapping, so if you really need it doccat might not cut it.
as for NER, OpenNLP has an addon called Modelbuilder-addon that may help you. It is designed to expedite the creation of NER model building. You can create a file/list of as many of the terms for each category as you can think of, then create a file of a bunch of sentences, then use the addon to create an NER model using the seed terms and the file of sentences. see this post where I described it before with code example. You will have to pull down the addon from SVN.
OpenNLP: foreign names does not get recognized

Resources