DL4J - When using a ComputationGraph, is it possible to get the Class labels from it? - deeplearning4j

I saw how to do this from a DataSet object, and I saw a setLabel method, and I saw a getLabelMaskArrays, but none of these are what I'm looking for.
Am I just blind or is there not a way?
Thanks

Masking is for variable length time series in RNNs. Most of the time you don't need it. Our built in sequence dataset iterators also tend to handle these cases. For more details see our rnn page: https://deeplearning4j.org/usingrnns

Related

BERT Certainty (iOS)

I am currently integrating the BERT model listed on https://developer.apple.com/machine-learning/models/#text into an iOS application and have had difficulty removing answers that have low certainty.
I have used the sample code found at the link above but because I wanted to answer questions based on larger volumes of text, I loop over an array of paragraphs and predict an answer for each one. However, the model does not return nil or "No Answer" if an answer is not found and instead returns a (seemingly) random substring. I suppose what I am trying to ask is: is it possible to access the certainty of BERT's response to filter out unlikely results? Or is there another way to get BERT to only return results above a set certainty threshold?
After hours of searching, I've now found a solution. Ironically it only took three lines of code, but here it is anyway:
if bestSum < 7.5 {
return nil
}
I implemented this in the findBestLogitPair() method in the BERTOutput.swift file as provided in Apple's sample code for text analysis using BERT. I have now discovered that the word logit does kind of mean probability in statistics - but being a programmer, I had no idea!

How to recover a valuation from a satifsiable formula, a question about model

I'm using Z3 with the ml interface. I had created a formula
f(x_i)
that is satisfiable, according to the solver
Solver.mk_simple_solver ctxr.
The problem is: I can get a model, but he find me values only for some variables of the formula, and not all (some of my Model.get_const_interp_er end with a type None)
How can it be possible that the model can give me only a part of the x_ir? In my understanding, if the model work for one of the values, it means that the formula was satisfiable (in my case, it is) and so all the values can be given...
I don't understand something..
Thanks for reading me!
You should always post full examples so people can help with actual coding issues; without seeing your actual code, it's impossible to know what might be the actual reason.
Having said that, this sounds very much like the following question: Why Z3Py does not provide all possible solutions So, perhaps the answer given there will help you.
Long story short: Z3 models will only contain values for variables that matter for the model. For anything that is not explicitly assigned, any value will do. There are ways to get "full" models as explained in that answer of course; which I'm sure is also possible from the ML interface.

Missing Values in WEKA output

I'm trying to compare J48 and MLP on a variety of datasets using WEKA. One of these is: https://archive.ics.uci.edu/ml/datasets/primary+tumor. I have converted this to CSV form which can be easily imported into WEKA. You can download this file here: https://ufile.io/8nj13
I used the "numeric to nominal" on the class and all the attributes to fit the natural structure of the data. However, when I ran J48 (and MLP), I got a bunch of question marks "?" in my output, presumably due to not having enough observations/instances of the appropriate type.
How can I get around this? I'm sure there must be a filter for this kind of thing. I've attached a picture below.
The detailed accuracy table is displaying a question mark since no instance was actually classified as that specific class. This for example means that since no instance was classified as class 16, WEKA can not provide you with details regarding said class 16 classifications. This image might help you understand.
In regards to the amount of instances of the appropriate class, you can use the ClassBalancer filter under, found at weka/filters/supervised/instance/ClassBalancer. This should help balance out the amount of the various classes.
Also note that your dataset contains some missing values, this could be solved by either discarding the instances with missing data or running the ReplaceMissingValues filter, found at weka/filters/unsupervised/attribute/ReplaceMissingValues.

Deap: Want to know the generation that created the best individual

I'm running a genetic algorithm program and can find the best individual at the end of the run (hof[0]), but i want to know which generation produced it. Is there any attributes of hof[0] that will help print the individual and the generation that created it.
I tried looking at the manuals and Google for answers but could not find it anywhere.
I also couldn't find a list of the attributes of individuals that I could print. Can someone point to the right link and documentation to that.
Thanks
This deap post suggest tracking the logbook, or explicitly adding the generation to the individual along with fitness:
https://groups.google.com/g/deap-users/c/r7fZbMwHg3I/m/BAzHh2ogBAAJ
For the latter:
If you are working with the algo locally(recommended if working beyond a tutorial as something always comes up like adding plotting or this very questions) then you can modify the fitness update line to resemble:
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
ind.generation = gen # now we can: print(hof[0].gen)
if halloffame is not None:
halloffame.update(population)
There is no built in way to do this (yet/to the best of my knowledge), and implementing this so would probably be quite a large task. The simplest of which (simplest in thought, not in implementation) would be to change the individual to be a tuple, where tup[0] is the individual and tup[1] is the generation it was produced in, or something similar.
If you're looking for a hacky way, you could maybe try writing the children of each generation to a text file and cross-checking your final solution with the text file; but other than that I'm not sure.
You could always try posting on their Google Group, though it can take a couple of days for a reply.
Good luck!

What are some good methods to find the "relatedness" of two bodies of text?

Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.
What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?
I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.
I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.
These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.
You could also look into Soundex for words that "sound alike" phonetically.
I've never used it, but you might want to look into Levenshtein distance
Jeff talked about something like this on the pod cast to find the Related questions listed on the right side here. (in podcast 32)
One big tip was to remove all common words, like "the" "and" "this" etc. This will leave you with more meaningful words to compare.
And here is a similar question Is there an algorithm that tells the semantic similarity of two phrases
This is quite doable for reasonable large texts, however harder for smaller texts.
I did it once like this, and it worked pretty well:
Filter all "general" words (like a, an, the, in, etc...) (filters about 10-30% of the words)
Count the frequencies of the remaining words, store the top x of most frequent words, these are your topics.
As an extra step you can create groups of 2/3/4 subsequent words and compare them with the groups in other texts. I used it as a measure for plagerism.
See Manning and Raghavan course notes about MinHashing and searching for similar items, and a C#(?) version. I believe the techniques come from Ullman and Motwani's research.
This book may be relevant.
Edit: here is a related SO question
Phonetic algorithms
The article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms.
I ended up mostly using Jaro Winkler to match on names. Here's more information where I asked about matching names on SO: Matching records based on Person Name
A few algorithms based on Levenshtein Distance are also available in the SimMetric library and would probably be useful in your application.

Resources