Bert model output interpretation - translation

I searched a lot for this but havent still got a clear idea so I hope you can help me out:
I am trying to translate german texts to english! I udes this code:
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-de-en")
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt")["input_ids"]
results = model(batch)
Which returned me a size error! I fixed this problem (thanks to the community: https://github.com/huggingface/transformers/issues/5480) with switching the last line of code to:
results = model(input_ids = batch,decoder_input_ids=batch)
Now my output looks like a really long array. What is this output precisely? Are these some sort of word embeddings? And if yes: How shall I go on with converting these embeddings to the texts in the english language? Thanks alot!

Adding to Timbus's answer,
What is this output precisely? Are these some sort of word embeddings?
results is of type <class 'transformers.modeling_outputs.Seq2SeqLMOutput'> and you can do
results.__dict__.keys()
to check that results contains the following:
dict_keys(['loss', 'logits', 'past_key_values', 'decoder_hidden_states', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_hidden_states', 'encoder_attentions'])
You can read more about this class in the huggingface documentation.
How shall I go on with converting these embeddings to the texts in the
english language?
To interpret the text in English, you can use model.generate which is easily decodable in the following way:
predictions = model.generate(batch)
english_text = tokenizer.batch_decode(predictions)

I think one possible answer to your dilemma is provided in this question:
https://stackoverflow.com/questions/61523829/how-can-i-use-bert-fo-machine-translation#:~:text=BERT%20is%20not%20a%20machine%20translation%20model%2C%20BERT,there%20are%20doubts%20if%20it%20really%20pays%20off.
Practically with the output of BERT, you get a vectorized representation for each of your words. In essence, it is easier to use the output for other tasks, but trickier in the case of Machine Translation.
A good starting point of using a seq2seq model from the transformers library in the context of machine translation is the following: https://github.com/huggingface/notebooks/blob/master/examples/translation.ipynb.
The example above provides how to translate from English to Romanian.

Related

BERT Certainty (iOS)

I am currently integrating the BERT model listed on https://developer.apple.com/machine-learning/models/#text into an iOS application and have had difficulty removing answers that have low certainty.
I have used the sample code found at the link above but because I wanted to answer questions based on larger volumes of text, I loop over an array of paragraphs and predict an answer for each one. However, the model does not return nil or "No Answer" if an answer is not found and instead returns a (seemingly) random substring. I suppose what I am trying to ask is: is it possible to access the certainty of BERT's response to filter out unlikely results? Or is there another way to get BERT to only return results above a set certainty threshold?
After hours of searching, I've now found a solution. Ironically it only took three lines of code, but here it is anyway:
if bestSum < 7.5 {
return nil
}
I implemented this in the findBestLogitPair() method in the BERTOutput.swift file as provided in Apple's sample code for text analysis using BERT. I have now discovered that the word logit does kind of mean probability in statistics - but being a programmer, I had no idea!

Elixir/Erlang - Split paragraph into sentences based on the language

In Java there is a class called BreakItterator which allows me to pass a paragraph of text in any language (the language it is written in is known) and it will split the text into separate sentences. The magic is that it can take as an argument the locale of the langue the text is written in and it will split the text according to that languages rules (if you look into it it is actually a very complex issue even in English - it is certainly not a case of 'split by full-stops/periods').
Does anybody know how I would do this in elixir? I can't find anything in a Google search.
I am almost at the point of deploying a very thin public API that does only this basic task that I can call into from elixir - but this is really not desirable.
Any help would be really appreciated.
i18n library should be usable for this. Just going from the examples provided, since I have no experience using it, something like the following should work (:en is the locale code):
str = :i18n_string.from("some string")
iter = :i18n_iterator.open(:en, :sentence)
sentences = :i18n_string.split(iter, str)
There's also Cldr, which implements a lot of locale-dependent Unicode algorithms directly in Elixir, but it doesn't seem to include iteration in particular at the moment (you may want to raise an issue there).

DL4J - When using a ComputationGraph, is it possible to get the Class labels from it?

I saw how to do this from a DataSet object, and I saw a setLabel method, and I saw a getLabelMaskArrays, but none of these are what I'm looking for.
Am I just blind or is there not a way?
Thanks
Masking is for variable length time series in RNNs. Most of the time you don't need it. Our built in sequence dataset iterators also tend to handle these cases. For more details see our rnn page: https://deeplearning4j.org/usingrnns

How to pars treebank in (python)?

I have several .tree files each file contains more than one tree and I try to pars these file in the easiest way.
when I used
for line in txt.readlines():
I faced error in parsing because sometimes line contains two trees
the question is how to separate trees in separated lines?
is there an effiecent solution to solve such problem?
Let the corpus reader take care of the segmentation. If the trees are in Treebank format, this might work by itself:
from nltk.corpus import BracketParseCorpusReader
reader = BracketParseCorpusReader("path/to/corpus", r".*\.tree")
for sent in reader.parsed_sents():
print(sent)
If this doesn't match your tree format, read the documentation for the options that customize the input.

SPSS Split file syntax that works before is ignored in general linear models

I am trying to do general linear model analysis using SPSS syntax coding. I wrote a syntax to split my file and then apply GLM. This code used to work perfectly for my other variables in the same dataset but today when I look at the output file I can see that the code is ignoring the split command. Even the previous ones are not working anymore. Could you please help me with this? The syntax is below. SW_CODE is the variable (0- 1) that I like to split. Am I missing something?
sort cases by SW_CODE.
split file by SW_CODE.
GLM ATTITUDE_2 BY SW_CODE COND_CODE newfactor
/METHOD = SSTYPE(3)
/INTERCEPT = INCLUDE
/PRINT = DESCRIPTIVE
/CRITERIA = ALPHA(.05)
/EMMEANS=TABLES(newfactor*COND_CODE) compare (newfactor) /DESIGN.
split file off.
IF you can help me fix this I'd appreciate it.
Thanks in advance.
Remove SW_CODE variable from the GLM syntax and it should work as intended.
sort cases by SW_CODE.
split file by SW_CODE.
GLM ATTITUDE_2 BY SW_CODE COND_CODE newfactor
The way you wrote it makes it impossible for SPSS to test an effect of SW_CODE while spliting database by level of same variable

Resources