How to pars treebank in (python)? - parsing

I have several .tree files each file contains more than one tree and I try to pars these file in the easiest way.
when I used
for line in txt.readlines():
I faced error in parsing because sometimes line contains two trees
the question is how to separate trees in separated lines?
is there an effiecent solution to solve such problem?

Let the corpus reader take care of the segmentation. If the trees are in Treebank format, this might work by itself:
from nltk.corpus import BracketParseCorpusReader
reader = BracketParseCorpusReader("path/to/corpus", r".*\.tree")
for sent in reader.parsed_sents():
print(sent)
If this doesn't match your tree format, read the documentation for the options that customize the input.

Related

Bert model output interpretation

I searched a lot for this but havent still got a clear idea so I hope you can help me out:
I am trying to translate german texts to english! I udes this code:
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-de-en")
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt")["input_ids"]
results = model(batch)
Which returned me a size error! I fixed this problem (thanks to the community: https://github.com/huggingface/transformers/issues/5480) with switching the last line of code to:
results = model(input_ids = batch,decoder_input_ids=batch)
Now my output looks like a really long array. What is this output precisely? Are these some sort of word embeddings? And if yes: How shall I go on with converting these embeddings to the texts in the english language? Thanks alot!
Adding to Timbus's answer,
What is this output precisely? Are these some sort of word embeddings?
results is of type <class 'transformers.modeling_outputs.Seq2SeqLMOutput'> and you can do
results.__dict__.keys()
to check that results contains the following:
dict_keys(['loss', 'logits', 'past_key_values', 'decoder_hidden_states', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_hidden_states', 'encoder_attentions'])
You can read more about this class in the huggingface documentation.
How shall I go on with converting these embeddings to the texts in the
english language?
To interpret the text in English, you can use model.generate which is easily decodable in the following way:
predictions = model.generate(batch)
english_text = tokenizer.batch_decode(predictions)
I think one possible answer to your dilemma is provided in this question:
https://stackoverflow.com/questions/61523829/how-can-i-use-bert-fo-machine-translation#:~:text=BERT%20is%20not%20a%20machine%20translation%20model%2C%20BERT,there%20are%20doubts%20if%20it%20really%20pays%20off.
Practically with the output of BERT, you get a vectorized representation for each of your words. In essence, it is easier to use the output for other tasks, but trickier in the case of Machine Translation.
A good starting point of using a seq2seq model from the transformers library in the context of machine translation is the following: https://github.com/huggingface/notebooks/blob/master/examples/translation.ipynb.
The example above provides how to translate from English to Romanian.

Parse batch of SequenceExample

There is function to parse SequenceExample --> tf.parse_single_sequence_example().
But it parses only single SequenceExample, which is not effective.
Is there any possibility to parse a batch of SequenceExamples?
tf.parse_example can parse many Examples.
Documentation for tf.parse_example contain a little info about SequenceExample:
Each FixedLenSequenceFeature df maps to a Tensor of the specified type (or tf.float32 if not specified) and shape (serialized.size(), None) + df.shape. All examples in serialized will be padded with default_value along the second dimension.
But it is not clear, how to do that. Have not found any examples in google.
Is it possible to parse many SequenceExamples using parse_example() or may be other function exists?
Edit:
Where can I ask question to tensorflow developers: does they plan to implement parse function for multiple SequenceExample -s?
Any help ll be appreciated.
If you have many small sequences where batching at this stage is important, I would recommend VarLenFeatures or FixedLenSequenceFeatures with regular Example protos (which, as you note, can be parsed in batches with parse_example). For examples of this, see the unit tests associated with example parsing (testSerializedContainingSparse parses Examples with FixedLenSequenceFeatures).
SequenceExamples are more geared toward cases where there is significant amounts of preprocessing work to be done for each SequenceExample (which can be done in parallel with queues). parse_example does does not support SequenceExamples.

Missing words in Stanford NLP dependency tree parser

I'm making an application using a dependency tree parser. Actually, the parser is this one:
Parser Stanford, but it rarely change one or two letters of some words in a sentence that I want to parse. This is a big trouble for me, because I can't see any pattern in these changes and I need the dependency tree with the same words of my sentence.
All I can see is that just some words have these problems. I'm working with a tweets database. So, I have a lot of grammar mistakes in this data. For example the hashtag '#AllAmericanhumour ' becomes AllAmericanhumor. It misses one letter(u).
Is there anything I can do to solve this problem? In my first view I thought using an edit distance algorithm, but I think that might be an easier way to do it.
Thanks everybody in advance
You can give options to the tokenizer with the -tokenize.options flag/property. For this particular normalization, you can turn it off with
-tokenize.options americanize=false
There are also various other normalizations that you can turn off (see PTBTokenizer or http://nlp.stanford.edu/software/tokenizer.shtml. You can turn off a lot with
-tokenize.options ptb3Escaping=false
However, the parser is trained on data that looks like the output of ptb3Escaping=true and so will tend to degrade in performance if used with unnormalized tokens. So, you may want to consider alternative strategies.
If you're working at the Java level, you can look at the word tokens, which are actually Maps, and they have various keys. OriginalTextAnnotation will give you the unnormalized token, even when it has been normalized. CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation will map to character offsets into the text.
p.s. And you should accept some answers :-).

Parsing binary data

I got interested in parser generators. But I don't have the theoretical background. I just read a few things on the internet.
Currently I'm trying to do something with ANTLR
So my questions:
I have a special format of my dataframes:
The first byte of a frame is a tag that describes the nature of the data
The second byte contains the length (number of bytes) of the data itself
Then follows the data itself
The data can contain dataframes itself, and dataframes can be listed one after the other
I hope my description is clear. My questions:
Can I create such a parser with ANTLR that reads the lengs of the frame and then knows when the frame ends?
In ANTLR can I load the different tags I use from a generated file?
Thank you!
I'm not 100% sure about this, but:
Parser generators like antlr require a grammar that is at least context-free
using length-fields in your data makes your grammar not context free (context-sensitive i think)
It is the latter point i'm not sure about - maybe you want to research some more on that.
You probably have to write a packet "parser" yourself (which then has to be a parser for your context-sensitive packet grammar)
Alternatively, you could drop the length field, and use something like s-expressions, JSON or xml; these would be parseable by something generated with antlr.
I think you will be better off to create a hand written binary parser instead of using ANTLR because ANTLR is primarily intended to read and make sense of a text file and not binary data. The lexer part is focused on tokenizing text so trying to make it read binary data instead would be an uphill battle.
It sounds as if your structure would need some kind of recursive way of reading the data although it could be done easier just having a tree structure and then fill it as you read your file.

Mahout : How to convert custom document in SparseVector format for using LDA

I have a set of documents in which each line has certain number of Strings seperated with "\t|\t". Each String(may contain spaces in between) is a indivisible dictionary item. Now I have to use LDA to find the correletaion between these documents with respect to each dictionsr word(String in my vocab).
Please guide me how can I convert these documents to spares vector format and then how to apply LDA on them?
This is one of the best links that i have found that might answer your queries.
http://www.theglassicon.com/computing/machine-learning/running-lda-algorithm-mahout

Resources