I want to use Stanford Parser to create a .conll file for further processing.
So far I managed to parse the test sentence with the command:
stanford-parser-full-2013-06-20/lexparser.sh stanford-parser-full-2013-06-20/data/testsent.txt > output.txt
Instead of a txt file I would like to have a file in .conll. I'm pretty sure it is possible, at it is mentioned in the documentation (see here). Can I somehow modify my command or will I have to write Javacode?
Thanks for help!
If you're looking for dependencies printed out in CoNLL X (CoNLL 2006) format, try this from the command line:
java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt >testsent.tree
java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx
Here's the output for the first test sentence:
1 Scores _ NNS NNS _ 4 nsubj _ _
2 of _ IN IN _ 0 erased _ _
3 properties _ NNS NNS _ 1 prep_of _ _
4 are _ VBP VBP _ 0 root _ _
5 under _ IN IN _ 0 erased _ _
6 extreme _ JJ JJ _ 8 amod _ _
7 fire _ NN NN _ 8 nn _ _
8 threat _ NN NN _ 4 prep_under _ _
9 as _ IN IN _ 13 mark _ _
10 a _ DT DT _ 12 det _ _
11 huge _ JJ JJ _ 12 amod _ _
12 blaze _ NN NN _ 15 xsubj _ _
13 continues _ VBZ VBZ _ 4 advcl _ _
14 to _ TO TO _ 15 aux _ _
15 advance _ VB VB _ 13 xcomp _ _
16 through _ IN IN _ 0 erased _ _
17 Sydney _ NNP NNP _ 20 poss _ _
18 's _ POS POS _ 0 erased _ _
19 north-western _ JJ JJ _ 20 amod _ _
20 suburbs _ NNS NNS _ 15 prep_through _ _
21 . _ . . _ 4 punct _ _
I'm not sure you can do this through command line, but this is a java version:
for (List<HasWord> sentence : new DocumentPreprocessor(new StringReader(filename))) {
Tree parse = lp.apply(sentence);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
GrammaticalStructure.printDependencies(gs, gs.typedDependencies(), parse, true, false);
}
There is a conll2007 output, see the TreePrint documentation for all options.
Here is an example using the 3.8 version of the Stanford parser. It assumes an input file of one sentence per line, output in Stanford Dependencies (not Universal Dependencies), no propagation/collapsing, keep punctuation, and output in conll2007:
java -Xmx4g -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -outputFormat conll2007 -originalDependencies -outputFormatOptions "basicDependencies,includePunctuationDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz input.txt
Related
The code is getting a json string from a server and parses it into a jObject and then branches appropriately.
let j = JObject.Parse x
match x, j with
| _ when x = "pong" -> ()
| _ when j.ContainsKey "table" -> HandleTableMessages j x
| _ when j.ContainsKey "success" -> HandleSuccessMessages j
| _ when j.ContainsKey "error" -> HandleErrorMessages j
| _ when j.ContainsKey "info" -> j.SelectToken "info" |> string |> this.Print
| _, null -> this.Error ("malformed message: " + x)
| _ -> this.Error("unknown message type: " + x)
I think there is something a little bit heavy with the _ when part and I am wondering if there a better use of the F# grammar to express this?
It's a good sign that you realize this code is bad. It shows you may have better taste than most beginners. Using the simplest structure for a task is very important.
match x, j with
| _ when x = "pong" -> ()
...
First note that (x,j) is unused, so this simplifies to:
match () with
| _ when x = "pong" -> ()
...
Then you can realize that matching on a unit is silly, and that you should have used a simpler statement:
if x = "pong" then ()
elif j.ContainsKey "table" then HandleTableMessages j x
...
else this.Error("unknown message type: " + x)
I have just noticed that elif is not in the F# cheatsheet so I will try to get it added there as it's a basic keyword.
// Active Pattern
let (|Table|Success|Error|Info|Unknown|) (j: Newtonsoft.Json.Linq.JObject) =
if j.ContainsKey "table" then Table
elif j.ContainsKey "success" then Success
elif j.ContainsKey "error" then Error
elif j.ContainsKey "info" then Info
else Unknown
match x, j with
| "pong", _ -> ()
| _, Table -> HandleTableMessages j x
| _, Success -> HandleSuccessMessages j
| _, Error -> HandleErrorMessages j
| _, Info -> j.SelectToken "info" |> string |> this.Print
| _, null -> this.Error ("malformed message: " + x)
| _, Unknown
| _, _ -> this.Error("unknown message type: " + x)
I want to parse French text with Universal Dependencies using Stanford Parser version 3.7.0 (the last one).
Here is my command :
"java -mx2100m -cp stanford-parser.jar:stanford-french-corenlp-2016-10-31-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -MAX_ITEMS 5000000 -encoding utf-8 -outputFormat conll2007 -outputFormatOptions includePunctuationDependencies -sentences newline frenchFactored.ser.gz "+startinDir+"/"+fic+" > "+startinDir+"/Parses_FR/"+fic_name
I use the last models available https://nlp.stanford.edu/software/lex-parser.shtml#Download
But my output doesn't contain any function, and the POS are not the ones of UD
1 La _ D D _ 2 NULL _ _
2 pluie _ N N _ 3 NULL _ _
3 bat _ V V _ 0 root _ _
4 les _ D D _ 5 NULL _ _
5 carreaux _ N N _ 3 NULL _ _
I am also trying to use the parser tool of the CoreNLP, here is my commandline :
java -mx1g -cp stanford-corenlp-3.7.0.jar:stanford-french-corenlp-2016-10-31-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-french.properties -annotators tokenize,ssplit,pos,depparse -file /Users/Rafael/Desktop/LANGAGES/CORPUS/Sentences_FR/3aube_schtrouFR30.txt -outputFormat sortie.txt
My properties files contains these lines :
annotators = tokenize, ssplit, pos, parse
tokenize.language = fr
parse.model = edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz
pos.model = edu/stanford/nlp/models/pos-tagger/french/french.tagger
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_French.gz
depparse.language = french
I get the following error message
Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/french/french.tagger" as class path, filename or URL
How can I fix that?
I built and ran Syntaxnet successfully on a set of 1400 tweets. I have difficulty in understanding what each parameter in the parsed file means. For example, I have the sentence:
Shoutout #Aetna for covering my doctor visit. Love you!
for which the parsed file contents are:
1 Shoutout _ NOUN NNP _ 9 nsubj _ _
2 # _ ADP IN _ 1 prep _ _
3 Aetna _ NOUN NNP _ 2 pobj _ _
4 for _ ADP IN _ 1 prep _ _
5 covering _ VERB VBG _ 4 pcomp _ _
6 my _ PRON PRP$ _ 8 poss _ _
7 doctor _ NOUN NN _ 8 nn _ _
8 visit. _ NOUN NN _ 5 dobj _ _
9 Love _ VERB VBP _ 0 ROOT _ _
10 you _ PRON PRP _ 9 dobj _ _
11 ! _ . . _ 9 punct _ _
What exactly do each of the columns mean? Why are there blanks and numbers other than the POS tags?
This type of format is called CoNLL Format. There are various versions available of it. The meaning of each column is described here
I have used Stanford Parser to parse some of my already tokenized and POS tagged (by Stanford POS tagger with Gate Twitter model). But the resulting conll 2007 formatted output does not include any punctuations. Why is that?
The command I have used:
java -mx16g -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -tokenized -tagSeparator § -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -escaper edu.stanford.nlp.process.PTBEscapingProcessor -outputFormat conll2007 edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ..test.tagged > ../test.conll
e.g.
Original tweet:
bbc sp says they don't understand why the tories aren't 8% ahead in the polls given the current economics stats ; bbc bias ? surely not ?
POS tagged tweet, used as input for Stanford parser:
bbc§NN sp§NN says§VBZ they§PRP don't§VBP understand§VB why§WRB the§DT tories§NNS aren't§VBZ 8%§CD ahead§RB in§IN the§DT polls§NNS given§VBN the§DT current§JJ economics§NNS stats§NNS ;§: bbc§NN bias§NN ?§. surely§RB not§RB ?§.
Resulting conll 2007 formatted parse:
1 bbc _ NN NN _ 2 compound _ _
2 sp _ NN NN _ 3 nsubj _ _
3 says _ VBZ VBZ _ 0 root _ _
4 they _ PRP PRP _ 5 nsubj _ _
5 don't _ VBP VBP _ 3 ccomp _ _
6 understand _ VB VB _ 5 xcomp _ _
7 why _ WRB WRB _ 10 advmod _ _
8 the _ DT DT _ 9 det _ _
9 tories _ NNS NNS _ 10 nsubj _ _
10 aren't _ VBZ VBZ _ 6 ccomp _ _
11 8% _ CD CD _ 12 nmod:npmod _ _
12 ahead _ RB RB _ 15 advmod _ _
13 in _ IN IN _ 15 case _ _
14 the _ DT DT _ 15 det _ _
15 polls _ NNS NNS _ 10 nmod _ _
16 given _ VBN VBN _ 15 acl _ _
17 the _ DT DT _ 19 det _ _
18 current _ JJ JJ _ 19 amod _ _
19 economics _ NNS NNS _ 16 dobj _ _
20 stats _ NNS NNS _ 19 dep _ _
22 bbc _ NN NN _ 23 compound _ _
23 bias _ NN NN _ 20 dep _ _
25 surely _ RB RB _ 26 advmod _ _
26 not _ RB RB _ 16 neg _ _
As you can see, Most of the punctuations are not included in the parse. But why?
I think adding "-parse.keepPunct" to your command will fix this issue. Please let me know if that doesn't work.
Finally, found the answer, use
-outputFormatOptions includePunctuationDependencies
Have contacted Stanford parser and corenlp support long time ago, no response at all
I'm working with maltparser, nltk for process texts. Well i have a integration between maltparser and nltk that works fine. But since every time i execute the program nltk call java VE this take a lot of time... So i think make a webservice who takes conll .txt and return conll parsed by java app.
Well the problem come when i test examples from maltparser sources. I pick one from just initialize model and parser a array of tokens. I just change de model to the regular english one (engmalt.linear-1.7.mco). So execute and return the sentences just like input.
The code is this
public static void main(String[] args) {
// Loading the Swedish model swemalt-mini
ConcurrentMaltParserModel model = null;
try {
URL swemaltMiniModelURL = new File("inputs/engmalt.linear-1.7.mco").toURI().toURL();
System.out.println(swemaltMiniModelURL.getFile());
model = ConcurrentMaltParserService.initializeParserModel(swemaltMiniModelURL);
} catch (Exception e) {
e.printStackTrace();
}
// Creates an array of tokens, which contains the Swedish sentence 'Samtidigt får du högsta sparränta plus en skattefri sparpremie.'
// in the CoNLL data format.
String[] tokens = new String[5];
tokens[0] = "1\tThis\t_\tDT\tDT\t_\t0\ta\t_\t_";
System.out.println(tokens[0]);
tokens[1] = "2\tis\t_\tVBZ\tVBZ\t_\t0\ta\t_\t_";
System.out.println(tokens[1]);
tokens[2] = "3\ta\t_\tZ\tZ\t_\t0\ta\t_\t_";
System.out.println(tokens[2]);
tokens[3] = "4\ttest\t_\tNN\tNN\t_\t0\ta\t_\t_";
System.out.println(tokens[3]);
tokens[4] = "5\t.\t_\tFp\tFp\t_\t0\ta\t_\t_";
System.out.println(tokens[4]);
try {
String[] outputTokens = model.parseTokens(tokens);
ConcurrentUtils.printTokens(outputTokens);
} catch (Exception e) {
e.printStackTrace();
}
}
and the output is:
/home/tomas/workspace/PruebaMalt/inputs/engmalt.linear-1.7.mco
1 This _ DT DT _ 0 a _ _
2 is _ VBZ VBZ _ 0 a _ _
3 a _ Z Z _ 0 a _ _
4 test _ NN NN _ 0 a _ _
5 . _ Fp Fp _ 0 a _ _
1 This _ DT DT _ 0 a _ _
2 is _ VBZ VBZ _ 0 a _ _
3 a _ Z Z _ 0 a _ _
4 test _ NN NN _ 0 a _ _
5 . _ Fp Fp _ 0 a _ _
I try with others models and languages and the same... Any suggestions? ty!
I discovered by myself. The problem is that nlkt send to java this format:
1 This _ DT DT _ 0 a _ _
and return: 1 This _ DT DT _ 2 SUBJ _ _
But in java the format is a little different, the last 2 _ has to be removed. With that, it'll work!
input: 1 This _ DT DT _
return: 1 This _ DT DT _ 2 SUBJ _ _
I hope this help others.