I want to parse French text with Universal Dependencies using Stanford Parser version 3.7.0 (the last one).
Here is my command :
"java -mx2100m -cp stanford-parser.jar:stanford-french-corenlp-2016-10-31-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser -MAX_ITEMS 5000000 -encoding utf-8 -outputFormat conll2007 -outputFormatOptions includePunctuationDependencies -sentences newline frenchFactored.ser.gz "+startinDir+"/"+fic+" > "+startinDir+"/Parses_FR/"+fic_name
I use the last models available https://nlp.stanford.edu/software/lex-parser.shtml#Download
But my output doesn't contain any function, and the POS are not the ones of UD
1 La _ D D _ 2 NULL _ _
2 pluie _ N N _ 3 NULL _ _
3 bat _ V V _ 0 root _ _
4 les _ D D _ 5 NULL _ _
5 carreaux _ N N _ 3 NULL _ _
I am also trying to use the parser tool of the CoreNLP, here is my commandline :
java -mx1g -cp stanford-corenlp-3.7.0.jar:stanford-french-corenlp-2016-10-31-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-french.properties -annotators tokenize,ssplit,pos,depparse -file /Users/Rafael/Desktop/LANGAGES/CORPUS/Sentences_FR/3aube_schtrouFR30.txt -outputFormat sortie.txt
My properties files contains these lines :
annotators = tokenize, ssplit, pos, parse
tokenize.language = fr
parse.model = edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz
pos.model = edu/stanford/nlp/models/pos-tagger/french/french.tagger
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_French.gz
depparse.language = french
I get the following error message
Caused by: java.io.IOException: Unable to open "edu/stanford/nlp/models/pos-tagger/french/french.tagger" as class path, filename or URL
How can I fix that?
Related
How can i create a recursive decent parser without the use of parsec or any library, for this grammar?
The output should be an error message if the string does not belong in this grammar?
parse::String -> AST
Re -> Sq | Sq + Re
Sq -> Ba | Ba Sq
Ba -> El | Ba*
El -> lower-or-digit | (Re)
lower-or-digit are just lowercase letters or digits
First, you need to define your abstract syntax tree, likely as some declared data types. Then you want to define your basic parsing action. For instance,
type ParseResult = Either String AST
type ParseState = (ParseResult, String)
Your parse action is straightforward:
re, sq, ba, el :: ParseState -> ParseState
where re is the top level parser action.
The concrete parsing step might look like this:
el (_, ('(':restOfInput)) = case re (Right restOfInput) of
err#(Left error, s) -> err
(result, ')':s) -> (El result, s)
(_, s) -> (Left "no closing parens", s)
el (_, input#(c:restOfInput)) = if lowerOrDigit c
then (El c, restOfInput)
else (Left "bad character", "")
Where a parsing library buys you a lot of traction is in handling all of the parsing state and propagating errors up the call stack.
I built and ran Syntaxnet successfully on a set of 1400 tweets. I have difficulty in understanding what each parameter in the parsed file means. For example, I have the sentence:
Shoutout #Aetna for covering my doctor visit. Love you!
for which the parsed file contents are:
1 Shoutout _ NOUN NNP _ 9 nsubj _ _
2 # _ ADP IN _ 1 prep _ _
3 Aetna _ NOUN NNP _ 2 pobj _ _
4 for _ ADP IN _ 1 prep _ _
5 covering _ VERB VBG _ 4 pcomp _ _
6 my _ PRON PRP$ _ 8 poss _ _
7 doctor _ NOUN NN _ 8 nn _ _
8 visit. _ NOUN NN _ 5 dobj _ _
9 Love _ VERB VBP _ 0 ROOT _ _
10 you _ PRON PRP _ 9 dobj _ _
11 ! _ . . _ 9 punct _ _
What exactly do each of the columns mean? Why are there blanks and numbers other than the POS tags?
This type of format is called CoNLL Format. There are various versions available of it. The meaning of each column is described here
I have used Stanford Parser to parse some of my already tokenized and POS tagged (by Stanford POS tagger with Gate Twitter model). But the resulting conll 2007 formatted output does not include any punctuations. Why is that?
The command I have used:
java -mx16g -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -tokenized -tagSeparator § -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory -escaper edu.stanford.nlp.process.PTBEscapingProcessor -outputFormat conll2007 edu/stanford/nlp/models/lexparser/englishPCFG.caseless.ser.gz ..test.tagged > ../test.conll
e.g.
Original tweet:
bbc sp says they don't understand why the tories aren't 8% ahead in the polls given the current economics stats ; bbc bias ? surely not ?
POS tagged tweet, used as input for Stanford parser:
bbc§NN sp§NN says§VBZ they§PRP don't§VBP understand§VB why§WRB the§DT tories§NNS aren't§VBZ 8%§CD ahead§RB in§IN the§DT polls§NNS given§VBN the§DT current§JJ economics§NNS stats§NNS ;§: bbc§NN bias§NN ?§. surely§RB not§RB ?§.
Resulting conll 2007 formatted parse:
1 bbc _ NN NN _ 2 compound _ _
2 sp _ NN NN _ 3 nsubj _ _
3 says _ VBZ VBZ _ 0 root _ _
4 they _ PRP PRP _ 5 nsubj _ _
5 don't _ VBP VBP _ 3 ccomp _ _
6 understand _ VB VB _ 5 xcomp _ _
7 why _ WRB WRB _ 10 advmod _ _
8 the _ DT DT _ 9 det _ _
9 tories _ NNS NNS _ 10 nsubj _ _
10 aren't _ VBZ VBZ _ 6 ccomp _ _
11 8% _ CD CD _ 12 nmod:npmod _ _
12 ahead _ RB RB _ 15 advmod _ _
13 in _ IN IN _ 15 case _ _
14 the _ DT DT _ 15 det _ _
15 polls _ NNS NNS _ 10 nmod _ _
16 given _ VBN VBN _ 15 acl _ _
17 the _ DT DT _ 19 det _ _
18 current _ JJ JJ _ 19 amod _ _
19 economics _ NNS NNS _ 16 dobj _ _
20 stats _ NNS NNS _ 19 dep _ _
22 bbc _ NN NN _ 23 compound _ _
23 bias _ NN NN _ 20 dep _ _
25 surely _ RB RB _ 26 advmod _ _
26 not _ RB RB _ 16 neg _ _
As you can see, Most of the punctuations are not included in the parse. But why?
I think adding "-parse.keepPunct" to your command will fix this issue. Please let me know if that doesn't work.
Finally, found the answer, use
-outputFormatOptions includePunctuationDependencies
Have contacted Stanford parser and corenlp support long time ago, no response at all
I'm fairly new to F# but I'm struggling to find how to properly represent the null character in the language. Can anyone tell me how to represent the null character in F#?
More to the point, what started me down the path is I'm trying to do some string processing with String.mapi, but I can't figure out how to remove a character in the below function:
let GetTargetFrameworkFolder version =
let versionMapper i c =
match c with
| 'v' -> if i = 0 then char(0x000) else c
| '.' -> char(0x000)
| _ -> c
match version with
| "v3.5" -> "net35"
| "v4.0" -> "net40"
| "v4.5" -> "net45"
| vers -> vers |> String.mapi versionMapper
GetTargetFrameworkFolder "v4.5.1" |> Dump
How can I remove a character from a string while doing character by character processing, as in the case with String.map and String.mapi?
You cannot remove a character using String.mapi, as this function maps exactly one character from the input to one character from the output. The null character is not the same thing as removing a character; it's just another character that happens to have the code 0.
In your case, if I understand correctly you want to remove the initial 'v' (if any) and remove dots. I would do it like this:
let GetTargetFrameworkFolder version =
match version with
| "v3.5" -> "net35"
| "v4.0" -> "net40"
| "v4.5" -> "net45"
| vers ->
let vers = if vers.[0] = 'v' then vers.[1..] else vers
vers.Replace(".", "")
Another way of doing this if you wanted to keep your original approach would be to write your own choose function for strings:
module String =
let choosei predicate str =
let sb = System.Text.StringBuilder()
let choose i (c:char) =
match predicate i c with
| Some(x) -> sb.Append(c) |> ignore
| None -> ()
str |> String.iteri choose
sb.ToString()
Then use it as follows:
let GetTargetFrameworkFolder version =
let versionMapper i = function
| 'v' when i = 0 -> None
| '.' -> None
| c -> Some(c)
match version with
| "v3.5" -> "net35"
| "v4.0" -> "net40"
| "v4.5" -> "net45"
| vers -> vers |> String.choosei versionMapper
GetTargetFrameworkFolder "v4.5.1" |> Dump
You can achieve this by using an array comprehension:
let GetTargetFrameworkFolder version =
match version with
| "v3.5" -> "net35"
| "v4.0" -> "net40"
| "v4.5" -> "net45"
| vers -> new String([|
for i in 0 .. vers.Length - 1 do
match i, vers.[i] with
| 0, 'v' | _, '.' -> () // skip 'v' at [0] and all '.'s
| _, c -> yield c // let everything else through
|])
By character processing while removing a character is filtering (string is a sequence of char):
let version (s: String) =
s
|> Seq.filter (fun ch -> ch <> '.' && ch <> 'v')
|> String.Concat
UPDATE:
To skip first 'v':
let version (s: String) =
s
|> Seq.skip (if s.StartsWith "v" then 1 else 0)
|> Seq.filter ((<>) '.')
|> String.Concat
I want to use Stanford Parser to create a .conll file for further processing.
So far I managed to parse the test sentence with the command:
stanford-parser-full-2013-06-20/lexparser.sh stanford-parser-full-2013-06-20/data/testsent.txt > output.txt
Instead of a txt file I would like to have a file in .conll. I'm pretty sure it is possible, at it is mentioned in the documentation (see here). Can I somehow modify my command or will I have to write Javacode?
Thanks for help!
If you're looking for dependencies printed out in CoNLL X (CoNLL 2006) format, try this from the command line:
java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt >testsent.tree
java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx
Here's the output for the first test sentence:
1 Scores _ NNS NNS _ 4 nsubj _ _
2 of _ IN IN _ 0 erased _ _
3 properties _ NNS NNS _ 1 prep_of _ _
4 are _ VBP VBP _ 0 root _ _
5 under _ IN IN _ 0 erased _ _
6 extreme _ JJ JJ _ 8 amod _ _
7 fire _ NN NN _ 8 nn _ _
8 threat _ NN NN _ 4 prep_under _ _
9 as _ IN IN _ 13 mark _ _
10 a _ DT DT _ 12 det _ _
11 huge _ JJ JJ _ 12 amod _ _
12 blaze _ NN NN _ 15 xsubj _ _
13 continues _ VBZ VBZ _ 4 advcl _ _
14 to _ TO TO _ 15 aux _ _
15 advance _ VB VB _ 13 xcomp _ _
16 through _ IN IN _ 0 erased _ _
17 Sydney _ NNP NNP _ 20 poss _ _
18 's _ POS POS _ 0 erased _ _
19 north-western _ JJ JJ _ 20 amod _ _
20 suburbs _ NNS NNS _ 15 prep_through _ _
21 . _ . . _ 4 punct _ _
I'm not sure you can do this through command line, but this is a java version:
for (List<HasWord> sentence : new DocumentPreprocessor(new StringReader(filename))) {
Tree parse = lp.apply(sentence);
GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
GrammaticalStructure.printDependencies(gs, gs.typedDependencies(), parse, true, false);
}
There is a conll2007 output, see the TreePrint documentation for all options.
Here is an example using the 3.8 version of the Stanford parser. It assumes an input file of one sentence per line, output in Stanford Dependencies (not Universal Dependencies), no propagation/collapsing, keep punctuation, and output in conll2007:
java -Xmx4g -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -outputFormat conll2007 -originalDependencies -outputFormatOptions "basicDependencies,includePunctuationDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz input.txt