stanford parser can't read german umlauts - parsing

The stanford parser (http://nlp.stanford.edu/software/lex-parser.html), version 3.6.0, comes with trained grammars for Engish, German and other languages. To parse german text the stanford parser provides the tool lexparser-lang.sh
./lexparser-lang.sh
Usage: lexparser-lang.sh lang len grammar out_file FILE...
lang : Language to parse (Arabic, English, Chinese, German, French)
len : Maximum length of the sentences to parse
grammar : Serialized grammar file (look in the models jar)
out_file : Prefix for the output filename
FILE : List of files to parse
So I call it with these options:
sadik#sadix:stanford-parser-full-2015-12-09$ ./lexparser-lang.sh German 500 edu/stanford/nlp/models/lexparser/germanFactored.ser.gz factored german_test.txt
The input file german_test.txt contains a single German sentence:
Fußball findet um 8 Uhr in der Halle statt.
But the "ß" results in a warning and a wrong result. Same with "ä", "ö" and "ü". Now, lexparser-lang.sh is supposed to be designed to deal with German text as input. Is there any option I am missing?
How it is:
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/germanFactored.ser.gz ...
done [3.8 sec].
Parsing file: german_test.txt
Apr 01, 2016 12:48:45 AM edu.stanford.nlp.process.PTBLexer next
WARNING: Untokenizable: (U+9F, decimal: 159)
Parsing [sent. 1 len. 11]: Fuà ball findet um 8 Uhr in der Halle statt .
Parsed file: german_test.txt [1 sentences].
Parsed 11 words in 1 sentences (32.07 wds/sec; 2.92 sents/sec).
With a parse tree that looks like crap:
(S (ADV FuÃ) (ADV ball) (VVFIN findet)
(PP (APPR um) (CARD 8) (NN Uhr))
(PP (APPR in) (ART der) (NN Halle))
(PTKVZ statt) ($. .))
How it should be
When written "Fussball", there is no problem (except incorrect orthography)
[main] INFO edu.stanford.nlp.parser.lexparser.LexicalizedParser - Loading parser from serialized file edu/stanford/nlp/models/lexparser/germanFactored.ser.gz ...
done [3.5 sec].
Parsing file: german_test.txt
Parsing [sent. 1 len. 10]: Fussball findet um 8 Uhr in der Halle statt .
Parsed file: german_test.txt [1 sentences].
Parsed 10 words in 1 sentences (40.98 wds/sec; 4.10 sents/sec).
The correct tree:
(S (NN Fussball) (VVFIN findet)
(PP (APPR um) (CARD 8) (NN Uhr))
(PP (APPR in) (ART der) (NN Halle))
(PTKVZ statt) ($. .))

The demo script is not running the tokenizer with the correct character set. So if your text is pre-tokenized, you can add the option "-tokenized" and it will just use space as the token delimiter.
Also you want to tell the parser to use "-encoding ISO-8859-1" for German.
Here is the full java command (alter the one found in the .sh script):
java -Xmx2g -cp "./*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 500 -tLPP edu.stanford.nlp.parser.lexparser.NegraPennTreebankParserParams -hMarkov 1 -vMarkov 2 -vSelSplitCutOff 300 -uwm 1 -unknownSuffixSize 2 -nodeCleanup 2 -writeOutputFiles -outputFilesExtension output.500.stp -outputFormat "penn" -outputFormatOptions "removeTopBracket,includePunctuationDependencies" -encoding ISO_8859-1 -tokenized -loadFromSerializedFile edu/stanford/nlp/models/lexparser/germanFactored.ser.gz german_example.txt
I get this output:
(NUR
(S (NN Fußball) (VVFIN findet)
(PP (APPR um) (CARD 8) (NN Uhr))
(PP (APPR in) (ART der) (NN Halle) (ADJA statt.))))
UPDATED AGAIN:
Make sure to separate "statt." into "statt ." since we are now saying the tokens are white space separated. If we apply this fix we get this parse:
(S (NN Fußball) (VVFIN findet)
(PP (APPR um) (CARD 8) (NN Uhr))
(PP (APPR in) (ART der) (NN Halle))
(PTKVZ statt) ($. .))
So just to summarize, basically the issue is we need to tell the PTBTokenizer to use ISO_8859-1 and LexicalizedParser to use ISO_8859-1.
I would recommend just using the full pipeline to accomplish this.
Download Stanford CoreNLP 3.6.0 from here:
http://stanfordnlp.github.io/CoreNLP/
Download the German model jar from here:
http://stanfordnlp.github.io/CoreNLP/download.html
Run this command:
java -Xmx3g -cp "stanford-corenlp-full-2015-12-09/*:stanford-corenlp-3.6.0-models-german.jar" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,parse -props StanfordCoreNLP-german.properties -file german_example_file.txt -outputFormat text
This will tokenize and parse the text and use the correct character encoding.

Related

Greek and special characters show as mojibake - how to decode?

I'm trying to figure out how to decode some corrupt characters I have in a spreadsheet. There is a list of website titles: some in English, some in Greek, some in other languages. For example, Greek phrase ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ shows as ŒïŒõŒõŒóŒùŒôŒöŒë ŒùŒïŒë Œ§Œ©Œ°Œë. So the whitespaces are OK, but the actual letters gone all wrong.
I have noticed that letters got converted to pairs of symbols:
Ε - Œï
Λ - Œõ
And so on. So it's almost always Œ and then some other symbol after it.
I went further, removed the repeated letter and checked difference in ASCII codes of the actual phrase and what was left of the corrupted phrase: ord('ï') - ord('Ε') and so on. The difference is almost the same all the time: `
678
678
678
676
676
677
676
678
0 (this is a whitespace)
676
678
678
0 (this is a whitespace)
765
768
753
678
I have manually decoded some of the other letters from other titles:
Greek
Œë Α
Œî Δ
Œï Ε
Œõ Λ
Œó Η
Œô Ι
Œö Κ
Œù Ν
Œ° Ρ
Œ§ Τ
Œ© Ω
Œµ ε
Œª λ
œÑ τ
ŒØ ί
Œø ο
œÑ τ
œâ ω
ŒΩ ν
Symbols
‚Äò ‘
‚Äô ’
‚Ķ …
‚Ć †
‚Äú “
Other
√© é
It's good I have a translation for this phrase, but there are a couple of others I don't have translation for. I would be glad to see any kind of advice because searching around StackOverflow didn't show me anything related.
It's a character encoding issue. The string appears to be in encoding Mac OS Roman (figured it out by educated guesses on this site). The IANA code for this encoding is macintosh, and its Windows code page number is 100000.
Here's a Python function that will decode macintosh to utf-8 strings:
def macToUtf8(s):
return bytes(s, 'macintosh').decode('utf-8')
print(macToUtf8('ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ'))
# outputs: ΕΛΛΗΝΙΚΑ ΝΕΑ ΤΩΡΑ
My best guess is that your spreadsheet was saved on a Mac Computer, or perhaps saved using some Macintosh-based setting.
See also this issue: What encoding does MAC Excel use?

Running Antlr4 parser with lexer grammar gets token recognition errors

I'm trying to create a grammar to parse Solr queries (only mildly relevant and you don't need to know anything about solr to answer the question -- just know more than I do about antlr 4.7). I'm basing it on the QueryParser.jj file from solr 6. I looked for an existing one, but there doesn't seem to be one that isn't old and out-of-date.
I'm stuck because when I try to run the parser I get "token recognition error"s.
The lexer I created uses lexer modes which, as I understand it means I need to have a separate lexer grammar file. So, I have a parser and a lexer file.
I whittled it down to a simple example to show I'm seeing. Maybe someone can tell me what I'm doing wrong. Here's the parser (Junk.g4):
grammar Junk;
options {
language = Java;
tokenVocab=JLexer;
}
term : TERM '\r\n';
I can't use an import because of the lexer modes in the lexer file I'm trying to create (the tokens in the modes become "undefined" if I use an import). That's why I reference the lexer file with the tokenVocab parameter (as shown in the XML example in github).
Here's the lexer (JLexer.g4):
lexer grammar JLexer;
TERM : TERM_START_CHAR TERM_CHAR* ;
TERM_START_CHAR : [abc] ;
TERM_CHAR : [efg] ;
WS : [ \t\n\r\u3000]+ -> skip;
If I copy the lexer code into the parser, then things work as expected (e.g., "aeee" is a term). Also, if I run the lexer file with grun (specifying tokens as the target), then the string parses as a TERM (as expected).
If I run the parser ("grun Junk term -tokens"), then I get:
line 1:0 token recognition error at: 'a'
line 1:1 token recognition error at: 'e'
line 1:2 token recognition error at: 'e'
line 1:3 token recognition error at: 'e'
[#0,4:5='\r\n',<'
'>,1:4]
I "compile" the lexer first, then "compile" the parser and then javac the resulting java files. I do this in a batch file, so I'm pretty confident that I'm doing this every time.
I don't understand what I'm doing wrong. Is it the way I'm running grun? Any suggestions would be appreciated.
Always trust your intuition! There is some convention internal to grun :-) See here TestRig.java c. lines 125, 150. Would have been lot nicer if some additional CLI args were also added.
When lexer and grammar are compiled separately, the grammar name - in your case - would be (insofar as TestRig goes) "Junk" and the two files must be named "JunkLexer.g4" and "JunkParser.g4". Accordingly the headers in parser file JunkParser.g4 should be modified too
parser grammar JunkParser;
options { tokenVocab=JunkLexer; }
... stuff
Now you can run your tests
> antlr4 JunkLexer
> antlr4 JunkParser
> javac Junk*.java
> grun Junk term -tokens
aeee
^Z
[#0,0:3='aeee',<TERM>,1:0]
[#1,6:5='<EOF>',<EOF>,2:0]
>

How can I prevent OpenNLP Parser from tokenizing strings?

I need to use OpenNLP Parser for a specific task. The documentation suggests that you send it tokenized input, which implies that no further tokenization will take place. However, when I pass a string with parentheses, brackets, or braces, OpenNLP tokenizes them and converts them to PTB tokens.
I don't want this to happen, but I can't figure out how to prevent it.
Specifically, if my input contains "{2}", I want it to stay that way, not become "-LCB- 2 -RCB-". I now have 3 tokens where I once had one. I'd also strongly prefer not to have to post-process the output to undo the PTB tokens.
Is there a way to prevent OpenNLP Parser from tokenizing?
Looking at the javadocs, there are two parseLine methods, for one a tokenizer can be specified. I haven't tried the following, but I guess training your own tokenizer (https://opennlp.apache.org/docs/1.8.0/manual/opennlp.html#tools.tokenizer.training), which shouldn't be that much of a problem, revert to simple whitespace splitting if need be, and feeding that to the parseLine method (in addition to just the sentence and the number of desired parses should do the trick. E.g. something like the following:
public static void main(String args[]) throws Exception{
InputStream inputStream = new FileInputStream(FileFactory.generateOrCreateFileInstance(<location to en-parser-chunking.bin>));
ParserModel model = new ParserModel(inputStream);
Parser parser = ParserFactory.create(model);
String sentence = "An example with a {2} string.";
//Parse topParses[] = ParserTool.parseLine(sentence, parser, 1);
// instead of using the line above, feed it a tokenizer, like so:
Parse topParses[] = ParserTool.parseLine(sentence, parser, new SimpleTokenizer(), 1);
for (Parse p : topParses)
p.show();
}
This particular piece of code still splits the { from the 2 in the input, resulting in:
(TOP (NP (NP (DT An) (NN example)) (PP (IN with) (NP (DT a) (-LRB- -LCB-) (CD 2) (-RRB- -RCB-) (NN string))) (. .)))
but if you train your own tokenizer and don't split on the cases you want to keep as a single token, guess this should work.

solution or workarounds for haskell-src-exts parsing modules with CPP failing

I'm trying to do some parsing of a bunch of haskell source files using haskell-src-exts but ran into trouble in the first file I tested on. Here is the first bit:
{-# LANGUAGE CPP, MultiParamTypeClasses, ScopedTypeVariables #-}
{-# OPTIONS_GHC -Wall -fno-warn-orphans #-}
----------------------------------------------------------------------
-- |
-- Module : FRP.Reactive.Fun
-- Copyright : (c) Conal Elliott 2007
-- License : GNU AGPLv3 (see COPYING)
--
-- Maintainer : conal#conal.net
-- Stability : experimental
--
-- Functions, with constant functions optimized, with instances for many
-- standard classes.
----------------------------------------------------------------------
module FRP.Reactive.Fun (Fun, fun, apply, batch) where
import Prelude hiding
( zip, zipWith
#if __GLASGOW_HASKELL__ >= 609
, (.), id
#endif
)
#if __GLASGOW_HASKELL__ >= 609
import Control.Category
#endif
And the code I'm using to test:
*Search> f <- parseFile "/tmp/file.hs"
*Search> f
ParseFailed (SrcLoc {srcFilename = "/tmp/file.hs", srcLine = 19, srcColumn = 1}) "Parse error: ;"
The issue appears to be the CPP conditional sections, but it appears that CPP is a supported extenstion. I'm using haskell-src-exts-1.11.1 with ghc 7.0.4
I'm just trying to do some quick and dirty analysis, so I don't mind stripping out those sections before parsing if I have to, but better solutions would be welcomed.
Possibly use cpphs to "evaluate" the pre-processor statements first?
Also, that is the known extension list copied (and extended) from Cabal; haskell-src-exts doesn't support CPP.

What Character Encoding Is This?

I need to clean up some file containing French text. Problem is that the files erroneously contain multiple encodings within the same file.
I think some sections are ISO8859-1 (Latin 1) but other parts have text encoded in single byte characters that look like 'extended' ASCII. In other words, it is UTF-7 encoding plus the following:
0x82 for é (e acute)
0x8a for è (e grave)
0x88 for ê (e circumflex)
0x85 for à (a grave)
0x87 for ç (c cedilla)
What encoding is this?
That's the original IBM PC encoding, Code page 437.
This website here shows a link with 0x87 for cedilla. I haven't look much further than this, but I bet the rest of your information could be found here as well.

Resources