I'm trying to parse chinese data in the CoNLL format using the Stanford dependency parser edu/stanford/nlp/models/parser/nndep/CTB_CoNLL_params.txt.gz but I seem to have some encoding difficulties.
My input file is in utf-8, already segmented into the different words, a sentence looks like this: 那时 的 坎纳里 鲁夫 , 有着 西海岸 最大 的 工业化 罐头 工厂 。
The commands I use to run the model are these:
java -mx2200m -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP \
-language Chinese \
-encoding utf-8 \
-props StanfordCoreNLP-chinese.properties \
-annotators tokenize,ssplit,pos,depparse \
-file ./ChineseCorpus/ChineseTestSegmented.txt \
-outputFormat conll \
It all seems to work fine except for not encoding the chinese character right, this is the output that I get:
1 ?? _ NT _ 2 DEP
2 ? _ DEG _ 4 NMOD
3 ??? _ NR _ 4 NMOD
4 ?? _ NR _ 6 SUB
5 ? _ PU _ 6 P
6 ?? _ VE _ 0 ROOT
7 ??? _ NN _ 12 NMOD
8 ?? _ JJ _ 9 DEP
9 ? _ DEG _ 12 NMOD
10 ??? _ NN _ 12 NMOD
11 ?? _ NN _ 12 NMOD
12 ?? _ NN _ 6 OBJ
13 ? _ PU _ 6 P
According to the Stanford parser faq the standard encoding for Chinese is GB18030 but they also say "However, the parser is able to parse text in any encoding, providing you pass the correct encoding option on the command line", which I did.
I have looked at this question: How to use Stanford LexParser for Chinese text? but their solution using iconv doesn't work for me, I get the error cannot convert and I have been trying several possible combinations of encodings.
Anybody suggestions on what is going wrong?
Try something like:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP \
-language Chinese -props StanfordCoreNLP-chinese.properties \
-annotators segment,ssplit,pos,parse -file chinese-in.txt -outputFormat conll
E.g.:
alvas#ubi:~/stanford-corenlp-full-2015-12-09$ cat chinese-in.txt
那时的坎纳里鲁夫,有着西海岸最大的工业化罐头工厂。
alvas#ubi:~/jose-stanford/stanford-corenlp-full-2015-12-09$ \
> java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLP \
> -language Chinese -props StanfordCoreNLP-chinese.properties \
> -annotators segment,ssplit,pos,parse -file chinese-in.txt -outputFormat conll
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Registering annotator segment with class edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator segment
Loading classifier from edu/stanford/nlp/models/segmenter/chinese/ctb.gz ... [main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Loading Chinese dictionaries from 1 file:
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - edu/stanford/nlp/models/segmenter/chinese/dict-chris6.ser.gz
[main] INFO edu.stanford.nlp.wordseg.ChineseDictionary - Done. Unique words in ChineseDictionary is: 423200.
done [14.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/chinese-distsim/chinese-distsim.tagger ... done [1.4 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz ...
done [5.2 sec].
Processing file /home/alvas/jose-stanford/stanford-corenlp-full-2015-12-09/chinese-in.txt ... writing to /home/alvas/jose-stanford/stanford-corenlp-full-2015-12-09/chinese-in.txt.conll
Annotating file /home/alvas/jose-stanford/stanford-corenlp-full-2015-12-09/chinese-in.txt
[main] INFO edu.stanford.nlp.wordseg.TagAffixDetector - INFO: TagAffixDetector: useChPos=false | useCTBChar2=true | usePKChar2=false
[main] INFO edu.stanford.nlp.wordseg.TagAffixDetector - INFO: TagAffixDetector: building TagAffixDetector from edu/stanford/nlp/models/segmenter/chinese/dict/character_list and edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
[main] INFO edu.stanford.nlp.wordseg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/segmenter/chinese/dict/character_list
[main] INFO edu.stanford.nlp.wordseg.affDict - Loading affix dictionary from edu/stanford/nlp/models/segmenter/chinese/dict/in.ctb
done.
Annotation pipeline timing information:
ChineseSegmenterAnnotator: 0.2 sec.
WordsToSentencesAnnotator: 0.0 sec.
POSTaggerAnnotator: 0.0 sec.
ParserAnnotator: 0.9 sec.
TOTAL: 1.2 sec. for 13 tokens at 11.0 tokens/sec.
Pipeline setup: 21.1 sec.
Total time for StanfordCoreNLP pipeline: 22.3 sec.
[out]:
http://pastebin.com/raw/Y9J0UBDF
Related
I'm trying to replicate the example provided here. In my case, though, when I do the torch.CharStorage('hello.txt') I get [torch.CharStorage of size 0]. Here's the full output
$ echo "Hello World" > hello.txt
$ th
______ __ | Torch7
/_ __/__ ________/ / | Scientific computing for Lua.
/ / / _ \/ __/ __/ _ \ | Type ? for help
/_/ \___/_/ \__/_//_/ | https://github.com/torch
| http://torch.ch
th> x = torch.CharStorage('hello.txt')
[0.0001s]
th> x
[torch.CharStorage of size 0]
I also noticed that when I do torch.CharStorage('hello.txt', false, 11) the data is read correctly. However, in documentation the shared and size parameters are specified as optional. Is it the case that the documentation is not up to date or am I doing something wrong?
You appear to be running into distro bug #245, introduced by commit 6a35cd9. As stated in torch7 bug #1064, you can work around it by either updating your pkg/torch submodule to commit 89ede3b or newer, or rolling it back to commit 2186e41 or older.
I have a directory of 130000+ .tif files. I want to use find with GNU parallel. All my files are named in the pattern and sequence of k-001 to k-163. One of the challenges is matching 001 with seq 1.
I tried this:
seq 111 163 | parallel -j10 find . -name 'k-{}\*' -print0 | parallel -0 'tesseract {/} /mnt/ramdisk/output/{/.} > /dev/null 2>&1'
I am not getting parallelism from the seq part. Where am I going wrong?
Not sure what the actual issue is, but you can generate zero-padding like this if that is the problem:
printf "%03d\n" {0..10} | parallel -k echo
000
001
002
003
004
005
006
007
008
009
010
I need to parse log file so I output the errors(with the stack trace below), and 10 lines above each error.
For example:
2017-10-29 00:00:10,440 INFO ...
2017-10-29 00:00:10,473 WARN ...
2017-10-29 00:00:10,504 INFO ...
2017-10-29 00:00:10,547 INFO ...
2017-10-29 00:00:10,610 INFO ...
2017-10-29 00:00:11,176 WARN ...
2017-10-29 00:00:11,894 WARN ..
2017-10-29 00:00:11,900 INFO ...
2017-10-29 00:00:11,900 INFO ...
2017-10-29 00:00:12,632 WARN ...
2017-10-29 00:00:12,946 ERROR...
...(stack trace)...
...(stack trace)...
...(stack trace)...
2017-10-29 00:00:12,946 WARN
I need to output 10 lines above the ERROR until the the date(2017-10-29) below(not including the line of the date)
Thought about doing it with grep -n -B10 "ERROR"(for the 10 lines above) and sed '/ERROR/,/29/p'(for the stack trace) but how do I combine the two?
With grep + head pipeline:
grep -B10 'ERROR' g1 | head -n -1
This might work for you (GNU sed):
sed -n ':a;N;/ERROR/bb;s/[^\n]\+/&/11;Ta;D;:b;p;n;/2017-10-29$/!bb' file
Gather up at most 10 lines in the pattern space then use these lines as moving window through the file. When the string ERROR is encountered print the window and then any further lines until (but not including) the string 2017-10-29 is matched. Repeat if necessary.
Try this one
grep -no -B10 ' [0-9:,]* ERROR.*' infile
Need perhaps to substitute ' ' by [[:blank:]]
here is a way using awk:
awk ' {a[++b]=$0}
/^([0-9]{2,4}-?){3}/ {f=0}
/ERROR/ {f=1; for(i=NR-10;i<NR;i++) print a[i]}
f' file
we store each line to array. When matching a date log line, we unset the flag. When matching ERROR we set the flag and we print last 10 lines of the array. And when flag is on, we print (default action so we wrote just f)
This should print expected lines for all existing ERRORs in file.
note: the date regexp used is not strict but seems enough for the case.
I have two text files containing one column each, for example -
File_A File_B
1 1
2 2
3 8
If I do grep -f File_A File_B > File_C, I get File_C containing 1 and 2. I want to know how to use grep -v on two files so that I can get the non-matching values, 3 and 8 in the above example.
Thanks.
You can also use comm if it allows empty output delimiter
$ # -3 means suppress lines common to both input files
$ # by default, tab character appears before lines from second file
$ comm -3 f1 f2
3
8
$ # change it to empty string
$ comm -3 --output-delimiter='' f1 f2
3
8
Note: comm requires sorted input, so use comm -3 --output-delimiter='' <(sort f1) <(sort f2) if they are not already sorted
You can also pass common lines got from grep as input to grep -v. Tested with GNU grep, some version might not support all these options
$ grep -Fxf f1 f2 | grep -hxvFf- f1 f2
3
8
-F option to match strings literally, not as regex
-x option to match whole lines only
-h to suppress file name prefix
f- to accept stdin instead of file input
awk 'NR==FNR{a[$0]=$0;next} !($0 in a) {print a[(FNR)], $0}' f1 f2
3 8
To Understand the meaning of NR and FNR check below output of their print.
awk '{print NR,FNR}' f1 f2
1 1
2 2
3 3
4 4
5 1
6 2
7 3
8 4
Condition NR==FNR is used to extract the data from first file as both NR and FNR would be same for first file only.
With GNU diff command (to compare files line by line):
diff --suppress-common-lines -y f1 f2 | column -t
The output (left column contain lines from f1, right column - from f2):
3 | 8
-y, --side-by-side - output in two columns
I am trying to run the Lanczos Example of mahout.
I am having trouble finding the input file. and what should be the format of input file.
I have used the commands to convert the .txt file into sequence File format by running:
bin/mahout seqdirectory -i input.txt -o outseq -c UTF-8
bin/mahout seq2sparse -i outseq -o ttseq
bin/hadoop jar mahout-examples-0.9-SNAPSHOT-job.jar org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver --input /user/hduser/outputseq --output /out1 --numCols 2 --numRows 4 --cleansvd "true" --rank 5
14/03/20 13:36:12 INFO lanczos.LanczosSolver: Finding 5 singular vectors of matrix with 4 rows, via Lanczos
14/03/20 13:36:13 INFO mapred.FileInputFormat: Total input paths to process : 7
Exception in thread "main" java.lang.IllegalStateException: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:245)
at org.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:200)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:152)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.run(DistributedLanczosSolver.java:111)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver$DistributedLanczosSolverJob.run(DistributedLanczosSolver.java:283)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.mahout.math.hadoop.decomposer.DistributedLanczosSolver.main(DistributedLanczosSolver.java:289)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/user/hduser/ttseq/df-count/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:51)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.mahout.math.hadoop.DistributedRowMatrix.timesSquared(DistributedRowMatrix.java:237)
... 13 more
Any idea please?
In your case you are doing input.txt -> outseq -> ttseq.
You are using outputseq (but not outseq) as input to generate out1.
And you are getting error with ttseq. That is strange? Perhaps you are missing some step in your post.
For me:
This PASSES: text-files -> output-seqdir -> output-seq2sparse-normalized
This FAILS: text-files -> output-seqdir -> output-seq2sparse -> output-seq2sparse-normalized
More details.
I am seeing this error in a different situation:
Create sequence files
$ mahout seqdirectory -i /data/lda/text-files/ -o /data/lda/output-seqdir -c UTF-8
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 20:47:25 INFO common.AbstractJob: Command line arguments: {--charset=[UTF-8], --chunkSize=[64], --endPhase=[2147483647], --fileFilterClass=[org.apache.mahout.text.PrefixAdditionFilter], --input=[/data/lda/ohsumed_full_txt/ohsumed_full_txt/], --keyPrefix=[], --output=[/data/lda/output], --startPhase=[0], --tempDir=[temp]}
14/03/24 20:57:20 INFO driver.MahoutDriver: Program took 594764 ms (Minutes: 9.912733333333334)
Convert sequence files to sparse vectors. Use TFIDF by default.
$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse/ -ow
Running on hadoop, using ....hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ....mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:00:08 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 1
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 1.0
14/03/24 21:00:09 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 1
14/03/24 21:00:10 INFO input.FileInputFormat: Total input paths to process : 1
14/03/24 21:00:11 INFO mapred.JobClient: Running job: job_201403241418_0001
.....
14/03/24 21:02:51 INFO driver.MahoutDriver: Program took 162906 ms (Minutes: 2.7151)
Following command fails ( using /data/lda/output-seq2sparse as input )
$ mahout seq2sparse -i /data/lda/output-seq2sparse -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 -nr 5
Exception in thread "main" java.io.FileNotFoundException: File does not exist: hdfs://localhost:54310/data/lda/output-seq2sparse/df-count/data
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:528)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:63)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
....SKIPPED....
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
However this works just fine ( using /data/lda/output-seqdir as input )
$ mahout seq2sparse -i /data/lda/output-seqdir -o /data/lda/output-seq2sparse-normalized -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 -nr 5
Running on hadoop, using .../hadoop-1.1.1/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: ..../mahout-distribution-0.7/mahout-examples-0.7-job.jar
14/03/24 21:35:55 INFO vectorizer.SparseVectorsFromSequenceFiles: Maximum n-gram size is: 2
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Minimum LLR value: 50.0
14/03/24 21:35:56 INFO vectorizer.SparseVectorsFromSequenceFiles: Number of reduce tasks: 5
14/03/24 21:35:57 INFO input.FileInputFormat: Total input paths to process : 1
...SKIPPED...
14/03/24 21:45:11 INFO common.HadoopUtil: Deleting /data/lda/output-seq2sparse-normalized/partial-vectors-0
14/03/24 21:45:11 INFO driver.MahoutDriver: Program took 556420 ms (Minutes: 9.273666666666667)