I'm trying to test training OpenNLP's Name Finder on some data, according to the guide in the documentation. However, I encountered the error: Unsupported language: en, which doesn't seem to make any sense.
The command I ran is: opennlp TokenNameFinderTrainer.conll03 -model model.bin -lang en -types per,loc,org,misc -data train.txt -encoding UTF-8
I downloaded OpenNLP 1.9.0 from https://opennlp.apache.org/download.html. The OPENNLP_HOME environment variable does seem to be properly set, and the lang folder in the base folder contains an en folder.
EDIT: This seems to have something to do with the CoNLL2003 format. If I try to run the trainer directly without specifying .conll2003 it works. However my input data is in CoNLL 2003 format. Running TokenNameFinderConverter gives me the same error. Even trying it on the official example https://github.com/apache/opennlp/blob/master/opennlp-tools/src/test/resources/opennlp/tools/formats/conll2003-en.sample doesn't work.
OK so apparently in a certain version after 1.5.3, for CoNLL-2003 related commands, OpenNLP changed the language codes from two characters to three characters, i.e. one should have passed in eng instead of en. But the documentation was never updated on this. (There are various outdated portions of the documentation.) Was banging my head for 2 hours trying to figure it out! I made a PR to fix the documentation.
Related
I'm trying my first steps in ML using Jupter's IPython, I was advised to start with Nasdaq's order book ITCH dataset to create models. I'm following the same steps in this tutorial on github.
I can't seem to unzip/expand files from the ITCH dataset, when executing the function may_be_download(url) and the following code (code cell nr.5 in tutorial):
file_name = may_be_download(urljoin(FTP_URL, SOURCE_FILE))
date = file_name.name.split('.')[0]
I get the following error; EOFError: Compressed file ended before the end-of-stream marker was reached
Nor am I able to simply unzip the file by clicking on it in Finder or using gzip & gunzip methods in Terminal.
I took the following steps:
Executed all previous code cells (1-4)
Copied the file 03272019.NASDAQ_ITCH50.gz to a folder named data in the relative path
First I went clicked on the sample link in the notebook
Then logged in as a guest and navigated to the folder Nasdaq ITCH
Then located the file 03272019.NASDAQ_ITCH50.gz and copyed it a local folder.
Executed code cell nr.5 listed above.
I've search and tried numerous solutions to similar issues listed here on Stack and Github, but none seem to solve this particular problem. I would deeply appreciate any help and thoughts on what may be occurring and how I might go about solving this.
I'll leave you with a picture of the error logs, assuming it may be of some help
Thanks for reading.
I downloaded that file and one other from that site. They both appear to be corrupted, both failing with incomplete deflate data.
What's more, there are MD5 signatures for the files there, and what is downloaded has MD5 signatures that do not match.
This is not being caused by the ftp server doing end-of-line conversions, because the lengths of the file in bytes match exactly the lengths on the server. Also a histogram of the byte values shows no bias.
I have a spss syntax file that I need to run on multiple files each in a different directory with the same name as the file, and I am trying to too do this automatically. So far I have tried doing it with syntax code and am trying to avoid doing python is spss, but all I have been able to get is the code bellow which does not work.
VECTOR v = key.
LOOP #i = 1 to 41.
GET
FILE=CONCAT('C:\Users\myDir\otherDir\anotherDir\output\',v(#i),'\',v(#i),'.sav').
DATASET NAME Data#i WINDOW=FRONT.
*Do stuff to the opened file
END LOOP.
EXE.
key is the only column in a file that contains all the names of the files.
I am having trouble debugging since I don't know how to print to the screen if it is possible. So my question is: is there a way to get the code above to work, or another option that accomplishes the same thing?
You can't use an expression like that on a GET command. There are two choices. Use the macro language to put this together (see DEFINE in the Command Syntax Reference via the Help menu) or use the SPSSINC PROCESS FILES extension command or your own Python code to select the files with a wildcard.
The extension command or a Python program require the free Python Essentials available from the SPSS Community website or available with your Statistics version.
I was using Deedle in F# to read a txt file (no header) to data frame, and cannot find any example about how to specify the schema.
let df= Frame.ReadCsv(datafile, separators="\t", hasHeaders=false, schema=schema)
I tried to give a string with names separated by ',', but seems don't work.
let schema = #"name, age, address";
I did some search on the doc, but only find following - don't know where I can find the info. :(
schema - A string that specifies CSV schema. See the documentation
for information about the schema format.
The schema format is the same as in the CSV type provider in F# Data.
The only problem (quite important!) is that the Deedle library had a bug where it completely ignores the schema parameter, so no matter what you provide, it would be ignored.
I just submitted a pull request that fixes the bug and also includes some examples (in the form of unit tests). See the pull request here (and click on "Files changed" to see the samples).
If you do not want to wait for a new release, just get the code from my GitHub fork and build it using build.cmd in the root (run this for the first time to restore packages). The complete build requires local installation of R (because it builds R plugin too), but it should build Deedle.dll and then fail... (After the first run of build.cmd, you can just use Deedle.sln solution).
I'm using the iPhone library for MeCab found at https://github.com/FLCLjp/iPhone-libmecab . I'm having some trouble getting it to tokenize all possible words. Specifically, I cannot tokenize "吉本興業" into two pieces "吉本" and "興業". Are there any options that I could use to fix this? The iPhone library does not expose anything, but it uses C++ underneath the objective-c wrapper. I assume there must be some sort of setting I could change to give more fine-grained control, but I have no idea where to start.
By the way, if anyone wants to tag this 'mecab' that would probably be appropriate. I'm not allowed to create new tags yet.
UPDATE: The iOS library is calling mecab_sparse_tonode2() defined in libmecab.cpp. If anyone could point me to some English documentation on that file it might be enough.
There is nothing iOS-specific in this. The dictionary you are using with mecab (probably ipadic) contains an entry for the company name 吉本興業. Although both parts of the name are listed as separate nouns as well, mecab has a strong preference to tag the compound name as one word.
Mecab lacks a feature that allows the user to choose whether or not compounds should be split into parts. Note that such a feature is generally hard to implement because not everyone agrees on which compounds can be split and which ones can't. E.g. is 容疑者 a compound made up of 容疑 and 者? From a purely morphological point of view perhaps yes, but for most practical applications probably no.
If you have a list of compounds you'd like to get segmented, a quick fix is to create a user dictionary for the parts they consist of, and make mecab use this in addition to the main dictionary.
There is Japanese documentation on how to do this here. For your particular example, it would involve the steps below.
Make a user dictionary with two entries, one for 吉本 and one for 興業:
吉本,,,100,名詞,固有名詞,人名,名,*,*,よしもと,ヨシモト,ヨシモト
興業,,,100,名詞,一般,*,*,*,*,こうぎょう,コウギョウ,コウギョウ
I suspect that both entries exist in the default dictionary already, but by adding them to a user dictionary and specifying a relatively low specificness indicator (I've used 100 for both -- the lower, the more likely to be split), you can get mecab to tend to prefer the parts over the whole.
Compile the user dictionary:
$> $MECAB/libexec/mecab/mecab-dict-index -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic
You may have to adjust the command. The above assumes:
Mecab was installed from source in $MECAB. If you use mecab installed by a package manager, you might have difficulties finding the mecab-dict-index tool. Best install from source.
The default dictionary is in /usr/lib64/mecab/dict/ipadic. This is not part of the mecab package; it comes as a separate package (e.g. this) and you may have difficulties finding this, too.
mydic is the name of the user dictionary created in step 1. mydic.dic is the name of the compiled dictionary you'll get as output (needs not exist).
Both the system dictionary (-t option) and the user dictionary (-f option) are encoded in UTF-8. This may be wrong, in which case you'll get an error message later when you use mecab.
Modify the mecab configuration. In a system-wide installation, this is a file named /usr/lib64/mecab/dic/ipadic/dicrc or similar. In your case it may be located somewhere else. Add the following line to the end of the configuration file:
userdic = home/myhome/mydic.dic
Make sure the absolute path to the dictionary compiled above is correct.
If you then run mecab against your input, it will split the compound into its parts (I tested it, using mecab 0.994 on a Linux system).
A more thorough fix would be to get the source of the default dictionary and manually remove all compoun nouns you want to get split, then recompile the dictionary. As a general remark, using a CJK tokenizer for a serious application in production mode over a longer period of time usually involves a certain amount of dictionary maintenance (adding/removing entries) regularly.
I'm having a big troubles with the libraries that I have to use in my project .
whenever I tried one of the libraries , a problem appears and I don't have so much time to get lost for all this time :( my project is "Image Understanding"
so I need a "feature extraction" & "image segmentation " & "Machine learning"
after reading , it turned out the " SVM " is the best one
and I want some code to build mine on it and start off .
1- first I looked at "Aforge & Accord" and there was an example named "SupportVectorMachine" but it's not on images .
2- I found a great example in "EmguCV" named "LatentSvmDetector" and it detected any image of cat I tried it !! but the problem was in the xml file !
I just wanted to know how they got it ! and I couldn't find a simple answer
actually I asked you here and no body answers me :(
[link] How to extract features from image for classification and object recognition?
3- I found an example uses opencv here in this site
[link] http://www.di.ens.fr/~laptev/download.html
but the same problem : xml file ?!!!
I tried to take the xml file of this example and tried in the "EmguCV" example but it didn't work either .
4- in all the papers that I read they're using "ImageNet" & "VOC PASCAL" , I downloaded them and they're not working !! errors in the code of the tool !! and I've fixed them all
but yet they're not compailing , those tool are written in "Matlab"
here's my qusetion on this site :
[link] Matlab Mex32 link error while compiling Felzenszwalb VOC on Windows
for god sake can anybody tell me what should I do ?!
I'm running out of time , need your help !
thanks.
I'm not sure, because I never used SVM (but used haartraining) but I think that they have trained the detector using a program that outputs a xml file at the end of the training. I have made a quick search and found this link (opencv doc about svm training) and this link (a post with a example). I hope that it helps you and give some light.
MATLAB supports xml files - both reading and writing. Try:
xmlfile = fullfile(matlabroot, 'path/to/xml/file/myfile.xml');
xDoc = xmlread(xmlfile)
If you don't have xmlread function they you can try this toolbox: http://www.mathworks.com/matlabcentral/fileexchange/4278-xml-toolbox