GIZA++ :Forbidden zero sentence length 0 - giza++

I have been using GIZA++ for translation of sentence when I used on test dataset an error is displayed "ERROR: Forbidden zero sentence length 0". IS there any way to avoid this error

I had the same problem with the en-vi corpus. (English-Vietnamese)
Because your corpus data is too long or not clean.
You should clean up your corpus data.
It will limit sentence length to 80. This is the command with Moses tools.
~/mosesdecoder/scripts/training/clean-corpus-n.perl
~/corpus/train en vi
~/corpus/train.clean 1 80
Or you can adjust manually.
Try to cut down the length of each line less than 100 characters or 80 words.

Related

How to generate a sequence code string in Rails

I have a model which has a column named code, which is a combination of the model's name column and its ID with leading zeros.
name = 'Rocky'
id = 16
I have an after_create callback which runs and generates the code:
update(code: "#{self.name[0..2].upcase}%.4d" % self.id)
The generated code will be:
"ROC0016"
The code is working.
I found (%.4d" % self.id) from another project, but I don't know how it works.
How does it determine the number of zeros to be preceded based on the passed integer.
You’re using a "format specifier". There are many specifiers, but the one you’re using, "%d", is the decimal specifier:
% starts it. 4 means it should always use at least four numbers, so if the number is only two digits, it gets padded with 0s to fill in the rest of the numbers. The second % means replace 4d with whatever comes after it. So in your case, 4d is getting replaced with "0016".
sprintf has more information about format specifiers.
You can read more about String#% in the documentation also.
After the percentage sign ("%") is a decimal (".") and a number. That number is the number of total digits in the result. If the result is less than this value, additional zeros will be added.
Thus, in this first example, the result is "34" but length was set to "4". The result will have two leading zeros to fill it into four digits.
"This is test string %.4d" % 34
result => "This is test string 0034"
"I want more zeroes in my code %.7d" % 34
result => "I want more zeroes in my code 0000034"

pytesseract ocr limit length of out characters using config

I am building an application in python using opencv which extracts characters from an image and runs pytesseract to convert to text.
I know that the characters are always 2 digits longs (range 10-99). How do I configure the parameters so that single digit outputs are not returned.
I have the following in my code:
text = pytesseract.image_to_string(Image.open(filename),config='--psm 100 --eom 3 -c tessedit_char_whitelist=0123456789')
what do I put instead of config='--psm 100 --eom 3 -c tessedit_char_whitelist=0123456789' so that it only returns 2 digit numbers (i.e. 01 but not 5)

UniVocity CSV parser does varying length ?

I have a 26 million rows dataset and when I try parsing it with uniVocity parser it reads it as 18 million rows only.
My rows field count varies from 158 to 162 with delimiter as ASCII '\u0001'.
wc -l output from linux >>>> wc -l withHeader.dat
26351323 withHeader.dat
But parser reads it as Total # of rows in file = 18554088 ( output from list.size of parser.parseAll() )
Can some one explain what could be the issue ?
this is my parserSettings
settings.getFormat().setLineSeparator("\n");
settings.selectFields("acctId","tcat", "transCode");
settings.getFormat().setDelimiter('\u0001');
//settings.setAutoConfigurationEnabled(true);
//settings.setMaxColumns(86);
settings.setHeaderExtractionEnabled(false);
// creates a CSV parser
CsvParser parser = new CsvParser(settings);
// parses all rows in one go.
List<String[]> allRows = parser.parseAll(newReader(filePath));
System.out.println("Total # of rows in file = " + allRows.size());
If your values can contain line separators, then the number of parsed records won't be equal to the number of lines.
If that's not the case, then it's likely you are not configuring the format correctly. You might need to configure quotes, quote escapes, etc.
My first suggestion is to try to detect the format automatically with:
settings.detectFormatAutomatically();
After parsing, check if you got the row count you expect to find. You can get what has been detected by calling:
CsvFormat detectedFormat = parser.getDetectedFormat();
Keep in mind this process is not guaranteed to work but in the majority of cases it does the trick. These features are available as of version 2.0.0.
If nothing helps, please attach (part of) your input file so I can take a look and update my answer.

Huffman tree, is this correct?

I'm trying to make create a correct huffman tree and was wondering if this was correct. The top number is the frequency/weight and the bottom number is the ASCII code. The string is
"hhiiiisssss". If I entered this into a text file, there would be only one LF correct? I'm not sure why my program is reading in two.
14
-1
/ \
9 5
-1 s(115)
/ \
5 4
-1 i(105)
/ \
3 2
h(104) LF(10)
In a text file there would only be one LF if there is only one line of text, correct.
Something else is wrong though. There are only two 'h' in your string but your tree shows three, and a total of 14 characters. I'm guessing it's a typo?
Aside from that it looks ok and your huffman codes would be (depending on whether you pick '0' for left or right):
s: 1
i: 01
LF: 001
h: 000

COBOL accurate LENGTH OF (XML-TEXT) when ampersands escaped?

I'm using the intrinsic function XML-PARSE with XML that looks like this:
<MSGBODYTXT>
<LN>One & two</LN>
</MSGBODYTXT>
By my count, the following string is 13 bytes long.
"One & two"
But when I take $
LENGTH OF(XML-TEXT) $
I get only 9 bytes.
What can I do to get the correct, 13-byte length?
The problem is that XML PARSE translates & into the character it represents, an ampersand. If
you look at the CONTENT-CHARACTERS associated with the <LN> tag you will see: One & two which
is 9 characters long, just as the LENGTH OF operator on XML-TEXT says it is.
Note that if you were to use XML GENERATE on data item LN having the value One & two
it will generate as <LN>One & two</LN> which is a symetrical operation.

Resources