Ruby calling a Python script causes encoding error with German characters - ruby-on-rails

A Ruby on rails application launches a Python script to get German word lemmas. The Python scripts exits with the following error:
File "/PATHTOSCRIPT/script.py", line 15, in <module>
for l in sys.stdin:
File "/PATHTOPYTHON/python3.4/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Ruby on Rails:
require 'open3'
#in, #out, stderr = Open3.popen3("/PATHTOSCRIPT/script.py") if ['de'].include? lang
a = "übervölkerung"
#in.write "#{a}\n"
logger.info(#treetagger_out.read.nil?)
logger.info(stderr.read)
Python:
import sys
import os
sys.stdin = os.fdopen(sys.stdin.fileno(), 'r', buffering=1)
for l in sys.stdin:
l = l.strip()
I've detected that in Ruby and Python there are different character counts:
Ruby:
2.2.3 :006 > a="übervölkerung"
=> "übervölkerung"
2.2.3 :007 > print a.bytes
[195, 188, 98, 101, 114, 118, 195, 182, 108, 107, 101, 114, 117, 110, 103] => nil
Python:
>>> a="übervölkerung"
>>> print(list(map(ord, a)))
[252, 98, 101, 114, 118, 246, 108, 107, 101, 114, 117, 110, 103]

The input to your Python script is apparently text encoded with UTF-8.
If you encode your test string "übervölkerung" with UTF-8, then the first byte is C3, which is found in the traceback in the beginning of your post.
This means you need to read STDIN with a text stream that decodes UTF-8, not ASCII.
You already have a line that creates a wrapper around sys.stdin:
sys.stdin = os.fdopen(sys.stdin.fileno(), 'r', buffering=1)
This replaces the default text stream reader (an io.TextIOWrapper instance) with a new one.
But you don't specify the input encoding, so the default encoding is used – which is determined by the environment (based on OS-specific environment variables).
In your case the encoding apparently defaults to ASCII, which is not what you need.
You need UTF-8, so write:
sys.stdin = os.fdopen(sys.stdin.fileno(), 'r', encoding='UTF-8')
(Of course you can leave the buffering=1 parameter there if you think you need it.)
Also, os.fdopen is just a more restricted version of the built-in open function. So you can just use that one, without losing anything:
sys.stdin = open(sys.stdin.fileno(), 'r', encoding='UTF-8')
By the way, the difference in character count you see between Ruby and Python comes from the fact that you are looking at different things.
In the Ruby code, you look at the bytes of the UTF-8 encoded text, while in Python you look at the (Unicode) code points.
In the second case, each number corresponds to a single character, while multiple numbers correspond to a character in the first case.
To see the byte values in Python, do:
>>> a = "übervölkerung"
>>> list(a.encode('utf8'))
[195, 188, 98, 101, 114, 118, 195, 182, 108, 107, 101, 114, 117, 110, 103]
I don't know how to see code points in Ruby, though.

Related

How does the control sequence ~s work?

The following output is as I expected:
125> [97, 98, 99].
"abc"
126> [97, 98, 0].
[97,98,0]
But the output using ~s is not what I expected:
127> io:format("~s~n", [[97, 98, 0]]).
ab^#
ok
How do I interpret that output?
The ~s control sequence expects to get a string, a binary or an atom, and prints it "with string syntax". As Erlang strings are just lists of integers, it tries to print [97, 98, 0] in this example as a string as well. The shell on the other hand tries to guess whether this list of integers is meant to be a string.
^# represents the NUL character. You might be familiar with the caret notation, where ^A represents byte 1, since A is the 1st letter in the alphabet - or in other words, it represents the byte whose value is 64 less than the ASCII value of the character, since A is 65 in ASCII. Extrapolate it to the 0 byte, and you'll find #, which is 64 in ASCII.

IOError while retrieving sequences from fasta file using biopython

I have a fasta file containning PapillomaViruses sequences (entire genomes, partial CDS, ....) and i'm using biopython to retrieve entire genomes (around 7kb) from this files, so here's my code:
rec_dict = SeqIO.index("hpv_id_name_all.fasta","fasta")
for k in rec_dict.keys():
c=c+1
if len(rec_dict[k].seq)>7000:
handle=open(rec_dict[k].description+"_"+str(len(rec_dict[k].seq))+".fasta","w")
handle.write(">"+rec_dict[k].description+"\n"+str(rec_dict[k].seq)+"\n")
handle.close()
i'm using a dictionary for avoiding loading everything in memory. The variable "c" is used to know how many iterations are made before THIS error pops up:
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
IOError: [Errno 2] No such file or directory: 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta'
when i print the value of "c", i get 9013 while the file contains 10447 sequences, meaning the for loop didn't go through all the sequences (the count is done before the "if" condition, so the i count all the iterations, not only those which match the condition). i don't understand the INPUT/OUTPUT error, it should create the 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' file instead of verifying its existence, shouldn't it?
The file you were trying to create -- 'EU410347.1|Human papillomavirus FA75/KI88-03_7401.fasta' -- contains a slash ('/'), which is interpreted by Python as a directory 'EU410347.1|Human papillomavirus FA75' followed by a filename 'KI88-03_7401.fasta', so Python complains that the directory does not exist.
You may want to replace the slash with something else, such as
handle=open(rec_dict[k].description.replace('/', '_')+"_"+str(len(rec_dict[k].seq))+".fasta","w")

Why is [9] returned as "\t" in Erlang?

I am working through some Erlang tutorials and noticed when I enter
[8].
the VM returns "\b"
or if I enter
[9].
the VM returns "\t"
I am confused on why this is happening. Other numbers are returned as a list of that number:
[3].
is returned as [3]
[4].
is returned as [4], etc.
I guess the question is why is the erlang VM return it this way? Perhaps an explanation of a list [65] and a list? "A".
Another related item is confusing as well:
Type conversion, converting a list to an integer is done as:
list_to_integer("3").
Not
list_to_integer([3]).
Which returns an error
In Erlang there are no real strings. String are a list of integers. So if you give a list with integers that represent characters then they will be displayed as a string.
1> [72, 101, 108, 108, 111].
"Hello"
If you specify a list with at least element that does not have a character counterpart, then the list will be displayed as such.
2> [72, 101, 108, 108, 111, 1].
[72,101,108,108,111,1]
In Erlang strings are lists and the notation is exactly the same.
[97,98,99].
returns "abc"
The following excerpt is taken directly from "Learn You Some Erlang for Great Good!", Fred Hébert, (C)2013, No Starch Press. p. 18.
This is one of the most disliked thins in Erlang: strings. Strings are lists, and the notation is exactly the same. Why do people dislike it?
Because of this:
3> [97,98,99,4,5,6].
[97,98,99,4,5,6]
4> [233].
"é"
Erlang will print lists of numbers as numbers only when at least one of them could not also represent a letter. There is no such thing as a real string in Erlang!
"Learn You Some Erlang for Great Good!" is also available online at: http://learnyousomeerlang.com/
kadaj answered your first question. Regarding the second one about list_to_integer, if you look at the documentation, most list_to_XXX functions except binary, bitstring, and tuple consider their argument as a string. Calling them string_to_XXX could be clearer, but changing the name would break a lot of code.

Unable to read in a string as ASCII or UTF-8

If I check the charset of a file that I'm reading from I get:
text/plain; charset=us-ascii
So I am reading it in like:
File.open(#file_path, r:ASCII) do |f|
f.each_line do |line|
line = line.rstrip.force_encoding("ASCII")
Which works fine until I hit this line:
"Seat 2: tchin\xE9 ($423 in chips)"
Where I get this error:
ArgumentError: invalid byte sequence in US-ASCII
This line looks like this in my text editor:
"Seat 2: tchin? ($423 in chips)"
If I try reading it in as UTF-8 instead of ASCII, I get the same error:
ArgumentError: invalid byte sequence in UTF-8
Any ideas of what I should be doing. I have tried using iconv to convert it from ASCII to UTF-8 and and I get this error:
Iconv::IllegalSequence: "\xE9 ($423 in chips"
ASCII is a 7-bit encoding (max 127, 128 characters), not 8 bit (max 255, 256 characters).
E9 (Decimal 233, probably a é) is higher then 128. So you have no ASCII, the ruby error message is correct.
I expect it is cp1252.
Update:
I'm quiet sure, it is a é. The sentence "Seat 2: tchiné ($423 in chips)" makes sense (I don't know what it is, but it seems to be something in Poker.
This line looks like this in my text editor:
"Seat 2: tchin? ($423 in chips)"
Your editor may not display the é, so it displays a substitute character.
Reading the file as "r:ISO8859-1" worked.

Is there an alphabetical character check in prolog?

Greetings,
Is there was test or a predicate I can use in prolog to verify that a certain given character is alphabetical? Right now, what I'm doing is:
List of unallawed characters: \n -> 10, space -> 32, !->33, .->46, ,->44, :->58, ;->59%
% 63->? , 45 -> -, 34->", 39-> %
\+member(Ch,[10, 32, 33, 34, 39, 44, 45, 46, 58, 59, 63 ]), %Checking for line return (\n), space, punctuations
Those are only a few of the characters I need to check for. having a test such as letter(Ch). would save me a great deal of time, and above all be a way more defensive approach.
Thank you
is_alpha/1
There are also other predicates such as is_lower/1 etc.
In SWI-Prolog, this is done with char_type/2 such as
% X is either a single-character atom or a character code
alphabetical(X) :- char_type(X, alpha).
SWI-Prolog also offers the ctypes library which provides is_alpha, etc.
:- use_module(library(ctypes)).
alphabetical(X) :- is_alpha(X).

Resources