Perform file encoding conversion with Rebol 3 - character-encoding

I want to use Rebol 3 to read a file in Latin1 and convert it to UTF-8. Is there a built-in function I can use, or some external library? Where I can find it?

Rebol has an invalid-utf? function that scours a binary value for a byte that is not part of a valid UTF-8 sequence. We can just loop until we've found and replaced all of them, then convert our binary value to a string:
latin1-to-utf8: function [binary [binary!]][
mark: :binary
while [mark: invalid-utf? mark][
change/part mark to char! mark/1 1
]
to string! binary
]
This function modifies the original binary. We can create a new string instead that leaves the binary value intact:
latin1-to-utf8: function [binary [binary!]][
mark: :binary
to string! rejoin collect [
while [mark: invalid-utf? binary][
keep copy/part binary mark ; keeps the portion up to the bad byte
keep to char! mark/1 ; converts the bad byte to good bytes
binary: next mark ; set the series beyond the bad byte
]
keep binary ; keep whatever is remaining
]
]
Bonus: here's a wee Rebmu version of the above—rebmu/args snippet #{DECAFBAD} where snippet is:
; modifying
IUgetLOAD"invalid-utf?"MaWT[MiuM][MisMtcTKm]tsA
; copying
IUgetLOAD"invalid-utf?"MaTSrjCT[wt[MiuA][kp copy/partAmKPtcFm AnxM]kpA]

Here's a version that should be a bit faster, and at least use less memory.
latin1-to-utf8: func [
"Transcodes a Latin-1 encoded string to UTF-8"
bin [binary!] "Bytes of Latin-1 data"
] [
to binary! head collect/into [
foreach b bin [
keep to char! b
]
] make string! length? bin
]
It takes advantage of Latin-1 characters having the same numeric values as the corresponding Unicode codepoints. If you wanted to convert from another character set for which that isn't the case, you can do a calculation on the b to remap the characters.
It uses less memory and is faster for a variety of reasons:
Normally, collect creates a block. We use collect/into and pass it a string as a target. Strings use less memory than blocks of integers or characters.
We preallocate the string to the length of the input data, which saves on reallocations.
We let Rebol's native code convert the characters rather than doing our own math.
There's less code in the loop, so it should run faster.
This method still loads the file into memory all at once, and still generates an intermediate value to store the results, but at least the intermediate value is smaller. Maybe this will let you process larger files.
If the reason you need it to be UTF-8 is that you need to process the file as a string in Rebol, just skip the to binary! and return the string as-is. Or you can just process the binary source data, just convert the bytes in the binary by using to char! on each one as you go.

Nothing built in at the moment, sorry. Here's a straightforward implementation of Latin-1 to UTF-8 conversion which I wrote and used with Rebol 3 a while back:
latin1-to-utf8: func [
"Transcodes a Latin-1 encoded string to UTF-8"
bin [binary!] "Bytes of Latin-1 data"
] [
to-binary collect [foreach b bin [keep to-char b]]
]
Note: this code is optimised for legibility, and not in any way for performance. (From a performance perspective, it's outright stupid. You have been warned.)
Update: Incorporated #BrianH's neat "Latin-1 byte values correspond to Unicode codepoints" optimisation, which makes the above collapse to a one-liner (and mildly less stupid at the same time). Still. for a more optimised version regarding memory usage, see #BrianH's nice answer.

latin1-to-utf8: func [
"Transcodes bin as a Latin-1 encoded string to UTF-8"
bin [binary!] "Bytes of Latin-1 data"
/local t
] [
t: make string! length? bin
foreach b bin [append t to char! b ]
t
]

Related

How to correctly parse a sequence of arbitrary bytes

Edit: changed title and question to make them more general and easier to find when looking for this specific issue
I'm parsing a sequence of arbitrary bytes (u8) from a file, and I've come to a point where I had to use std::str::from_utf8_unchecked (which is unsafe) instead of the usual std::str::from_utf8, because it seemingly was the only way to make the program work.
fn parse(seq: &[u8]) {
. . .
let data: Vec<u8> = unsafe {
str::from_utf8_unchecked(&value[8..data_end_index])
.as_bytes()
.to_vec()
};
. . .
}
This, however, is probably leading to some inconsistencies with the actual use of the program, because with non trivial use I get an io::Error::InvalidData with the message "stream did not contain valid UTF-8".
In the end the issue was caused by the fact that I was converting the sequence of arbitrary bytes into a string which, in Rust, must contain only valid UTF-8 characters. The conversion was done because I thought I could call to_vec only after as_bytes, without realizing that a &[u8] is in fact a slice of bytes already.
I was converting the byte slice into a vec, but only after performing a useless and harmful conversion into a string, which doesn't make sense because not all bytes are valid ASCII characters that can be parsed as UTF-8 characters to create a Rust string.
The correct code is the following:
let data: Vec<u8> = value[8..data_end_index].to_vec();

Handing strings with binary data in it using java.nio

I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks
It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.
Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.

list of garbage characters like ’

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.
thanx
Search for the term "UTF-8", because that's what you're seeing.
UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.
In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)
I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.
If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.
Question reminder:
"...I noticed characters like '’' is replaced with ’... i decided to
replace such garbage characeters with actual values after downloading
data. What I need is a list of such garbage string and their
equivalent characters."
Strictly dealing with this part:
"What I need is a list of such garbage string and their equivalent
characters."
Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.
<?php
for ($i=128; $i<258; $i++)
$tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";
echo "<table border=1>
<tr><td>&#</td><td>"Garbage"</td><td>symbol</td></tr>";
echo $tmp1;
echo "</table>";
?>
From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.
In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
.
"i decided to replace such garbage characeters with actual values
after downloading data"
You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.
First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:
<?php
//ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
$tmp1 = "\$SearchArr = array(";
for ($i=128; $i<258; $i++)
$tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
$tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
$tmp1 .= ");";
$tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>
The second array is the replace term:
<?php
//Adapt for your relevant range.
$tmp2 = "\$ReplaceArr = array(\n";
for ($i=128; $i<258; $i++)
$tmp2 .= "\"&#".$i.";\", ";
$tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
$tmp2 .= ");";
echo $tmp1."\n<br><br>\n";
echo $tmp2."\n";
?>
Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:
$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);
Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

In Erlang how do I convert a String to a binary value?

In Erlang how do I convert a string to a binary value?
String = "Hello"
%% should be
Binary = <<"Hello">>
In Erlang strings are represented as a list of integers. You can therefore use the list_to_binary (built-in-function, aka BIF). Here is an example I ran in the Erlang console (started with erl):
1> list_to_binary("hello world").
<<"hello world">>
the unicode (utf-8/16/32) character set needs more number of bits to express characters that are greater than 1-byte in length:
this is why the above call failed for any byte value > 255 (the limit of information that a byte can hold, and which is sufficient for IS0-8859/ASCII/Latin1)
to correctly handle unicode characters you'd need to use
unicode:characters_to_binary() R1[(N>3)]
instead, which can handle both Latin1 AND unicode encoding.
HTH ...

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources