Edit: changed title and question to make them more general and easier to find when looking for this specific issue
I'm parsing a sequence of arbitrary bytes (u8) from a file, and I've come to a point where I had to use std::str::from_utf8_unchecked (which is unsafe) instead of the usual std::str::from_utf8, because it seemingly was the only way to make the program work.
fn parse(seq: &[u8]) {
. . .
let data: Vec<u8> = unsafe {
str::from_utf8_unchecked(&value[8..data_end_index])
.as_bytes()
.to_vec()
};
. . .
}
This, however, is probably leading to some inconsistencies with the actual use of the program, because with non trivial use I get an io::Error::InvalidData with the message "stream did not contain valid UTF-8".
In the end the issue was caused by the fact that I was converting the sequence of arbitrary bytes into a string which, in Rust, must contain only valid UTF-8 characters. The conversion was done because I thought I could call to_vec only after as_bytes, without realizing that a &[u8] is in fact a slice of bytes already.
I was converting the byte slice into a vec, but only after performing a useless and harmful conversion into a string, which doesn't make sense because not all bytes are valid ASCII characters that can be parsed as UTF-8 characters to create a Rust string.
The correct code is the following:
let data: Vec<u8> = value[8..data_end_index].to_vec();
Related
I have this little code:
void main(List<String> args) {
const data = 'amigo+/=:chesu';
var encoded = base64Encode(utf8.encode(data));
var encoded2 = base64Encode(data.codeUnits);
var decoded = utf8.decode(base64Decode(encoded));
var decoded2 = utf8.decode(base64Decode(encoded2));
print(encoded);
print(encoded2);
print(decoded);
print(decoded2);
}
The output is:
YW1pZ28rLz06Y2hlc3U=
YW1pZ28rLz06Y2hlc3U=
amigo+/=:chesu
amigo+/=:chesu
codeUnits property gives an unmodifiable list of the UTF-16 code units, is it OK to use utf8.decode function? or what function should be used for encoded2?
It's simply not a good idea to do base64Encode(data.codeUnits) because base64Encode encodes bytes, and data.codeUnits isn't necessarily bytes.
Here they are (because all the characters of the string have code points below 256, they are even ASCII.)
Using ut8.encode before base64Encode is good. It works for all strings.
The best way to convert from UTF-16 code units to a String is String.fromCharCodes.
Here you are using base64Encode(data.codeUnits) which only works if the data string contains only code units up to 255. So, if you assume that, then it means that decoding that can be done using either latin1.decode or String.fromCharCodes.
Using ascii.decode and utf8.decode also works if the string only contains ASCII (which it does here, but which isn't guaranteed by base64Encode succeeding).
In short, don't do base64Encode(data.codeUnits). Convert the string to bytes before doing base64Encode, then use the reverse conversion to convert bytes back to strings.
I tried this
print(utf8.decode('use âsmartâ symbols like â thisâ'.codeUnits));
and got this
use “smart” symbols like ‘ this’
The ” and ‘ are smart characters from iOS keyboard
I am using erlang ASN.1 compiler and I have the following ASN.1 definition:
DecryptedCertificate ::= SEQUENCE {
certificateProfileIdentifier INTEGER(0..255),
certificateAuthorityReference CertificateAuthority,
certificateHolderAuthorization CertificateHolderAuthorization,
endOfValidity TimeReal,
certificateHolderReference KeyIdentifier,
rsaPublicKey RsaPublicKey
}
KeyIdentifier ::= CHOICE {
extendedSerialNumber ExtendedSerialNumber,
certificateRequestID CertificateRequestID,
certificationAuthorityKID CertificationAuthorityKID
}
When I decode a binary, it always picks the CertificateRequestID choice, I would like to specify a specific choice for decoder, is that possible?
PS: I am using PER.
Edit:
I am including more information to make the question more clear
The CHOICE types are:
ExtendedSerialNumber ::= SEQUENCE {
serialNumber INTEGER(0..2^32-1)
monthYear BCDString(SIZE(2))
type OCTET STRING(SIZE(1))
manufacturerCode ManufacturerCode
}
CertificateRequestID ::= SEQUENCE {
requestSerialNumber INTEGER(0..2^32-1)
requestMonthYear BCDString(SIZE(2))
crIdentifier OCTET STRING(SIZE(1))
manufacturerCode ManufacturerCode
}
CertificationAuthorityKID ::= SEQUENCE {
nationNumeric NationNumeric
nationAlpha NationAlpha
keySerialNumber INTEGER(0..255)
additionalInfo OCTET STRING(SIZE(2))
caIdentifier OCTET STRING(SIZE(1))
}
ManufacturerCode ::= INTEGER(0..255)
NationNumeric ::= INTEGER(0..255)
NationAlpha ::= IA5String(SIZE(3))
There are some deterministic things like:
caIdentifier is always equal to 1;
crIdentifier is always equal to 0xFF.
I tried to specify the number by using: caIdentifier INTEGER(1) and crIdentifier INTEGER(255), however it always picks the first choice and throws parse error.
When I decode a binary, it always picks the CertificateRequestID choice, I would like to specify a specific choice for decoder, is that possible?
It's not possible to specify what is to be decoded. As you are decoding a particular binary message/record/PDU the decoder is going to pick whatever is contained in the binary according to the ASN.1 definitions and the UPER/APER encoding rules.
Somewhere in the binary there are two bits that determine what KeyIdentifier contains and if you are able to find and change them the decoder will try to decode a different field but then most probably is going to fail as your binary message actually contains a different field.
You could try to create a KeyIdentifier, fill whatever values you want and then encode it to get an idea what this different binary is going to look like.
UPDATE
PER format does not contain header for option types.
In the PER encoding CHOICE does contain an index (header) that specify the encoded type. See X.691 23 Encoding the choice type
23 Encoding the choice type
NOTE – (Tutorial) A choice type is encoded by encoding an index specifying the
chosen alternative. This is encoded as for a constrained integer (unless the
extension marker is present in the choice type, in which case it is a normally
small non-negative whole number) and would therefore typically occupy a fixed
length bit-field of the minimum number of bits needed to encode the index.
Although it could in principle be arbitrarily large.) This is followed by the
encoding of the chosen alternative, with alternatives that are extension
additions encoded as if they were the value of an open type field. Where the
choice has only one alternative, there is no encoding for the index.
It seems you are taking the wrong path to fix your problem: you must not modify an ASN.1 specification to fix it.
In your question, you should add more information about the erlang code you have written to decode the input.
Either erlang compiler has a bug or you are not using it properly ...
You can also use https://asn1.io/asn1playground/ as test to decode your PER data
After digging in Asn.1 Communication Between Heterogeneous Systems, I found a nice paragraph called Selecting a CHOICE alternative, which specify exactly what I was looking for:
The chosen type can be selected by the left angle bracket “ < ”
preceded by the alternative to be extracted and followed by a
reference to the CHOICE type
So the solution for my structure is to use:
DecryptedCertificate ::= SEQUENCE {
certificateProfileIdentifier INTEGER(0..255),
certificateAuthorityReference CertificateAuthority,
certificateHolderAuthorization CertificateHolderAuthorization,
endOfValidity TimeReal,
certificateHolderReference certificationAuthorityKID < KeyIdentifier,
rsaPublicKey RsaPublicKey
}
Of course I could have just hardcoded the type myself, however this is more flexible when it comes to specification modifications and more readable.
I want to use Rebol 3 to read a file in Latin1 and convert it to UTF-8. Is there a built-in function I can use, or some external library? Where I can find it?
Rebol has an invalid-utf? function that scours a binary value for a byte that is not part of a valid UTF-8 sequence. We can just loop until we've found and replaced all of them, then convert our binary value to a string:
latin1-to-utf8: function [binary [binary!]][
mark: :binary
while [mark: invalid-utf? mark][
change/part mark to char! mark/1 1
]
to string! binary
]
This function modifies the original binary. We can create a new string instead that leaves the binary value intact:
latin1-to-utf8: function [binary [binary!]][
mark: :binary
to string! rejoin collect [
while [mark: invalid-utf? binary][
keep copy/part binary mark ; keeps the portion up to the bad byte
keep to char! mark/1 ; converts the bad byte to good bytes
binary: next mark ; set the series beyond the bad byte
]
keep binary ; keep whatever is remaining
]
]
Bonus: here's a wee Rebmu version of the above—rebmu/args snippet #{DECAFBAD} where snippet is:
; modifying
IUgetLOAD"invalid-utf?"MaWT[MiuM][MisMtcTKm]tsA
; copying
IUgetLOAD"invalid-utf?"MaTSrjCT[wt[MiuA][kp copy/partAmKPtcFm AnxM]kpA]
Here's a version that should be a bit faster, and at least use less memory.
latin1-to-utf8: func [
"Transcodes a Latin-1 encoded string to UTF-8"
bin [binary!] "Bytes of Latin-1 data"
] [
to binary! head collect/into [
foreach b bin [
keep to char! b
]
] make string! length? bin
]
It takes advantage of Latin-1 characters having the same numeric values as the corresponding Unicode codepoints. If you wanted to convert from another character set for which that isn't the case, you can do a calculation on the b to remap the characters.
It uses less memory and is faster for a variety of reasons:
Normally, collect creates a block. We use collect/into and pass it a string as a target. Strings use less memory than blocks of integers or characters.
We preallocate the string to the length of the input data, which saves on reallocations.
We let Rebol's native code convert the characters rather than doing our own math.
There's less code in the loop, so it should run faster.
This method still loads the file into memory all at once, and still generates an intermediate value to store the results, but at least the intermediate value is smaller. Maybe this will let you process larger files.
If the reason you need it to be UTF-8 is that you need to process the file as a string in Rebol, just skip the to binary! and return the string as-is. Or you can just process the binary source data, just convert the bytes in the binary by using to char! on each one as you go.
Nothing built in at the moment, sorry. Here's a straightforward implementation of Latin-1 to UTF-8 conversion which I wrote and used with Rebol 3 a while back:
latin1-to-utf8: func [
"Transcodes a Latin-1 encoded string to UTF-8"
bin [binary!] "Bytes of Latin-1 data"
] [
to-binary collect [foreach b bin [keep to-char b]]
]
Note: this code is optimised for legibility, and not in any way for performance. (From a performance perspective, it's outright stupid. You have been warned.)
Update: Incorporated #BrianH's neat "Latin-1 byte values correspond to Unicode codepoints" optimisation, which makes the above collapse to a one-liner (and mildly less stupid at the same time). Still. for a more optimised version regarding memory usage, see #BrianH's nice answer.
latin1-to-utf8: func [
"Transcodes bin as a Latin-1 encoded string to UTF-8"
bin [binary!] "Bytes of Latin-1 data"
/local t
] [
t: make string! length? bin
foreach b bin [append t to char! b ]
t
]
I have a requirement to escape a string with url information but also some special characters such as '<'.
Using cl_http_utility=>escape_url this translates to '%3c'. However due to our backend webserver, it is unable to recognize this as special character and takes the value literally. What it does recognize as special character is '%3C' (C is upper case). Also if one checks http://www.w3schools.com/tags/ref_urlencode.asp it shows the value with all caps as the proper encoding.
I guess my question is is there an alternative to cl_http_utility=>escape_url that does essentially the same thing except outputs the value in upper case?
Thanks.
Use the string function.
l_escaped = escape( val = l_unescaped
format = cl_abap_format=>e_url ).
Other possible formats are e_url_full, e_uri, e_uri_full, and a bunch of xml/json stuff too. The string function escape is documented pretty well, demo programs and all.
I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks
It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.
Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.