Handing strings with binary data in it using java.nio - parsing

I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks

It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.

Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.

Related

How to correctly parse a sequence of arbitrary bytes

Edit: changed title and question to make them more general and easier to find when looking for this specific issue
I'm parsing a sequence of arbitrary bytes (u8) from a file, and I've come to a point where I had to use std::str::from_utf8_unchecked (which is unsafe) instead of the usual std::str::from_utf8, because it seemingly was the only way to make the program work.
fn parse(seq: &[u8]) {
. . .
let data: Vec<u8> = unsafe {
str::from_utf8_unchecked(&value[8..data_end_index])
.as_bytes()
.to_vec()
};
. . .
}
This, however, is probably leading to some inconsistencies with the actual use of the program, because with non trivial use I get an io::Error::InvalidData with the message "stream did not contain valid UTF-8".
In the end the issue was caused by the fact that I was converting the sequence of arbitrary bytes into a string which, in Rust, must contain only valid UTF-8 characters. The conversion was done because I thought I could call to_vec only after as_bytes, without realizing that a &[u8] is in fact a slice of bytes already.
I was converting the byte slice into a vec, but only after performing a useless and harmful conversion into a string, which doesn't make sense because not all bytes are valid ASCII characters that can be parsed as UTF-8 characters to create a Rust string.
The correct code is the following:
let data: Vec<u8> = value[8..data_end_index].to_vec();

Converting Extended ASCII characters in Dart

My flutter app retrieves information via a REST interface which can contain Extended ASCII characters e.g. e-acute 0xe9. How can I convert this into UTF-8 (e.g 0xc3 0xa9) so that it displays correctly?
0xE9 corresponds to e-acute (é) in the ISO-8859/Latin 1 encoding. (It's one of many possible encodings for "extended ASCII", although personally I associate the term "extended ASCII" with code page 437.)
You can decode it to a Dart String (which internally stores UTF-16) using Latin1Codec. If you really want UTF-8, you can encode that String to UTF-8 afterward with Utf8Codec.
import 'dart:convert';
void main() {
var s = latin1.decode([0xE9]);
print(s); // Prints: é
var utf8Bytes = utf8.encode(s);
print(utf8Bytes); // Prints: [195, 169]
}
I was getting confused because sometimes the data contained extended ascii characters and sometimes UTF-8 characters. When I tried doing a UTF-8 decode it baulked at the extended ascii.
I fixed it by trying the utf8 decode and catching the error when it is extended ascii, it seems to decode this OK.

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

Specific decode choice ASN.1

I am using erlang ASN.1 compiler and I have the following ASN.1 definition:
DecryptedCertificate ::= SEQUENCE {
certificateProfileIdentifier INTEGER(0..255),
certificateAuthorityReference CertificateAuthority,
certificateHolderAuthorization CertificateHolderAuthorization,
endOfValidity TimeReal,
certificateHolderReference KeyIdentifier,
rsaPublicKey RsaPublicKey
}
KeyIdentifier ::= CHOICE {
extendedSerialNumber ExtendedSerialNumber,
certificateRequestID CertificateRequestID,
certificationAuthorityKID CertificationAuthorityKID
}
When I decode a binary, it always picks the CertificateRequestID choice, I would like to specify a specific choice for decoder, is that possible?
PS: I am using PER.
Edit:
I am including more information to make the question more clear
The CHOICE types are:
ExtendedSerialNumber ::= SEQUENCE {
serialNumber INTEGER(0..2^32-1)
monthYear BCDString(SIZE(2))
type OCTET STRING(SIZE(1))
manufacturerCode ManufacturerCode
}
CertificateRequestID ::= SEQUENCE {
requestSerialNumber INTEGER(0..2^32-1)
requestMonthYear BCDString(SIZE(2))
crIdentifier OCTET STRING(SIZE(1))
manufacturerCode ManufacturerCode
}
CertificationAuthorityKID ::= SEQUENCE {
nationNumeric NationNumeric
nationAlpha NationAlpha
keySerialNumber INTEGER(0..255)
additionalInfo OCTET STRING(SIZE(2))
caIdentifier OCTET STRING(SIZE(1))
}
ManufacturerCode ::= INTEGER(0..255)
NationNumeric ::= INTEGER(0..255)
NationAlpha ::= IA5String(SIZE(3))
There are some deterministic things like:
caIdentifier is always equal to 1;
crIdentifier is always equal to 0xFF.
I tried to specify the number by using: caIdentifier INTEGER(1) and crIdentifier INTEGER(255), however it always picks the first choice and throws parse error.
When I decode a binary, it always picks the CertificateRequestID choice, I would like to specify a specific choice for decoder, is that possible?
It's not possible to specify what is to be decoded. As you are decoding a particular binary message/record/PDU the decoder is going to pick whatever is contained in the binary according to the ASN.1 definitions and the UPER/APER encoding rules.
Somewhere in the binary there are two bits that determine what KeyIdentifier contains and if you are able to find and change them the decoder will try to decode a different field but then most probably is going to fail as your binary message actually contains a different field.
You could try to create a KeyIdentifier, fill whatever values you want and then encode it to get an idea what this different binary is going to look like.
UPDATE
PER format does not contain header for option types.
In the PER encoding CHOICE does contain an index (header) that specify the encoded type. See X.691 23 Encoding the choice type
23 Encoding the choice type
NOTE – (Tutorial) A choice type is encoded by encoding an index specifying the
chosen alternative. This is encoded as for a constrained integer (unless the
extension marker is present in the choice type, in which case it is a normally
small non-negative whole number) and would therefore typically occupy a fixed
length bit-field of the minimum number of bits needed to encode the index.
Although it could in principle be arbitrarily large.) This is followed by the
encoding of the chosen alternative, with alternatives that are extension
additions encoded as if they were the value of an open type field. Where the
choice has only one alternative, there is no encoding for the index.
It seems you are taking the wrong path to fix your problem: you must not modify an ASN.1 specification to fix it.
In your question, you should add more information about the erlang code you have written to decode the input.
Either erlang compiler has a bug or you are not using it properly ...
You can also use https://asn1.io/asn1playground/ as test to decode your PER data
After digging in Asn.1 Communication Between Heterogeneous Systems, I found a nice paragraph called Selecting a CHOICE alternative, which specify exactly what I was looking for:
The chosen type can be selected by the left angle bracket “ < ”
preceded by the alternative to be extracted and followed by a
reference to the CHOICE type
So the solution for my structure is to use:
DecryptedCertificate ::= SEQUENCE {
certificateProfileIdentifier INTEGER(0..255),
certificateAuthorityReference CertificateAuthority,
certificateHolderAuthorization CertificateHolderAuthorization,
endOfValidity TimeReal,
certificateHolderReference certificationAuthorityKID < KeyIdentifier,
rsaPublicKey RsaPublicKey
}
Of course I could have just hardcoded the type myself, however this is more flexible when it comes to specification modifications and more readable.

URL Escape in Uppercase

I have a requirement to escape a string with url information but also some special characters such as '<'.
Using cl_http_utility=>escape_url this translates to '%3c'. However due to our backend webserver, it is unable to recognize this as special character and takes the value literally. What it does recognize as special character is '%3C' (C is upper case). Also if one checks http://www.w3schools.com/tags/ref_urlencode.asp it shows the value with all caps as the proper encoding.
I guess my question is is there an alternative to cl_http_utility=>escape_url that does essentially the same thing except outputs the value in upper case?
Thanks.
Use the string function.
l_escaped = escape( val = l_unescaped
format = cl_abap_format=>e_url ).
Other possible formats are e_url_full, e_uri, e_uri_full, and a bunch of xml/json stuff too. The string function escape is documented pretty well, demo programs and all.

Resources