I am using erlang ASN.1 compiler and I have the following ASN.1 definition:
DecryptedCertificate ::= SEQUENCE {
certificateProfileIdentifier INTEGER(0..255),
certificateAuthorityReference CertificateAuthority,
certificateHolderAuthorization CertificateHolderAuthorization,
endOfValidity TimeReal,
certificateHolderReference KeyIdentifier,
rsaPublicKey RsaPublicKey
}
KeyIdentifier ::= CHOICE {
extendedSerialNumber ExtendedSerialNumber,
certificateRequestID CertificateRequestID,
certificationAuthorityKID CertificationAuthorityKID
}
When I decode a binary, it always picks the CertificateRequestID choice, I would like to specify a specific choice for decoder, is that possible?
PS: I am using PER.
Edit:
I am including more information to make the question more clear
The CHOICE types are:
ExtendedSerialNumber ::= SEQUENCE {
serialNumber INTEGER(0..2^32-1)
monthYear BCDString(SIZE(2))
type OCTET STRING(SIZE(1))
manufacturerCode ManufacturerCode
}
CertificateRequestID ::= SEQUENCE {
requestSerialNumber INTEGER(0..2^32-1)
requestMonthYear BCDString(SIZE(2))
crIdentifier OCTET STRING(SIZE(1))
manufacturerCode ManufacturerCode
}
CertificationAuthorityKID ::= SEQUENCE {
nationNumeric NationNumeric
nationAlpha NationAlpha
keySerialNumber INTEGER(0..255)
additionalInfo OCTET STRING(SIZE(2))
caIdentifier OCTET STRING(SIZE(1))
}
ManufacturerCode ::= INTEGER(0..255)
NationNumeric ::= INTEGER(0..255)
NationAlpha ::= IA5String(SIZE(3))
There are some deterministic things like:
caIdentifier is always equal to 1;
crIdentifier is always equal to 0xFF.
I tried to specify the number by using: caIdentifier INTEGER(1) and crIdentifier INTEGER(255), however it always picks the first choice and throws parse error.
When I decode a binary, it always picks the CertificateRequestID choice, I would like to specify a specific choice for decoder, is that possible?
It's not possible to specify what is to be decoded. As you are decoding a particular binary message/record/PDU the decoder is going to pick whatever is contained in the binary according to the ASN.1 definitions and the UPER/APER encoding rules.
Somewhere in the binary there are two bits that determine what KeyIdentifier contains and if you are able to find and change them the decoder will try to decode a different field but then most probably is going to fail as your binary message actually contains a different field.
You could try to create a KeyIdentifier, fill whatever values you want and then encode it to get an idea what this different binary is going to look like.
UPDATE
PER format does not contain header for option types.
In the PER encoding CHOICE does contain an index (header) that specify the encoded type. See X.691 23 Encoding the choice type
23 Encoding the choice type
NOTE – (Tutorial) A choice type is encoded by encoding an index specifying the
chosen alternative. This is encoded as for a constrained integer (unless the
extension marker is present in the choice type, in which case it is a normally
small non-negative whole number) and would therefore typically occupy a fixed
length bit-field of the minimum number of bits needed to encode the index.
Although it could in principle be arbitrarily large.) This is followed by the
encoding of the chosen alternative, with alternatives that are extension
additions encoded as if they were the value of an open type field. Where the
choice has only one alternative, there is no encoding for the index.
It seems you are taking the wrong path to fix your problem: you must not modify an ASN.1 specification to fix it.
In your question, you should add more information about the erlang code you have written to decode the input.
Either erlang compiler has a bug or you are not using it properly ...
You can also use https://asn1.io/asn1playground/ as test to decode your PER data
After digging in Asn.1 Communication Between Heterogeneous Systems, I found a nice paragraph called Selecting a CHOICE alternative, which specify exactly what I was looking for:
The chosen type can be selected by the left angle bracket “ < ”
preceded by the alternative to be extracted and followed by a
reference to the CHOICE type
So the solution for my structure is to use:
DecryptedCertificate ::= SEQUENCE {
certificateProfileIdentifier INTEGER(0..255),
certificateAuthorityReference CertificateAuthority,
certificateHolderAuthorization CertificateHolderAuthorization,
endOfValidity TimeReal,
certificateHolderReference certificationAuthorityKID < KeyIdentifier,
rsaPublicKey RsaPublicKey
}
Of course I could have just hardcoded the type myself, however this is more flexible when it comes to specification modifications and more readable.
Related
Edit: changed title and question to make them more general and easier to find when looking for this specific issue
I'm parsing a sequence of arbitrary bytes (u8) from a file, and I've come to a point where I had to use std::str::from_utf8_unchecked (which is unsafe) instead of the usual std::str::from_utf8, because it seemingly was the only way to make the program work.
fn parse(seq: &[u8]) {
. . .
let data: Vec<u8> = unsafe {
str::from_utf8_unchecked(&value[8..data_end_index])
.as_bytes()
.to_vec()
};
. . .
}
This, however, is probably leading to some inconsistencies with the actual use of the program, because with non trivial use I get an io::Error::InvalidData with the message "stream did not contain valid UTF-8".
In the end the issue was caused by the fact that I was converting the sequence of arbitrary bytes into a string which, in Rust, must contain only valid UTF-8 characters. The conversion was done because I thought I could call to_vec only after as_bytes, without realizing that a &[u8] is in fact a slice of bytes already.
I was converting the byte slice into a vec, but only after performing a useless and harmful conversion into a string, which doesn't make sense because not all bytes are valid ASCII characters that can be parsed as UTF-8 characters to create a Rust string.
The correct code is the following:
let data: Vec<u8> = value[8..data_end_index].to_vec();
well i was reading some common concepts regarding parsing in compiler..i came across look ahead and read ahead symbol i search and read about them but i am stuck like why we need both of them ? would be grateful for any kind suggestion
Lookahead symbol: when node being considered in parse tree is for a terminal, and the
terminal matches lookahead symbol,then we advance in both parse and
input
read aheadsymbol: lexical analyzer may need to read some character
before it can decide on the token to be returned
One of these is about parsing and refers to the next token to be produced by the lexical scanner. The other one, which is less formal, is about lexical analysis and refers to the next character in the input stream. It should be clear which is which.
Note that while most parsers only require a single lookahead token, it is not uncommon for lexical analysis to have to backtrack, which is equivalent to examining several unconsumed input characters.
I hope I got your question right.
Consider C.
It has several punctuators that begin the same way:
+, ++, +=
-, --, -=, ->
<, <=, <<, <<=
...
In order to figure out which one it is when you see the first + or - or <, you need to look ahead one character in the input (and then maybe one more for <<=).
A similar thing can happen at a higher level:
{
ident1 ident2;
ident3;
ident4:;
}
Here ident1, ident3 and ident4 can begin a declaration, an expression or a label. You can't tell which one immediately. You can consult your existing declarations to see if ident1 or ident3 is already known (as a type or variable/function/enumeration), but it's still ambiguous because a colon may follow and if it does, it's a label because it's permitted to use the same identifier for both a label and a type/variable/function/enumeration (those two name spaces do not intersect), e.g.:
{
typedef int ident1;
ident1 ident2; // same as int ident2
int ident3 = 0;
ident3; // unused expression of value 0
ident1:; // unused label
ident2:; // unused label
ident3:; // unused label
}
So, you may very well need to look ahead a character or a token (or "unread" one) to deal with situations like these.
Hello and thank you for reading my post.
The Apache Commons StringEscapeUtils.escapeHtml3() and StringEscapeUtils.escapeHtml4() functions allow, in particular, to convert characters with an acute (like é, à...) in a string into
character entity references which have the format &name; where name is a case-sensitive alphanumeric string.
How can I get the escaped string of a given string with numeric character references instead (&#nnnn; or &#xhhhh; where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form)?
I actually need to escape strings for a XML document which doesn't know about such entities as & eacute;, & agrave; etc.
Best regards.
To solve this problem, I wrote a method which takes a string as an argument and replaces, in this string, character entity references (like é) with their corresponding numeric character references (é in this case).
I used this W3C list of references: http://www.sagehill.net/livedtd/xhtml1-transitional/xhtml-lat1.ent.html
Nota: It would be great to be able to pass another argument to the StringEscapeUtils.escapeHtml4() method to tell it whether we would like character entity references or numeric character references in the output string...
Create your CharacterTranslator:
CharacterTranslator XML_ESCAPE = StringEscapeUtils.ESCAPE_XML11.with(
NumericEntityEscaper.between(0x7f, Integer.MAX_VALUE) );
and use it:
XML_ESCAPE.translate(…)
I have a requirement to escape a string with url information but also some special characters such as '<'.
Using cl_http_utility=>escape_url this translates to '%3c'. However due to our backend webserver, it is unable to recognize this as special character and takes the value literally. What it does recognize as special character is '%3C' (C is upper case). Also if one checks http://www.w3schools.com/tags/ref_urlencode.asp it shows the value with all caps as the proper encoding.
I guess my question is is there an alternative to cl_http_utility=>escape_url that does essentially the same thing except outputs the value in upper case?
Thanks.
Use the string function.
l_escaped = escape( val = l_unescaped
format = cl_abap_format=>e_url ).
Other possible formats are e_url_full, e_uri, e_uri_full, and a bunch of xml/json stuff too. The string function escape is documented pretty well, demo programs and all.
I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks
It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.
Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.