Mime encoded headers with extra '=' (==?utf-8?b?base64string?=) - parsing

This might be a silly question but... here it goes!
I wrote my own MIME parser in native C++. It's a nightmare with the encodings! It was stable for the last 3 months or so but recently I noticed this Subject: header.
Subject: =?UTF-8?B?T2ZpY2luYSBkZSBJbmZvcm1hY2nDs24sIEluaWNpYXRpdmFzIHkgUmVjbGFt?===?UTF-8?B?YWNpb25lcw==?=
which should decode to this:
Subject: Oficina de InformaciΓ³n, Iniciativas y Reclamaciones
The problem is there is one extra = (equal) in there which I can't figure out binding the two (why 2?) encoded elements which I don't understand why are separated. In theory the format should be: =?charset?encoding?encoded_string?= but found another subject that starts with two =.
==?UTF-8?B?blahblahlblah?=
How should I handle the extra =?
I could replace ==? with =? (which I am) before doing anything (and it works)... but I'm wondering if there's any kind of spec regarding this so I don't hack my way into proper functionality.
PS: How much I hate these relic protocols! All text communications should be UTF-8 and XML :)

In MIME headers encoded words are used (RFC 2047 Section 2.).
... (why 2?)
To overcome 75 encoded word limit, which is there because of 78 line length limit (or to use 2 different encodings like Chinese and Polish for example).
RFC 2047:
An 'encoded-word' may not be more than 75 characters long,
including 'charset', 'encoding', 'encoded-text', and delimiters.
If it is desirable to encode more text than will fit in an
'encoded-word' of 75 characters, multiple 'encoded-word's
(separated by CRLF SPACE) may be used.
Here's the example from RFC2047 (note there is no '=' in between):
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
Your subject should be decoded as:
"Oficina de InformaciΓ³n, Iniciativas y Reclam=aciones"
mraq answer is incorrect. Soft line breaks apply to 'Quoted Printable' Content-Transfer-Encoding only, which can be used in MIME body.

It is called the "Soft Line Break" and it is the heritage of the SMTP protocol.
Quoting page 20 of RFC2045
(Soft Line Breaks) The Quoted-Printable encoding
REQUIRES that encoded lines be no more than 76
characters long. If longer lines are to be encoded
with the Quoted-Printable encoding, "soft" line breaks
must be used. An equal sign as the last character on a
encoded line indicates such a non-significant ("soft")
line break in the encoded text.
And also Wikipedia on Quoted-printable
A soft line break consists of an "=" at the end of an encoded line,
and does not appear as a line break in the decoded text.

From what I can see in the MIME RFC double equal signs are not valid input (for encoding), but keep in mind you could interpret the first equal sign as what it is and then use the following stuff for decoding. But seriously, those extra equal signs look like artifacts, maybe from an incorrect encoder.

Related

How to convert a formatted string into plain text

User copy paste and send data in following format: "𝕛𝕠𝕧π•ͺ π••π•–π•“π•“π•šπ•–"
I need to convert it into plain txt (we can say ascii chars) like 'jovy debbie'
It comes in different font and format:
ex:
'π‘±π’†π’π’Šπ’„π’‚ π‘«π’–π’ˆπ’π’”'
'π™ΆπšŽπšŸπš’πšŽπš•πš’πš— π™½πš’πšŒπš˜πš•πšŽ π™»πšžπš–πš‹πšŠπš'
Any Help will be Appreciated, I already refer other stack overflow question but no luck :(
Those letters are from the Mathematical Alphanumeric Symbols block.
Since they have a fixed offset to their ASCII counterparts, you could use tr to map them, e.g.:
"𝕛𝕠𝕧π•ͺ π••π•–π•“π•“π•šπ•–".tr("𝕒-𝕫", "a-z")
#=> "jovy debbie"
The same approach can be used for the other styles, e.g.
"π‘±π’†π’π’Šπ’„π’‚ π‘«π’–π’ˆπ’π’”".tr("𝒂-𝒛𝑨-𝒁", "a-zA-Z")
#=> "Jenica Dugos"
This gives you full control over the character mapping.
Alternatively, you could try Unicode normalization. The NFKC / NFKD forms should remove most formatting and seem to work for your examples:
"𝕛𝕠𝕧π•ͺ π••π•–π•“π•“π•šπ•–".unicode_normalize(:nfkc)
#=> "jovy debbie"
"π‘±π’†π’π’Šπ’„π’‚ π‘«π’–π’ˆπ’π’”".unicode_normalize(:nfkc)
#=> "Jenica Dugos"

AES encrypt string appear \r\n

When I use AES128 encrypt string, if the encrypted string is too long then it will contain \r\n in it. like this
Now I have to use empty string to replace it. Why does the encrypt-string contain the \r\n and any better way to avoid it or fix it.
Thanks.
Answers: it's caused by the Base64 encoding process, every 64 characters will insert a \r\n .
That is a Base64 encoded string.
Actual encryption output is an array of 8-bit bytes, not characters. The code is Base64 encoding the encrypted data with an option to insert line breaks every 64 characters, this is sometimes to allow better printing of the output. When it is decoded use the NSDataBase64DecodingIgnoreUnknownCharacters option to remove line breaks .
In particular for Objective-C the to create a Base64 string from NSData is:
- (NSString *)base64EncodedStringWithOptions:(NSDataBase64EncodingOptions)options
The options include:
NSDataBase64Encoding64CharacterLineLength
Set the maximum line length to 64 characters, after which a line ending is inserted.
Which inserts "\r\n" (carriage return, new line) characters each 64 characters.
If that is not what you want pass 0 as the option value.
To decode Base64 use the Objective-C method:
- (instancetype)initWithBase64EncodedString:(NSString *)base64String options:(NSDataBase64DecodingOptions)options
With the option: NSDataBase64DecodingIgnoreUnknownCharacters.
Apple code:
The default implementation of this method will reject non-alphabet characters, including line break characters. To support different encodings and ignore non-alphabet characters, specify an options value of NSDataBase64DecodingIgnoreUnknownCharacters.
The thing that gives it away as a Base64 string is a length that is a multiple of 4, the characters used "a-zA-Z0-9+/" and the trailing "=" characters.
Historic note: These days on OSX and iOS line breaks are a single "\n" (0x0a) line feed character. Back when we used teletypes as terminals "\r" (0x0d) carriage return moved the carriage or print head back but did not move the paper up to the next line. "\n" newline moved the paper up one line but did move the carriage or print head back. They were two distinct mechanical operations. Later some systems used either "\r\n", "\n\r", "\r" or "\n". Unix choose "\n" and thus OSX and iOS.

.gsub erroring with non-regular character 194

I've seen this posted a couple of times but none of the solutions seem to work for me so far...
I'm trying to remove a spurious Γ‚ character from a string...
e.g.
"myΓ‚string here Γ‚$100"
..but it should be my string here $100
I've tried:
string.gsub(/\194/,'')
string.gsub(194.chr,'')
string.delete 194.chr
All of these still leave the Γ‚ intact..
Any thoughts?
By default, Rails supports UTF-8.
You can use your favorite editor to write a gsub call using the proper character you want to replace, as in:
"myΓ‚string here Γ‚$100".gsub(/Γ‚/,"")
If this does not work as well, you might be having an encoding error somewhere on your stack, probably on your HTML document. Try running rails console, extract somehow that string (if it comes from the Model, try to perform a find on the containing class) and run the gsub. It won't solve your problem, but you'll get a clue to where exactly the problem may lie.
Looks like a character encoding problem to me. For every Unicode code point in the range U+0080..U+00BF inclusive, the UTF-8 encoding is a two-byte sequence, 0xC2 (194 decimal) and the numeric value the code point. For example, a non-breaking space--U+00A0--becomes 0xC2 0xA0. Was there another extra character in there, that you already removed?
At any rate, gsub(/\194/,'') is wrong. \nnn is supposed to be an octal escape, but the number is in its decimal form. 194 in octal is \302.
"myΓ‚string here Γ‚$100".gsub("Γ‚","") # "mystring here $100"
Is that what you meant?

What's valid and what's not in a URI query?

Background (question further down)
I've been Googling this back and forth reading RFCs and SO questions trying to crack this, but I still don't got jack.
So I guess we just vote for the "best" answer and that's it, or?
Basically it boils down to this.
3.4. Query Component
The query component is a string of information to be interpreted by the resource.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "#", "&", "=", "+", ",", and "$" are reserved.
The first thing that boggles me is that *uric is defined like this
uric = reserved | unreserved | escaped
reserved = ";" | "/" | "?" | ":" | "#" | "&" | "=" | "+" | "$" | ","
This is however somewhat clarified by paragraphs such as
The "reserved" syntax class above refers to those characters that are allowed within a URI, but which may not be allowed within a particular component of the generic URI syntax; they are used as delimiters of the components described in Section 3.
Characters in the "reserved" set are not reserved in all contexts. The set of characters actually reserved within any given URI component is defined by that component. In general, a character is reserved if the semantics of the URI changes if the character is replaced with its escaped US-ASCII encoding.
This last excerpt feels somewhat backwards, but it clearly states that the reserved character set depends on context. Yet 3.4 states that all the reserved characters are reserved within a query component, however, the only things that would change the semantics here is escaping the question mark (?) as URIs do not define the concept of a query string.
At this point I've given up on the RFCs entirely but found RFC 1738 particularly interesting.
An HTTP URL takes the form:
http://<host>:<port>/<path>?<searchpart>
Within the <path> and <searchpart> components, "/", ";", "?" are reserved. The "/" character may be used within HTTP to designate a hierarchical structure.
I interpret this at least with regards to HTTP URLs that RFC 1738 supersedes RFC 2396. Because the URI query has no notion of a query string also the interpretation of reserved doesn't really let allow me to define query strings as I'm used to doing by now.
Question
This all started when I wanted to pass a list of numbers together with the request of another resource. I didn't think much of it, and just passed it as a comma separated values. To my surprise though the comma was escaped. The query page.html?q=1,2,3 encoded turned into page.html?q=1%2C2%2C3 it works, but it's ugly and didn't expect it. That's when I started going through RFCs.
My first question is simply, is encoding commas really necessary?
My answer, according to RFC 2396: yes, according to RFC 1738: no
Later I found related posts regarding the passing of lists between requests. Where the csv approach was poised as bad. This showed up instead, (haven't seen this before).
page.html?q=1;q=2;q=3
My second question, is this a valid URL?
My answer, according to RFC 2396: no, according to RFC 1738: no (; is reserved)
I don't have any issues with passing csv as long as it's numbers, but yes you do run into the risk of having to encode and decode values back and forth if the comma suddenly is needed for something else. Anyway I tried the semi-colon query string thing with ASP.NET and the result was not what I expected.
Default.aspx?a=1;a=2&b=1&a=3
Request.QueryString["a"] = "1;a=2,3"
Request.QueryString["b"] = "1"
I fail to see how this greatly differs from a csv approach as when I ask for "a" I get a string with commas in it. ASP.NET certainly is not a reference implementation but it hasn't let me down yet.
But most importantly -- my third question -- where is specification for this? and what would you do or for that matter not do?
That a character is reserved within a generic URL component doesn't mean it must be escaped when it appears within the component or within data in the component. The character must also be defined as a delimiter within the generic or scheme-specific syntax and the appearance of the character must be within data.
The current standard for generic URIs is RFC 3986, which has this to say:
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by characters in the "reserved" set. These characters are called "reserved" because they may (or may not) be defined as delimiters by the generic syntax, by each scheme-specific syntax, or by the implementation-specific syntax of a URI's dereferencing algorithm. If data for a URI component would conflict with a reserved character's purpose as a delimiter [emphasis added], then the conflicting data must be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
3.3. Path Component
[...]
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
[...]
3.4 Query Component
[...]
query = *( pchar / "/" / "?" )
Thus commas are explicitly allowed within query strings and only need to be escaped in data if specific schemes define it as a delimiter. The HTTP scheme doesn't use the comma or semi-colon as a delimiter in query strings, so they don't need to be escaped. Whether browsers follow this standard is another matter.
Using CSV should work fine for string data, you just have to follow standard CSV conventions and either quote data or escape the commas with backslashes.
As for RFC 2396, it also allows for unescaped commas in HTTP query strings:
2.2. Reserved Characters
Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.
Since commas don't have a reserved purpose under the HTTP scheme, they don't have to be escaped in data. The note from Β§ 2.3 about reserved characters being those that change semantics when percent-encoded applies only generally; characters may be percent-encoded without changing semantics for specific schemes and yet still be reserved.
I think the real question is: "What characters should be encoded in a query string?" And that depends mainly on two things: The validity and the meaning of a character.
Validity according to the RFC standard
In RFC3986 we can find which special characters are valid and which are not inside a query string:
// Valid:
! $ & ' ( ) * + , - . / : ; = ? # _ ~
% (should be followed by two hex chars to be completely valid (e.g. %7C))
// Invalid:
" < > [ \ ] ^ ` { | }
space
# (marks the end of the query string, so it can't be a part of it)
extended ASCII characters (e.g. Β°)
Deviations from the standard
Browsers and web frameworks do not always strictly follow the RFC standard. Below are some examples:
[, ] are not valid, but Chrome and Firefox do not encode these characters inside a query string. The reasoning given by Chrome devs is simply: "If other browsers and an RFC disagree, we will generally match other browsers." QueryHelpers.AddQueryString from ASP.NET Core on the other hand will encode these characters.
Other invalid characters that are not encoded by Chrome and Firefox are:
\ ^ ` { | }
' is a valid character inside a query string but will be encoded by Chrome, Firefox and QueryHelpers nevertheless. The explanation given by Firefox devs is that they knew that they don't have to encode it according to the RFC standard, but did it to reduce vulnerabilities.
Special meaning
Some characters are valid and also don't get encoded by browsers, but should still be encoded in certain cases.
+: Spaces are normally encoded as %20 but alternatively they can be encoded as +. So + inside a query string means it's an encoded space. If you want to include a character that's actually supposed to literally mean plus, then you have to use the encoded version of + which is %2B.
~: Some old Unix systems interpreted URI parts that started with ~ as a path to a home directory. So it's a good idea to encode ~ if it's not meant to denote the start of a Unix home directory path for an old system (so nowadays probably always encode).
=, &: Usually (although RFC doesn't specify that this is required) query strings contain parameters in the format "key1=value1&key2=value2". If that's the case and =s or &s should be part of the parameter key or the parameter value instead of giving them the role of separating the key and value or separating the parameters, then you have to encode those =s and &s. So if a parameter value should for some reason consist of the string "=&" then it has to be encoded as %3D%26 which then can be used for the full key and value: "weirdparam=%3D%26".
%: Usually web frameworks figure out that %s that are not followed by two hex characters simply mean the % itself, but it's still a good idea to always encode % when it's supposed to only mean % and not indicate the start of an encoded character (e.g. %7C) because RFC3986 specifies that % is only valid when followed by two hex characters. So don't use "percentageparam=%" use "percentageparam=%25" instead.
Encoding guidelines
Encode every character that is otherwise invalid* according to RFC3986 and every character that can have special meaning but should only be interpreted in a literal way without giving it a special meaning. You can also encode things that aren't required to be encoded, like '. Why? Because it doesn't hurt to encode more than necessary. Servers and web frameworks when parsing a query string will decode every encoded character, no matter if it was really necessary to previously encode that character or not.
The only characters of a query string that shouldn't be encoded are those that can have a special meaning and shouldn't lose that special meaning, e.g. don't encode the = of "key1=value1". For that to achieve don't apply an encoding method to the whole query string (and also not to the whole URI) but apply it only and separately to the query parameter keys and values. For example, with JS:
var url = "http://example.com?" + encodeURIComponent(myKey1) + "=" + encodeURIComponent(myValue1) + "&" + encodeURIComponent(myKey2)...;
Note that encodeURIComponent encodes a lot more characters than necessary meaning characters that are valid in a query string and don't have special meaning there e.g. /, ?, ...
The reason is that encodeURIComponent wasn't created for query strings alone but instead encodes characters that have special meaning outside of the query string as well, e.g. / for the path URI component. QueryHelpers.AddQueryString works in a similar manner. Under the hood it uses System.Text.Encodings.Web.DefaultUrlEncoder which is not just meant for query strings but also for isegment, ipath-noscheme and ifragment.
* You could probably get away with only regarding those characters as invalid that are both not allowed by the RFC and that are also always encoded by Chrome for instance. This would be Space " < >. But it's probably better to be on the safer side and encode at least everything that RFC3986 considers invalid.
OP's questions
My first question is simply, is encoding commas really necessary -> No it's not necessary, but it doesn't hurt (except ugliness) and will happen with default encoding methods e.g. encodeURIComponent and decoding and query string parsing should work nevertheless.
My second question, is this a valid URL (page.html?q=1;q=2;q=3)? -> It's RFC valid, but your server / web framework might have a hard time parsing the query string when it might expect the typical "key1=value1&key2=value2" format for query strings.
Where is specification for this? -> There isn't a single specification that covers everything because some things are implementation specific. For instance there are different ways of specifying arrays inside of query strings.
Just use ?q=1+2+3
I am answering here a fourth question :) that did not ask but all started with: how do i pass list of numbers a-la comma-separated values? Seems to me the best approach is just to pass them space-separated, where spaces will get url-form-encoded to +. Works great, as longs as you know the values in the list contain no spaces (something numbers tend not to).
page.html?q=1;q=2;q=3
is this a valid URL?
Yes. The ; is reserved, but not by an RFC. The context that defines this component is the definition of the application/x-www-form-urlencoded media type, which is part of the HTML standard (section 17.13.4.1). In particular the sneaky note hidden away in section B.2.2:
We recommend that HTTP server implementors, and in particular, CGI implementors support the use of ";" in place of "&" to save authors the trouble of escaping "&" characters in this manner.
Unfortunately many popular server-side scripting frameworks including ASP.NET do not support this usage.
I would like to note that page.html?q=1&q=2&q=3 is a valid url as well. This is a completely legitimate way of expressing an array in a query string. Your server technology will determine how exactly that is presented.
In Classic ASP, you check Response.QueryString("q").Count and then use Response.QueryString("q")(0) (and (1) and (2)).
Note that you saw this in your ASP.NET, too (I think it was not intended, but look):
Default.aspx?a=1;a=2&b=1&a=3
Request.QueryString["a"] = "1;a=2,3"
Request.QueryString["b"] = "1"
Notice that the semicolon is ignored, so you have a defined twice, and you got its value twice, separated by a comma. Using all ampersands Default.aspx?a=1&a=2&b=1&a=3 will yield a as "1,2,3". But I am sure there is a method to get each individual element, in case the elements themselves contain commas. It is simply the default property of the non-indexed QueryString that concatenates the sub-values together with comma separators.
I had the same issue. The URL that was hyperlinked was a third party URL and was expecting a list of parameters in format page.html?q=1,2,3 ONLY and the URL page.html?q=1%2C2%2C3 did not work. I was able to get it working using javascript. May not be the best approach but can check out the solution here if it helps anyone.
If you are sending the ENCODED characters to FLASH/SWF file, then you should ENCODE the character twice!! (because of Flash parser)

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources