"BAD Could not parse command" is returned if mailbox name contains non-English symbols - imap

I'm using Chilkat.IMAP components to get emails from IMAP servers. If a mailbox name contains non-English symbols, "BAD Could not parse command" is returned:
----IMAP REQUEST----
aaai LIST "[Gmail]/" "%"
----IMAP RESPONSE----
* LIST (\All \HasNoChildren) "/" "[Gmail]/All Mail"
* LIST (\HasChildren \Trash) "/" "[Gmail]/Bin"
* LIST (\Drafts \HasNoChildren) "/" "[Gmail]/Drafts"
* LIST (\HasNoChildren \Important) "/" "[Gmail]/Important"
* LIST (\HasNoChildren \Sent) "/" "[Gmail]/Sent Mail"
* LIST (\HasNoChildren \Junk) "/" "[Gmail]/Spam"
* LIST (\HasNoChildren) "/" "[Gmail]/&BB8EMAQ,BDoEMA-"
aaai OK Success
----IMAP REQUEST----
aaaj LIST "[Gmail]/All Mail/" "%"
----IMAP RESPONSE----
aaaj OK Success
----IMAP REQUEST----
aaap LIST "[Gmail]/Папка/" "%"
----IMAP RESPONSE----
aaap BAD Could not parse command

IMAP by default does not send 8-bit characters, and the original protocol defines mailboxes with non-English ASCII characters to be UTF-7 encoded (with some modifications). This is the &BB8EMAQ,BDoEMA- you're seeing.
You can either add UTF-7 encoding/decoding to your application, or, if your server is new enough, ENABLE UTF-8 mode. Note: enabling UTF-8 may get you Unicode in places you do not expect. Gmail does support this extension.
> a LIST "" *
< ...
< * LIST (\HasChildren) "/" "&AOk-cole"
> b ENABLE UTF8=ACCEPT
< ...
< * LIST (\HasChildren) "/" "école"
Here's how that UTF-7 string breaks down:
[Gmail]/&BB8EMAQ,BDoEMA-
& and - shift in and out of decoding mode, so this looks like
"[Gmail]/" + mUTF7decode("BB8EMAQ,BDoEMA")
And here's a python 3 one liner that decodes that. With "===" added to meet the base64 padding requirements, and the altchars specifying the last two characters of the base64 encoding:
>>> import base64; base64.b64decode("BB8EMAQ,BDoEMA===", altchars="+,").decode("utf-16be")
'Папка'

It may be that you're using a very old version of Chilkat. Try the latest version, it should work fine. If not, please let us know..

Related

Encoding percent symbol in URL creates invalid folder name on Sharepoint Group

i'm trying to upload file to specific non-existing path on Microsoft Sharepoint Group assuming folder hierarchy will be created based on that path. And that's true.
Problem appears when path segment have special characters. I found MS documentation stating that path segment should be encoded (using escape function in Javascript).
So let's say i'm uploading file File1.txt to a path Test 1/Whatever%Text!Here
Here's what url would look like:
PUT https://graph.microsoft.com/v1.0/groups/<group-id>/drive/items/root:/Test%201/Whatever%25Text%21Here:/children/File1.txt/content
You can see encoded path segment (/Test%201/Whatever%25Text%21Here) and how % is encoded to %25. Seems fine to me. But this URL will create subfolder called Whatever%25Text!Here, not Whatever%Text!Here
%25 stays %25, it's not decoded to %. Does anyone have a clue what's going on?
I was mainly testing through Microsoft Graph Api explorer, trying several different URLs, like % changing to %2525 but without luck.
The % symbol is one of OneDrive for Business' "Reserved Characters".
From the documentation:
OneDrive reserved characters
The following characters are OneDrive reserved characters, and can't be used in OneDrive folder and file names.
onedrive-reserved = "/" / "\" / "*" / "<" / ">" / "?" / ":" / "|"
onedrive-business-reserved = "/" / "\" / "*" / "<" / ">" / "?" / ":" / "|" / "#" / "%"

Discrepancies of Percent Encoding for URLs

After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using + instead of %20 for spaces, while still having an application/x-www-urlencoded content type.
This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.
Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.
This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded.
Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.
The use of '+' for encoding space characters is specific to the application/x-www-form-urlencoded format, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.
The application/x-www-form-urlencoded format is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:
Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
This specification does not specify all valid submission methods or content types that may be used with forms. However, HTML 4 user agents must support the established conventions in the following cases:
• If the method is "get" and the action is an HTTP URI, the user agent takes the value of action, appends a `?' to it, then appends the form data set, encoded using the "application/x-www-form-urlencoded" content type. The user agent then traverses the link to this URI. In this scenario, form data are restricted to ASCII codes.
• If the method is "post" and the action is an HTTP URI, the user agent conducts an HTTP "post" transaction using the value of the action attribute and a message created according to the content type specified by the enctype attribute.
Section 17.13.4 Form content types, application/x-www-form-urlencoded
This is the default content type. Forms submitted with this content type must be encoded as follows:
1.Control names and values are escaped. Space characters are replaced by '+', and then reserved characters are escaped as described in [RFC1738], section 2.2: Non-alphanumeric characters are replaced by '%HH', a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., '%0D%0A').
2.The control names/values are listed in the order they appear in the document. The name is separated from the value by '=' and name/value pairs are separated from each other by '&'.
The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.
So, in the situation where the webform data is submitted via an HTTP GET request instead of a POST request, the webform data is encoded using application/x-www-form-urlencoded and placed as-is in the URL query component.
Per RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
URI producing applications should percent-encode data octets that correspond to characters in the reserved set unless these characters are specifically allowed by the URI scheme to represent data in that component.
'+' is a reserved character:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
The query component explicitly allows unencoded '+' characters, as it allows characters from sub-delims:
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded = "%" HEXDIG HEXDIG
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
query = *( pchar / "/" / "?" )
So, in the context of a webform submission, spaces are encoded using '+' prior to then being put as-is into the query component. This is allowed by the URL syntax, since the encoded form of application/x-www-form-urlencoded is compatible with the definition of the query component.
So, for example: http://server/script?field=hello+world
However, outside of a webform submission, putting a space character directly into the query component requires the use of pct-encoded, since ' ' is not included in either unreserved or sub-delims, and is not explicitly allowed by the query definition.
So, for example: http://server/script?hello%20world
Similar rules also apply to the path component, due to its use of pchar:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
; non-zero-length segment without any colon ":"
So, although path does allow for unencoded sub-delims characters, a '+' character gets treated as-is, not as an encoded space. application/x-www-form-urlencoded is not used with the path component, so a space character has to be encoded as %20 due to the definitions of pchar and segment-nz-nc.
Now, regarding the charset used to encode characters -
For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an accept-charset attribute or hidden _charset_ field directly in the <form> itself, otherwise the charset is typically the charset used by the parent HTML.
However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding
), but those are not commonly used.

Do browsers ignore slashes in URLs? [duplicate]

This question already has answers here:
url with multiple forward slashes, does it break anything?
(8 answers)
Closed 8 years ago.
I noticed that both Chrome and Firefox ignore slashes between words in a URL.
So, github.com/octocat/hello-world seems to be equivalent to github.com//////octocat////hello-world.
I am writing an application that parses a URL and retrieves a part of it, and thanks to this behavior, I am able to return the original URL without modifying the code, which in my case is rather convenient. I don't know if it would be a good idea to rely on this quirk though.
Path separators are defined to be a single slash according to this. (Search for Path Component)
Note that browsers don't usually modify the URL. Browsers could append a / at the end of a URL, but in your case, the URL with extra slashes is simply sent along in the request, so it is the server ignoring the slashes instead.
Also, have a look at:
Is a URL with // in the path-section valid?
URL with multiple forward slashes, does it break anything?
What does the double slash mean in URLs?
Even if this behavior is convenient for you, it is generally not recommended. In addition, caching may also be affected (source):
Since both your browser and the server cache individual pages (according to their caching settings), requesting same file multiple times via slightly different URIs might affect the caching (depending on server and client implementation).
An empty path segment is valid as per specification:
path = path-abempty ; begins with "/" or is empty
/ path-absolute ; begins with "/" but not "//"
/ path-noscheme ; begins with a non-colon segment
/ path-rootless ; begins with a segment
/ path-empty ; zero characters
path-abempty = *( "/" segment )
path-absolute = "/" [ segment-nz *( "/" segment ) ]
path-noscheme = segment-nz-nc *( "/" segment )
path-rootless = segment-nz *( "/" segment )
path-empty = 0<pchar>
segment = *pchar
segment-nz = 1*pchar
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "#" )
; non-zero-length segment without any colon ":"
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
In the latter URI https://github.com//////octocat////hello-world, the path //////octocat////hello-world would be composed of:
//////octocat////hello-world: path-abempty
/: segment
/: segment
/: segment
/: segment
/: segment
/octocat: segment-nz
/: segment
/: segment
/: segment
/hello-world: segment-nz
Removing these empty path segments would make up a completely different URI. How the server would handle these empty path segments is a completely different question.
Actually browsers do not ignore them, they pass them to the web server in the HTTP request. It's the server that may decide to ignore them, but technically multiplying slashes results in a different URL.
W3.org specifies that the path part of a URL consists of "path segments", separated by /, and a path segment consists of zero or more "URL units" (characters) except / and ?, so empty path segments are allowed, which is what you get when you duplicate slashes.
See http://www.w3.org/TR/url-1/ for details
Actually browsers do not ignore slashes between URLs.
If you use document.URL in (client side) JavaScript you get the URL with the repeating '///'s.
Similarly in (server side) PHP, when using $_SERVER['REQUEST_URI'] you get the URL with the repeating '///'s.
It is the server, e.g., Apache, that actually redirects to the proper page without URL. In Apache you can write rules in the .htaccess file to not redirect to the page with ///s ignored.

IMAP SEARCH complex query

I need to find all mails in IMAP mailbox which contains somestring in BODY and is FROM someone#me.com or TO someone#me.com.
Trying to do:
49:51.53 > JBPM3 SEARCH CHARSET utf-8 "BODY \"somestring\" (OR (TO \"someone#me.com\") (FROM \"someone#me.com\"))"
Receiving:
49:51.71 < JBPM3 BAD Could not parse command
How to make it work using GMail?
You may skip parenthesis '(' ')' to group logical expressions in IMAP.
Parenthesis are not needed in Polish Notation (see edit below):
A0001 SEARCH CHARSET utf-8 BODY "somestring" OR TO "someone#me.com" FROM "someone#me.com"
You could also use gmail search syntax (X-GM-RAW) command:
http://www.limilabs.com/blog/search-gmail-using-gmails-search-syntax
[Edit]
Parenthesis are sometimes required in IMAP SEARCH. This is because AND operator can have more than 2 operands and is not explicitly defined:
http://www.limilabs.com/blog/imap-search-requires-parentheses

Is there any legal way to create IMAP folder with hierarchy separator character in the name?

In the IMAP protocol there is a folder hierarchy character. If you try to create folder with such character in the name, mailserver will create two folders. For example, if a delimiter character is "/", then command CREATE "aaa/bbb" will create two folders aaa and bbb in folder aaa.
Is it possible to create single folder with delimiter character inside? For example, the single folder with the name aaa/bbb, without aaa and bbb in aaa folder.
#Pawel - tried you trick, didn't work with dovecot. Resorted to reading the RFC.
The correct way, is to create a folder with trailing /
Here's an example (taken direct from a manual IMAP session):
[root#mailer-daemon ~]# telnet localhost imap
Trying 127.0.0.1...
Connected to mailer-daemon.co (127.0.0.1).
Escape character is '^]'.
* OK Dovecot ready.
A1 LOGIN user password
A1 OK Logged in.
A1 LIST "" *
* LIST (\NoInferiors \UnMarked) "/" "Drafts"
* LIST (\NoInferiors \UnMarked) "/" "Deleted Messages"
* LIST (\NoInferiors \UnMarked) "/" "INBOX"
A1 OK List completed.
A1 CREATE test/
A1 OK Create completed.
A1 CREATE test/case
A1 OK Create completed.
A1 LIST "" test*
* LIST (\Noselect \HasChildren) "/" "test"
* LIST (\NoInferiors \UnMarked) "/" "test/case"
A1 OK List completed.
And here's the RFC saying the same thing
If the mailbox name is suffixed with the server's hierarchy
separator character (as returned from the server by a LIST
command), this is a declaration that the client intends to create
mailbox names under this name in the hierarchy. Server
implementations that do not require this declaration MUST ignore
the declaration. In any case, the name created is without the
trailing hierarchy delimiter.
If the server's hierarchy separator character appears elsewhere in
the name, the server SHOULD create any superior hierarchical names
that are needed for the CREATE command to be successfully
completed. In other words, an attempt to create "foo/bar/zap" on
a server in which "/" is the hierarchy separator character SHOULD
create foo/ and foo/bar/ if they do not already exist.
If a new mailbox is created with the same name as a mailbox which
was deleted, its unique identifiers MUST be greater than any
unique identifiers used in the previous incarnation of the mailbox
UNLESS the new incarnation has a different unique identifier
validity value. See the description of the UID command for more
detail.
I think this is the money shot:
Example: C: A003 CREATE owatagusiam/
S: A003 OK CREATE completed
C: A004 CREATE owatagusiam/blurdybloop
S: A004 OK CREATE completed
Note: The interpretation of this example depends on whether
"/" was returned as the hierarchy separator from LIST. If
"/" is the hierarchy separator, a new level of hierarchy
named "owatagusiam" with a member called "blurdybloop" is
created. Otherwise, two mailboxes at the same hierarchy
level are created.
You may try UTF7 encoding:
CREATE "one&AC8-two"
But RFC says:
If the server's hierarchy separator
character appears elsewhere in the
name, the server SHOULD create any
superior hierarchical names that are
needed for the CREATE command to be
successfully completed. In other
words, an attempt to create
"foo/bar/zap" on a server in which "/"
is the hierarchy separator character
SHOULD create foo/ and foo/bar/ if
they do not already exist."
http://www.faqs.org/rfcs/rfc3501.html
Strictly speaking, no, there is no way officially allowed by the protocol.
The accepted answer violates the protocol:
In modified UTF-7, printable US-ASCII characters, except for "&", represent themselves; that is, characters with octet values 0x20-0x25 and 0x27-0x7e. […] Modified BASE64 MUST NOT be used to represent any printing US-ASCII character which can represent itself.
— 5.1.3. Mailbox International Naming Convention
The hack only works because most servers aren't strict about the above rule.

Resources