What encoding type of these text? - character-encoding

When I search in Google by Thai language. Google will convert like these.
%E0%B8%A0%E0%B8%B2%E0%B8%A9%E0%B8%B2%E0%B9%84%E0%B8%97%E0%B8%A2

URL Encoding: See http://www.w3schools.com/tags/ref_urlencode.asp
It's a URL encoding in which all
non-alphanumeric characters except
-_. are replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type.
(Information copied from http://php.net/manual/en/function.urlencode.php)

UTF-8 + URL Encoding.

Related

What scheme is used to encode unicode characters in a .url shortcut?

What scheme is used to encode unicode characters in a windows url shortcut?
For example, a new shortcut for url "http://Ψαℕ℧▶" produces a .url file with the text:
[{000214A0-0000-0000-C000-000000000046}]
Prop3=19,2
[InternetShortcut]
IDList=
URL=http://?aN??/
[InternetShortcut.A]
URL=http://?aN??/
[InternetShortcut.W]
URL=http://+A6gDsSEVIScltg-/
What is the algorithm to decode "+A6gDsSEVIScltg-" to "Ψαℕ℧▶"?
I am not asking for API code, but I would like to know the encoding scheme details.
Note: The encoding scheme is not utf-8 nor utf-16 nor ucs-2 and no %encoding.
+A6gDsSEVIScltg- is the UTF-7 encoded form of Ψαℕ℧▶.
The correct way to process a .url file is to use the IUniformResourceLocator and IPropertyStorage interfaces from the CLSID_InternetShortcut COM object. See Internet Shortcuts on MSDN for details.
The answer (utf-7) allowed me to successfully develop the url conversion routine.
Let me summarize the steps:
To obtain the unicode url from a InternetShortcut.W found in a .url file.
. Pass ascii chars until crlf, after making them internet safe.
. A none escaped + character starts a utf-7 formatted unicode sequence:
. Collect 6-bit nibbles from base64 coded ascii
. Per collected 16 bits, convert the 16 bits to utf-8 (1,2, or 3 chars)
. Pass the utf8 generated characters as %hh
. Continue until the occurrence of a "-" character
. The bit collector should be zero

append space in URL in form of variable

I have an url say.
WWW.XYZ.COM
I have a varibale in bakend contains space in it. Then i want to add it with that url.
eg.www.xyz.com/variable or www.xyz.com/stack over
But url is not going to accept it. How can i do this?
A typical URL embed %20 in place of space.
That means your url www.xyz.com/stack overwill be treated as www.xyz.com/stack%20over. So, there can be a solution , write a function that will retrive data from backend as a %20 in every space. Then that will make an url. And try to make the pages appended as %20.
You have to use url encode like function to eliminate some characters (converting to other composite characters). Example of url encode in php
Say for Ex : this is your Stackoverflow URL
"stackoverflow.com/questions/51166674/append-space-in-url-in-form-of-variable"
Just Remove "-" in that.
Click Enter.
You Will See the same page with url
"stackoverflow.com/questions/51166674/append%20space%20in%20url%20in%20form%20of%20variable"
Just add %20 instead of space. I hope May be this is helpfull.
Thank you
Theory:
From HTML URL Encoding Reference
URLs can only be sent over the Internet using the ASCII character-set.
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format.
URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits.
URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.
In other words, every character ('a', 'b', '', '_', ...) can be replace with its correspondant ASCII representation.
For example, the ASCII representation of space is %20.
Example: When you want to send the attribute "text" containing "Hello World" through a formular or URL, the web-server will process the input "text=Hello%20World" or, less frequent "text=Hello+World".
Your example: So, your URL www.xyz.com/stack over will be mostly represented as www.xyz.com/stack%20over
Reserved characther
= | ; | / | # | ? | : | space are reserved characters. RFC 1630

percent encoding - how is a greek letter percent encoded?

I'm browsing the web for a answer but cannot find one. I have a HTML form (method=GET) and submit in a text field the text helloΩ (hello with the greek letter Omega appended)
The URL in the browser encodes it as:
mytext=hello%26%23937%3B
Without the greek letter Omega appended, I get (as expected):
mytext=hello
So how is the greek Omega letter percent encoded into:
%26%23937%3B
Thanks
This happens when your web server declared an encoding that doesn't support the character. For example, ISO-8859-1 doesn't support it which is the default encoding for many web servers.
That's a html entity character reference percent-encoded: Ω, because #, & .. are all ASCII characters, this is the only way to not lose information because the browser thinks the server only supports ISO-8859-1.
To fix this, declare UTF-8 in your http header:
Content-Type: text/html; charset=utf-8
This isn't even consistent behavior between browsers, because IE encodes it as hello%D9, which is Ú in CP1252/ISO-8859-1.

Is % percentage a valid url character

I am trying to put a url, something like the following urn:test.project:123, as part of the url.
Does it make sense to encode urn:test.project:123 into urn%3atest.project%3a123 and decode it back to urn:test.project:123 at the receiver end?
http://{domain}/abc/urn%3atest.project%3a123/Manifest
Yes, it's a valid character. It's the escape character for URLs in a similar way to how the ampersand & is the escape character for xml/html, and the backslash \ is the escape character for string literals in c-like languages. It's the (very important) character that allows you to specify (through an escape sequence) all the other characters that wouldn't be valid in a URL.
(And yes, it makes sense to encode such a string so it's a legal URL, and as #PaulPRO mentions, most frameworks will automatically decode it for you on the server-side.)
Yes, the %3a means that 3a is the HEX encoded value for ':'
If you put it in the url as %3a your server will most likely automatically decode it.

Interesting Encoding

I have an interesting promblem with social network http://www.odnoklassniki.ru/.
When I use advanced searching my cyrillic symbols are encoded in no understantable symbols for me.
For Example:
Иван Иванов Encode %25D0%25B8%25D0%25B2%25D0%25B0%25D0%25BD%25D0%25BE%25D0%25B2+%25D0%25B8%25D0%25B2%25D0%25B0%25D0%25BD%25D0%25BE%25D0%25B2
Any ideas?
It's a double URL-encoded string. The %25 sequences represent the percent sign. Decoding once gives %D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2+%D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2.
Decoding again gives the UTF-8 string иванов иванов.
That's URL- or percent- encoding. The percent starts it. Then its the 4 hex-digits for the char. The + is the space.
See: http://en.wikipedia.org/wiki/Percent-encoding
Well, it appears to be twice URL encoded. If we unwrap it once, we get
%D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2 %D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2
and again, we get
иванов иванов
This appears to be UTF-8 with the bytes encoded separately.

Resources