I have collected some data, which is encoded in URL and contains Turkish letters.
but what Turkish letter is %263287 or %23305?
Can someone give me a list of some sort? I could not find the corresponding letters in the internet
I found the solution the Turkish letter ğ is transferred to %26%23304%3B, but my decoding program decodes %26 and %3B, so I get &%23304; which seems pretty messed up.
Following this logic I have to decode the Turkish letters first.
Related
On Wikipedia you see URLs like these:
https://zh.wiktionary.org/wiki/附录:字母索引 (but copy-pasting the URL results in the equivalent https://zh.wiktionary.org/wiki/%E9%99%84%E5%BD%95:%E5%AD%97%E6%AF%8D%E7%B4%A2%E5%BC%95).
https://th.wiktionary.org/wiki/หน้าหลัก (which when copy-pasted becomes
https://th.wiktionary.org/wiki/%E0%B8%AB%E0%B8%99%E0%B9%89%E0%B8%B2%E0%B8%AB%E0%B8%A5%E0%B8%B1%E0%B8%81)
First, I'm wondering what is happening here, what the encoding transformation is called and what it's doing and why it's doing that. I don't see why you can't just have the original native characters in the URL.
Second, I'm wondering if what Wikipedia is doing is considered valid. If it is okay to include these non-ASCII glyphs in the URL, and if not, why not (other than perhaps because the standard says so). Also would be interested to know how many browsers support showing the link in the URL bar using the native glyphs vs. this encoded thing, and even would be interesting to know how native Chinese/Thai/etc. people enter in the URL in their language, if they use the encoding or what (but that probably makes this question too complicated; still would be an interesting bonus).
The reason I ask is because I would like to put let's say words/definitions of a few different languages onto a webpage, and I would like to make the url show the actual word used in the language. So in english it might be /hello, but the equivalent word/definition in Thai would be /สวัสดี. That makes way more sense to me than having to make it into the encoding thing.
From https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
Strings of data octets within a URI are represented as characters. *Permitted characters within a URI are the ASCII characters for the lowercase and uppercase letters of the modern English alphabet, the Arabic numerals, hyphen, period, underscore, and tilde.[14] Octets represented by any other character must be percent-encoded.
Not all Unicode characters can be used in URIs. Characters that aren't supported can still be encoded using Percent Encoding. You can see the non-ascii characters in the URL field because your browser chooses to display them that way, the actual HTTP requests are done using the encoded strings.
I'm trying to reproduce a character conversion...
Essentially out of the Chinese word for Login. In this example the Chinese word for Login, "登录"should be converted into this text instead "µÇ¼".
It would be nice if there was a piece of software that did this for me already...
I just decompiled a file with luadec, it does it well, and, the output not being perfect, it's still usable, but I'm getting a weird string of numbers \198\247\184\181\188\177\177\219\183\161\189\186 that I know for a fact are in Korean language, but I do not know what they're called and basically can't find anything about them.
I just need to correctly translate the string from numbers to symbols or gibberish text, like this c±Ý»ö´À³¦Ç¥.
If someone could point me in the right direction I would be grateful, thanks.
I ran this script with Lua
print"\198\247\184\181\188\177\177\219\183\161\189\186"
and saved the output to a text file which I then loaded into Safari.
I got gibberish the default encoding. I got 포링선글래스 with Korean (Mac OS) encoding. Same thing with Korean (Windows, DOS), but not with Korean (ISO 2022-KR).
Note that escaped numbers in Lua are in decimal.
I've had this problem for a long time but I've been implementing this ugly hack on the backend to get around it.
Now I've decided to act as a real developer and deal with it.
My problem is that when parsing an XML feed with any of the Norwegian characters æ, ø or å in the title node, all the letters appearing before these special characters are ommitted.
So if the word is "Bålhuset" it only displays "ålhuset" - the funny thing is that æ,ø and å characters AFTER the initial problem character is included.
So if I put for example "ÅBålhuset", I will get "Bålhuset". So it seems it's only the first occurence of one of these special characters that will cause a problem.
Any help would be immensely appreciated!
-Chris
Try while you creating XML use CDATA tags like
<title><![CDATA[Transport "Bålhuset"Classic World's]]></title>
Also here is a list of HTML Tags and more cases XML with those characters is invalid, unless they are contained within a CDATA. Also try this Question hope with help you
Otherwise you need to use their special character code. If you want to represent ö you need to type ö please review like.
And Final XML with those characters is invalid, unless they are contained within a CDATA.
You can Validate you XML while creating and easily fix the bug.
What did it for me was getting the data in JSON and using the native JSON methods; no dropped characters and other sporadic behaviour.
So what that means to me is that there is an issue with NSXMLParser that makes it choke on international characters (the first occurence of which mind you) even though everything is in order with encoding etc.
How can I output whatever æ would be, if ø = ø?
I'm guessing the left side is unicode and the right side is something else, for example iso-8859-1, but how can I print out what a unicode character would be when messed up?
Backstory: I have a bit of a strange problem here with Steam messing up character encodings. Trying to help a friend recover their account and I think they have used the letter æ in their secret answer. The dialog for resetting the password doesn't accept that letter, and it says the answer is wrong if we try natural alternatives. In the recovery email I get, the letter ø shows up as ø in the secret question. So, I'm thinking perhaps when the answer and question was created, the letter æ was accepted, but messed up. Figured I could try to use the messed up equivalent, but don't know what that would be, and my programming skills fails me in finding it myself :p
In Python, you can encode the string to a byte-string in UTF-8, and then convert the byte-string to a (text) string using iso-8859-1. The result will be the desired mojibake.
In Python 3:
>>> 'æ'
'æ'
>>> 'æ'.encode('utf8')
b'\xc3\xa6'
>>> 'æ'.encode('utf8').decode('iso-8859-1')
'æ'
In Python 2, use u'æ' instead of 'æ'.