Why WSO2 Registry wouldn't store my chinese characters? - character-encoding

I ran a WSO2 Registry server to maintain our configurations. It played its role perfectly, until one day I put some Chinese characters into resource content.
I created a resource with 'Create Text Content', and Chinese characters into both 'Description' and 'Content' fields. When I opened this resource again, the 'Description' field remained Chinese, but the 'Content' text became a sequence of '?'(question marks, one Chinese character each).
Why would this happen, and how to prevent it?

You should start the Governance Registry with the "carbon.registry.character.encoding" system property.
Please see more details in documentation [1].
[1]http://docs.wso2.org/wiki/display/Governance450/Supported+System+Properties
Eg :
sh wso2server.sh -Dcarbon.registry.character.encoding={chinese encoding type}

Related

How to differentiate a link from text?

I have lines of text and I have to find whether these lines contain some link . how can I do it?Firstly I thought of finding www in the text but some links might not have www . Secondly I thought of finding http in text but again all links do not contain http. what to do?
Here is a regexp adapted from http://mathiasbynens.be/demo/url-regex entry by #diegoperini (Ruby syntax; you might need to change some details like Unicode \uXXXX to whatever your system uses):
(?:(?:https?|ftp):\/\/)?(?:\S+(?::\S*)?#)?(?:(?!10(?:\.\d{1,3}){3})(?!127(?:\.\d{1,3}){3})(?!169\.254(?:\.\d{1,3}){2})(?!192\.168(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]+-?)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/[^\s]*)?

Why is my Rails app on Heroku soemtimes displaying apostrophes as HTML entities?

Sometimes we have jobs whose name has an apostrophe in it. I always want those apostrophes to display as ' , never as their HTML entity (').
The apostrophe displays correctly on most pages most of the time. But in some instances, the apostrophe displays as the HTML entity instead. Here's a screen-capture showing the behavior:
The apostrophes in the "Notes" field (a textarea) display correctly, but not in the "Job name" and "Display as" fields. Luckily, the apostrophes display correctly on the public-facing side, and I only see this behavior on the admin-side.
This sentence is going to sound insane, but stick with me, here: When I look at a page's source code where this problem occurs, it looks like the leading ampersand in the apostrophe's html entity is being replaced with the HTML entity for ampersand, thus becoming '
Here's a gist of the form's code.
When I look up this job's record in console, the job name and display name are "Job's Got An Apostrophe", so I know (think?) I'm not storing the HTML entity in my database. My database.yml specifies unicode encoding. It's a PostgreSQL 9.2.7 database. Not sure what other information is needed to help resolve this, if any.

How to transform encoded URL to readable texts?

It's about Bangla Unicode texts, but can be a problem for any language other than Latin glyphs.
I'm a host of a Bangla blog with all its texts and categories in Bangla (I prefer not to say Bengali as because the name of the language is Bangla rather than Bengali).
So the category in Bangla "বাংলা" saying a URL like:
http://www.example.com/category/বাংলা
But whenever I copied the URL from address bar and put 'em into a chat panel or somewhere else, it changed with some strange characters, for example:
http://www.example.com/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0*
* it's just an example, not the exact gibberish for the word "বাংলা")
So, in many cases I got some encoded URLs like above, from where I found no trace which Unicode text they are saying. Recently I'm getting some 404 error logged by one of my plugin. From there I found a URI like:
/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0%A6%BE%E0%A7%9F%E0%A7%81%E0%A6%AC%E0%A6%BF%E0%A6%A6%E0%A7%8D%E0%A6%AF%E0
I used the Jetpack's Omnisearch to find out any match, but the result is empty. I can't even trace which category that is— creating such a 404.
So here comes the question:
How can I transform the encoded URL to readable glyphs?
http://www.example.com/category/বাংলা
isn't a URL; URLs can only contain ASCII characters. This is an IRI.
http://www.example.com/category/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE
is the URI representation of that IRI. They are otherwise equivalent. A browser may display the ‘pretty’ IRI version in the user interface, but put the URI version on the clipboard so that you can paste it into other tools that don't support IRI.
The 404 address you pasted translates to:
/category/স্নায়ুবিদ্য�
where the last character is a � because it is an invalid, truncated UTF-8 sequence. (This is probably why the request failed.) Someone may have mis-pasted a partial URI here.
If you're using javascript you can do:
decodeURIComponent(url);
This will make sure the original language is preserved.

Copying a UTF-8 URL from browser's address bar, gives only the ugly encoded one [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
When I copy a UTF-8 URL from the browser's address bar (almost any browser on any os), then try to paste it in another text field (to post it on facebook or twitter for example), it gives only the decoded URL, which makes it ugly. For example, in the address bar, the URL appears like this one:
https://www.chaino.com/وذكر
But when trying to copy and paste it in any other place, it gives the following ugly url:
https://www.chaino.com/%D9%88%D8%B0%D9%83%D8%B1
& if I wanted to get the original URL to use it in any place, I used to decode it in this Raw URL Decoder - Online Tool
Question is: is there a short direct way to copy these kind of URLs, and paste it without this hideous process? (may be using chrome extensions or something)
You can add a 'space' at the end of the URL in the address bar, then you can select it all and copy it directly.
You can select URL without selecting scheme (e.g. http://), and copy it. This will give you what you expected.
P.S. The point is to select only part of the link. E.g. you can select whole URL without first character and than add it manually.
In Firefox 53+ you can set browser.urlbar.decodeURLsOnCopy about:config option to true.
The URI you get by copying from the address bar is the only valid URI the browser can give you.
From the RFC 3986 (and other URL RFCs):
A URI is a sequence of characters from a very limited set: the
letters of the basic Latin alphabet, digits, and a few special
characters.
So: https://www.chaino.com/وذكر
Is an invalid URI, yet a valid IRI (International Resource Identifier), that your browser will convert to a valid URI while requesting the server over HTTP (HTTP does not allow IRI, only URI).
TL;DR: Your browser is giving you what you expect: A valid URI that you can use everywhere, not an IRI only supported here and here.
PS If "facebook or twitter for example" are kind, they may display a readable form to their users, so don't worry about giving an encoded form.
You can use Chrome Extensions like below:
https://chrome.google.com/webstore/detail/copy-unicode-urls/fnbbfiapefhkicjhecnoepbijhanpkjp
https://chrome.google.com/webstore/detail/copy-cyrilic-urls/alnknpnpinldhpkjkgobkalmoaeolhnf
Create a bookmark with this url: javascript:console.log(prompt('copy (Control+C) this link:', decodeURIComponent(window.location))).
Click this bookmark on that page.
Example page: https://www.google.com.hk/search?q=中文
The best answer I found tell now is using this Chrome extension:
https://chrome.google.com/webstore/detail/copy-cyrilic-urls/alnknpnpinldhpkjkgobkalmoaeolhnf?hl=en-US
which enables me to copy the url (in a decoded state) with only one click :)
You can use Chrome and FireFox extension called "Copy Unicode URLs", which I created.
It is:
Open source.
Gives you an option to leave URL terminators encoded so, e.g., links that end with a dot will have that dot encoded and email clients won't wrongly recognize this dot as a sentence/URL terminator.
If you love my work then, please, donate some sum here.
Copy addres without 'h' in http...
And past addres without 'h' and sum first addres with 'h'

Domain Name with unicode Pitfalls

According to yahoo and stackoverflow.com they advise having a static content site that you don't assign cookies to. http://developer.yahoo.com/performance/rules.html#cookie_free http://sstatic.net
Based of the desire for a static only domain name I though it would be cool to have the domain name made up of unicode characters. From what I understand pitfalls of unicode characters include: difficulty to type and automatic punycode conversation due to the paypal.com innocent.
For example if I wanted to link my stylesheet.
<link rel=stylesheet href=☺.com/s.css>
....
<script src=☺.com/s.js></script>
Considering I only plan to link to static content, are there are issues or pitfalls?
Do all browsers natively support unicode -> punycode conversation? It has been unclear to me if internet explorer less than 7 supports punycode. Also would IE display a notice if you are simply linking to server content in unicode format.
Bonus: Also is there any place to find a list of legal url unicode url characters? Supposedly some characters aren't permitted?! Or would a url containing non permitted characters simply be translated to punycode immediately therefore not effecting my situation?
Read http://www.ietf.org/rfc/rfc3454.txt.

Resources