Non UTF-8 chars in tweets, how to force them? - twitter

I have weird characters in my tweets and I've realised is because the tweets are not utf-8 encoded as my html page is.
I might be able to use a function (I've seen couple references) to force the output to be utf-8 but I am using this plugin (it's on a WordPress platform) and I don't know where is the output coming from.. so I can't even try to force my output to be utf-8 as I don't know on what to apply whatever function I might find.
Could anyone please show me the right direction? Is it $widgetContent from line 516 I have to "sanitise"? My meta charset is charset="UTF-8" indeed.
Thanks.
http://pastebin.com/wKMBSARj
P.S. These ones => ����

Related

Can not decode \\u00e2\\u0080\\u0099 to ’ in iOS

Exact text written on admin panel is Test’s, and our PHP server is using utf8_encode() method to encode this text, which results in response on mobile end like ::
Test\u00e2\u0080\u0099s
How could I decode it back to ’ to display on mobile app ?
I have tried so many solutions given including utf8 decoding, but it's not working, please help.
I also tried solution given in How to replace the \u00e2\u0080\u0099 this string into ' in iOS, but this solution is for only a specific character, and I am looking for some generalize solution, replacement of \u00e2\u0080\u0099 with ’ seems to be a temporary solution as it don't assure if I get some other unicode in response.
As per the OP...
The problem was with the server encoding, and not on the decoding end.
I'm adding this as an answer so other folks don't have to dig through the comments.

\u0092 is not printed in UILabel

I have a local json file with some descriptions of an app and I have found a weird behaviour when parsing \u0092 and \u0091 characters.
When json file contains these characters, the corresponding parsed NSString is printed like "?" and in UIlabel it dissapears completely.
Example "L\u2019H\u00e9r." is showed as "LHér." instead of "L'Hér."
If I replace this characters with \u2019, then I can see the caracter ' in UILabel
Does anybody any clue about this?
EDIT: For the moment I will substitute both of them with character \u2019, it is also a ' and there is no problem confusing it with a control character. Thank you all!
This answer is a little speculative, but I hope it gets you on the right tracks.
Your best bet may be to give up and substitute \u0091 and \u0092 for something else as a preprocessing step before string display. These are control characters and are unprintable in most encodings. But:
If rest of the file is proper UTF, your json file probably has problems: encoding is wrong (CP-1250?) while you read the file as UTF, some error has been made when converting the file, or a similar issue. So another solution is of course fixing your file.
If you're not sure about how your file is encoded, it may simply be encoded in CP-1250 - so reading the file using NSWindowsCP1250StringEncoding might fix your problem.
BTW, if you hardcode a string #"\u0091", you'll get a compilation time error Universal character name refers to a control character. Yes, not even a warning, it's that much unprintable in Unicode ;)

Rails View Encoding Issues

I'm using Ruby 2.0 and Rails 3.2.14. My view is littered several UTF-8 characters, mainly currency symbols like บาท and د.إ etc. I noticed some
(ActionView::Template::Error) "incompatible character encodings: ASCII-8BIT and UTF-8
in our production code and promptly tried visiting the page url on my browser without any issues. On digging in, I realised the error was actually caused by BingBot and few spiders. So when I tried to curl the same url, I was able to reproduce the issue. So, if I try
curl http://localhost:3000/?x=✓
I get the error where UTF-8 symbols are used in the view code. I also realised that if use HTML encoded strings in place of the symbols, this does not occur. However, I prefer using the actual symbols.
And I have already tried setting Encoding.default_external = Encoding::UTF_8 in environment.rb adding #encoding: utf-8 magic comment to top of file and it does not help.
So, why does this error occur? What is the difference between hitting this url on browser and on CURL besides cookies? And how do I go about fixing this issue and allow BingBot to index our site? Thanks.
The culprit that was leaking non UTF-8 characters in my template was an innocuous meta tag for Facebook Open Graph
%meta{property: "og:url", content: request.url}
And when the request is non-standard, this causes the encoding issue. Changing it to
%meta{property: "og:url", content: request.url.force_encoding('UTF-8')}
made the trick.
That error message usually occurs when you try to concatenate strings with different character encodings.
Is your database set to use UTF-8 as well?
If not, you could have a problem when you try to insert the non-UTF8 values into your UTF-8 template.

HTML DOM Parse and Character encoding on XMLHTTPRequest at Firefox extension

I am now writing firefox 4 bootstrapped extension.
The following is my story:
When I'm using #mozilla.org/xmlextras/xmlhttprequest;1, nsIXMLHttpRequest, content of target URL can be successfully loaded by req.responseText.
I parsed the responseText to DOM by createElement method and innerHTML property into a BODY Element.
Everything seems to be successful.
However, there is a problem on character encoding ( charset ).
As I need the extension detect the charset of target documents, overriding the Mine type of request with text/html; charset=blahblah.. seems not to meet my need.
I've tried the #mozilla.org/intl/utf8converterservice;1, nsIUTF8ConverterService, but it seems that XMLHTTPRequest has no ScriptableInputStream or even any InputStream or readable stream.
I have no idea on reading a target document content in a suitable, automatically detected charset, no matter the function of Auto-Detect Character Encoding in GUI or the charset readed at head meta tag of the content document.
EDIT: Would it be practical if I parse whole document including HTML, HEAD, BODY tag to a DOM object, but without loading extensive document like js, css, ico files?
EDIT: Method on the article at MDC titled as "HTML to DOM" which is using #mozilla.org/feed-unescapehtml;1, nsIScriptableUnescapeHTML is inappropriate as it parsed with lots of error and mistake with baseURI can not be set in type of text/html. All attribute HREF in A Elements are missed when it contains a relative path.
EDIT#2: It would still be nice if there are any methods that can convert the incoming responseText into readable UTF-8 charset strings. :-)
Any ideas or works to solve encoding problem are appreciated. :-)
PS. the target documents are universal so there are no specific charset ( or ... preknown ), and of course not only UTF8 as it has already defined in default.
SUPP:
Til now, I have two brief main ideas of solving this problem.
Can anybody could help me to work out of the XPCOM modules and methods' names?
To Specify the charset while parsing Content into DOM.
We need to first know the charset of the document ( by extracting head meta Tag, or header).
Then,
find out a method that can specify the charset when parsing body content.
find out a method that can parse both head and body.
To Convert or Make Incoming responseText into/be UTF-8 so parsing to DOM Element with default charset UTF-8 is still working.
X seems to be not practical and sensible : Overiding the Mine type with charset is an implementation of this idea but we can not preknow the charset before initiating a request.
It seems that there are no more other answer.
After a day of tests, I've found out that there is a way (although it is clumsy) to solve my problem.
xhr.overrideMimeType('text/plain; charset=x-user-defined'); , where xhr stand for XMLHttpRequest Handler.
To force Firefox to treat it as plain
text, using a user-defined character
set. This tells Firefox not to parse
it, and to let the bytes pass through
unprocessed.
Refers to MDC Document: Using_XMLHttpRequest#Receiving_binary_data
And then use Scriptable Unicode Converter : #mozilla.org/intl/scriptableunicodeconverter, nsIScriptableUnicodeConverter
Variable charset can be extracted from head meta tags no matter by regexp from req.responseText (with unknown charset) or something other method.
var unicodeConverter = Components.classes["#mozilla.org/intl/scriptableunicodeconverter"].createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
unicodeConverter.charset = charset;
str = unicodeConverter.ConvertToUnicode(str);
An unicode string, as well as a family of UTF-8, is finally produced. :-)
Then simply parse to body element and meet my need.
Other brilliant ideas are still welcome. Feel free to object my answer by sufficient reason. :-)

Rails charachter encoding problem view to controller

The character encoding starts to irritate me.
It took me a while to get everything from the DB in the right encoding on the screen, but with help from the i18n helper, this worked out.
Now I only have one more problem: saving text...
If i add some letters with accents (eg é ç ...) in a text field and want to save it, already in my controller it show as some exotic combination of characters.
Could someone tell me why this is and how I can fix this.
Everything is in UTF-8 btw
Thanks!
//Edit:
When I save the form, this is my log output
Parameters: {"free_text"=>"test 1 2 é",
And everything is capable of UTF-8...
Can you illustrate your output?
Let me guess your situation.
Supposed that you log those characters in controller in log/development.log or production.log.
If you view that log in terminal, you should ensure your terminal is capable to show UTF-8, with appropriate font. Also, your shell is capable to show UTF-8 and so do your the text viewer.

Resources