meta keywords characterset - character-encoding

On my site sinj.com.hr I have croatian diacritic letters which I encode to utf-8 html entities. For example,
Obiteljski liječnici, na području koje
pokriva sinjska ispostava Nastavnog
zavoda za javno zdravstvo, danas su
počeli s cijepljenjem protiv sezonske
gripe. Dodajmo kako cjepivo protiv
sezonske gripe nije otporno na virus
nove pandemijske H1N1 gripe.
Cijepljenje protiv svinjske gripe
počet će u prosincu
in html is printed like this:
Obiteljski liječnici, na području koje pokriva sinjska ispostava Nastavnog zavoda za javno zdravstvo, danas su počeli s cijepljenjem protiv sezonske gripe. Dodajmo kako cjepivo protiv sezonske gripe nije otporno na virus nove pandemijske H1N1 gripe. Cijepljenje protiv svinjske gripe počet će u prosincu
I wonder, how should this string be printed in meta tag? I'm asking this because some search engines in their results show utf-8 entity instead of character. In google it works fine, but Yahoo doesn't show it correctly (if link is not good, try to search for "sinj")

There's really not much you can do. Yahoo search engine is the one at fault here. You could try encoding the characters in UTF-8 directly, though, since you have declared the content-type meta tag correctly.

Related

JabRef citation in LibreOffice with institutional author

I'm using JabRef as a reference manager and LibreOffice writer as document editor.
I'm using the ooPluging to cite JabRef sources in LibreOffice, but I'm having troubles with sources with institutional authors. For example, the following source
#Misc{RevistaSemana2013,
Title = {Un buen año para la economía},
HowPublished = {Online. Available at http://www.foo.bar},
Institution = {Revista Semana},
Month = {Dec},
Year = {2013},
Comment = {Last visited 21-10-2015},
}
will be appear in the references as:
Revista Semana (2013a). Un buen año para la economía. Online. Available at http://www.foo.bar. Last visited 21-10-2015.
But will be cited as (Semana 2013a).
My style file can be found in http://pastebin.com/j5vNgyDR
Thanks,
It looks to me like JabRef always splits up the last name, even if it's an institution (java code):
AuthorList al = AuthorList.getAuthorList(author);
sb.append(getAuthorLastName(al, 0));
However I found a simple workaround. In the source, use a non-breaking space instead of an ordinary space between Revista and Semana. Then we get the desired result in LibreOffice:
(Revista Semana 2013)
you might use an additional curly bracket:
{{Revista Semana}}
as mentioned in the biblatex manual, section 2.3.3, for corporate authors:
http://mirrors.ibiblio.org/CTAN/macros/latex/contrib/biblatex/doc/biblatex.pdf

how to convert unicode text to utf8 text readable?

I got a serious problem regarding Unicode and utf8,
I saved a paragraph of Arabic/Persian text file into notepad and saved it, now I saw my information like
Êæ Çíä ÓæÑÓ ÈÑäÇãå ÚÏÏ ÏáÎæÇåí Ñæ ÇÒ æÑæÏí ãííÑå æ Èå Øæá åãæä ÚÏÏ ãËáËí Ñæ ÑÓã ãí ˜äå
my question is how to get back my data, it is important for me to get this data back, thanks in advance
The paragraph was scrambled by saving as code page 1256 (Arabic/Persian), then interpreted as code page 1252 (Western Europe), and finally saved as Unicode text. You can use C# to reverse this procedure:
string scrambled = "Êæ Çíä ÓæÑÓ ÈÑäÇãå ÚÏÏ ÏáÎæÇåí Ñæ ÇÒ æÑæÏí ãííÑå æ " +
"Èå Øæá åãæä ÚÏÏ ãËáËí Ñæ ÑÓã ãí ˜äå";
byte[] bytes = Encoding.GetEncoding("windows-1252").GetBytes(scrambled);
string plainText = Encoding.GetEncoding("windows-1256").GetString(bytes);
Console.WriteLine(text);
The plain text output is:
"تو اين سورس برنامه عدد دلخواهي رو از ورودي ميگيره و به طول همون عدد مثلثي رو رسم مي کنه"
On Linux you can use Gedit to open it as a 1256 encoded file:
gedit shahnameh.txt --encoding WINDOWS-1256
You can do the same work via gui. You just need select the correct encoding from "open" dialog box when opening a file. It should be at the bottom of the open dialog.

How to clean up a string (email body) with regards to special characters?

I have an email extracted from an IMAP account. I have encoded it like this:
body = imap.uid_fetch(uid, "BODY[TEXT]")[0].attr["BODY[TEXT]"].force_encoding('UTF-8')
So now it looks like this:
puts body.inspect => "\n--Apple-Mail-028364EC-0K8B-4FD7-87E8-97C28C324717\nContent-Type: text/plain; charset=\"utf-8\"\nContent-Transfer-Encoding: quoted-printable\n\nHej=20\n\nI m=C3=A5 meget undskylde men jeg vil ikke k=C3=B8be produktet alligevel hvord=\nan g=C3=B8r vi det...=20\n\nHans Nielsen. =20\nR=C3=B8rgade 65=20\n1234 G=C3=B8rlev\n\n"
I want to present the email in my Rails app, so the user of the app can review the email. But how do I clean up the body?
I want to remove this part:
--Apple-Mail-028364EC-0K8B-4FD7-87E8-97C28C324717
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
And clean up this part:
Hej=20
I m=C3=A5 meget undskylde men jeg vil ikke k=C3=B8be produktet alligevel hvord=
an g=C3=B8r vi det...=20
Hans Nielsen. =20
R=C3=B8rgade 65=20
1234 G=C3=B8rlev
This means replacing the weird characters with the originally intended characters. Fyi, these are:
=C3=A5 is å
=C3=B8 is ø
=20 is ???
= is ???
How to do this (without just using gsub)?
You need to use a MIME parser, which should take care of removing the headers and getting rid of the quoted printable encoding. Depending on the layout of your email, body[text] might get you a lot more than you want. You need to either download the BODYSTRUCTURE and pick out the parts you want, or download the entire message (BODY[]) and use a MIME parser.
The decoding result is:
Hej
I må meget undskylde men jeg vil ikke købe produktet alligevel hvordan gør vi det...
Hans Nielsen.
Rørgade 65
1234 Gørlev
It seems that = is ... and =20 is "\n".

Not sure if web crawlers read my website correctly

At the W3C Internationalization Checker page, (http://validator.w3.org/i18n-checker/ ) I got no errors about language issues for my website fxrehber.com but when I check my website for web crawlers at a website like http://tools.seobook.com/general/spider-test/ I get the text like these:
SPK Lisansl Forex irketlerinin Kar la t rmalar ve Kullan c Yorumlar
FXrehber com Forex irketleri kar la t rma ve yorumlar Forex'te g venle
i lem yap n T rkiye'de ofisi bulunan
My website is in Turkish, so it should look like this:
SPK Lisanslı Forex şirketlerinin Karşılaştırmaları ve Kullanıcı Yorumları
FXrehber com Forex şirketleri karşılaştırma ve yorumları Forex'te güvenle
işlem yapın Türkiye'de ofisi bulunan
I'm not sure if it's normal behaviour and this is a problem for SEO.
Consider the tools.seobook.com service useless; it apparently cannot even read UTF-8 data correctly (when in document body – it seems to get the meta tag contents OK, making the behavior even more absurd).
If you search for e.g. “SPK Lisanslı Forex şirketlerinin Karşılaştırmaları” in Google, you’ll see your page well placed, with the extract of page content correctly displayed by Google. Ditto when searching with Bing, Yahoo, Yandex.

Escape html elements in blackberry

I was wondering if there is something for blackberry to escape html values, basically I want to show just plain text that's coming from and rss. However the rss is returning values likes this:
<item><guid isPermaLink="true"><![CDATA[http://www4.elcomercio.com/deportes
/Vettel_F1_China.aspx]]>
</guid>
<title><![CDATA[ Vettel domina primer día de ensayos en China]]></title>
<description><![CDATA[El alemán Sebastian Vettel, de Red Bull, realizó el mejor tiempo en la segunda sesión de entrenamientos libres del Gran Premio de China de Fórmula 1, el viernes en el circuito de Shanghai, tercera prueba del campeonato, tras haber dominado el primer ensayo.<br />
<br />
I can sucesffuly retrieve the title and description tags content, but now I would like to remove all the CDATA, <br /> or any possible html tags that I could find.
I tried using JSoup however it uses JVM 1.5+ classes like Enum, and as result I couldn't preverify the jar to use it on Blackberry-JavaME. Also I haven't found any class in the RIM API that could help on this task, maybe I missed a class or a library that I could use. This is just to avoid writing code that's probably already done on several libraries.
Thanks a lot.
Have you tried using SAX Parser and just getting the values for the characters(...) method for each endElement ?
Here is a brief tutorial on SAX Parser for Blackberry:
http://jsinghfoss.wordpress.com/2009/09/06/sax-parsing-revising/
Well, couldn't find a prerolled class, however there is a library that allows us to use regex in Blackberry projects, it's called regexp-me. Helped me to remove the tags in an easy way. SAX Parser is also a solution, but if you want something more simple like in this case, I think regexp-me is the best option.
Thanks.

Resources