I am building a Rails application which needs the list of countries,cities.City should at least contain 100,000 peoples. I have found the data from Wikipedia.But I need a clarification city names contains some special letters.
Durrës - ë
Vicente López - ó
São Paulo - ã
I have googled and found these are accented syllable.
My question is
Can I directly insert these values into the database?
Can I search the database without any problem?
Thank you.
If you set your database to store values as utf-8 you should be able to store a wide range of such values without problem.
When it comes to sorting and comparing the important thing is which collation you ask the database to use. In a nutshell a collation is a set of rules saying how strings are compared, for example how is é sorted relative to e, is ß equal to ss and so on.
When using full text search (solr, sphinx etc) you should ensure that your stop words, choice of stemmer and so on are Unicode aware
Related
I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.
The problem I have is I wish to support unicode within my project (mvc project).
Where by a user can post a comment using characaters such as ペ without this becoming ????.
Any information that can be shared on this subject would be greatly appreciated.
You need to understand what is basically a Character Set and Encoding.
A character set defines a set of textual and graphic symbols, each of which is mapped to a set of non negative integers. For example, when the database stores the letter A, it actually stores a numeric code that is interpreted by software as the letter ”A”, the numeric code is called code point or encoded value.
Character encoding is a process of assigning code point to a character; it defines a rule for representing and storing a character in a character set.
Also You need to know what collation is, which is just a bit patterns that represent every character and some rules which are applied on characters being stored and compared, in case you are storing the same in your database.
In a Nutshell you need to change your page charset to charset="UTF-8" for all your web pages, and do the same activity on your database.
I am facing one issue in one of my Rails project.
My users database contain names with special character and i want them to be shown in search result while searching it with simple characters.
Example: Lets suppose i have a user whose name is "Noël Nocciolo" (please notice soft sign on e) and i want that to be searched if i pass "Noel Nocciolo" as a parameter.
Can anyone tell me how to handle with these cases because no one knows how to provide input of "e with two dots".
And i am using postgres as my databse.
Regards,
Karan
You can create separate field "indexed_name" for search and fill it only with ASCII characters.
Then you have to preprocess query string with .gsub('ë', 'e') (or any other non ASCII characters to its ASCII analog) and search with this processed query
and i believe there is more elegant way to convert any string to ascii analog i just gave you direction )
.parameterize or ActiveSupport::Inflector.transliterate will probably be acceptable for your use case.
"àáâãäå".parameterize
=> "aaaaaa"
However, it won't handle ligatures such as ffi, so for that you'll need:
"àáâãÀffi".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/,'').to_s
=> "aaaaAffi"
I'm using Phonegap to do a dictionary app for iOS.
When querying the database for an alphabetical list I use COLLATE NOCASE:
ORDER BY term COLLATE NOCASE ASC
This solved the problem that terms starting with a lower case letter where appended to the end (Picked it up from that question).
However non-standard characters as öäüéêè still get sorted in the end ~ here 2 examples:
Expected: Öffnungszeiten Oberved: Zuzahlung
Zuzahlung Öffnungszeiten
(or) clé cliquer sur
cliquer sur clé
I looked around and found similar matters discussed here or here but it seems the general advice is to install some type of extension
This extension can probably help you ...
...use ICU either as an extension
SQLite supports integration with ICU ...
But I'm not sure if this is applicable in my situation where the database is not hosted by myself but running on the customers device. So I'd guess I'd to ship this extension w/ my app-package.
I'm not very familiar with iOS but I've got the feeling that would be complicated - at least.
Also in the official forum I've found that hint:
SQLite does not properly handle accented characters.
and a little bit down in the text the poster mentions a bug in SQLite.
All the links I've found haven't been active for >= 1 year and non of them seems to deal with the mobile environment I'm currently developing in.
So I was wondering if anyone else found a solution on their iOS projects.
The documentation states they're only 3 default COLLATION option:
6.0 Collating Sequences
When SQLite compares two strings, it uses a collating sequence or
collating function (two words for the same thing) to determine which
string is greater or if the two strings are equal. SQLite has three
built-in collating functions: BINARY, NOCASE, and RTRIM.
BINARY - Compares string data using memcmp(), regardless of text encoding.
NOCASE - The same as binary, except the 26 upper case characters of ASCII are folded to their lower case equivalents before the
comparison is performed. Note that only ASCII characters are case
folded. SQLite does not attempt to do full UTF case folding due to the
size of the tables required.
RTRIM - The same as binary, except that trailing space characters are ignored.
For now my best guess would be to do the sorting in JavaScript but I suspect that this wouldn't do anything well to overall performance.
The reason is that the SQLite on iOS doesn't come with ICU (International Components for Unicode) enabled. So you need to build your own SQLite version with ICU enabled + your own ICU version as static lib + add the ICU .dat and make SQLite load this .dat file. Then you can load any collation via a simple SQL command (i.e. 'icu_load_collation("de_DE", "DEUTSCH")', once after the db was opened)
It doesn't only sound like it's dirt work, it really is. Try to find a version of SQLite + ICU with all of it done already.
I have a stored procedure for SQL 2000 that has an input parameter with a data type of varchar(17) to handle a vehicle identifier (VIN) that is alphanumeric. However, whenever I enter a value for the parameter when executing that has a numerical digit in it, it gives me an error. It appears to only accept alphabetic characters. What am I doing wrong here?
Based on comments, there is a subtle "feature" of SQL Server that allows letters a-z to be used as stored proc parameters without delimiters. It's been there forever (since 6.5 at least)
I'm not sure of the full rules, but it's demonstrated in MSDN (rename SQL Server etc): there are no delimiters around the "local" parameter. And I just found this KB article on it
In this case, it could be starting with a number that breaks. I assume it works for a contained number (but as I said I'm not sure of the full rules).
Edit: confirmed by Martin as "breaks with leading number", OK for "containing number"
This doesn't help much, but somewhere, you have a bug, typo, or oversight in your code. I spent 2+ years working with VINs as parameters, and other than regretting not having made it char(17) instead ov varchar(17), we never had any problems passing in alphanumeric VIN values. Somewhere, and I'd guess it's in the application layer, something is not liking digits -- perhaps a filter looking for only alphabetical characters?