UTF-8 String Capitalization correction in Rails

UTF-8 String Capitalization correction in Rails - ruby-on-rails

I have an application where users enter their names, often in an international format (e.g. Sarah Grüße). I would like to fix the capitalization of users' names when they enter them incorrectly due to laziness, or different international conventions.
Some users do not capitalize their names at all (e.g mario bel). When this is the case, I would like to capitalize the first letter of the first and last name (which are different columns in the DB). So the result would be Mario Bel.
Some users (often French), capitalize their last names entirely, but this does not fit how we want to display their names. For example Baptist ALBERT should become Baptist Albert.
Some names correctly have multiple capital letters, thus if a name isn't ALL CAPS or all lowercase, I would like to leave it as is. For example McDonald or von Ramstein.
Some characters in UTF-8 obviously have no sense of capitalization, thus these names would also stay the same. (Japanese names for example.)
So what I think I need is a way to detect when a name is in all caps or all lowercase, including non-latin case-sensitive characters like Ü/ü and ignoring non-latin non-case-sensitive characters like ß. Then once detected, correctly change these two cases to First Letter Capitalized.

Related

What is the meaning of 'Swift are Unicode correct and locale insensitive' in Swift's String document?

I found this sentence in Swift's String document
(https://developer.apple.com/documentation/swift/string)
Overview
A string is a series of characters, such as "Swift", that forms a collection. Strings in Swift are Unicode correct and locale insensitive, and are designed to be efficient. The String type bridges with the Objective-C class NSString and offers interoperability with C functions that works with strings.
But, I can't understand this one hundred percent and I don't know where to start.

To expand on #matt's answer a little:
The Unicode Consortium maintains certain standards for interoperation of data, and one of the most well-known standards is the Unicode string standard. This standard defines a huge list of characters and their properties, along with rules for how those characters interact with one another. (Like Matt notes: letters, emoji, combining characters [letters with diacritics, like é, etc.)
Swift strings being "Unicode-correct" means that Swift strings conform to this Unicode standard, offering the same characters, rules, and interactions as any other string implementation which conforms to the same standard. These days, being the main standard that many string implementations already conform to, this largely means that Swift strings will "just work" the way that you expect.
However, along with the character definitions, Unicode also defines many rules for how to perform certain common string actions, such as uppercasing and lowercasing strings, or sorting them. These rules can be very specific, and in many cases, depend entirely on context (e.g., the locale, or the language and region the text might belong to, or be displayed in). For instance:
Case conversion:
In English, the uppercase form of i ("LATIN SMALL LETTER I" in Unicode) is I ("LATIN CAPITAL LETTER I"), and vice versa
In Turkish, however, the uppercase form of i is actually İ ("LATIN CAPITAL LETTER I WITH DOT ABOVE"), and the lowercase form of I ("LATIN CAPITAL LETTER I") is ı ("LATIN SMALL LETTER DOTLESS I")
Collation (sorting):
In English, the letter Å ("LATIN CAPITAL LETTER A WITH RING ABOVE") is largely considered the same as the letter A ("LATIN CAPITAL LETTER A"), just with a modifier on it. Sorted in a list, words starting with Å would appear along with other A words, but before B words
In certain Scandinavian languages, however, Å is its own letter, distinct from A. In Danish and Norwegian, Å comes at the end of the alphabet: ... X, Y, Z, Æ, Ø, Å. In Swedish and Finnish, the alphabet ends with: ... X, Y, Z, Å, Ä, Ö. For these languages, words starting with Å would come after Z words in a list
In order to perform many string operations in a way that makes sense to users in various languages, those operations need to be performed within the context of their language and locale.
In the context of the documentation's description, "locale-insensitive" means that Swift strings do not offer locale-specific rules like these, and default to Unicode's default case conversion, case folding, and collation rules (effectively: English). So, in contexts where correct handling of these are needed (e.g. you are writing a localized app), you'll want to use the Foundation extensions to String methods which do take a Locale for correct handling:
localizedUppercase/uppercased(with locale: Locale?) over just uppercased()
localizedLowercase/lowercased(with locale: Locale?) over just lowercased()
localizedStandardCompare(_:)/compare(_:options:range:locale:) over just <
among others.

It basically just means that Swift strings are Unicode strings. A Swift string "character" is a character in a Unicode sense: a letter, an emoji, a combined letter-and-diacritic, whatever. A string can also be viewed not merely as a character sequence but as a sequence of UTF8, 16, or 32 code points. The "locale insensitive" stuff means they don't have a locale dependent encoding, as strings did in the bad old days before Unicode.
This is delightful but it has some downsides, most notably that strings qua character-sequence are not directly indexable by integers.

How to create a model with table names as integers?

Is there a way to generate modles with table field names as numbers
rails g model Numbers 1-10:string 11-20:string

You should not do this according to both SQL Standards and Ruby Syntax.
PostgreSQL 4.1.1. Identifiers and Key Words
SQL identifiers and key words must begin with a letter (a-z, but also letters with diacritical marks and non-Latin letters) or an underscore (_). Subsequent characters in an identifier or key word can be letters, underscores, digits (0-9), or dollar signs ($). Note that dollar signs are not allowed in identifiers according to the letter of the SQL standard, so their use might render applications less portable. The SQL standard will not define a key word that contains digits or starts or ends with an underscore, so identifiers of this form are safe against possible conflict with future extensions of the standard.
Many SQL servers will allow columns starting with numbers though such as MSSQL, and MySQL/MariaDB however referencing these columns requires the need for explicit double quotes/square brackets (MSSQL) or backticks (MySQL/MariaDB) otherwise 1-10 would be considered a expression rather than a column reference thus resulting in -9.
Also even if this wasn't true ruby does not allow method names to begin with a number so this would make for very awkward code usage.
Method names may be one of the operators or must start a letter or a character with the eight bit set. It may contain letters, numbers, an _ (underscore or low line) or a character with the eight bit set. The convention is to use underscores to separate words in a multiword method name.

Delphi - create Title/Proper/Mixed Case for Strings

I have a list of approx 100,000 names I need to process. Some are business names, some are people names. Unfortunately, some are lower, some are upper, and some are mixed. I am looking for a routine to convert them to proper case. (Sometimes called Mixed or Title case). I realize I can just loop through the string and capitalize every character that starts a new word. That would be an incredibly simplistic approach. For businesses, short words should be lowercase (of, with, for, ...). For last names, if it starts with Mc, the 3rd letter should be capitalized (McDermot, McDonald, etc). Roman numerals should always be capitalized (John Smith II ), etc.
I have not been able to find any Delphi built in, or otherwise, routines. Surely this is out there. Where can I find this?
Thanks

As it was already said by others, making a fully automated routine for this is nearly impossible due to so many special variations. So leaving out the human interaction completely is almost impossible.
Now what you can do instead is to make this much easier for human to solve. How? Make a dictionary of all the name variations in Lowercase and present it to him.
Before presenting the names you can make sure that the first letter in any of the names is already capitalized.
Once all name correction has been made in dictionary you go and automatically replace all the names in original database.

Smarter Autocapitalization

I've been looking around, and I am wondering whether there is a simple way to capitalize all words in a UITextField, while leaving certain words (such as of, the, or, etc.) lowercase, unless they are the first word of the phrase.
This is an
Example of the Effect I'm Trying to Convey.
One of the methods I've found is to search the text field value for the certain words and replace them with lowercase versions, as the user types a new word or character, perhaps listening for the space bar.
I'm not sure if the method above is best practice, or whether my searches haven't been broad enough to find a solution already in the mix.
I was originally thinking something along these "pseudocode" lines:
When value of textfield is changed
Get current value textfield
For each word in value:
If the word matches ("For", "Of", "The", etc.) and the word is not the first word in the value:
Change the word to lowercase, and replace word
Go to next word
My actual question is mainly one of performance. Would this method be overly strenuous on my application? If so, are there any better solutions?
Thank you all for your assistance!
Update:
Thanks to holex, cluemein, and others who have already commented and answered. I will try your solutions when I get the opportunity to do so.

A better way then converting the words to lowercase is to capitalize the words that are NOT those words you specified. Set up if statements to capitalize the beginning letter of the first word, and to capitalize the words following that if they are not the words you specified. Then, if you want to make sure the specified words weren't capitalized after the first word, use an else statement. "pseudocode" example:
Capitalize letter of first word;
Move on to next word;
While not end of textfield (or while typing):
if word is not ("the"|"and"|"of"|"or"|...):
capitalize first letter;
else:
set first letter to lowercase;
move to next word at space;
This will on average be roughly twice as fast as going back through the text looking for the specified words in terms of runtime. This isn't the code you would use, but the algorithm you would implement. Also, take into account what holex said about spaces. I leave how you implement this algorithm up to you. Just to clarify, this algorithm is for both autocapitalizng and auto-setting to lower case.

Regexp for a name

I need to make sure people enter their first, middle and last names correctly for a form in Rails. So the first thought for a regular expression is:
\A[[:upper:]][[:alpha:]'-]+( [[:upper:]][[:alpha:]'-]*)*\z
That'll make sure every word in the name starts with an uppercase letter followed by a letter or hyphen or apostrophe.
My first question I guess doesn't have much to do with regular expressions, though I'm hoping there's a regular expression I can copy for this. Are letters, hyphens and apostrophes the only characters I should be checking in a name?
My second question is if it's important to make sure each name has at least 1 uppercase letter? So many people enter all lowercase names and I really want to avoid that, but is it sometimes legitimate?
Here's what I have so far that makes sure there's at least 1 uppercase letter somewhere in the name:
\A([[:alpha:]'-]+ )*[[:alpha:]'-]*[[:upper:]][[:alpha:]'-]*( [[:alpha:]'-]+)*\z
Isn't there a [:name:] bracket expression? :)
UPDATE: I added . and , to the characters allowed, surprised I didn't think of them originally. So many people must have to deal with this kind of regular expression! Nobody has any pre-made regular expressions for this sort of thing?

A good start would be to allow letters, marks, punctiation and whitespace. To allow for a given name like "María-Jose" and a last name like "van Rossum" (note the whitespace).
So that boils down to something like:
[\p{Letter}\p{Mark}\p{Punctuation}\p{Separator}]+
If you want to restrict that a bit you could have a look at classes like \p{Lowercase_Letter}, \p{Uppercase_Letter}, \p{Titlecase_Letter}, but there may be scripts that don't have casing. \p{Space_Separator} and \p{Dash_Punctuation} can narrow it down to names that I know. But names I don't...I don't know...
But before you start constructing your regex for "validating" a name. Please read this excellent piece on names by W3C. It will shake even your concepts of first, middle and last names.
For example:
In some cultures you are given a name (Björk, Osama) and an indication of who your father (or mother) was (Guðmundsdóttir, bin Mohammed). So the "first name" could be "Björk" but:
Björk wouldn’t normally expect to be called Ms. Guðmundsdóttir. Telephone directories in Iceland are sorted by given name.
But in other cultures, the first name is not given, but a family name. In "Zhāng Mànyù", "Zhāng" is the family name. And how to address her, would depend how well you know her, but again "Ms. Zhāng" would be strange.
The list of examples goes on and ends in a some 30+ links to Wikipedia for more examples.
The article does end with suggestions for field design and some pointers on what characters to allow:
Don't forget to allow people to use punctuation such as hyphens, apostrophes, etc. in names. Don't require names to be entered all in upper case – this can be difficult on a mobile device. Allow the user to enter a name with spaces , eg. to support prefixes and suffixes such as de in French, von in German, and Jnr/Jr in American names, and also because some people consider a space-separated sequence of characters to be a single name, eg. Rose Marie.

To answer your question about capital letters: in many areas of the world, names do not necessarily start with a capital letter. In Dutch for instance, you have surnames like "van der Vliet" where words like "van", "de", "den" and "der" are not capitalised. Additionally, you have special cases like "De fauw" and "Van pellicom" where an administrative error never got rectified, and the correct capitalisation is fairly illogical. Please do not make the mistake of rejecting such names.
I also know about town names in South Africa such as eThekwini, where the capital letter is not necessarily the first letter of the word. This could very well appear in surnames or given names as well.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart