How to find which unicode letters look good in URL - url

For some examples:
These characters are too short or overlap the surrounding characters:
/b5/ີ/foo
/31/ั/foo
/39/᤹/foo
/a3/ᮣ/foo
These are too long to fit into monospace character slot:
/4b/ോ/foo
/23/ᠣ/fo
/61/ᡡ/foo
/86/ᢆ/foo
/ba/຺/foo
Then blank/whitespace/invisible characters would also be considered ones that don't fit well in the URL.
Wondering if there is a simple way to figure out which characters fall into these slots:
Fits well in URL (latin characters, chinese characters, etc.).
Too large for monospace (chinese characters, the above examples, etc.).
Combining character or overlaps surrounding URL characters (examples above).
Maybe by checking some property on the unicode character there is a way to tell this programmatically, so I don't need to go through each character individually and visually check which category it falls into.
Mainly I am looking for which characters need to be either (a) placed on another character (combining characters), or (b) need some extra padding like the examples above, so you can see them in the URL).

The problem is ill-defined. You claim that the latter five don't fit, but for me they render in one column, which is precisely according to how it's specified in Unicode. Also see: https://stackoverflow.com/a/56216985/46395
use 5.030;
use Unicode::GCString qw();
for (
"\N{WORD JOINER}", # U+2060
"\N{LATIN SMALL LETTER L}", # U+006C
"\N{CJK UNIFIED IDEOGRAPH-4E2D}", # U+4E2D
"\N{LAO VOWEL SIGN II}", # U+0EB5
"\N{THAI CHARACTER MAI HAN-AKAT}", # U+0E31
"\N{LIMBU SIGN MUKPHRENG}", # U+1939
"\N{SUNDANESE CONSONANT SIGN PANYIKU}", # U+1BA3
"\N{MALAYALAM VOWEL SIGN OO}", # U+0D4B
"\N{MONGOLIAN LETTER O}", # U+1823
"\N{MONGOLIAN LETTER SIBE U}", # U+1861
"\N{MONGOLIAN LETTER ALI GALI THREE BALUDA}", # U+1886
"\N{LAO SIGN PALI VIRAMA}", # U+0EBA
) {
say Unicode::GCString->new($_)->columns
}
__END__
0
1
2
0
0
0
0
1
1
1
1
1

Related

Using an escaped (magic) character as boundary in a character range in Lua patterns

The Lua manual in section 6.4.1 on Lua Patterns states
A character class is used to represent a set of characters. The
following combinations are allowed in describing a character class:
x: (where x is not one of the magic characters ^$()%.[]*+-?) represents the character x itself.
.: (a dot) represents all characters.
%a: represents all letters.
%c: represents all control characters.
%d: represents all digits.
%g: represents all printable characters except space.
%l: represents all lowercase letters.
%p: represents all punctuation characters.
%s: represents all space characters.
%u: represents all uppercase letters.
%w: represents all alphanumeric characters.
%x: represents all hexadecimal digits.
%x: (where x is any non-alphanumeric character) represents the character x. This is the standard way to escape the magic characters.
Any non-alphanumeric character (including all punctuation characters,
even the non-magical) can be preceded by a % when used to represent
itself in a pattern.
[set]: represents the class which is the union of all characters in set. A range of characters can be specified by separating the end
characters of the range, in ascending order, with a -. All classes
%x described above can also be used as components in set. All other
characters in set represent themselves. For example, [%w_] (or
[_%w]) represents all alphanumeric characters plus the underscore,
[0-7] represents the octal digits, and [0-7%l%-] represents the
octal digits plus the lowercase letters plus the - character.
You can put a closing square bracket in a set by positioning it as the
first character in the set. You can put a hyphen in a set by
positioning it as the first or the last character in the set. (You can
also use an escape for both cases.)
The interaction between ranges and classes is not defined. Therefore, patterns like [%a-z] or [a-%%] have no meaning.
[^set]: represents the complement of set, where set is interpreted
as above.
For all classes represented by single letters (%a, %c, etc.), the
corresponding uppercase letter represents the complement of the class.
For instance, %S represents all non-space characters.
The definitions of letter, space, and other character groups depend on
the current locale. In particular, the class [a-z] may not be
equivalent to %l.
(Highlighting and some formatting added by me)
So, since the "interaction between ranges and classes is not defined.", how do you create a character class set that starts and/or ends with a (magic) character that needs to be escaped?
For example,
[%%-c]
does not define a character class that ranges from % to c and includes all characters in-between but a set that consists only of the three characters %, -, and c.
The interaction between ranges and classes is not defined.
Obviously, this is not a hard and fast rule (of regex character sets in general) but a Lua implementation decision. While using shorthand characters in character sets/ranges work in some (most) regex flavors, it does not in all (like in Python's re module, demo).
However, the second example is misleading:
Therefore, patterns like [%a-z] or [a-%%] have no meaning.
While the first example is fine since %a is a shorthand class (that represents all letters) in a set, [%a-z] is undefined and will return nil if matched against a string.
Escaped range characters in a [set]
In the second example, [a-%%], %% simply defines an escaped % sign and not a shorthand character class. The superficial problem is, the range is defined upsidedown, from high to low (in reference to the US ASCII value of the characters a 61 and % 37), e.g like an erroneous Lua pattern like [f-a]. If the set is defined in reverse order it seems to work: [%%-a] but all it does is matching the three individual characters instead of the range of characters between % and a; credit cyclaminist).
This could be considered a bug and, indeed, means it is not possible to create a range of characters in a [set] if one of the defining range characters need to be escaped.
Possible Solution
Start the character range from the next character that does not need to be escaped - and then add the remaining escaped characters individually, e.g.
[%%&-a]
Sample:
for w in string.gmatch("%&*()-0Aa", "[%%&-a]") do
print(w)
end
This is the answer I have found. Still, maybe somebody else has something better.

Combine these regex expressions

I have two regular expressions: ^(\\p{L}|[0-9]|_)+$ and #[^[:punct:][:space:]]+ (the first is used in Java, the second on iOS). I want to combine these into one expression, to match either one or the other in iOS.
The first one is for a username so I also need to add a # character to the start of that one. What would that look like?
The ^(\\p{L}|[0-9]|_)+$ pattern in Java matches the same way as in ICU library used in iOS (they are very similar): a whole string consisting of 1 or more Unicode letters, ASCII digits or _. It is poorly written as the alternation group is quantified and that is much less efficient than a character class based solution, ^[\\p{L}0-9_]+$.
The #[^[:punct:][:space:]]+ pattern matches a # followed with 1 or more chars other than punctuation/symbols and whitespace chars (that is, 1 or more letters or digits, or alphanumeric chars).
What you seek can be writtern as
#[\\p{L}0-9_]+|[^[:punct:][:space:]]+
or
#[\\p{L}0-9_]+|#[[:alnum:]]+
or if you want to limit to ASCII digits and not match Unicode digits:
#[\\p{L}0-9_]+|#[\\p{L}0-9]+
It matches
# - a # symbol
[\\p{L}0-9_]+ - 1 or more Unicode letters, ASCII diigts, _
| - or
# - a # char
[[:alnum:]]+ - 1 or more letters or digits.
[^[:punct:][:space:]]+ - any 1+ chars other than punctuation/symbols and whitespace.
Basically, all these expressions match strings like this.
If you want to match #SomeThing_123 in full, just use [##]\\w+, a # or # and then 1 or more letters, digits or _, or to only allow ASCII digits, [##][\\p{L}0-9_]+.
A word boundary may be required at the end of the pattern, [##][\\p{L}0-9_]+\\b.

Ruby: Split a string into substring of maximum 40 characters

I have some strings with a sentence and i need to subdivise it into a substring of maximum 40 characters.
But i don't want to split the sentence in the middle of a word.
I tried with .gsub function but it's return 40 characters maximum and avoid to cut the string in the middle of a word. But it's return only the first occurence.
sentence[0..40].gsub(/\s\w+$/,'')
I tried with split but i can select only the fist 40 characters and split in the middle of a word...
sentence.split(...){40}
My string is "Sure, we will show ourselves only when we know the east door has been opened.".
The string output i want is
["Sure, we will show ourselves only when we","know the east door has
been opened."]
Do you have a solution ? Thanks
Your first attempt:
sentence[0..40].gsub(/\s\w+$/,'')
almost works, but it has one fatal flaw. You are splitting on the number of characters before cutting off the last word. This means you have no way of knowing whether the bit being trimmed off was a whole word, or a partial word.
Because of this, your code will always cut off the last word.
I would solve the problem as follows:
sentence[/\A.{0,39}[a-z]\b/mi]
\A is an anchor to fix the regex to the start of the string.
.{0,39}[a-z] matches on 1 to 40 characters, where the last character must be a letter. This is to prevent the last selected character from being punctuation or space. (Is that desired behaviour? Your question didn't really specify. Feel free to tweak/remove that [a-z] part, e.g. [a-z.] to match a full stop, if desired.)
\b is a word boundary look-around. It is a zero-width matcher, on beginning/end of words.
/mi modifiers will include case insensitive (i.e. A-Z) and multi-line matches.
One very minor note is that because this regex is matching 1 to 40 characters (rather than zero), it is possible to get a null result. (Although this is seemingly very unlikely, since you'd need a 1-word, 41+ letter string!!) To account for this edge case, call .to_s on the result if needed.
Update: Thank you for the improved edit to your question, providing a concrete example of an input/result. This makes it much clearer what you are asking for, as the original post was somewhat ambiguous.
You could solve this with something like the following:
sentence.scan(/.{0,39}[a-z.!?,;](?:\b|$)/mi)
String#scan returns an array of strings that match the pattern - so you can then re-join these strings to reconstruct the original.
Again, I have added a few more characters (!?,;) to the list of "final characters in the substring". Feel free to tweak this as desired.
(?:\b|$) means "either a word boundary, or the end of the line". This fixes the issue of the result not including the final . in the substrings. Note that I have used a non-capture group (?:) to prevent the result of scan from changing.

ruby/rails detect financial track data and return nil/empty string

I read through similar stackoverflow questions to understand financial track card data.
I think the issue I am facing might be slightly different or maybe I am really weak in regex.
Now we have a service that returns track data accidentally instead of the guest name.
My goal is every time I receive track data I display "" empty string, else return the guest name.( This is a temp solution until we fix the root cause)
This is what my regular expressions is but looks like it doesn't detect track data.
irb(main):043:0> guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
irb(main):044:0> (/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/.match(guestname)) ? "" : guestname
=> "%4234242xx12^TEST/GUEST L ^324532635645744646462"
(Not real data)
Now, looking at the wiki for track data information I want to cover most cases, if not all:
https://en.wikipedia.org/wiki/Magnetic_stripe_card#Financial_cards
Could some help with my regex. This is what I have:
/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/
Track 1, Format B:
Start sentinel — one character (generally '%')
Format code="B" — one character (alpha only)
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Field Separator — one character (generally '^')
Name — 2 to 26 characters
Field Separator — one character (generally '^')
Expiration date — four characters in the form YYMM.
Service code — three characters
Discretionary data — may include Pin Verification Key Indicator (PVKI,
1 character), PIN Verification Value (PVV, 4 characters), Card
Verification Value or Card Verification Code (CVV or CVC, 3
characters)
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track.
Track 2: This format was developed by the banking industry (ABA). This
track is written with a 5-bit scheme (4 data bits + 1 parity), which
allows for sixteen possible characters, which are the numbers 0-9,
plus the six characters : ; < = > ? . The selection of six
punctuation symbols may seem odd, but in fact the sixteen codes simply
map to the ASCII range 0x30 through 0x3f, which defines ten digit
characters plus those six symbols. The data format is as follows:
Start sentinel — one character (generally ';')
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Separator — one char (generally '=')
Expiration date — four characters in the form YYMM.
Service code — three digits. The first digit specifies the interchange
rules, the second specifies authorisation processing and the third
specifies the range of services
Discretionary data — as in track one
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track. Most
reader devices do not return this value when the card is swiped to the
presentation layer, and use it only to verify the input internally to
the reader.
Your example input string does not contain format code after first sentinel.
You are trying to parse html-encoded version, which is weird.
So, I would start with html decoding. E.g. with Nokogiri:
▶ guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
▶ parsed = Nokogiri::HTML.parse(guestname).text
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
OK, now we at least have a leading percent. Now let us ask ourselves: how many users have a guest name starting with a percent sign? I bet none. You might re-check yourself by running a query against your database. Since it is a temporary solution, I would definitely shut the perfectionism up and go with:
▶ parsed =~ /\A%/ ? '' : parsed
Hope it helps.

Full name regex in Ruby

I know there are lots of similar questions, but I couldn't find my case anywhere.
I'm trying to write a Full Name RegEx in Ruby on Rails user model.
It should validate that first name and last name are filled with one whitespace. Both of the names should contain at least 2 characters (ex: Li Ma).
As a bonus, but not necessary I would like to trim the whitespaces to one character in case that user will mistype and enter more than one whitespace (ex: Li Ma will be trimmed to Li Ma)
Currently I'm validating it like that (Warning: It might be incorrect):
validates :name,
presence: true,
length: {
maximum: 64,
minimum: 5,
message: 'must be a minimum: 5 letters and a maximum: 64 letters'},
format: {
# Full Name RegEx
with: /[\w\-\']+([\s]+[\w\-\']){1}/
}
This works for me, but doesn't check for minimum 2 characters for each name (ex: Peter P is now correct). This also accepts multiple whitespaces which is not good (ex: Peter P)
I know that this problem of identifying names is very culture-centric and it might be not a proper way to validate full name (maybe there are people with one character name), but this is currently a requirement.
I don't want to split this field to 2 different fields First name and Last name as it will complicate user interface.
You could match the following regex:
/([\w\-\']{2,})([\s]+)([\w\-\']{2,})/
and replace with: (assuming it supports capturing groups)
'\1 \3' or $1 $3 whatever the syntax is:
It gets rid of extra whitespaces and only keeps one, as you wanted.
Demo: http://regex101.com/r/oQ6aO7
result = subject.gsub(/\A(?=[\w' -]{5,64})([\w'-]{2,})([\s]{1})\s*?([\w'-]{2,})\Z/, '\1\2\3')
http://regex101.com/r/dT1fJ4
Assert position at the beginning of the string «^»
Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=[\w' -]{5,64})»
Match a single character present in the list below «[\w' -]{5,64}»
Between 5 and 64 times, as many times as possible, giving back as needed (greedy) «{5,64}»
A word character (letters, digits, and underscores) «\w»
The character “'” «'»
The character “ ” « »
The character “-” «-»
Match the regular expression below and capture its match into backreference number 1 «([\w'-]{2,})»
Match a single character present in the list below «[\w'-]{2,}»
Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) «{2,}»
A word character (letters, digits, and underscores) «\w»
The character “'” «'»
The character “-” «-»
Match the regular expression below and capture its match into backreference number 2 «([\s]{1})»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «[\s]{1}»
Exactly 1 times «{1}»
Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) «\s*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 3 «([\w'-]{2,})»
Match a single character present in the list below «[\w'-]{2,}»
Between 2 and unlimited times, as many times as possible, giving back as needed (greedy) «{2,}»
A word character (letters, digits, and underscores) «\w»
The character “'” «'»
The character “-” «-»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»

Resources