How are URLs with right-to-left TLDs represented? - url

I'm writing some Ruby code that does some text analysis on domain names. In looking at the list of valid TLDs, I see some that use right-to-left languages such as:
تونس.
سوريا.
السعودية.
Just looking at those TLDs alone shows that the dot (.) appears to the right instead of the left. If I came across a domain like this in the wild, how would the URL be structured? Specifically, a left-to-right URL is structured as:
<protocol>://[<user>:<pass>#]<host>:<port>/<path>[?<query>]
Additionally, the <host> portion above could be broken out to look like:
[<subdomain>.]<domain>.<tld>
(e.g. "foo.example.com")
What is the structure of a right-to-left language URL?

The short answer: the structure is the same.
For the dot, by default the system doesn't show the dot as right-to-left until there is string written before the symbol. So on your case when you deleted the domain the dot became as the first charterer and nothing before it, the system then showed as LTR charterer.
example:
As left-to-right string, for example when we have
A[dot]B
and when you deleted A it will become:
[dot]B.
As right-to-left (such as Arabic) string, for example when we have B[dot]A and when you delete A it should print it like B[dot] but because the dot is the first charterer, the system will show the dot as left-to-right charterer. So it will be shown like [dot]B and what comes after B will be printed as right-to-left.
For the structure, the order of charterer doesn't care about the language direction, so When you Split نطاق.السعودية for example, you will find string[0] = "نطاق"//domain
and string[1] = "السعودية"//TLD.

Related

Eggplant : How to read text with special characters like ' _ etc

I am trying to read a text in a given rectangle using readText() function.
The function works correctly except when it has to read some text which has special characters like ' _ & etc.
I tried using validCharacters with readText() function. But it didn't help.
Code -
put ReadText((287,125,810,164),validCharacters:"_-'.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890") into Login
I tried working with character collections. But that doesn't seem to be right because the text trying to pick is a dynamic text combination of numbers alphabets and a special character. So one cannot create a library of character collection of every alphabet (a-z, A-Z), numbers(0-9) and special characters.
Example of text trying to read:
Login_Userid1_1, Login'Userid1_1
So how do I read such text correctly
Debugging OCR is a bit of an imprecise science. EggPlant has a lot of OCR Parameters to tweak. When designing test cases it's best to try use other mechanisms to gather information whenever possible. ReadText() should be considered a last resort when more reliable methods are unavailable. When I've used it I've often needed to do a lot of trial and error to find the right set of settings, and SearchRectangle to get consistent results. Without seeing exactly what images you are trying to read text from it's difficult to impossible to troubleshoot where the issue might be.
One thing that does stand out to me is that you're trying to read strings that may contain underscores. ReadText() has an optional property IgnoreUnderscores which treats underscores as spaces. By default this property is set to ON. It defaults to ON because some OCR engines have problems identifying underscore characters consistently.
If you want to have ReadText() handle underscores you'll want to explicitly set this property to OFF.
ReadText(rect, validCharacters:chars, ignoreUnderscores:OFF)

NSRegularExpression not matching number sign (#)

I'm working on a Guitar Chord transposer, and so from a given text file, I want to identify guitar chords. e.g. G#, Ab, F#m, etc.
I'm almost there! I have run into a few problems already due to the number sign (hash tag).
#
For example, you can't include the number sign in your regex pattern. The NSRegularExpression will not initialize with this:
let fail: String = "\\b[ABCDEFG](b|#)?\\b"
let success: String = "\\b[CDEFGAB](b|\\u0023)?\\b"
I had to specifically provide the unicode character. I can live with that.
However, now that I have a NSRegularExpression object, it won't match these (sharps = number sign) when I have a line of text such as:
Am Bb G# C Dm F E
When it starts processing the G#, the sharp associated with that second capture group is not matched. (i.e. the NSTextCheckingResult's second range has a location of NSNotFound) Note, it works for Bb... it matches the 'b'
I'm wondering what I need to do here. It would seem the documentation doesn't cover this case of '#' which IS in fact sometimes used in Regex patterns (I think related to comments or sth)
One thing that would be great would be to not have to look up the unicode identifier for a #, but just use it as a String "#" then convert that so it plays nicely with the pattern. There exists the chance that \u0023 is in fact not the code associated with # ...
The \b word boundary is a context dependent construct. It matches in 4 contexts: 1) between start of string and a word char, 2) between a word char and end of string, 3) between word and a non-word and 4) a non-word and a word char.
Your regex is written in such a way that ultimately the regex engine sees a \b after # and that means a # will only match if there is a word char after it.
If you replace \b with (?!\w), a negative lookahead that fails the match if there is a word char immediately to the right of the current location, it will work.
So, you may use
\\b[CDEFGAB](b|\\u0023)?(?!\\w)
See the regex demo.
Details
\b - a word boundary
[CDEFGAB] - a char from the set
(b|\\u0023)? - an optional sequence of b or #
(?!\\w) - a negative lookahead failing the match (and causing backtracking into the preceding pattern! To avoid that, add + after ? to prevent backtracking into that pattern) if there is a word char immediately to the right of the current position.
(I'd like to first say #WiktorStribiżew has been a tremendous help and what I am writing now would not have been possible without him! I'm not concerned about StackOverflow points and rep, so if you like this answer, please upvote his answer.)
This issue took many turns and had a few issues going on. Ultimately this question should be called How do I use Regex on iOS to detect Musical Chords in a text file?
The answer is (so far), not simply.
CRASH COURSE IN MUSIC THEORY
In music you have notes. They are made up of a letter between A->G and an optional symbol called an accidental. (A note relates to the acoustic frequency of the sound you hear when that note is played) An accidental can be a flat (represented as a ♭ or simply a b), or a sharp (represented as a ♯ or simply a #, as these are easier to type on a keyboard). An accidental serves to make a note a semitone higher (#) or lower (b). As such, a F# is the same acoustic frequency as a Gb. On a piano, the white keys are notes with no accidentals, and the black keys represent notes with an accidental. Depending on some factors of the piece of music, that piece won't mix accidental types. It will either be flats throughout the piece or sharps. (Depending on the musical key of the composition, but this is not that relevant here.)
In terms of regex, you have something like ABCDEFG? to determine the note. In reality it's more complicated.
Then, a Musical Chord is comprised of the root note and it's chord type. There are over 50 types of chords. They have a 'text signature' that is unique. Also, a 'major' chord has an empty signature. So in terms of pseudo-regex you have for a Chord:
[ABCDEFG](b|#)?(...|...|...)?
where the first part you recognize as the note (as before), and the last optional is to determine the chord type. The different types were omitted, but can be as simple as a m (for Minor chord), or maj7#5 (for a major 7th chord with an augmented 5th... don't worry about it. Just know there are many string constants to represent a chord type)
Then finally, with guitar you often have a corresponding bass note that changes the chord's tonality somewhat. You denote this by adding a slash and then the note, giving the general pseudoform:
[ABCDEFG](b|#)?(...|...|...)?(/[ABCDEFG](b|#)?)? // NOT real Regex
real examples: C/F or C#m/G# and so on
where the last part has a slash then the same pattern to recognize a note.
So putting these all together, in general we want to find chords that could take on many forms, such as:
F Gm C#maj7/G# F/C Am A7 A7/F# Bmaj13#11
I was hoping to find one Regex to rule them all. I ended up writing code that works, though it seems like I kind of hacked around a bit to get the results I desired.
You can see this code here, written in Swift. It is not complete for my purposes, but it will parse a string, return a list of Chord Results and their text range within the original string. From there you would have to finish the implementation to suit your needs.
There have been a few issues on iOS:
iOS does not handle the number sign (#) well at all. When providing regex patterns or match text, I either had to replace the # with its unicode \u0023, or what ultimately worked was replacing all occurrences of # with another character (such as 'S'), and then convert it back once regex did it's thing. So this code I wrote often has to 'sanitize' the pattern or the input text before doing anything.
I couldn't get a Regex Pattern to perfectly parse a chord structure. It wasn't fully working for a Chord with a bass note, but it would successfully match a Chord with a bass note, then I had to split those 2 components and parse them separately, then recombine them
Regex is really a bit of voodoo, and I think it sucks that for something so confusing to many people, there are also different platform-dependent implementations of it. For example, Wiktor referred me to Regex patterns he wrote to help me solve the problem on www.regex101.com, that would WORK on that website, but these would not work on iOS, and NSRegularExpression would throw an error (often it had something to do with this # character)
My solution pays absolutely no regard to performance. It just wanted it to work.

Unicode URLs shown in wrong order

I have enabled unicode urls in my joomla site
My language is Persian which is a right-to-left language but
urls written in persian appear in wrong order. For example:
Mysite.com/محصولات/محصول-اول
It translates to:
Mysite.com/first-product/products
Which should have been:
Mysite.com/products/first-product
This is only a matter of displaying text. I know that the actual text the server receives is in correct order because url-encoded version has the correct order.
(If you don't get the idea type "something.com/" in your url bar. Now copy/paste this at the end of url
محصولات
Now type a slash and copy/paste this at the end
محصول
You see? The last one should have gone to the right but goes to the left)
I have two questions regarding this issue:
1-is there anything i can do to display urls in correct order?
2-can it affect how google indexes my pages? Can it misdirect google?
The behaviour of the url display is totally correct in Unicode sense, as the slash is defined as bidirectionally neutral:
http://www.fileformat.info/info/unicode/char/002f/index.htm
Thus, standing between two arabic (right-to-left) words, the slash has to adapt to the writing direction of the surrounding words. The slash would, though, never adapt to the writing direction of the whole line within in a right-to-left neighborhood.
To answer your questions:
(1) It is not possible to influence this behaviour if you do not change the URL, as Jukka K. Korpela already assumed.
(2) As long as the order of the words is correctly encoded, I do not see any bad consequences for search engine indexings.
If you want to change it anyway, and assumed that your URLs are artificial and do no represent real paths, I can see the following workarounds:
(a) Substitute the slash with another "strong" symbol which influences the writing direction.
(b) Insert a "pseudo strong" character before (U+200e) the slash, which will enforce LTR for the slash.
Hope this helps.

How do I implement hotkeys in ideographic languages?

I have an application implemented in German / English. It uses property files for the translation strings in the various menus and dialogs. The problem I have is that these files have a separate mnemonic field like so:
Field1_Label=Open a file
Field1_Label_MNEMONIC=1O
So in this example, the MNEMONIC tells the dialog to underline the O, and if the the user types ALT+O, the dialog will set focus to the entry field / button associated with the label.
So far so good.
The problem I face is that the product is being translated into Chinese and Japanese. These ideographic languages use input method editors (IMEs) to compose their symbols. A symbol might be composed by phonetically typing the word into the IME which then produces the corresponding Chinese text. So I can't underline a symbol because there is no key equivalent to it.
So what do I do? What is best practice for dealing with this? I could potentially just remove all mnemonics altogether. I could potentially throw an ASCII char at the end of the string to acts as the mnemonic.
But what is the best industry practice for this?
The usual practice is what you hinted at in your question: the Latin character used for the original mnemonic is appended to the translated text in parentheses. Look at some screenshots of e.g. Japanese user interfaces and you will notice that UI elements tend looking like this:
File(F) | Edit(E) | View(V) | ...
Here are some examples:
http://www.komeiharada.com/Japanese/Tategaki.gif
http://i.stack.imgur.com/7N5XB.png

why use - instead off _ in url

why use - instead off _ in url?
Url contain '_' seems like no bad effects.
Underscores are not allowed in a host name. Thus some_place.com is not a valid URL because the host name is not valid. Underscores are permissible in URLS. Thus some-place.com/which_place/ is perfectly legitimate, other concerns aside.
From RFC 1738:
host
[...] Fully qualified domain names take the form as described
in Section 3.5 of RFC 1034 [13] and Section 2.1 of RFC 1123
[5]: a sequence of domain labels separated by ".", each domain
label starting and ending with an alphanumerical character and
possibly also containing "-" characters. The rightmost domain
label will never start with a digit, though, which
syntactically distinguishes all domain names from the IP
addresses.
When you read a_long_sentence_with_many_underscores, because you are reading it by letter or word recognition, your eye tracks along the middle of the line, but when you reach an underscore, your eye is more likely to track down a bit and back up for the next word.
When you read a-long-sentence-with-many-dashes, your eye keeps tracking along the same horizon, and by sight, it is easier for your brain to try and ignore them.
Another good reason is that Google and other search engines rank urls that match to search terms higher when the word separator is a dash.
One main reason is that most anchor tags have text-decoration:underline which effectively hides your underscore.
And, a non-tech savvy user wont automatically assume that there is an underscore :)
By the way... it seems several Java network libraries will not be able to interpret a URL correctly when using underscore:
URI uri = URI.create("http://www.google-plus.com/");
System.out.println(uri.getHost()); // prints www.google-plus.com
URI uri = URI.create("http://www.google_plus.com/");
System.out.println(uri.getHost()); // prints null
It's easier to type (at least on my german keyboard) and see.

Resources