Unicode URLs shown in wrong order - url

I have enabled unicode urls in my joomla site
My language is Persian which is a right-to-left language but
urls written in persian appear in wrong order. For example:
Mysite.com/محصولات/محصول-اول
It translates to:
Mysite.com/first-product/products
Which should have been:
Mysite.com/products/first-product
This is only a matter of displaying text. I know that the actual text the server receives is in correct order because url-encoded version has the correct order.
(If you don't get the idea type "something.com/" in your url bar. Now copy/paste this at the end of url
محصولات
Now type a slash and copy/paste this at the end
محصول
You see? The last one should have gone to the right but goes to the left)
I have two questions regarding this issue:
1-is there anything i can do to display urls in correct order?
2-can it affect how google indexes my pages? Can it misdirect google?

The behaviour of the url display is totally correct in Unicode sense, as the slash is defined as bidirectionally neutral:
http://www.fileformat.info/info/unicode/char/002f/index.htm
Thus, standing between two arabic (right-to-left) words, the slash has to adapt to the writing direction of the surrounding words. The slash would, though, never adapt to the writing direction of the whole line within in a right-to-left neighborhood.
To answer your questions:
(1) It is not possible to influence this behaviour if you do not change the URL, as Jukka K. Korpela already assumed.
(2) As long as the order of the words is correctly encoded, I do not see any bad consequences for search engine indexings.
If you want to change it anyway, and assumed that your URLs are artificial and do no represent real paths, I can see the following workarounds:
(a) Substitute the slash with another "strong" symbol which influences the writing direction.
(b) Insert a "pseudo strong" character before (U+200e) the slash, which will enforce LTR for the slash.
Hope this helps.

Related

Eggplant : How to read text with special characters like ' _ etc

I am trying to read a text in a given rectangle using readText() function.
The function works correctly except when it has to read some text which has special characters like ' _ & etc.
I tried using validCharacters with readText() function. But it didn't help.
Code -
put ReadText((287,125,810,164),validCharacters:"_-'.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890") into Login
I tried working with character collections. But that doesn't seem to be right because the text trying to pick is a dynamic text combination of numbers alphabets and a special character. So one cannot create a library of character collection of every alphabet (a-z, A-Z), numbers(0-9) and special characters.
Example of text trying to read:
Login_Userid1_1, Login'Userid1_1
So how do I read such text correctly
Debugging OCR is a bit of an imprecise science. EggPlant has a lot of OCR Parameters to tweak. When designing test cases it's best to try use other mechanisms to gather information whenever possible. ReadText() should be considered a last resort when more reliable methods are unavailable. When I've used it I've often needed to do a lot of trial and error to find the right set of settings, and SearchRectangle to get consistent results. Without seeing exactly what images you are trying to read text from it's difficult to impossible to troubleshoot where the issue might be.
One thing that does stand out to me is that you're trying to read strings that may contain underscores. ReadText() has an optional property IgnoreUnderscores which treats underscores as spaces. By default this property is set to ON. It defaults to ON because some OCR engines have problems identifying underscore characters consistently.
If you want to have ReadText() handle underscores you'll want to explicitly set this property to OFF.
ReadText(rect, validCharacters:chars, ignoreUnderscores:OFF)

How are URLs with right-to-left TLDs represented?

I'm writing some Ruby code that does some text analysis on domain names. In looking at the list of valid TLDs, I see some that use right-to-left languages such as:
تونس.
سوريا.
السعودية.
Just looking at those TLDs alone shows that the dot (.) appears to the right instead of the left. If I came across a domain like this in the wild, how would the URL be structured? Specifically, a left-to-right URL is structured as:
<protocol>://[<user>:<pass>#]<host>:<port>/<path>[?<query>]
Additionally, the <host> portion above could be broken out to look like:
[<subdomain>.]<domain>.<tld>
(e.g. "foo.example.com")
What is the structure of a right-to-left language URL?
The short answer: the structure is the same.
For the dot, by default the system doesn't show the dot as right-to-left until there is string written before the symbol. So on your case when you deleted the domain the dot became as the first charterer and nothing before it, the system then showed as LTR charterer.
example:
As left-to-right string, for example when we have
A[dot]B
and when you deleted A it will become:
[dot]B.
As right-to-left (such as Arabic) string, for example when we have B[dot]A and when you delete A it should print it like B[dot] but because the dot is the first charterer, the system will show the dot as left-to-right charterer. So it will be shown like [dot]B and what comes after B will be printed as right-to-left.
For the structure, the order of charterer doesn't care about the language direction, so When you Split نطاق.السعودية for example, you will find string[0] = "نطاق"//domain
and string[1] = "السعودية"//TLD.

Using commas in URL's can break the URL sometimes?

Is anyone aware of any problems with using commas in SEO friendly URL's? I'm working with some software that uses a lot of commas in it's SEO friendly URL's; but I am 100% certain I have seen some instances where some programs/platforms don't recognize the URL correctly & cut the "linking" of the URL off after the first comma.
I just tested this out with thunderbird, gmail, hotmail & on a SMF forum with no problems; however I know I have seen the issue before.
So my question is, is there anything in particular that would cause some platforms to stop linking URL's with a comma? Such as a certain character after the comma?
There will be countless implementations that will cut the automatical linking at that point. As with many other characters, too. But that’s not a problem because of using these characters, but because of a wrong/incomplete implementation.
See for example this very site, Stack Overflow. It will cut off the link at the * when manually entering/pasting this URL (see bug; in case it gets fixed, here’s a screenshot of it):
http://wayback.archive.org/web/*/http://www.example.com/
But when using the hyperlink syntax, it works fine:
http://wayback.archive.org/web/*/http://www.example.com/
The * character is allowed in an HTTP URL path, so the link detection should have recognized the first URL instead of breaking it at the occurence of *.
Regarding the comma:
The comma is a reserved character and its meaning is relevant for the URL path (bold emphasis mine):
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same.
So, if you don’t intend to use the comma for the function it has as reserved character, you may want to percent-encode it with %2C. Users copying such an URL from their browser’s address bar would paste it in the encoded form, so it should work almost everywhere.
However, especially because it’s a reserved character, the unencoded form should work, too.

why use - instead off _ in url

why use - instead off _ in url?
Url contain '_' seems like no bad effects.
Underscores are not allowed in a host name. Thus some_place.com is not a valid URL because the host name is not valid. Underscores are permissible in URLS. Thus some-place.com/which_place/ is perfectly legitimate, other concerns aside.
From RFC 1738:
host
[...] Fully qualified domain names take the form as described
in Section 3.5 of RFC 1034 [13] and Section 2.1 of RFC 1123
[5]: a sequence of domain labels separated by ".", each domain
label starting and ending with an alphanumerical character and
possibly also containing "-" characters. The rightmost domain
label will never start with a digit, though, which
syntactically distinguishes all domain names from the IP
addresses.
When you read a_long_sentence_with_many_underscores, because you are reading it by letter or word recognition, your eye tracks along the middle of the line, but when you reach an underscore, your eye is more likely to track down a bit and back up for the next word.
When you read a-long-sentence-with-many-dashes, your eye keeps tracking along the same horizon, and by sight, it is easier for your brain to try and ignore them.
Another good reason is that Google and other search engines rank urls that match to search terms higher when the word separator is a dash.
One main reason is that most anchor tags have text-decoration:underline which effectively hides your underscore.
And, a non-tech savvy user wont automatically assume that there is an underscore :)
By the way... it seems several Java network libraries will not be able to interpret a URL correctly when using underscore:
URI uri = URI.create("http://www.google-plus.com/");
System.out.println(uri.getHost()); // prints www.google-plus.com
URI uri = URI.create("http://www.google_plus.com/");
System.out.println(uri.getHost()); // prints null
It's easier to type (at least on my german keyboard) and see.

Valid URL separators

I have a long URL with several values.
Example 1:
http://www.domain.com/list?seach_type[]=0&search_period[]=1&search_min=3000&search_max=21000&search_area=6855%3B7470%3B7700%3B7730%3B7741%3B7742%3B7752%3B7755%3B7760%3B7770%3B7800%3B7840%3B7850%3B7860%3B7870%3B7884%3B7900%3B7950%3B7960%3B7970%3B7980%3B7990%3B8620%3B8643%3B8800%3B8830%3B8831%3B8832%3B8840%3B8850%3B8860%3B8881%3B9620%3B9631%3B9632
My variable search area contains only 4 number digits (example 4000, 5000), but can contain a lot of them. Right now I seperate these in the URL by using ; as separator symbol. Though as seen in Example 1, the ; is converted into %3B. This makes me believe that this is a bad symbol to use.
What is the best URL separator?
Moontear, I think you have misread the linked document. That limitation only applies to the "scheme" portion of the URL. In the case of WWW URLs, that is the "http".
The next section of the document goes on to say:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
I'd personally use comma (,). However, plus (+) and dash (-) are both reasonable options as well.
BTW, that document also mentions that semi-colon (;) is reserved in some schemes.
Well, according to RFC1738, valid URLs may only contain the letters a-z, the plus sign (+), period and hyphen (-).
Generally I would go with a plus to separate your search areas. So your URL would become http://www.domain.com/list?seach_type=0&search_period=1&search_min=3000&search_max=21000&search_area=6855+7470+7700+...
--EDIT--
As GinoA pointed out I misread the document. Hence "$-_.+!*'()," are valid characters too. I'd still go with the + sign though.
If there are only numbers to separate, you have a large choice of separators. You can choose any letter for example.
Probably a space can be a good choice. It will be transformed into + character in the URL, so will be more readable than a letter.
Example: search_area=4000+5000+6000
I'm very late to the party, but a valid query string can repeat variables so instead of...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855+7470+7700
...you could also use...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855&area=7470&area=7700
"+" is to be interpreted as a space " " when the content-type is application/x-www-form-urlencoded (standard for HTML forms). This may be handled by your server software.
I prefer "!". It doesn't get URL encoded (at least not in Chrome) and it reserves "+" for use as a real space character in the typical case.

Resources