Named anchor (A) with NAME same as a DIV ID conflict - url

I am workign on a site that is using a listener for the hash to show and hide content DIVs and to scroll to the named anchor of the same name.
I was having a weird issue where instead of scrolling to the anchor, it would scroll to the DIV with the ID the same as the anchor's name.
Once I changed the DIV ID to something different, the behavior was as expected.
I can't seem to find any documentation on this and was wondering if this is documented behavior.
Code that works:
<a name="top">top</a>
<p id="bottomx" style="height: 1800px;">
top
bottom
<br>
</p>
<a name="bottom">bottom</a>
Not working as expected:
<a name="top">top</a>
<p id="bottom" style="height: 1800px;">
top
bottom
<br>
</p>
<a name="bottom">bottom</a>
In the second example, it would scroll to the P named "bottom". Likewise, if I make a DIV at the bottom of the page with an ID of "bottom" and I hit page.html#bottom, it scrolls down to that DIV.
Just seems confusing. An idea why this is working this way? Same behavior in Safari and FF.

id has precedence over name:
For HTML documents (and HTML MIME types), the following processing
model must be followed to determine what the indicated part of the
document is.
Parse the URL, and let fragid be the component of the
URL.
If fragid is the empty string, then the indicated part of the
document is the top of the document; stop the algorithm here.
Let decoded fragid be the result of expanding any sequences of
percent-encoded octets in fragid that are valid UTF-8 sequences into
Unicode characters as defined by UTF-8. If any percent-encoded
octets in that string are not valid UTF-8 sequences (e.g. they
expand to surrogate code points), then skip this step and the next
one.
If this step was not skipped and there is an element in the DOM that
has an ID exactly equal to decoded fragid, then the first such
element in tree order is the indicated part of the document; stop
the algorithm here.
If there is an a element in the DOM that has a name attribute whose
value is exactly equal to fragid (not decoded fragid), then the
first such element in tree order is the indicated part of the
document; stop the algorithm here.
If fragid is an ASCII case-insensitive match for the string top,
then the indicated part of the document is the top of the document;
stop the algorithm here.
Otherwise, there is no indicated part of the document.

The HTML 4.01 and XHTML 1.0 specifications require that a name attribute in an a element must not be the same as the value of an id attribute, except when set on the same element, the document is in error. Browsers are free to apply their own error handling, which can be rather unplanned.
HTML5 drafts specify complicated error handling rules, but they also declare the name attribute in an a element as obsolete.
It would be illogical (and formally forbidden) to use the same id value to two elements in the same document, as the very purpose, and sole purpose, of id is to provide a unique identifier for an element. The <a name=...> construct predates id in the history of HTML and was always meant to play the same role as id, in a restricted setting. It is thus natural that it is treated the same way.

Related

Complex text substitution algorithm or design pattern

I am in the need of doing multiple substitutions in a text coming from a database and before displaying it to the user.
My example is for data most likely found on a CRM and the output is HTML for web, but the question is generalizable to any other text-subtitution need. The question is general for any programming language. In my case I use PHP but it's more an algorithm question than a PHP question.
Problem
Each of the 3 examples I'm writing below are super-easy to do via regular expressions. But combining them in a single shot is not so direct even if I do multi-step substitutions. They interfere.
Question
Is there a design-pattern for doing multiple interferring text substitutions?
Example #1 of substitution: The IDs.
We work with IDs. The IDs are sha-1 digests. IDs are universal and can represent any entity in the company, from a user to an airport, from an invoice to a car.
So in the database we can find this text to be displayed to a user:
User d19210ac35dfc63bdaa2e495e17abe5fc9535f02 paid 50 EUR
in the payment 377b03b0b4e92502737eca2345e5bdadb1262230. We sent
an email a49c6737f80eadea0eb16f4c8e148f1c82e05c10 to confirm.
We want all IDs to be translated into links so the user watching it the info can click. There's one general URL for decoding IDs. Let's assume it's http://example.com/id/xxx
The transformed text would be this:
User d19210ac35dfc63bdaa2e495e17abe5fc9535f02 paid 50 EUR
in the payment 377b03b0b4e92502737eca2345e5bdadb1262230. We sent
an email a49c6737f80eadea0eb16f4c8e148f1c82e05c10 to confirm
Example #2 of substitution: The Links
We want anything that ressembles a URI to be clickable. Let's focus only in http and https protocols and forget the rest.
If we find this in the database:
Our website is http://mary.example.com and the info
you are requesting is in this page http://mary.example.com/info.php
would be converted into this:
Our website is http://mary.example.com and the info
you are requesting is in this page http://mary.example.com/info.php
Example #3 of substitution: The HTML
When the original text contains HTML it must not be sent raw as it would be interpreted. We want to change the < and > chars into the escaped form < and >. The translation table for HTML-5 also contains the & symbol to be converted to &This also affects the translation of the Message Ids of the emails, for example.
For example if we find this in the database:
We need to change the CSS for the <code> tag to a pure green.
Sent to John&Partners in Message-ID: <aaa#bbb.ccc> this morning.
The resulting substitution would be:
We need to change the CSS for the <code> tag to a pure green.
Sent to John&Partners in Message-ID: <aaa#bbb.ccc> this morning.
Allright... But... combinations?
Up to here, every change "per se" is super-easy.
But when we combine things we want them to still be "natural" to the user. Let's assume that the original text contains HTML. And one of the tags is an <a> tag. We still want to see the complete tag "displayed" and the HREF be clickable. And also the text of the anchor if it was a link.
Combination sample: #2 (inject links) then #3 (flatten HTML)
Let's say we have this in the database:
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
If we first apply #2 to transform the links and then #3 to encode HTML we would have:
Applying rule #2 (inject links) on the original the link http://example.com/data.xmlis detected and subtituted by http://example.com/data.xml
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
which obviously is a broken HTML and makes no sense but, in addition, applying rule #3 (flatten HTML) on the output of #2 we would have:
Paste this <a class="dark" href="<a href="http://example.com/data.xml">http://example.com/data.xml</a>">Download</a> into your text editor.
which in turn is the mere flat HTML representation of the broken HTML and not clickable. Wrong output: Neither #2 nor #3 were satisfied.
Reversed combination: First #3 (flatten HTML) then #2 (inject links)
If I first apply rule #3 to "decode all HTML" and then afterwards I apply rule #2 to "inject links HTML" it happens this:
Original (same than above):
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
Result of applying #3 (flatten HTML)
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
Then we apply rule #2 (inject links) it seems to work:
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
This works because " is not a valid URL char and detects http://example.com/data.xml as the exact URL limit.
But... what if the original text had also a link inside the link text? This is a very common case scenario. Like this original text:
Paste this <a class="dark" href="http://example.com/data.xml">http://example.com/data.xml</a> into your text editor.
Then applying #2 would give this:
Paste this <a class="dark" href="http://example.com/data.xml"<http://example.com/data.xml</a> into your text editor.
HERE WE HAVE A PROBLEM
As all of &, ; and / are valid URL characters, the URL parser would find this: http://example.com/data.xml</a> as the URL instead of ending at the .xml point.
This would result in this wrong output:
Paste this <a class="dark" href="http://example.com/data.xml"<http://example.com/data.xml</a> into your text editor.
So http://example.com/data.xml</a> got substituted by http://example.com/data.xml</a> but the problem is that the URL was not correctly detected.
Let's mix it up with rule #1
If rules #2 and #3 are a mess when processed together imagine if we mix them with rule #1 and we have a URL which contains a sha-1 like this database entry:
Paste this <a class="dark" href="http://example.com/id/89019b16ab155ba1c19e1ab9efdb9134c8f9e2b9">http://example.com/id/89019b16ab155ba1c19e1ab9efdb9134c8f9e2b9</a> into your text editor.
Could you imagine??
Tokenizer?
I have thought of creating a syntax tokenizer. But I feel it's an overkill.
Is there a design-pattern
I wonder if there's a design-pattern to read and study, how is it called, and where is it documented, when it comes to do multiple text substitutions.
If there's not any pattern... then... is building a syntax tokenizer the only solution?
I feel there must be a much simpler way to do this. Do I really have to tokenize the text in a syntax-tree and then re-render by traversing the tree?
The design pattern is the one you already rejected, left-to-right tokenisation. Of course, that's easier to do in languages for which there are code generators which produce lexical scanners.
There's no need to parse or to build a syntax tree. A linear sequence of tokens suffices. In effect, the scanner becomes a transducer. Each token is either passed through unaltered, or is replaced immediately with the translation required.
Nor does the tokeniser need to be particularly complicated. The three regular expressions you currently have can be used, combined with a fourth token type representing any other character. The important part is that all patterns are tried at each point, one is selected, the indicated replacement is performed, and the scan resumes after the match.

Trix formatting rules

When users paste items from MS word, for example numbered list or bullet points Trix leaves the symbols in, but does not use the default stye rules. eg See below. Note the indenting
I am wanting to replace pasted bulletpoints with '<li>' tags since that is how the browser, or just adds the default style rules to the text.
As a workaround I was thinking that using Javascript/coffee script to replace all incidents of '•' to <li> during a paste command using onPaste='' However this is problematic, since the implementation could cause unforeseen effects.
Another way might be to create a regex expression, to remove the sybols and do It JIT while pasting.
Any other suggestions would be welcome in achieving this.
Edit
/\d\.\s+|[a-z]\)\s+|•\s+|[A-Z]\.\s+|[IVX]+\.\s+[•].\s/g
This regex expression can find Numbered list and a simple replace on the pasted string, will allow for the desired results.

CommonMark Parsing ***

Let's say I want to parse the string ***cat*** into Markdown using the CommonMark standard. The standard says (http://spec.commonmark.org/0.28/#phase-2-inline-structure):
....
If one is found:
Figure out whether we have emphasis or strong emphasis: if both closer
and opener spans have length >= 2, we have strong, otherwise regular.
Insert an emph or strong emph node accordingly, after the text node
corresponding to the opener.
Remove any delimiters between the opener and closer from the delimiter
stack.
Remove 1 (for regular emph) or 2 (for strong emph) delimiters from the
opening and closing text nodes. If they become empty as a result,
remove them and remove the corresponding element of the delimiter
stack. If the closing node is removed, reset current_position to the
next element in the stack.
....
Based on my reading of this the result should be <em><strong>cat</strong></em> since first the <strong> is added, THEN the <em>. However, all online markdown editors I have tried this in output <strong><em>cat</em></strong>. What am I missing?
Here is a visual representation of what I think should be happening
TextNode[***] TextNode[cat] TextNode[***]
TextNode[*] StrongEmphasis TextNode[cat] TextNode[*]
TextNode[] Emphasis StrongEmphasis TextNode[cat] TextNode[]
Emphasis StrongEmphasis TextNode[cat]
It's important to remember that Commonmark and Markdown are not necessarily the same thing. Commonmark is a recent variant of Markdown. Most Markdown parsers existed and established their behavior long before the Commonmark spec was even started.
While the original Markdown rules make no comment on whether the <em> or <strong> tag should be first in the given example, the reference implementation's (markdown.pl) actual behavior was to list the <strong> tag before the <em> tag in the output. In fact, the MarkdownTest package, which was created by the author of Markdown and markdown.pl) explicitly required that output (the original is no longer available online that I know of, but mdtest is a faithful copy with its history showing no modifications of that test since the initial import from MarkdownTest). AFAICT, every (non-Commonmark) Markdown parser has followed that behavior exactly.
The Commonmark spec took a different route. The spec specifically states in Rule 14 of Section 6.4 (Emphasis and strong emphasis):
An interpretation <em><strong>...</strong></em> is always preferred to <strong><em>...</em></strong>.
... and backs it up with example 444:
***foo***
<p><em><strong>foo</strong></em></p>
In fact, you can see that that is exactly the behavior of the reference implementation of Commonmark.
As an aside, the original question quotes from the Appendix to the spec which recommends how to implement a parser. While potentially useful to a parser creator, I would not recommend using that section to determine proper syntax handling and/or output. The actual rules should be consulted instead; and in fact, they clearly provide the expected output in this instance. But this question is about an apparent disparity between implementations and the spec, not interpretation of the spec.
For a more complete comparison, see Babelmark. With the exception of a few (completely) broken implementations, every "classic" Markdown parser follows markdown.pl, while every Commonmark parser follows the Commonmark spec. Therefore, there is no actual disparity between the spec and implementations. The disparity is between Markdown and Commonmark.
As for why the Commonmark authors chose a different route in this regard, or why they insist on calling Commonmark "Markdown" when it is clearly different are off topic here and better asked of the authors themselves.

Unicode URLs shown in wrong order

I have enabled unicode urls in my joomla site
My language is Persian which is a right-to-left language but
urls written in persian appear in wrong order. For example:
Mysite.com/محصولات/محصول-اول
It translates to:
Mysite.com/first-product/products
Which should have been:
Mysite.com/products/first-product
This is only a matter of displaying text. I know that the actual text the server receives is in correct order because url-encoded version has the correct order.
(If you don't get the idea type "something.com/" in your url bar. Now copy/paste this at the end of url
محصولات
Now type a slash and copy/paste this at the end
محصول
You see? The last one should have gone to the right but goes to the left)
I have two questions regarding this issue:
1-is there anything i can do to display urls in correct order?
2-can it affect how google indexes my pages? Can it misdirect google?
The behaviour of the url display is totally correct in Unicode sense, as the slash is defined as bidirectionally neutral:
http://www.fileformat.info/info/unicode/char/002f/index.htm
Thus, standing between two arabic (right-to-left) words, the slash has to adapt to the writing direction of the surrounding words. The slash would, though, never adapt to the writing direction of the whole line within in a right-to-left neighborhood.
To answer your questions:
(1) It is not possible to influence this behaviour if you do not change the URL, as Jukka K. Korpela already assumed.
(2) As long as the order of the words is correctly encoded, I do not see any bad consequences for search engine indexings.
If you want to change it anyway, and assumed that your URLs are artificial and do no represent real paths, I can see the following workarounds:
(a) Substitute the slash with another "strong" symbol which influences the writing direction.
(b) Insert a "pseudo strong" character before (U+200e) the slash, which will enforce LTR for the slash.
Hope this helps.

Should I retain html markup when performing language translation (ie via MS translate API)?

Is there any advantage in passing html fragments to the translation API as opposed to only plain text. For example, translating the following
Please click <a href='#'>here</a> to continue
returns a valid, translated html fragment - but what happens under the hood? Is the returned translation equivalent to the translation of three sentence fragments
Please click > here > to continue
Or the single sentence
Please click here to continue
Why do I ask?
I have one or two html fragments that are larger than the permitted size and I need to chunk them up in some-way. Using the htmlagilitypack I could just replace the html document text nodes with the translated equivalent values but do i lose anything by doing this? Will the quality of the translation improve if I translate whole sentences (ie <H1>, <H2>, <p> tags)
Many thanks in advance
Duncan
From MSDN here I got the following reply:
In the translation it matters where the sentence boundary lies. HTML
markup has sentence-internal elements, like <em> or <a>, and sentence
breaking elements, like <p>. HTML mode will process the elements
according to their behavior in HTML document rendering.
So there it is!

Resources