Complex text substitution algorithm or design pattern - parsing

I am in the need of doing multiple substitutions in a text coming from a database and before displaying it to the user.
My example is for data most likely found on a CRM and the output is HTML for web, but the question is generalizable to any other text-subtitution need. The question is general for any programming language. In my case I use PHP but it's more an algorithm question than a PHP question.
Problem
Each of the 3 examples I'm writing below are super-easy to do via regular expressions. But combining them in a single shot is not so direct even if I do multi-step substitutions. They interfere.
Question
Is there a design-pattern for doing multiple interferring text substitutions?
Example #1 of substitution: The IDs.
We work with IDs. The IDs are sha-1 digests. IDs are universal and can represent any entity in the company, from a user to an airport, from an invoice to a car.
So in the database we can find this text to be displayed to a user:
User d19210ac35dfc63bdaa2e495e17abe5fc9535f02 paid 50 EUR
in the payment 377b03b0b4e92502737eca2345e5bdadb1262230. We sent
an email a49c6737f80eadea0eb16f4c8e148f1c82e05c10 to confirm.
We want all IDs to be translated into links so the user watching it the info can click. There's one general URL for decoding IDs. Let's assume it's http://example.com/id/xxx
The transformed text would be this:
User d19210ac35dfc63bdaa2e495e17abe5fc9535f02 paid 50 EUR
in the payment 377b03b0b4e92502737eca2345e5bdadb1262230. We sent
an email a49c6737f80eadea0eb16f4c8e148f1c82e05c10 to confirm
Example #2 of substitution: The Links
We want anything that ressembles a URI to be clickable. Let's focus only in http and https protocols and forget the rest.
If we find this in the database:
Our website is http://mary.example.com and the info
you are requesting is in this page http://mary.example.com/info.php
would be converted into this:
Our website is http://mary.example.com and the info
you are requesting is in this page http://mary.example.com/info.php
Example #3 of substitution: The HTML
When the original text contains HTML it must not be sent raw as it would be interpreted. We want to change the < and > chars into the escaped form < and >. The translation table for HTML-5 also contains the & symbol to be converted to &This also affects the translation of the Message Ids of the emails, for example.
For example if we find this in the database:
We need to change the CSS for the <code> tag to a pure green.
Sent to John&Partners in Message-ID: <aaa#bbb.ccc> this morning.
The resulting substitution would be:
We need to change the CSS for the <code> tag to a pure green.
Sent to John&Partners in Message-ID: <aaa#bbb.ccc> this morning.
Allright... But... combinations?
Up to here, every change "per se" is super-easy.
But when we combine things we want them to still be "natural" to the user. Let's assume that the original text contains HTML. And one of the tags is an <a> tag. We still want to see the complete tag "displayed" and the HREF be clickable. And also the text of the anchor if it was a link.
Combination sample: #2 (inject links) then #3 (flatten HTML)
Let's say we have this in the database:
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
If we first apply #2 to transform the links and then #3 to encode HTML we would have:
Applying rule #2 (inject links) on the original the link http://example.com/data.xmlis detected and subtituted by http://example.com/data.xml
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
which obviously is a broken HTML and makes no sense but, in addition, applying rule #3 (flatten HTML) on the output of #2 we would have:
Paste this <a class="dark" href="<a href="http://example.com/data.xml">http://example.com/data.xml</a>">Download</a> into your text editor.
which in turn is the mere flat HTML representation of the broken HTML and not clickable. Wrong output: Neither #2 nor #3 were satisfied.
Reversed combination: First #3 (flatten HTML) then #2 (inject links)
If I first apply rule #3 to "decode all HTML" and then afterwards I apply rule #2 to "inject links HTML" it happens this:
Original (same than above):
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
Result of applying #3 (flatten HTML)
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
Then we apply rule #2 (inject links) it seems to work:
Paste this <a class="dark" href="http://example.com/data.xml">Download</a> into your text editor.
This works because " is not a valid URL char and detects http://example.com/data.xml as the exact URL limit.
But... what if the original text had also a link inside the link text? This is a very common case scenario. Like this original text:
Paste this <a class="dark" href="http://example.com/data.xml">http://example.com/data.xml</a> into your text editor.
Then applying #2 would give this:
Paste this <a class="dark" href="http://example.com/data.xml"<http://example.com/data.xml</a> into your text editor.
HERE WE HAVE A PROBLEM
As all of &, ; and / are valid URL characters, the URL parser would find this: http://example.com/data.xml</a> as the URL instead of ending at the .xml point.
This would result in this wrong output:
Paste this <a class="dark" href="http://example.com/data.xml"<http://example.com/data.xml</a> into your text editor.
So http://example.com/data.xml</a> got substituted by http://example.com/data.xml</a> but the problem is that the URL was not correctly detected.
Let's mix it up with rule #1
If rules #2 and #3 are a mess when processed together imagine if we mix them with rule #1 and we have a URL which contains a sha-1 like this database entry:
Paste this <a class="dark" href="http://example.com/id/89019b16ab155ba1c19e1ab9efdb9134c8f9e2b9">http://example.com/id/89019b16ab155ba1c19e1ab9efdb9134c8f9e2b9</a> into your text editor.
Could you imagine??
Tokenizer?
I have thought of creating a syntax tokenizer. But I feel it's an overkill.
Is there a design-pattern
I wonder if there's a design-pattern to read and study, how is it called, and where is it documented, when it comes to do multiple text substitutions.
If there's not any pattern... then... is building a syntax tokenizer the only solution?
I feel there must be a much simpler way to do this. Do I really have to tokenize the text in a syntax-tree and then re-render by traversing the tree?

The design pattern is the one you already rejected, left-to-right tokenisation. Of course, that's easier to do in languages for which there are code generators which produce lexical scanners.
There's no need to parse or to build a syntax tree. A linear sequence of tokens suffices. In effect, the scanner becomes a transducer. Each token is either passed through unaltered, or is replaced immediately with the translation required.
Nor does the tokeniser need to be particularly complicated. The three regular expressions you currently have can be used, combined with a fourth token type representing any other character. The important part is that all patterns are tried at each point, one is selected, the indicated replacement is performed, and the scan resumes after the match.

Related

Trix formatting rules

When users paste items from MS word, for example numbered list or bullet points Trix leaves the symbols in, but does not use the default stye rules. eg See below. Note the indenting
I am wanting to replace pasted bulletpoints with '<li>' tags since that is how the browser, or just adds the default style rules to the text.
As a workaround I was thinking that using Javascript/coffee script to replace all incidents of '•' to <li> during a paste command using onPaste='' However this is problematic, since the implementation could cause unforeseen effects.
Another way might be to create a regex expression, to remove the sybols and do It JIT while pasting.
Any other suggestions would be welcome in achieving this.
Edit
/\d\.\s+|[a-z]\)\s+|•\s+|[A-Z]\.\s+|[IVX]+\.\s+[•].\s/g
This regex expression can find Numbered list and a simple replace on the pasted string, will allow for the desired results.

opennlp does not recognized twiiter input

I have a file that contain twitter post and I am trying to identify the structure of the twitter post per line, like get the noun ,verb and stuff, using opennlp.
it work perfectly until it reach line that contain hashtag and link only
example :
#birthday www.mybirthday/test/mypi.com
and give error com.cybozu.labs.langdetect.LangDetectException: no features in text
when I write a sentence next to the line it just work. any idea how to handle it?? there are more then thousand line that almost like the example.
To use the POS tagger, you need to pass tokens, (in laymen terms say individual words). The link contains multiple words separated by a slash /. The link in itself is not associated with any Part Of Speech. See here the list of tags and how they are assigned to a word. If you want it to identify your link, and give a separate tag to it, say LN either give your own training data, here you will know how to create the training dataor separate the words in the link as separate token (you can separate a link by slash/, question mark?, equal to sign = or ampersand (&)) to get the underlying words and then use the POSTagger to get Part Of Speech (similar case for the hash tag.) For tokenization also, you can use opennlp tokenizer and for your special case, train it. Go through the documentation, it will help you a lot.

Should I retain html markup when performing language translation (ie via MS translate API)?

Is there any advantage in passing html fragments to the translation API as opposed to only plain text. For example, translating the following
Please click <a href='#'>here</a> to continue
returns a valid, translated html fragment - but what happens under the hood? Is the returned translation equivalent to the translation of three sentence fragments
Please click > here > to continue
Or the single sentence
Please click here to continue
Why do I ask?
I have one or two html fragments that are larger than the permitted size and I need to chunk them up in some-way. Using the htmlagilitypack I could just replace the html document text nodes with the translated equivalent values but do i lose anything by doing this? Will the quality of the translation improve if I translate whole sentences (ie <H1>, <H2>, <p> tags)
Many thanks in advance
Duncan
From MSDN here I got the following reply:
In the translation it matters where the sentence boundary lies. HTML
markup has sentence-internal elements, like <em> or <a>, and sentence
breaking elements, like <p>. HTML mode will process the elements
according to their behavior in HTML document rendering.
So there it is!

Named anchor (A) with NAME same as a DIV ID conflict

I am workign on a site that is using a listener for the hash to show and hide content DIVs and to scroll to the named anchor of the same name.
I was having a weird issue where instead of scrolling to the anchor, it would scroll to the DIV with the ID the same as the anchor's name.
Once I changed the DIV ID to something different, the behavior was as expected.
I can't seem to find any documentation on this and was wondering if this is documented behavior.
Code that works:
<a name="top">top</a>
<p id="bottomx" style="height: 1800px;">
top
bottom
<br>
</p>
<a name="bottom">bottom</a>
Not working as expected:
<a name="top">top</a>
<p id="bottom" style="height: 1800px;">
top
bottom
<br>
</p>
<a name="bottom">bottom</a>
In the second example, it would scroll to the P named "bottom". Likewise, if I make a DIV at the bottom of the page with an ID of "bottom" and I hit page.html#bottom, it scrolls down to that DIV.
Just seems confusing. An idea why this is working this way? Same behavior in Safari and FF.
id has precedence over name:
For HTML documents (and HTML MIME types), the following processing
model must be followed to determine what the indicated part of the
document is.
Parse the URL, and let fragid be the component of the
URL.
If fragid is the empty string, then the indicated part of the
document is the top of the document; stop the algorithm here.
Let decoded fragid be the result of expanding any sequences of
percent-encoded octets in fragid that are valid UTF-8 sequences into
Unicode characters as defined by UTF-8. If any percent-encoded
octets in that string are not valid UTF-8 sequences (e.g. they
expand to surrogate code points), then skip this step and the next
one.
If this step was not skipped and there is an element in the DOM that
has an ID exactly equal to decoded fragid, then the first such
element in tree order is the indicated part of the document; stop
the algorithm here.
If there is an a element in the DOM that has a name attribute whose
value is exactly equal to fragid (not decoded fragid), then the
first such element in tree order is the indicated part of the
document; stop the algorithm here.
If fragid is an ASCII case-insensitive match for the string top,
then the indicated part of the document is the top of the document;
stop the algorithm here.
Otherwise, there is no indicated part of the document.
The HTML 4.01 and XHTML 1.0 specifications require that a name attribute in an a element must not be the same as the value of an id attribute, except when set on the same element, the document is in error. Browsers are free to apply their own error handling, which can be rather unplanned.
HTML5 drafts specify complicated error handling rules, but they also declare the name attribute in an a element as obsolete.
It would be illogical (and formally forbidden) to use the same id value to two elements in the same document, as the very purpose, and sole purpose, of id is to provide a unique identifier for an element. The <a name=...> construct predates id in the history of HTML and was always meant to play the same role as id, in a restricted setting. It is thus natural that it is treated the same way.

Sanitize pasted text from MS-Word

Here's my wild and whacky psuedo-code. Anyone know how to make this real?
Background:
This dynamic content comes from a ckeditor. And a lot of folks paste Microsoft Word content in it. No worries, if I just call the attribute untouched it loads pretty. But the catch is that I want it to be just 125 characters abbreviated. When I add truncation to it, then all of the Microsoft Word scripts start popping up. Then I added simple_format, and sanitize, and truncate, and even made my controller start spotting out specific variables that MS would make and gsub them out. But there's too many of them, and it seems like an awfully messy way to accomplish this. Thus so! Realizing that by itself, its clean. I thought, why not just slice it. However, the microsoft word text becomes blank but still holds its numbered position in the string. So I came up with this (probably awful) solution below.
It's in three steps.
When the text parses, it doesn't display any of the MSWord junk. But that text still holds a number position in a slice statement. So I want to use a regexp to find the first actual character.
Take that character and find out what its numbered position is in the total string.
Use a slice statement to cut it from.
def about_us_truncated
x = self.about_us.find.first(regExp representing first actual character)
x.charCount = y
self.about_us[y..125]
end
The only other idea i got, is a regex statement that allows it to explicitly slice only actual characters like so :
about_us([a-zA-Z][0..125]) , but that is definately not how it is written.
Here is some sample text of MS Word junk :
&Lt;! [If Gte Mso 9]>&Lt;Xml>&Lt;Br /> &Lt;O:Office Document Settings>&Lt;Br /> &Lt;O:Allow Png/>&Lt;Br /> &Lt;/O:Off...
You haven't provided much information to go off of, but don't be too leery of trying to build this regex on your own before you seek help...
Take your sample text and paste it in Rubular in the test string area and start building your regex. It has a great quick reference at the bottom.
Stumbled across this
http://gist.github.com/139987
it looks like it requires the sanitize gem.
This is technically not a straight answer, but it seems like the best possible one you can find.
In order to prevent MS Word, you should be using CK Editor's built-in MS word sanitizer. This is because writing regex for it can be very complicated and you can very easily break tags in half and destroy your site with it.
What I did as a workaround, is I did a force paste as plain text in the CK Editor.

Resources