Processing HTML with UIMA - html-parsing

I am trying to get my head around the UIMA architecture.
I would like to create a pipeline that starts with HTML markup. I need to strip this to plain text, so it can be processed by different annotators, like POS, chunking, entity detection, etc. However I would also like to keep track of which regions correspond to the original html tags, like links, paragraphs, em, etc. Basically I would like a final annotator that takes advantage of structural annotations (from html) and semantic annotations (from the other components), all at once.
So, I can imagine starting off with a component that strips the html markup and adds annotations to keep track of the tags I am interested in. Does such a component exist already? It seems like something a lot of people would want.
If I do have to create it from scratch, what kind of component is it? It's not just a straight annotator, because it needs to change the SOFA: it needs to replace the markup with plain text.
Or should I have it create a new view of the document, so we maintain a markup view and a plain text view of the document? This seems weird, considering I will never care about the markup view again. Also, how would I make sure the other annotators (which I won't be coding myself) operate on the plain text view of the document rather than the markup view?

Depending on the complexity of the markup, some people use Apache Tika, and some people use Boilerpipe.
Here is a blog post from someone who wanted to use Boilerpipe in UIMA but ran into a snag because he wanted to preserve offsets back to the HTML.
Here is the UIMA annotator that calls tika.

UIMA Ruta provides some analysis engines for this task. The HtmlAnnotator creates annotations in the html text for the different tags. The HtmlConverter is able to create a new view that contains only the text of the html, but with the corresponding annotations for the tags. There are some configuration parameters for handling linebreaks and so on. For further processing without sofa mappings in a pipeline, there is the ViewWriter that is able copy the new plain text view to the _initalView of a new file.
DISCLAIMER: I am a developer of UIMA Ruta

Related

Search in Content and skip html elements in mvc

I want to search in content and I don't want to get fault result.
assume users search 'br' I don't want to see in output results that have <br> or <P> and other html elements
Simply, you must strip the tags before you search. However, that would mean not being able to query the database directly. Rather, you'd have to pull all the objects first, and then query the collection in memory.
If you're going to be doing a lot of this or have large collections of objects (where pulling all of them for the initial query would be a performance drag), then you should look into a true search solution. I've been working with Elasticsearch, which seems to be just about the best out there in my opinion. It's easy to set up, easy to use, and has third-party .NET integration through the nuget package, NEST.
With a true search solution, you can index your content fields, stripped of HTML, and then run your queries on the index instead of directly on your database. You'll also get powerful advanced features such as faceting, which would be difficult or impossible to do directly with Entity Framework.
Alternatively, if you can't go full board on the search and it's unacceptable to query everything up front (which really it pretty much always is), then your only other option is to create another companion field for each HTML content property, and always save a HTML-stripped copy of the text there. Then, use that field for your search queries.

Ruby on rails, markup interpreter with custom tags provided on runtime? For forms, not views

Site for writers and readers, both groups will be non-technical users (writers will be familiar with BBCode already, but I can choose other markup). Writers will write guides using markup tags to embodied info. Readers will be presented with parsed text. Tags will be expanded to some info.
Number of tags needed as well as info tied to particular tag will change. So they can not be hard-coded.
I'm looking for any interpreter that can use tags provided at run time, for my next Ruby on rails app. Anyone know such?
Edit: Yeah. I'm not looking for views markup, but for forms textarea markup to be used by website users (to format their guides, but I do need ONE markup for formatting, and embedding info).
Based on my current understanding of your needs, I recommend mustache. This is described as a "logic-less" template processor. It doesn't have programming logic, simply run-time replacements.
Here's one way to use it (from the github readme)
Given this template (winner.mustache):
Hello {{name}}
You have just won {{value}} bucks!
We can fill in the values at will:
view = Winner.new
view[:name] = 'George'
view[:value] = 100
view.render
Which returns:
Hello George
You have just won 100 bucks!

Rails: Embed metadata in templates – YAML in my HAML?

I would like to be able to set things like the page title and <meta> description from within HAML “pages” served up by my static page controller.
Is there a good way to do this? Ideally, I see it working something like:
Name files like about_us.html.haml.yaml
Use the normal render method
But now there is a hash of metadata available to my controller and layout templates, which set various headers and elements, respectively.
Thoughts?
(Since no one contributed a full answer)
If you want to set up title, description, noindex or similar tags in the head, then github.com/kpumuk/meta-tags is the best way to do it! I've used in a various projects, and think it's best gem ever for manipulating with title, description and other stuff that sits in the head tag.
— Dmitry Polushkin
It seems to work well for me, though it is a touch less powerful than what my question was looking for. Further answers welcome.

Making tagsoup markup cleansing optional

Tagsoup is interfering with input and formatting it incorrectly. For instance when we have the following markup
Text outside anchor
It is formatted as follows
Text outside anchor
This is a simple example but we have other issues as well. So we made tagsoup cleanup/formatting optional by adding an extra attribute to textarea control.
Here is the diff(https://github.com/binnyg/orbeon-forms/commit/044c29e32ce36e5b391abfc782ee44f0354bddd3).
Textarea would now look like this
<textarea skip-cleanmarkup="true" mediatype="text/html" />
Two questions
Is this the right approach?
If I provide a patch can it make it to orbeon codebase?
Thanks
BinnyG
Erik, Alex, et al
I think there are two questions here:
The first Concern is a question of Tag Soup and the clean up that happens OOTB: Empty tags are converted to singleton tags which when consumed/sent to the client browser as markup gets "fixed" by browsers like firefox but because of the loss of precision they do the wrong thing.
Turning off this clean up helps in this case but for this issue alone is not really the right answer because we it takes away a security feature and a well-formed markup feature... so there may need to be some adjustment to the handling of at least certain empty tags (other than turning them in to invalid singleton tags.)
All this brings us to the second concern which is do we always want those features in play? Our use-case says no. We want the user to be able to spit out whatever markup they want, invalid or not. We're not putting the form in an app that needs to protect the user from cross script coding, we're building a tool that lets users edit web pages -- hence we have turned off the clean-up.
But turning off cleanup wholesale? Well it's important that we can do it if that's what our usecase calls for but the implementation we have is all or nothing. It would be nice to be able to define strategies for cleanup. Make that function plug-able. For example:
* In the XML Config of the system define a "map" of config names to class names which implement the a given strategy. In the XForm Def the author would specify the name from the map.
If TagSoup transforms:
Text outside anchor
Into:
Text outside anchor
Wouldn't that be bug in TagSoup? If that was the case, then I'd say that it is better to fix this issue rather than disable TagSoup. But, it isn't a bug in TagSoup; here is what seems to be happening. Say the browsers sends the following to the client:
<a shape="rect"></a>After<br clear="none">
This goes through TagSoup, the result goes through the XSLT clean-up code, and the following is sent to the browser:
<a shape="rect"/>After<br clear="none"/>
The issue is on the browser, which transforms this into:
<a shape="rect">After</a><br clear="none"/>
The problem is that we serialize this as XML with Dom4jUtils.domToString(cleanedDocument), while it would be more prudent to serialize it as HTML. Here we could use the Saxon serializer. It is also used from HTMLSerializer. Maybe you can try changing this code to use it instead of using Dom4jUtils.domToString(). You'll let us know what you find when a get a chance to do that.
Binesh and I agree, if there is a bug it would be a good idea to address the issue closer to the root. But I think the specific issue he is only part of the matter.
We're thinking it would be best to have some kind of name-to-strategy mapping so that RTEs can call in the server-side processing that is right for them or the default if it's not specified.

Fill a rails form with a hashmap

I have a difficult situation.
I let the the user create a form through a Rich Text Editor and then I save this.
So for example, I save this literally into my DB:
http://pastebin.com/DNdeetJp (how can you post HTML here? It gets interpreted, so now I use pastebin...)
On another page I wrap this in a form_tag and it gets presented as it should be.
What I want to do is save this as a template and save the answers as a hashmap to my DB.
This works well, but the problem is I want to recreate what checkbox/radiobutton/... is selected when the user goes back to the page. So I want to fill the form with the answers from the hashmap.
Is there a way to use a 'dummy' model or something else to accomplish this?
Thanks!
Since you're pasting in raw HTML which is not properly configured as a template, it is more difficult to enable the proper options based on whatever might be stored in your DB.
The reliable approach to making this work is to use Hpricot or Nokogiri to manipulate the bit of HTML you have and substitute values accordingly. This isn't too hard so long as you can define the elements in that form using a proper selector. For example, create a div with a unique id and operate on all input elements within it, comparing the name attribute with your properties. There may even be a library for this somewhere.
The second approach is to use JavaScript to enable the options in much the same fashion. This seems like a bit of a hack since the form itself will not have a proper default state.

Resources