UIMA Ruta: Make HTMLAnnotator annotate more tags - html-parsing

I'm relatively new to UIMA Ruta and I need to process HTML documents. I already have a ProcessHTML.ruta script which is basically the same as in the documentation (with minor adjustments):
ENGINE utils.HtmlAnnotator;
ENGINE utils.HtmlConverter;
ENGINE HtmlViewWriter;
TYPESYSTEM utils.HtmlTypeSystem;
TYPESYSTEM utils.SourceDocumentInformation;
Document{->CONFIGURE(HtmlAnnotator, "onlyContent"=true), EXEC(HtmlAnnotator, {TAG})};
Document { -> CONFIGURE(HtmlConverter, "inputView" = "_InitialView",
"outputView" = "plain", "expandOffsets"=false, "replaceLinebreaks"=true, "skipWhitespacs"=true, "linebreakReplacement"=" ", "processAll"=true),
EXEC(HtmlConverter)};
Document{ -> CONFIGURE(HtmlViewWriter, "inputView" = "plain",
"outputView" = "_InitialView", "output" = "../converted/"),
EXEC(HtmlViewWriter)};
I noticed that I might need layout information from the HTML source for my next script which is not present currently. For example, text is often marked up with tags, but there is no STRONG annotation in the output. If I understand correctly, all tags not implemented in HTMLTypeSystem are annotated with a default TAG annotation.
Is it possible to define additional annotations for specific HTML tags to be retained? Is there some configuration for this or do I need to extend the annotator somehow?

Adding the following to HTMLTypeSystem.xml did the trick:
<typeDescription>
<name>org.apache.uima.ruta.type.html.STRONG</name>
<description></description>
<supertypeName>org.apache.uima.ruta.type.html.TAG</supertypeName>
</typeDescription>
(Kudos to a colleague who figured that one out)

Related

Dart Markdown package, how to handle new lines

I am trying to make a WYSIWYG internal tool. And we decided to implement this feature with contentEditable. However, we save data to our databases in markdown. So I have to be able to parse from html to md and back. For html to md I use package html2md and for the other way around I use Markdown package.
The issue i've been having is that when you write to my editor text like
HEY
After many lines some text
It produces this in md
HEY
After many lines some text
Notably it uses 2 whitespace and 2 LF characters (or atleast i think so but i might be slightly wrong.) I solved this issue by parsing it like this
markdownToHtml(data.replaceAll('&', '&').replaceAll('<', '<').replaceAll('>', '>'), inlineSyntaxes: [TextSyntax(String.fromCharCodes([32,32,10,10]),sub: "<div><br></div>")],inlineOnly: true );
The inline only parameter was neccesary because without it the text syntax wasnt applied for some reason. However this inline only then bit me in the arse when I tried to implement parsing of unordered lists, which are parsed as blocks. So I need a way to correctly parse these empty lines without using inline only.
class EmptyLineBlockSyntax extends BlockSyntax{
RegExp get pattern => RegExp(r'^(?:[ \t][ \t]+)$');
const EmptyLineBlockSyntax();
Node parse(BlockParser parser) {
parser.encounteredBlankLine = true;
parser.advance();
return Element('p',[Element.empty('br')]);
}
}
return markdownToHtml(data.replaceAll('&', '&').replaceAll('<', '<').replaceAll('>', '>'), blockSyntaxes: [EmptyLineBlockSyntax()]);

How do I parse wikitext using built-in mediawiki support for lua scripting?

The wiktionary entry for faint lies at https://en.wiktionary.org/wiki/faint
The wikitext for the etymology section is:
From {{inh|en|enm|faynt}}, {{m|enm|feynt||weak; feeble}}, from
{{etyl|fro|en}} {{m|fro|faint}}, {{m|fro|feint||feigned; negligent;
sluggish}}, past participle of {{m|fro|feindre}}, {{m|fro|faindre||to
feign; sham; work negligently}}, from {{etyl|la|en}}
{{m|la|fingere||to touch, handle, usually form, shape, frame, form in
thought, imagine, conceive, contrive, devise, feign}}.
It contains various templates of the form {{xyz|...}}
I would like to parse them and get the text output as it shows on the page:
From Middle English faynt, feynt (“weak; feeble”), from Old French
faint, feint (“feigned; negligent; sluggish”), past participle of
feindre, faindre (“to feign; sham; work negligently”), from Latin
fingere (“to touch, handle, usually form, shape, frame, form in
thought, imagine, conceive, contrive, devise, feign”).
I have about 10000 entries extracted from the freely available dumps of wiktionary here.
To do this, my thinking is to extract templates and their expansions (in some form). To explore the possibilites I've been fiddling with the lua scripting facility on mediawiki. By trying various queries inside the debug console on edit pages of modules, like here:
https://en.wiktionary.org/w/index.php?title=Module:languages/print&action=edit
mw.log(p)
>> table
mw.logObject(p)
>> table#1 {
["code_to_name"] = function#1,
["name_to_code"] = function#2,
}
p.code_to_name("aaa")
>>
p.code_to_name("ab")
>>
But, I can't even get the function calls right. p.code_to_name("aaa") doesn't return anything.
The code that presumably expands the templates for the etymology section is here:
https://en.wiktionary.org/w/index.php?title=Module:etymology/templates
How do I call this code correctly?
Is there a simpler way to achieve my goal of parsing wikitext templates?
Is there some function available in mediawiki that I can call like "parse-wikitext("text"). If so, how do I invoke it?
To expand templates (and other stuff) in wikitext, use frame.preprocess, which is called as a method on a frame object. To get a frame object, use mw.getCurrentFrame. For instance, type = mw.getCurrentFrame():preprocess('{{l|en|word}}') in the console to get the wikitext resulting from {{l|en|word}}. That currently gives <span class="Latn" lang="en">[[word#English|word]]</span>.
You can also use the Expandtemplates action in the MediaWiki API ( https://en.wiktionary.org/w/api.php?action=expandtemplates&text={{l|en|word}}), or the Special:ExpandTemplates page, or JavaScript (if you open the browser console while browsing a Wiktionary page):
new mw.Api().get({
action: 'parse',
text: '{{l|en|word}}',
title: mw.config.values.wgPageName,
}).done(function (data) {
const wikitext = data.parse.text['*'];
if (wikitext)
console.log(wikitext);
});
If the mw.api library hasn't already been loaded and you get a TypeError ("mw.Api is not a constructor"):
mw.loader.using("mediawiki.api", function() {
// Use mw.Api here.
});
So these are some of the ways to expand templates.

Which settings should be used for TokensregexNER

When I try regexner it works as expected with the following settings and data;
props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, regexner");
Bachelor of Laws DEGREE
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE
What I would like to do is that using TokenRegex. For example
Bachelor of Laws DEGREE
Bachelor of ([{tag:NNS}] [{tag:NNP}]) DEGREE
I read that to do this, I should use TokensregexNERAnnotator.
I tried to use it as follows, but it did not work.
Pipeline.addAnnotator(new TokensRegexNERAnnotator("expressions.txt", true));
Or I tried setting annotator in another way,
props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, tokenregexner");
props.setProperty("customAnnotatorClass.tokenregexner", "edu.stanford.nlp.pipeline.TokensRegexNERAnnotator");
I tried to different TokenRegex formats but either annotator could not find the expression or I got SyntaxException.
What is the proper way to use TokenRegex (query on tokens with tags) on NER data file ?
BTW I just see a comment in TokensRegexNERAnnotator.java file. Not sure if it is related pos tags does not work with RegexNerAnnotator.
if (entry.tokensRegex != null) {
// TODO: posTagPatterns...
pattern = TokenSequencePattern.compile(env, entry.tokensRegex);
}
First you need to make a TokensRegex rule file (sample_degree.rules). Here is an example:
ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
{ pattern: (/Bachelor/ /of/ [{tag:NNP}]), action: Annotate($0, ner, "DEGREE") }
To explain the rule a bit, the pattern field is specifying what type of pattern to match. The action field is saying to annotate every token in the overall match (that is what $0 represents), annotate the ner field (note that we specified ner = ... in the rule file as well, and the third parameter is saying set the field to the String "DEGREE").
Then make this .props file (degree_example.props) for the command:
customAnnotatorClass.tokensregex = edu.stanford.nlp.pipeline.TokensRegexAnnotator
tokensregex.rules = sample_degree.rules
annotators = tokenize,ssplit,pos,lemma,ner,tokensregex
Then run this command:
java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props degree_example.props -file sample-degree-sentence.txt -outputFormat text
You should see that the three tokens you wanted tagged as "DEGREE" will be tagged.
I think I will push a change to the code to make tokensregex link to the TokensRegexAnnotator so you won't have to specify it as a custom annotator.
But for now you need to add that line in the .props file.
This example should help in implementing this. Here are some more resources if you want to learn more:
http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexRules
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html

How do I disable Transformations in TYPO3 RTE Editor?

I created a custom extension for TYPO3 CMS.
It basically does some database queries to get text from database.
As I have seen, TYPO3 editor, transforms data before storing it in database so for example a link <a href="....." >Link</a> is stored as <link href>My Link Text</link> and so on for many tags like this.
when I query data from DB, I get it as it is stored in DB (<link href>My Link Text</link>)
so links are not displayed as they shoud. They display as normal text..
As far as I know there are two ways to go:
disable RTE transformations (how to do that?)
use lib.parseFunc_RTE (which i have no Idea how to configure it properly)
any idea?
thanks.
I guess you're not using Extbase and Fluid? Just as a reference, if you are using Extbase and Fluid for your extension you can render text from the RTE using Fluid:
<f:format.html>{bodytext}</f:format.html>
This uses lib.parseFunc_RTE to render the RTE text as HTML. You can also tell it to use a different TypoScript object for the rendering:
<f:format.html parseFuncTSPath="lib.my_parseFunc">{bodytext}</f:format.html>
Useful documentation:
parseFunc
Fluid format.html
I came across the same problem, but using EXTBASE the function "pi_RTEcssText" ist not available anymore. Well maybe it is, but I didn't know how to include it.
Anyway, here's my solution using EXTBASE:
$this->cObj = $this->configurationManager->getContentObject();
$bodytext = $this->cObj->parseFunc($bodyTextFromDb, $GLOBALS['TSFE']->tmpl->setup['lib.']['parseFunc_RTE.']);
This way I get the RTE formatted text.
I have managed to do it by configuring the included typoscript:
# Creates persistent ParseFunc setup for non-HTML content. This is recommended to use (as a reference!)
lib.parseFunc {
makelinks = 1
makelinks.http.keep = {$styles.content.links.keep}
makelinks.http.extTarget < lib.parseTarget
makelinks.http.extTarget =
makelinks.http.extTarget.override = {$styles.content.links.extTarget}
makelinks.mailto.keep = path
tags {
link = TEXT
link {
current = 1
typolink.parameter.data = parameters : allParams
typolink.extTarget < lib.parseTarget
typolink.extTarget =
typolink.extTarget.override = {$styles.content.links.extTarget}
typolink.target < lib.parseTarget
typolink.target =
typolink.target.override = {$styles.content.links.target}
parseFunc.constants =1
}
}
allowTags = {$styles.content.links.allowTags}
And denied tag link:
denyTags = link
sword = <span class="csc-sword">|</span>
constants = 1
nonTypoTagStdWrap.HTMLparser = 1
nonTypoTagStdWrap.HTMLparser {
keepNonMatchedTags = 1
htmlSpecialChars = 2
}
}
Well, just so if anyone else runs into this problem,
I found one way to resolve it by using pi_RTEcssText() function inside my extension file:
$outputText=$this->pi_RTEcssText( $value['bodytext'] );
where $value['bodytext'] is the string I get from the database-query in my extension.
This function seems to process data and return the full HTML (links, paragraphs and other tags inculded).
Note:
If you haven't already, it requires to include this file:
require_once(PATH_tslib.'class.tslib_pibase.php');
on the top of your extension file.
That's it basically.

How to use locale entity in js-code

is it possible to get the value of an entity
<!ENTITY gatwayError "Gateway error">
using javascript? For now I reference them in my xul file using
&gatewayError;
UPDATE: In my ff-sidebar.xul within the <page> I have
<stringbundleset id="stringbundleset">
<stringbundle id="strings"
src="chrome://myaddon/locale/de/sidebar.properties"/>
</stringbundleset>
In my ff-sidebar.js I do on click:
var strbundle = document.getElementById("strings");
var localizedString = strbundle.getString("test");
This gives me following error
Should it not be
var strbundle = document.getElementById("stringbundleset");
This gives me no error but no result too.
Basically what Neil posted there is what you need to do (minus first paragraph rant :P )
Here's an example (basically digest from Neil's links):
Your XUL file:
<stringbundleset id="strbundles">
<stringbundle id="strings" src="chrome://yourextension/locale/something.properties"/>
</stringbundleset>
Your something.properties (there you define your localized strings key=value). Of course you can have as many files as you want/need:
something=Some text for localization
something2=Some more text
Your js file:
var strbundle = document.getElementById("strings");
var localizedString = strbundle.getString("something");
Hope this helps.
This works for small numbers of entities. For instance, menuitems sometimes have two entities with slightly different text depending on what the menuitem will be used for, and the correct entity is then copied to the label. The worst abuse of this was for the Delete menuitem in Thunderbird and SeaMonkey's mail windows, which had labels for unsubscribing from newsgroups, deleting folders, cancelling news posts, deleting single or multiple messages, or undeleting single or multiple messages from folders using the IMAP mark as delete model. Phew!
If you have lots of locale data then the best thing is to put it in its own .properties file and read it using a <stringbundle>. If your script doesn't have access to a <stringbundle> element it is also possible to manually retrieve an nsIStringBundle from the nsIStringBundleService.

Resources