How can I use Apache Tika to extract css and html text

How can I use Apache Tika to extract css and html text - apache-tika

I want to use apache tika to extract html text and css class names so that I can build a poi spreadsheet. I can get the text but how do I extract css class names?
Thank You In Advance ...

Try creating a custom handler. If you override the startElement method you'll have access to the html attributes. Inheriting from BodyContentHandler should be pretty simple as a starting point. If the element you're targeting isn't getting mapped and you're not getting it passed into startElement you'll need to tell the ParseContext to let it through, by either using the IdentityHtmlMapper or writing your own mapper.

You could run Tika from command line
java -jar tika-app.jar -h [file|port...]
(-h or --html is an option that gives the Output of HTML content)
You could also do it programmatically by using the html parser:
Parser parser = new HtmlParser();
Thats not enough since the HTML parser first transforms the incoming HTML document to wellformed XHTML and then maps the included elements to a “safe” subset. The default
mapping drops things such as and elements that don’t affect the
text content of the HTML page and applies other normalization rules. This default
mapping produces good results in most use cases, but sometimes a client wants more
direct access to the original HTML markup. The IdentityHtmlMapper class can be
used to achieve this:
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new IdentityHtmlMapper());
Finally you can get your content by calling the parse method:
parser.parse(stream, handler, metadata, context);
Hope this helps a bit. :)

Related

Parse a HTML string into a document in JScript ES3

As JScript is the 'out of browser', Microsoft ES3 variant of Javascript, it's hard to do something simple as parsing a HTML string into an object.
As mentioned, JScript does reside not in a browser, so it does not have the standard document type, nor does it have the domparser.
I can make a document object as so:
var document = new ActiveXObject('htmlfile')
document.innerHTML = http.responseText
and while this will render the html response into a document, I cannot use getElementsByClassName, TagName or even ID - which is exactly what I need to do with the html responses I'm looking at (a mix of those mentioned).
I've tried using John Resig's "pure javascript HTML parser", but that won't run in ES3 and I am not versed enough in JScript/ES3 to understand why not.
https://johnresig.com/blog/pure-javascript-html-parser/
Ultimately, I want to parse the HTML file in a document object, and be able to pull elements by their class, id, tagname etc. For me it sounds like this should be easy, but it isn't.
Any help would be appreciated.

getElementById and getElementsByTagName seem to work:
var document = new ActiveXObject('htmlfile');
document.open();
document.write('<html><div id="div1" class="class1">test</div></html>');
document.close();
WScript.Echo(document.getElementById("div1").id);
WScript.Echo(document.getElementsByTagName("div")[0].id);
WScript.Echo(document.getElementsByTagName("div")[0].className);

Format dart code as html

I am knocking together a quick debugging view of a backend, as a small set of admin HTML pages (driven by angulardart, but not sure that is critical).
I get back from my XHR call a complex JSON object. I want to see that on the HTML page formatted nicely. It doesn't have to be a great implementation, as its just a debug ui, but the goal is to format the object instead of having it be one long string with no newlines.
I looked at trying to pretty print JSON in dart then putting that inside <pre></pre> tags, as well as just dumping the dart Map object to string (again, inside or not inside <pre></pre> tags. But not getting to where I want.
Even searched pub for something similar, such as a syntax highlighter that would output html, but didn't find something obvious.
Any recommendations?

I think what you're looking for is:
Format your JSON so it's readable
Have syntax highlight
For 1 - This can be done with JsonEncoder with indent
For 2 - You can use the JS lib called HighlightJs pretty easily by appending your formatted json into a marked-up div. (See highlightjs' doc to see what I mean)

How to keep attributes with parseFragment in Firefox extension

In Firefox extension we use parseFragment (documentation) to parse a string of HTML (received from 3rd party server) into a sanitized DocumentFragment as it required by Mozilla. The only problem, the parser removes all attributes we need, for example, class attribute.
Is it possible somehow to keep class attributes while parsing HTML with parseFragment?
P.S. I know that in Gecko 14.0 they replaced this function with another which supports sanitizing parameters. But what to do with Gecko < 14.0?

No, the whitelist is hardcoded and cannot be adjusted. However, the class attribute is in the whitelist and should be kept, you probably meant the style attribute? If you need a customized behavior you will have to use a different solution (like DOMParser which can parse HTML documents in Firefox 12).
As to older Firefox versions, you can parse XHTML data with DOMParser there. If you really have HTML then I am only aware of one way to parse it without immediately inserting it into a document (which might cause various security issues): range.createContextualFragment(). You need an HTML document for that, if you don't have one - a hidden <iframe> loading about:blank will do as well. Here is how it works:
// Get the HTML document
var doc = document.getElementById("dummyFrame").contentDocument;
// Parse data
var fragment = doc.createRange().createContextualFragment(htmlData);
// Sanitize it
sanitizeData(fragment);
Here sanitizing the data is your own responsibility. You probably want to base your sanitization on Mozilla's whitelist that I linked to above - remove all tags and attributes that are not on that list, also make sure to check the links. The style attribute is a special case: it used to be insecure but IMHO no longer is given than -moz-binding isn't supported on the web any more.

setting innerHTML in xul

I have in my browser.xul code,what I am tyring to is to fetch data from an html file and to insert it into my div element.
I am trying to use div.innerHTML but I am getting an exception:
Component returned failure code: 0x804e03f7
[nsIDOMNSHTMLElement.innerHTML]
I tried to parse the HTML using Components.interfaces.nsIScriptableUnescapeHTML and to append the parsed html into my div but my problem is that style(attribute and tag) and script isn`t parsed.

First a warning: if your HTML data comes from the web then you are trying to build a security hole into your extension. HTML code from the web should never be trusted (even when coming from your own web server and via HTTPS) and you should really use nsIScriptableUnescapeHTML. Styles should be part of your extension, using styles from the web isn't safe. For more information: https://developer.mozilla.org/En/Displaying_web_content_in_an_extension_without_security_issues
As to your problem, this error code is NS_ERROR_HTMLPARSER_STOPPARSING which seems to mean a parsing error. I guess that you are trying to feed it regular HTML code rather than XHTML (which would be XML-compliant). Either way, a better way to parse XHTML code would be DOMParser, this gives you a document that you can then insert into the right place.
If the point is really to parse HTML code (not XHTML) then you have two options. One is using an <iframe> element and displaying your data there. You can generate a data: URL from your HTML data:
frame.src = "data:text/html;charset=utf-8," + encodeURIComponent(htmlData);
If you don't want to display the data in a frame you will still need a frame (can be hidden) that has an HTML document loaded (can be about:blank). You then use Range.createContextualFragment() to parse your HTML string:
var range = frame.contentDocument.createRange();
range.selectNode(frame.contentDocument.documentElement);
var fragment = range.createContextualFragment(htmlData);

XML documents don't have innerHTML, and nsIScriptableUnescapeHTML is one way to get the html parsed but it's designed for uses where the HTML might not be safe; as you've found out it throws away the script nodes (and a few other things).
There are a couple of alternatives, however. You can use the responseXML property, although this may be suboptimal unless you're receiving XHTML content.
You could also use an iframe. It may seem old-fashioned, but an iframe's job is to take a url (the src property) and render the content it receives, which necessarily means parsing it and building a DOM. In general, when an extension running as chrome does this, it will have to take care not to give the remote content the same chrome privilages. Luckily that's easily managed; just put type="content" on the iframe. However, since you're looking to import the DOM into your XUL document wholesale, you must have already ensured that this remote content will always be safe. You're evidently using an HTTPS connection, and you've taken extra care to verify the identity of the server by making sure it sends the right certificate. You've also verified that the server hasn't been hacked and isn't delivering malicious content.

How do you change the default format to XML in Symfony?

I'm writing a restful XML API for a university assignment, the spec requires no HTML frontend.
There doesn't seem to be any documentation (or guessable functionality) regarding how to change the default format? Whilst thus far I have created all templates as ...Success.xml.php it would be easier to just use the regular ones and set this globally; I really expected this functionality to be configurable from YAML.. yet I have found some hard coded references to the HTML format.
The main issue I'm encountering is that part of the assessment is returning a 404 in a certain way (not as a 404 :/), but importantly it must always return XML, and the default setup of a missing route is a HTML 404 not XML (so it only works when I use forward404 from an action running via a XML route.
So in summary, is there a way to do this / what class(es) do I have to override?

Try putting this in factories.yml
all:
request:
class: sfWebRequest
param:
default_format: xml
That will still need the template names changing though. It will just mean that urls that don't specify a format will revert to xml instead of html.
You can subclass sfPHPView and override the initialise method to affect this (copy paste the initialise method from sfView) - the lines like this need changing:
if ('html' != $format)
You then need to change the view class used ... try this:
http://mirmodynamics.com/post/2009/02/23/symfony%3A-use-your-own-View-class

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How can I use Apache Tika to extract css and html text - apache-tika

I want to use apache tika to extract html text and css class names so that I can build a poi spreadsheet. I can get the text but how do I extract css class names? Thank You In Advance ...

Related

Parse a HTML string into a document in JScript ES3

Format dart code as html

How to keep attributes with parseFragment in Firefox extension

setting innerHTML in xul

How do you change the default format to XML in Symfony?

Categories

Resources