Parse a HTML string into a document in JScript ES3 - html-parsing

As JScript is the 'out of browser', Microsoft ES3 variant of Javascript, it's hard to do something simple as parsing a HTML string into an object.
As mentioned, JScript does reside not in a browser, so it does not have the standard document type, nor does it have the domparser.
I can make a document object as so:
var document = new ActiveXObject('htmlfile')
document.innerHTML = http.responseText
and while this will render the html response into a document, I cannot use getElementsByClassName, TagName or even ID - which is exactly what I need to do with the html responses I'm looking at (a mix of those mentioned).
I've tried using John Resig's "pure javascript HTML parser", but that won't run in ES3 and I am not versed enough in JScript/ES3 to understand why not.
https://johnresig.com/blog/pure-javascript-html-parser/
Ultimately, I want to parse the HTML file in a document object, and be able to pull elements by their class, id, tagname etc. For me it sounds like this should be easy, but it isn't.
Any help would be appreciated.

getElementById and getElementsByTagName seem to work:
var document = new ActiveXObject('htmlfile');
document.open();
document.write('<html><div id="div1" class="class1">test</div></html>');
document.close();
WScript.Echo(document.getElementById("div1").id);
WScript.Echo(document.getElementsByTagName("div")[0].id);
WScript.Echo(document.getElementsByTagName("div")[0].className);

Related

F#Data HTML Type provider missing Table types

The documentation at
http://fsharp.github.io/FSharp.Data/library/HtmlProvider.html
claims the following;
"The generated type provides a type space of tables that it has managed to parse out of the given HTML Document. Each type's name is derived from either the id, title, name, summary or caption attributes/tags provided. If none of these entities exist then the table will simply be named Tablexx where xx is the position in the HTML document if all of the tables were flatterned out into a list. "
I am trying to parse the following url
optionsdata = = HtmlProvider<"http://finance.yahoo.com/q/op?s=DDD+Options">
I do not see any Tablexx... types. Any help is appreciated and Thanks in advance. When I view source there are /table tags and there certainly are tables on the html page.
It looks like Yahoo does not send you the page with the same content that you can see in a web browser when you make a plain GET request from a script. This is why the type provider cannot see the tables - they are actually missing in the HTML that gets to the type provider. You can see this by looking at the Html that the type provider gets when you load the page using it:
type DDD = HtmlProvider<"http://finance.yahoo.com/q/op?s=DDD+Options">
DDD.GetSample().Html |> printfn "%A"
As a fix, you can view the source code in a browser, save it in a local file and then pass that to the type provider. Using this, I was able to write the following code:
type DDD = HtmlProvider<"c:/temp/yahoo.html">
let ddd = DDD.GetSample()
for r in ddd.Tables.Table1.Rows do
printfn "%s" r.``Contract Name``
The GetSample method just loads the file from the file system. I assume you want to parse live web pages - for that, you'll need to figure out how the get the right HTML from Yahoo (presumably, by setting some HTTP headers and cookies). Then you can call DDD.Parse(html) to load your actual data.

Format dart code as html

I am knocking together a quick debugging view of a backend, as a small set of admin HTML pages (driven by angulardart, but not sure that is critical).
I get back from my XHR call a complex JSON object. I want to see that on the HTML page formatted nicely. It doesn't have to be a great implementation, as its just a debug ui, but the goal is to format the object instead of having it be one long string with no newlines.
I looked at trying to pretty print JSON in dart then putting that inside <pre></pre> tags, as well as just dumping the dart Map object to string (again, inside or not inside <pre></pre> tags. But not getting to where I want.
Even searched pub for something similar, such as a syntax highlighter that would output html, but didn't find something obvious.
Any recommendations?
I think what you're looking for is:
Format your JSON so it's readable
Have syntax highlight
For 1 - This can be done with JsonEncoder with indent
For 2 - You can use the JS lib called HighlightJs pretty easily by appending your formatted json into a marked-up div. (See highlightjs' doc to see what I mean)

How can I use Apache Tika to extract css and html text

I want to use apache tika to extract html text and css class names so that I can build a poi spreadsheet. I can get the text but how do I extract css class names?
Thank You In Advance ...
Try creating a custom handler. If you override the startElement method you'll have access to the html attributes. Inheriting from BodyContentHandler should be pretty simple as a starting point. If the element you're targeting isn't getting mapped and you're not getting it passed into startElement you'll need to tell the ParseContext to let it through, by either using the IdentityHtmlMapper or writing your own mapper.
You could run Tika from command line
java -jar tika-app.jar -h [file|port...]
(-h or --html is an option that gives the Output of HTML content)
You could also do it programmatically by using the html parser:
Parser parser = new HtmlParser();
Thats not enough since the HTML parser first transforms the incoming HTML document to wellformed XHTML and then maps the included elements to a “safe” subset. The default
mapping drops things such as and elements that don’t affect the
text content of the HTML page and applies other normalization rules. This default
mapping produces good results in most use cases, but sometimes a client wants more
direct access to the original HTML markup. The IdentityHtmlMapper class can be
used to achieve this:
ParseContext context = new ParseContext();
context.set(HtmlMapper.class, new IdentityHtmlMapper());
Finally you can get your content by calling the parse method:
parser.parse(stream, handler, metadata, context);
Hope this helps a bit. :)

setting innerHTML in xul

I have in my browser.xul code,what I am tyring to is to fetch data from an html file and to insert it into my div element.
I am trying to use div.innerHTML but I am getting an exception:
Component returned failure code: 0x804e03f7
[nsIDOMNSHTMLElement.innerHTML]
I tried to parse the HTML using Components.interfaces.nsIScriptableUnescapeHTML and to append the parsed html into my div but my problem is that style(attribute and tag) and script isn`t parsed.
First a warning: if your HTML data comes from the web then you are trying to build a security hole into your extension. HTML code from the web should never be trusted (even when coming from your own web server and via HTTPS) and you should really use nsIScriptableUnescapeHTML. Styles should be part of your extension, using styles from the web isn't safe. For more information: https://developer.mozilla.org/En/Displaying_web_content_in_an_extension_without_security_issues
As to your problem, this error code is NS_ERROR_HTMLPARSER_STOPPARSING which seems to mean a parsing error. I guess that you are trying to feed it regular HTML code rather than XHTML (which would be XML-compliant). Either way, a better way to parse XHTML code would be DOMParser, this gives you a document that you can then insert into the right place.
If the point is really to parse HTML code (not XHTML) then you have two options. One is using an <iframe> element and displaying your data there. You can generate a data: URL from your HTML data:
frame.src = "data:text/html;charset=utf-8," + encodeURIComponent(htmlData);
If you don't want to display the data in a frame you will still need a frame (can be hidden) that has an HTML document loaded (can be about:blank). You then use Range.createContextualFragment() to parse your HTML string:
var range = frame.contentDocument.createRange();
range.selectNode(frame.contentDocument.documentElement);
var fragment = range.createContextualFragment(htmlData);
XML documents don't have innerHTML, and nsIScriptableUnescapeHTML is one way to get the html parsed but it's designed for uses where the HTML might not be safe; as you've found out it throws away the script nodes (and a few other things).
There are a couple of alternatives, however. You can use the responseXML property, although this may be suboptimal unless you're receiving XHTML content.
You could also use an iframe. It may seem old-fashioned, but an iframe's job is to take a url (the src property) and render the content it receives, which necessarily means parsing it and building a DOM. In general, when an extension running as chrome does this, it will have to take care not to give the remote content the same chrome privilages. Luckily that's easily managed; just put type="content" on the iframe. However, since you're looking to import the DOM into your XUL document wholesale, you must have already ensured that this remote content will always be safe. You're evidently using an HTTPS connection, and you've taken extra care to verify the identity of the server by making sure it sends the right certificate. You've also verified that the server hasn't been hacked and isn't delivering malicious content.

How do I remove HTML from the SAS URL access method?

What is the most convenient way to remove all the HTML tags when using the SAS URL access method to read web pages?
This should do what you want. Removes everything between the <> including the <> and leaves just the content (aka innerHTML).
Data HTMLData;
filename INDEXIN URL "http://www.zug.com/";
input;
textline = _INFILE_;
/*-- Clear out the HTML text --*/
re1 = prxparse("s/<(.|\n)*?>//");
call prxchange(re1, -1, textline);
run;
I think the methodology is not to remove the HTML from the page, but identify the standard patterns for the data you are trying to capture. This is the perl / regular expressions type methodology.
An example might be some data or table that comes so many characters after the logo image. You could write a script to keep only the data.
If you want to post up some html, maybe we can help decode it.

Resources