HTML parsing and extracting text - html-parsing

There are a number of resources to parse HTML pages and extract textual content. Jsoup is an example. In my case, I would like to extract the textual content tagged with the html tags under which each sentence occurs. For example, take this page
<html>
<head><title>Test Page</title>
<body>
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
</body>
</html>
I'm expecting the output to be like this:
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
In other words, I want to include specific html tags within the textual content of the page.

To get your result you can use this:
final String html = "<html>"
+ "<head><title>Test Page</title>"
+ "<body>"
+ "<h1>This is a test page</h1>"
+ "<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages."
+ "</body>"
+ "</html>";
// Parse the String into a Jsoup Document
Document doc = Jsoup.parse(html);
Elements body = doc.body().children();
// Do further things here ...
System.out.println(body);
Instead of the String html you can load a file or a website too - jsoup provides this all.
In this example body contains the html you posted as result.
Or do you need to select something like "h1 followed by p tag"?
However you may take a look at the Jsoup Selector API

You do it in two steps. First, as you have described, create a DOM tree using JSoup. Then process it using an XSL filter. In the XSL filter you can extract only those tags you are interested.

Related

Parsing HTML in Jenkins

I'm using poll-mailbox-trigger-plugin to trigger Jenkins jobs based on incoming emails.
One of the build parameters (pmt_content) contains the body of the email specified in HTML.
Is there a Jenkins plugin that can parse the HTML and retrieve the values of user-specified tags?
Email content example:
<!DOCTYPE html>
<html>
<head>
<meta content="text/html; charset=UTF-8">
<title></title>
</head>
<body style='margin:20px'>
<p>The following user has registered a device, click on the link below to
review the user and make any changes if necessary.</p>
<ul style='list-style-type:none; margin:25px 15px;'>
<li><b>User name:</b> Test User</li>
<li><b>User email:</b> test#abc.com</li>
<li><b>Identifier:</b> abc123def132afd1213afas</li>
<li><b>Description:</b> Tom's iPad</li>
<li><b>Model:</b> iPad 3</li>
<li><b>Platform:</b></li>
<li><b>App:</b> Test app name</li>
<li><b>UserID:</b></li>
</ul>
<p>Review user: https://cirrus.app47.com/users?search=test#abc.com</p>
<hr style='height=2px; color:#aaa'>
<p>We hope you enjoy the app store experience!</p>
<p style='font-size:18px; color:#999'>Powered by App47</p><img alt='' src=
'https://cirrus.app47.com/notifications/562506219ac25b1033000904/img'>
</body>
</html>
Specifically, how could I retrieve the value of the "Identifier:" tag?
I'm sure I could write a script to do it but I'd rather the logic in Jenkins.
Is there a Jenkins plugin that can parse the HTML and retrieve the values of user-specified tags?
Its a one-liner on the shell or few lines in the scripting language of your choice. But seems, thats not what you are looking for.
In general, no, there isn't a plugin for the purpose of parsing HTML and retrieving the value of a tag, see https://wiki.jenkins-ci.org/display/JENKINS/Plugins
How could I retrieve the value of the "Identifier:" tag?
There is a generic plugin called Conditional BuildStep,
which supports regular expressions on parameters.
When the HTML Email content is in pmt_content you could use the following
RegExp
<li><b>Identifier:<\/b>(.*)<\/li> to extract the value abc123def132afd1213afas (or match and exec another command, if found).

Pass a link as a paremeter to g:render tag in grails

I have a grails render tag that renders a small chunk of HTML. Sometimes the HTMl needs to display some text, and sometimes some text and a link.
This chunk of code is working fine:
<g:render template='accountInfo' model="[
'accountStatus':subscription.status,
'subscription':subscription.plan.name,
'msg':g.message (code: 'stripe.plan.info', args:'${[periodStatus, endDate]}')]"/>
But I want to be able to pass in some HTML for the 'msg' variable in the model, like this to ouput some text and a link:
<g:render template='accountInfo' model="[
'accountStatus':profile.profileState,
'subscription':'None selected yet',
'msg':Your profile is currently inactive, select a plan now to publish your profile.]"/>
That code does not work. I've tried adding the link to a taglib, but I can't figure out how to pass the taglib to the 'msg' either:
<g:render template='accountInfo' model="[
'accountStatus':profile.profileState,
'subscription':'None selected yet',
'msg':<abc:showThatLink/>]"/>
I am open to suggestions on how best to achieve passing in text only, and text and a grails createLink closure along with some text for the link.
You have multiple options:
1 Use quotes:
<g:render template='accountInfo' model='${ [msg: "<a href='www.someurl.com'>.. </a>"]}' />
2 Use another variable:
<g:set var="myMsg>
<a href="www.someurl.com>...</a>
</g:set>`
<g:render template='accountInfo' model='[ 'msg': ${myMsg} ]'/>
3 Use the body content of <g:render />:
<g:render template="accountInfo" model="[ .. ]">
..
</g:render>
You can use to body() method inside the template (accountInfo) to render the body content. Some time ago I wrote a small blog post about this which gives some more details.

Encoding issue asp.net mvc 4

I´m trying to render a dynamic database title in asp.net mvc view. So I have something like this in my view.
#section meta {
<meta name="title" content="#Model.title" />
}
When model has special characters like Misión in spanish it shows in title something like
Misi&#243;n ... I´m using meta charset utf8 in my layout. Is there a special encoding I´m missing ?
How can I render Misión in title page ?
Using #someproperty will assume you're rendering out HTML and make sure it gets encoded to prevent things like cross site scripting. In this instance you want it to render the raw value, in which case you need to use Html.Raw(...) to render your content in it's raw form.
#section meta {
<meta name="title" content="#Html.Raw(Model.title)" />
}
However, just be aware that if the Model.title can come from user generated content (or some other untrusted source), you could be opening yourself up to security issues (for example if your Model.title's value was "test" /> <script ...etc...", a malicious user could use it to inject code into your pages.
Edit: Just including the content of my comment below for future googlers, since it appears that was the actual solution...
If you put the #Html.Raw(Model.title) directly in the page somewhere (i.e. not in the meta tag) and it works correctly there, you may be facing the same problem discussed here, in which case you could work around it by using the slightly uglier:
#section meta {
<meta name="title" #Html.Raw("content=\" + Model.title + "\"") />
}
Approach - 1
string value1 = "<html>"; // <html>
string value2 = HttpUtility.HtmlDecode(value1); // <html> //While getting
string value3 = HttpUtility.HtmlEncode(value2); // <html> //While saving
Approach - 2
Html.Raw("PKKG StackOverFlow"); // PKKG StackOverFlow

How to use html in Foundation's tooltips?

Is it possible to use html in Foundation's tooltips?
Yes. It supports html in the title attribute.
from foundation.tooltip.js:
create : function ($target) {
var $tip = $(this.settings.tip_template(this.selector($target), $('<div></div>').html($target.attr('title')).html())),
...
Breaking that down it creates a new element wrapped in a div and the contents of the title attribute are inserted into the div using the html() method which will convert any markup in the string to html elements.
The following code:
<img src="example.png"
class="general-infotip has-tip tip-top"
data-tooltip
title="<b>This is bold</b> This is not" />
Will result in a tool tip that looks like
This is bold This is not
In Foundation v6.3+, you can append the attribute data-allow-html="true" to the element to allow html in the tooltip.
For example:
<span data-tooltip data-allow-html="true" aria-haspopup="true"
class="has-tip" data-disable-hover="false" tabindex="1"
title="Fancy word for a <strong>beetle</strong>. <br><br><img src=https://pbs.twimg.com/profile_images/730481747679432704/uc08_dqy.jpg />">
Scarabaeus
</span>
Here it is working in jsfiddle.
For more information, check out the pull request.

How to preserve tags inside pre or code while sanitizing?

I need some way to preserve tags inside a code or a pre block, while sanitizing.
For example:
link
<code>
link
<p>The link above and this p should not be sanitized, just converted to html special chars.</p>
</code>
Should output something like:
link
<code>
<a href="http://donotsanitize.com">link</a>
<p>The link above and this p should not be sanitized, just converted to html special chars.</p>
</code>
With common/regular sanitization methods the output is:
link
<code>
link
The link above and this p should not be sanitized, just converted to html special chars.
</code>

Resources