How to Capture html tags using lua pattern - lua

This is how what i'm trying to extract from looks : http://pastebin.com/VD0K3ZcN
lines:match([[title="(value here)">]])
How can I get the "value here"? it does not have numbers or the ">" symbol inside it, only letters, spaces, ' - and .
I have tried
lines:match([[title="(.+)">]])
but it simply got the whole line after the capture.

The problem with your pattern is this:
title=" -- This is fine, but you probably want to find out what tag title is in.
(.+) -- Problem: Greedy match. I'll illustrate this later.
"> -- Will match a closing tag with a double quote.
Now, if I have this HTML:
<html>
<head title="Foobar">
</head>
<body onload="somejs();">
</body>
</html>
Your pattern will match:
Foobar"></head><body onload="somejs();
You can fix this by using (.-). This is the non-greedy version, and it will match the least amount possible, stopping once it finds the next "> instead of the last ">.

Related

Parsing HTML in Jenkins

I'm using poll-mailbox-trigger-plugin to trigger Jenkins jobs based on incoming emails.
One of the build parameters (pmt_content) contains the body of the email specified in HTML.
Is there a Jenkins plugin that can parse the HTML and retrieve the values of user-specified tags?
Email content example:
<!DOCTYPE html>
<html>
<head>
<meta content="text/html; charset=UTF-8">
<title></title>
</head>
<body style='margin:20px'>
<p>The following user has registered a device, click on the link below to
review the user and make any changes if necessary.</p>
<ul style='list-style-type:none; margin:25px 15px;'>
<li><b>User name:</b> Test User</li>
<li><b>User email:</b> test#abc.com</li>
<li><b>Identifier:</b> abc123def132afd1213afas</li>
<li><b>Description:</b> Tom's iPad</li>
<li><b>Model:</b> iPad 3</li>
<li><b>Platform:</b></li>
<li><b>App:</b> Test app name</li>
<li><b>UserID:</b></li>
</ul>
<p>Review user: https://cirrus.app47.com/users?search=test#abc.com</p>
<hr style='height=2px; color:#aaa'>
<p>We hope you enjoy the app store experience!</p>
<p style='font-size:18px; color:#999'>Powered by App47</p><img alt='' src=
'https://cirrus.app47.com/notifications/562506219ac25b1033000904/img'>
</body>
</html>
Specifically, how could I retrieve the value of the "Identifier:" tag?
I'm sure I could write a script to do it but I'd rather the logic in Jenkins.
Is there a Jenkins plugin that can parse the HTML and retrieve the values of user-specified tags?
Its a one-liner on the shell or few lines in the scripting language of your choice. But seems, thats not what you are looking for.
In general, no, there isn't a plugin for the purpose of parsing HTML and retrieving the value of a tag, see https://wiki.jenkins-ci.org/display/JENKINS/Plugins
How could I retrieve the value of the "Identifier:" tag?
There is a generic plugin called Conditional BuildStep,
which supports regular expressions on parameters.
When the HTML Email content is in pmt_content you could use the following
RegExp
<li><b>Identifier:<\/b>(.*)<\/li> to extract the value abc123def132afd1213afas (or match and exec another command, if found).

HTML parsing and extracting text

There are a number of resources to parse HTML pages and extract textual content. Jsoup is an example. In my case, I would like to extract the textual content tagged with the html tags under which each sentence occurs. For example, take this page
<html>
<head><title>Test Page</title>
<body>
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
</body>
</html>
I'm expecting the output to be like this:
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
In other words, I want to include specific html tags within the textual content of the page.
To get your result you can use this:
final String html = "<html>"
+ "<head><title>Test Page</title>"
+ "<body>"
+ "<h1>This is a test page</h1>"
+ "<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages."
+ "</body>"
+ "</html>";
// Parse the String into a Jsoup Document
Document doc = Jsoup.parse(html);
Elements body = doc.body().children();
// Do further things here ...
System.out.println(body);
Instead of the String html you can load a file or a website too - jsoup provides this all.
In this example body contains the html you posted as result.
Or do you need to select something like "h1 followed by p tag"?
However you may take a look at the Jsoup Selector API
You do it in two steps. First, as you have described, create a DOM tree using JSoup. Then process it using an XSL filter. In the XSL filter you can extract only those tags you are interested.

How to preserve tags inside pre or code while sanitizing?

I need some way to preserve tags inside a code or a pre block, while sanitizing.
For example:
link
<code>
link
<p>The link above and this p should not be sanitized, just converted to html special chars.</p>
</code>
Should output something like:
link
<code>
<a href="http://donotsanitize.com">link</a>
<p>The link above and this p should not be sanitized, just converted to html special chars.</p>
</code>
With common/regular sanitization methods the output is:
link
<code>
link
The link above and this p should not be sanitized, just converted to html special chars.
</code>

Printing in IE8 Has #href contents inline

Can someone tell me how to stop IE8 printing the value of the href for an A tag next to the text. For example this markup
Some Link
When printed comes out as
Some Link(/site/page.html)
when printed. How can I stop this?
This doesn't happen for me in IE8 and I've never spotted it. I also can't find it in the Internet Options anywhere.
It is possible that you have some software on your computer that does this, for example AVG Anti-Virus adds content to web pages to tell you that it has checked the links being displayed for potentially harmful content - so your system-security software may be expanding all links to show you where they actually point, to prevent phishing attacks.
If you do have some anti-phishing software on your machine, you'll have to find the option within that.
Update - It is almost certainly some clever CSS.
I have created the following test page to demonstrate how you can add the URL to a link using CSS generated content. If this was used within a print stylesheet, this would explain how the URL is getting added to the link when you are printing the page. To stop this, you would have to save a copy of the web page, remove the style rule from the print-only style sheet and then open your copy and print it!
<html>
<head>
<title>Test</title>
<style type="text/css">
a:after {
content: " [" attr(href) "] ";
}
</style>
</head>
<body>
<h1>Test</h1>
<p>This is a test to see if this
Link Shows A URL</p>
</body>
</html>

Parsing atom/rss feed containing multiple <link> tags with Haml on RoR

So, firstly, here's an Atom feed snippet which I am trying to parse:
// http://somelink.com/atom
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title>Title Here</title>
<link href="http://somelink.com/link1&amp;ref=rss" rel="alternate" />
<link href="http://somelink.com/link2&amp;ref=rss" rel="tag:somelink.com/apply_url"/>
...
</entry>
I pull the Atom feed like so,
// In controller index method
#rss = SimpleRSS.parse open('http://somelink.com/atom')
Then I output the response in the view, which I am writing using Haml, as follows:
- #rss.entries.each do |item|
.title-div
= item.title
.title-link
= item.link //outputs the first link
I could run a second loop for the links but is there a way to get the second link without it? Like reading the "rel" attribute and outputting the correct link? How do I do this in Haml/Rails?
EDIT: The gem i am using: http://simple-rss.rubyforge.org/
I'm not familiar with that gem, but have you tried item.links to see if each item provides a collection of links?
I have never used SimpleRSS but maybe you could give Nokogiri or Hpricot a try? You can than run an XPath query to only select the link with the right attribute. An example with Nokogiri:
atom_doc = Nokogiri::XML(open("http://www.example.com/atom.xml"))
atom_doc.xpath("/xmlns:feed/xmlns:entry/xmlns:link[#rel='tag:somelink.com/apply_url']")
Don't forget the namespaces if you are parsing an Atom feed.

Resources