This is how what i'm trying to extract from looks : http://pastebin.com/VD0K3ZcN
lines:match([[title="(value here)">]])
How can I get the "value here"? it does not have numbers or the ">" symbol inside it, only letters, spaces, ' - and .
I have tried
lines:match([[title="(.+)">]])
but it simply got the whole line after the capture.
The problem with your pattern is this:
title=" -- This is fine, but you probably want to find out what tag title is in.
(.+) -- Problem: Greedy match. I'll illustrate this later.
"> -- Will match a closing tag with a double quote.
Now, if I have this HTML:
<html>
<head title="Foobar">
</head>
<body onload="somejs();">
</body>
</html>
Your pattern will match:
Foobar"></head><body onload="somejs();
You can fix this by using (.-). This is the non-greedy version, and it will match the least amount possible, stopping once it finds the next "> instead of the last ">.
Related
I'm using poll-mailbox-trigger-plugin to trigger Jenkins jobs based on incoming emails.
One of the build parameters (pmt_content) contains the body of the email specified in HTML.
Is there a Jenkins plugin that can parse the HTML and retrieve the values of user-specified tags?
Email content example:
<!DOCTYPE html>
<html>
<head>
<meta content="text/html; charset=UTF-8">
<title></title>
</head>
<body style='margin:20px'>
<p>The following user has registered a device, click on the link below to
review the user and make any changes if necessary.</p>
<ul style='list-style-type:none; margin:25px 15px;'>
<li><b>User name:</b> Test User</li>
<li><b>User email:</b> test#abc.com</li>
<li><b>Identifier:</b> abc123def132afd1213afas</li>
<li><b>Description:</b> Tom's iPad</li>
<li><b>Model:</b> iPad 3</li>
<li><b>Platform:</b></li>
<li><b>App:</b> Test app name</li>
<li><b>UserID:</b></li>
</ul>
<p>Review user: https://cirrus.app47.com/users?search=test#abc.com</p>
<hr style='height=2px; color:#aaa'>
<p>We hope you enjoy the app store experience!</p>
<p style='font-size:18px; color:#999'>Powered by App47</p><img alt='' src=
'https://cirrus.app47.com/notifications/562506219ac25b1033000904/img'>
</body>
</html>
Specifically, how could I retrieve the value of the "Identifier:" tag?
I'm sure I could write a script to do it but I'd rather the logic in Jenkins.
Is there a Jenkins plugin that can parse the HTML and retrieve the values of user-specified tags?
Its a one-liner on the shell or few lines in the scripting language of your choice. But seems, thats not what you are looking for.
In general, no, there isn't a plugin for the purpose of parsing HTML and retrieving the value of a tag, see https://wiki.jenkins-ci.org/display/JENKINS/Plugins
How could I retrieve the value of the "Identifier:" tag?
There is a generic plugin called Conditional BuildStep,
which supports regular expressions on parameters.
When the HTML Email content is in pmt_content you could use the following
RegExp
<li><b>Identifier:<\/b>(.*)<\/li> to extract the value abc123def132afd1213afas (or match and exec another command, if found).
There are a number of resources to parse HTML pages and extract textual content. Jsoup is an example. In my case, I would like to extract the textual content tagged with the html tags under which each sentence occurs. For example, take this page
<html>
<head><title>Test Page</title>
<body>
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
</body>
</html>
I'm expecting the output to be like this:
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
In other words, I want to include specific html tags within the textual content of the page.
To get your result you can use this:
final String html = "<html>"
+ "<head><title>Test Page</title>"
+ "<body>"
+ "<h1>This is a test page</h1>"
+ "<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages."
+ "</body>"
+ "</html>";
// Parse the String into a Jsoup Document
Document doc = Jsoup.parse(html);
Elements body = doc.body().children();
// Do further things here ...
System.out.println(body);
Instead of the String html you can load a file or a website too - jsoup provides this all.
In this example body contains the html you posted as result.
Or do you need to select something like "h1 followed by p tag"?
However you may take a look at the Jsoup Selector API
You do it in two steps. First, as you have described, create a DOM tree using JSoup. Then process it using an XSL filter. In the XSL filter you can extract only those tags you are interested.
I need some way to preserve tags inside a code or a pre block, while sanitizing.
For example:
link
<code>
link
<p>The link above and this p should not be sanitized, just converted to html special chars.</p>
</code>
Should output something like:
link
<code>
<a href="http://donotsanitize.com">link</a>
<p>The link above and this p should not be sanitized, just converted to html special chars.</p>
</code>
With common/regular sanitization methods the output is:
link
<code>
link
The link above and this p should not be sanitized, just converted to html special chars.
</code>
Can someone tell me how to stop IE8 printing the value of the href for an A tag next to the text. For example this markup
Some Link
When printed comes out as
Some Link(/site/page.html)
when printed. How can I stop this?
This doesn't happen for me in IE8 and I've never spotted it. I also can't find it in the Internet Options anywhere.
It is possible that you have some software on your computer that does this, for example AVG Anti-Virus adds content to web pages to tell you that it has checked the links being displayed for potentially harmful content - so your system-security software may be expanding all links to show you where they actually point, to prevent phishing attacks.
If you do have some anti-phishing software on your machine, you'll have to find the option within that.
Update - It is almost certainly some clever CSS.
I have created the following test page to demonstrate how you can add the URL to a link using CSS generated content. If this was used within a print stylesheet, this would explain how the URL is getting added to the link when you are printing the page. To stop this, you would have to save a copy of the web page, remove the style rule from the print-only style sheet and then open your copy and print it!
<html>
<head>
<title>Test</title>
<style type="text/css">
a:after {
content: " [" attr(href) "] ";
}
</style>
</head>
<body>
<h1>Test</h1>
<p>This is a test to see if this
Link Shows A URL</p>
</body>
</html>
So, firstly, here's an Atom feed snippet which I am trying to parse:
// http://somelink.com/atom
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title>Title Here</title>
<link href="http://somelink.com/link1&ref=rss" rel="alternate" />
<link href="http://somelink.com/link2&ref=rss" rel="tag:somelink.com/apply_url"/>
...
</entry>
I pull the Atom feed like so,
// In controller index method
#rss = SimpleRSS.parse open('http://somelink.com/atom')
Then I output the response in the view, which I am writing using Haml, as follows:
- #rss.entries.each do |item|
.title-div
= item.title
.title-link
= item.link //outputs the first link
I could run a second loop for the links but is there a way to get the second link without it? Like reading the "rel" attribute and outputting the correct link? How do I do this in Haml/Rails?
EDIT: The gem i am using: http://simple-rss.rubyforge.org/
I'm not familiar with that gem, but have you tried item.links to see if each item provides a collection of links?
I have never used SimpleRSS but maybe you could give Nokogiri or Hpricot a try? You can than run an XPath query to only select the link with the right attribute. An example with Nokogiri:
atom_doc = Nokogiri::XML(open("http://www.example.com/atom.xml"))
atom_doc.xpath("/xmlns:feed/xmlns:entry/xmlns:link[#rel='tag:somelink.com/apply_url']")
Don't forget the namespaces if you are parsing an Atom feed.