Hpricot Element intersection - ruby-on-rails

I want to remove all images from a HTML page (actually tinymce user input) which do not meet certain criteria (class = "int" or class = "ext") and I'm struggeling with the correct approach. That's what I'm doing so far:
hbody = Hpricot(input)
#internal_images = hbody.search("//img[#class='int']")
#external_images = hbody.search("//img[#class='ext']")
But I don't know how to find images where the class has the wrong value (not "int" or "ext").
I also have to loop over the elements to check other attributes which are not standard html (I use them for setting internal values like the DB id, which I set in the attribute dbsrc). Can I access these attributes too and is there a way to remove certain elements (which are in the hpricot search result) when they don't meet my criteria?
Thanks for your help!

>> doc = Hpricot.parse('<html><img src="foo" class="int" /><img src="bar" bar="42" /><img src="foobar" class="int"></html>')
=> #<Hpricot::Doc {elem <html> {emptyelem <img class="int" src="foo">} {emptyelem <img src="bar" bar="42">} {emptyelem <img class="int" src="foobar">} </html>}>
>> doc.search("img")[1][:bar]
=> "42"
>> doc.search("img") - doc.search("img.int")
=> [{emptyelem img src"bar" bar"42"}]
Once you have results from search you can use normal array operations. nonstandard attributes are accessible through [].

Check out the not CSS selector.
(hbody."img:not(.int)")
(hbody."img:not(.ext)")
Unfortunately, it doesn't seem you can concat not expressions. You might want to fetch all img nodes and remove those where the .css selector doesn't include neither .int nor .ext.
Additionally, you could use the difference operator to calculate which elements are not part of both collections.
Use the .remove method to remove nodes or elements: Hpricot Altering documentation.

Related

Nokogiri: Get text which is not inside the <a> tag

Take a look at this example:
<li>This is a website, it belongs to John Sulliva</li>
I can get the content of the <li> tag by using:
nodeset = doc.css('li')
I also can get the text inside the <a> tag by using:
nodeset.each do |element|
ahref = element.css('a') // <-- This is a website
name = ahref.text.strip // <--This is a website
end
But how do I get the rest of the text within the <li> tag but without the text from the <a> tag?
From this example, I like to get
", it belongs to John Sullivan"
How can I do this?
This is straightforward using XPath and the text() node test. If you have extracted the lis into nodeset, you can get the text with:
nodeset.xpath('./text()')
Or you can get it directly from the whole doc:
doc.xpath('//li/text()')
This uses the text() node test as part of te XPath expression, not the text Ruby method. It extracts any text nodes that are direct descendants of the li node, so doesn’t include the contents of the a element.
I found a cheap way to get the rest of the text:
ahref = element.css('a')
name = ahref.text.strip
suppl = element.text.strip.gsub(name, '')

Regular expression to determine each and every attribute of an anchor tag inside HTML content

I basically wanted the values of each and every attribute. The attributes may be optional and the href may contain HTTP or HTTPS.
A sample anchor tag inside content is:
<a class=\"direct_link\" rel=\"nofollow\" target=\"_blank\" href=\"http://google.com\">link text</a>
Sample HTML content is:
<p><br></p><h1>A beautiful <a class=\"f-link\" rel=\"nofollow\" target=\"_blank\" href=\"fake.com/abc.html\">jQuery</a>; a</h1><h3 class=\"text-light\">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's.</h3><p><br></p><p><br></p>
Don't use a regular expression to try to parse HTML. HTML can be expressed too many ways and still be valid, yet it will break your pattern and code.
The correct way to get the values for the parameters is to use a parser. Nokogiri is the defacto XML/HTML parser for Ruby:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(' <a class=\"direct_link\" rel=\"nofollow\" target=\"_blank\" href=\"http://google.com\">link text</a>')
That parses the document into a DOM and returns it.
link = doc.at('a')
at finds the first instance using the CSS 'a' selector. (If you want to iterate over them all you can use search, which returns a NodeSet, which is akin to an Array.)
At this point link is a Node, which we can consider to be like a pointer to the <a> tag.
link.to_h # => {"class"=>"\\\"direct_link\\\"", "rel"=>"\\\"nofollow\\\"", "target"=>"\\\"_blank\\\"", "href"=>"\\\"http://google.com\\\""}
That is the link's parameters and their values turned into a hash. Or, you can directly access the parameters, using keys, or their values:
link.values # => ["\\\"direct_link\\\"", "\\\"nofollow\\\"", "\\\"_blank\\\"", "\\\"http://google.com\\\""]
link.keys # => ["class", "rel", "target", "href"]
Or treat it like a hash and iterate over the key/value pairs:
link.each do |k, v|
puts 'parameter: "%s" value: "%s"' % [k, v]
end
# >> parameter: "class" value: "\"direct_link\""
# >> parameter: "rel" value: "\"nofollow\""
# >> parameter: "target" value: "\"_blank\""
# >> parameter: "href" value: "\"http://google.com\""
The advantage to using the parser, is that the HTML format can change and the parser is still able to figure it out, and your code won't care. The following format works just as good as the tag used above:
doc = Nokogiri::HTML::DocumentFragment.parse(' <a
class=\"direct_link\"
rel=\"nofollow\" target=\"_blank\"
href=\"http://google.com\">
link text
</a>')
Try doing that with a pattern.
Well if you want does the stuff in the quotes it would be this:
"([\w:\/.]+)\\"
Test it here
Otherwise if you want the name before the quotes it would be this:
(\w+=\\"[\w:\/.]+\\")
Test it here
This one matches tags without backslashes:
(\w+="[\w:\/.-]+")
Test it here

umbraco - how to get all of nodes by Document Type

How can I get all nodes by specific Document Type?
For example, I want to get in code behind all of nodes with Document Type: s3Article. How can I do this?
New informations:
IEnumerable<Node> nodes = uQuery.GetNodesByType("s3Article").Where(x => x.NiceUrl.Contains("en"));
lvArticles.DataSource = nodes;
lvArticles.DataBind();
This is my code. I had to use Where(x => x.NiceUrl.Contains("en")), because I have 2 language version- without Where I receive nodes from all catalogues with doctype s3Article, but I want to get only from one language version.
Problem is here:
<a href='<%# umbraco.library.NiceUrl(Tools.NumericTools.tryParseInt( Eval("id"))) %>'><%# Eval("title")%></a>
<%# Tools.TextTools.makeIMGHTML("../.."+ Eval("img").ToString(),"180") %>
<%# umbraco.library.StripHtml(Limit(Eval("Article"), 1000))%>
<%# Eval("author")%>
System.Web.HttpException: DataBinding:
'umbraco.presentation.nodeFactory.Node' does not contain a property named 'title'.
The same problem happens with the title, img, article, author. Only ID works nice. How to resolve it?
You can use the uQuery GetNodesByType(string or int) method:
IEnumerable<Node> nodes = uQuery.GetNodesByType("s3Article");
Alternatively, you can use an extension method to get all descendant nodes and then query them by type as in the following answer:
Umbraco 4.6+ - How to get all nodes by doctype in C#?
You could use this to databind to a control within a usercontrol like so:
lvArticles.DataSource = nodes.Select(n => new {
ID: n.Id,
Title: n.GetProperty("title").Value,
Author: n.GetProperty("author").Value,
Article: n.GetProperty("article").Value,
Image: n.GetProperty("img").Value,
});
lvArticles.DataBind();
Only you would need to strip the html, convert the image id to a url, etc. within the select statement as well...
As Shannon Deminick mentions, uQuery is somewhat obsolete. ExamineManager will be the fastest execution time. https://our.umbraco.org/forum/developers/api-questions/45777-uQuery-vs-Examine-vs-IPublishedContent-for-Querying
I also found it to be the easiest and most readable approach to use ExamineManager's search builder. Very flexible, and has the added benefit of being very readable due to the Fluent Builder pattern the U Team used.
This will search ALL nodes, so if you need within a specific branch, you can use .ParentId(1234) etc.
var query = ExamineManager.Instance.CreateSearchCriteria()
.NodeTypeAlias("yourDocumentType")
.Compile();
IEnumerable<IPublishedContent> myNodes = Umbraco.TypedSearch(query);
I prefer typed nodes, but you can also just use "Search()" instead of "TypedSearch()" if you prefer dynamic nodes.
Another example including a specific property value "myPropValue" == "ABC",
var query = ExamineManager.Instance.CreateSearchCriteria()
.NodeTypeAlias("yourDocumentType")
.Or() //Other predicate .And, .Not etc.
.Field("myPropValue", "ABC")
.Compile();
Ref - https://our.umbraco.org/documentation/reference/querying/umbracohelper/

Find child of child which attribute code is equal to the parameter passed on the url - XSL

On this dynamic website,
The url looks something like this : departments/CHEM.html
CHEM is a parameter.
<xsl:param name="dep" select="'CHEM'" />
a piece of the xml is below
<course acad_year="2012" cat_num="5085" offered="Y">
<term term_pattern_code="1" fall_term="Y" spring_term="N">fall term</term>
<department code="CHEM">
<dept_long_name>Department of Chemistry and Chemical Biology</dept_long_name>
<dept_short_name>Chemistry and Chemical Biology</dept_short_name>
</department>
</course> ....
I am trying to get the dept_short_name to use on my H1 tag, but I have not been successful.So far I tried
<h2><xsl:value-of select="course/department/[code={#$dep}]"/></h2>
Any suggestions??? Thanks!
Just use:
<xsl:value-of select="course/department[#code eq $dep]/dept_short_name"/>
Remember:
In XPath 2.0 (XSLT 2.0) use the eq operator for value comparissons -- it is more efficient than the general comparisson operator = which really, only, needs to be used when at least one of its operands is a sequence.
I would try this:
<xsl:value-of select="course/department[#code=$dep]/dept_short_name/text()"/>
That says: find the department element (inside a course element) whose code attribute is the value of parameter "dep", then find the dept_short_name child element, then get the text inside that element.
You have to use the # to say that "code" is an attribute, but "dep" should not have it. I think the {} notation is for use inside attributes of the non-XSLT elements of your stylesheet, so I wouldn't use it inside a value-of expression.

How to parse a remote website and create a link on every single word for a dictionary tooltip?

I want to parse a random website, modify the content so that every word is a link (for a dictionary tooltip) and then display the website in an iframe.
I'm not looking for a complete solution, but for a hint or a possible strategy. The linking is my problem, parsing the website and displaying it in an iframe is quite simple. So basically I have a String with all the html content. I'm not even sure if it's better to do it serverside or after the page is loaded with JS.
I'm working with Ruby on Rails, jQuery, jRails.
Note: The content of the href tag depends on the word.
Clarification:
I tried a regexp and it already kind of works:
#site.gsub!(/[A-Za-z]+(?:['-][A-Za-z]+)?|\\d+(?:[,.]\\d+)?/) {|word| '' + word + ''}
But the problem is to only replace words in the text and leave the HTML as it is. So I guess it is a regex problem...
Thanks for any ideas.
I don't think a regexp is going to work for this - or, at least, it will always be brittle. A better way is to parse the page using Hpricot or Nokogiri, then go through it and modify the nodes that are plain text.
It sounds like you have it mostly planned out already.
Split the content into words and then for each word, create a link, such as whatever
EDIT (based on your comment):
Ahh ... I recommend you search around for screen scraping techniques. Most of them should start with removing anything between < and > characters, and replacing <br> and <p> with newlines.
I would use Nokogiri to remove the HTML structure before you use the regex.
no_html = Nokogiri::HTML(html_as_string).text
Simple. Hash the HTML, run your regex, then unhash the HTML.
<?php
class ht
{
static $hashes = array();
# hashes everything that matches $pattern and saves matches for later unhashing
function hash($text, $pattern) {
return preg_replace_callback($pattern, array(self,'push'), $text);
}
# hashes all html tags and saves them
function hash_html($html) {
return self::hash($html, '`<[^>]+>`');
}
# hashes and saves $value, returns key
function push($value) {
if(is_array($value)) $value = $value[0];
static $i = 0;
$key = "\x05".++$i."\x06";
self::$hashes[$key] = $value;
return $key;
}
# unhashes all saved values found in $text
function unhash($text) {
return str_replace(array_keys(self::$hashes), self::$hashes, $text);
}
function get($key) {
return self::$hashes[$key];
}
function clear() {
self::$hashes = array();
}
}
?>
Example usage:
ht::hash_html($your_html);
// your word->href converter here
ht::unhash($your_formatted_html);
Oh... right, I wrote this in PHP. Guess you'll have to convert it to ruby or js, but the idea is the same.

Resources