Using regex to get title - ruby-on-rails

I'm not sure how I'd select an title with regex. I've tried
match(/<title>(.*) .*<\/title>/)[1]
but that doesn't match anything.
This is the response body I'm trying to select from.
Trying to select "title I need to select."

The reason it doesn't work is because of the itemprop=\"name\" property. To fix this, you can match it as well:
# copy-paste from the page you provided
html = '<!doctype html>\n<html lang=\"en\" itemscope itemtype=\"https://schema.org/WebPage\">\n<head>\n<meta charset=\"utf-8\"><meta name=\"referrer\" content=\"always\" />\n<title itemprop=\"name\">title I need to select.</title>\n<meta itemprop=\"description\" name=\"description\" content=\\'
html.match(/<title.*?>(.*)<\/title>/)[1] # => "title I need to select."
.*? basically means "match as many characters are needed, but not more"
However, as other have pointed out, regexes are not ideal for html parsing. Instead, you could use a popular ruby gem for that purpose - Nokogiri:
require 'nokogiri'
page = Nokogiri.parse(html)
page.css('title').text # => "title I need to select."
Note that it can handle even malformed html like is the case here.

If you're looking for a much more robust XML/HTML parser, try using Nokogiri which supports XPath.
This post explains why
Use xPath or Regex?
require "nokogiri"
string = "<title itemprop=\"name\">title I need to select.</title>"
html_doc = Nokogiri::HTML(string)
html_doc.xpath("//title").first.text

Here's the regexp that will give you what you want:
<title.*>(.*)<\/title>
As was mentioned, there are better ways to parse HTML. You might want to check out something like Nokogiri.

When I have to get elements from XML I like to convert it to a hash
from_xml(xml, disallowed_types = nil) public
Returns a Hash containing a collection of pairs when the key is the
node name and the value is its content
# http://apidock.com/rails/Hash/from_xml/class
now you can do something like
hash = Hash.from_xml('XML')
hash.title # my favorite book

One solution would be to use the following pattern:
<title.*?>(.*?)<\/title>
https://regex101.com/r/piwm5H/1

Use a HTML/XML parser when dealing with XML or HTML data, except for extremely simple cases. HTML and XML are too complicated for normal regular expressions.
Using Nokogiri I'd do:
require 'nokogiri'
some_html = '
<html>
<head>
<title>the title</title>
</head>
</html>
'
doc = Nokogiri::HTML(some_html)
doc.title # => "the title"
Nokogiri already has a method to return the title so you can take advantage of that. Or, you can do it the normal way:
doc.at('title').text # => "the title"
The problem with a regular expression is that HTML could be written in many ways:
<title>foo</title>
or:
<title>
foo
</title>
or even:
<title>foo
</head>
which, while not correct, will be accepted by browsers and fixed up by Nokogiri which will then still work. Writing a pattern to handle those variants is a pain and error-prone. It only gets worse as the HTML gets more complex, especially when you don't control the generation of the content.

Related

Rails 5 - How to strip tags from string in rails (NOT in/for html)

I need to strip tags from user input before saving into DB
I'm well aware of strip_tags method but it also html escapes string, as well as all other recommended methods:
Rails::Html::FullSanitizer.new.sanitize '&'
=> "&"
Rails::Html::WhiteListSanitizer.new.sanitize('&', tags: [])
=> "&"
ActionController::Base.helpers.strip_tags "&"
=> "&"
The string I want to sanitize is NOT to be escaped, it's getting exported via API, used in files etc. it's NOT only outputted via HTML (where also in cases like link_to ActionController::Base.helpers.strip_tags("&") - link_to is double escaping string so you'll get link to & in the frontend )
As a monkey patch I've wrapped strip_tags into CGI.unescapeHTML to get more or less expected result but want to find some straight solution (I'm also afraid what else can strip_tags do and there are too many moving parts for that small functionality - more stuff that can go wrong or become broken)
Real world example:
JPMorgan Chase & Co should become JPMorgan Chase & Co after removing tags
test<script>alert('hacked!');</script>&test should become test&test after stripping tags
And also string:
"test <script>alert('hacked!')</script>"
Should still be
"test <script>alert('hacked!')</script>"
After stripping HTMLs
With alternative solutions that I've found or that was proposed:
> Nokogiri::HTML("test <script>alert('hacked!')</script>").text
=> "test <script>alert('hacked!')</script>"
> Loofah.fragment("test <script>alert('hacked!')</script>").text(encode_special_chars: false)
=> "test <script>alert('hacked!')</script>"
So they're also a no go
You have to parse the HTML and extract the text elements. Use Nokogiri to do that.
Nokogiri::HTML("<div>Strip <i>this</i> & <b>this</b> & <u>this</u>!</div>").text
Nokogiri is already used by Rails so there's no cost to using it.
You will get all the text, including the content of <script> tags.
Nokogiri::HTML(%q[test<script>alert('hacked!');</script>&test]).text
# testalert('hacked!');&test
You can strip the <script> tags.
doc = Nokogiri::HTML(%q[test<script>alert('hacked!');</script>&test])
doc.search('//script').each { |node| node.replace('') }
doc.text
# test&test
But with the tags stripped out the string is of no harm. It might not be worth the effort.
See the Nokogiri tutorials for more.

How do I fix this Nokogiri document result to make it legible?

I'm trying to scrape kickass.to and I'm having difficultly returning a legible document.
Here's my code:
require 'nokogiri'
require 'open-uri'
url = "http://kickass.to/usearch/Mobile%20Suit%20Gundam:%20Char%27s%20Counterattack%201988category:movies/"
doc = Nokogiri::HTML(open(url))
result:
#<Nokogiri::HTML::Document:0x3ffb45c23ab4 name="document" children=[#<Nokogiri::XML::DTD:0x3ffb45c23744 name="html">, #<Nokogiri::XML::Element:0x3ffb45c26fc0 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26db8 name="body" children=[#<Nokogiri::XML::Element:0x3ffb45c26bb0 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c269a8 "\u008B å}ùvÛF²÷ßñSt8Ç\u009142H,Y\u0092©Åñ\u008Cíx,%\u0099\\_],\r\tÐX$Ñ\u0093y¢ï¾ÿî\u0093Ý_u ¸\u0088\"eÑ\u008E3>>\"6º««ªkëBõþ÷Ç?\u009Dÿöæ\u0084õ\u0093áàðÑ>}°\u009Bá \u0088*ý$íÕj×××Õk£F½\u009AÖn·k7Ô¦Â\\?:¨\u0092¨BOqË=|Äðo\u007FÈ\u009D%#\u007FLý«\u0083ÊQ$">, #<Nokogiri::XML::Element:0x3ffb45c268cc name="h">]>]>, #<Nokogiri::XML::Element:0x3ffb45c26480 name="html" children=[#<Nokogiri::XML::Element:0x3ffb45c26278 name="p" children=[#<Nokogiri::XML::Text:0x3ffb45c26070 "T~\u0093Ô¨§§Ìé[QÌ\u0093\u00834ñ\u0094V¥vWGgÉxÀvçÄñôã\u00815ä\u0097ÇNä\u008F?J CάÀenxBËeÃÐö\u009CÅ©\u009F°^¸ÖpOÀ¶ì³\u0088¬$±\u009CKfÙq8H>3/\u008C\u0098q^e§V\u009C}ÅUvìGÜ\u0099ÜaW¾Å~\u007Fì+ËXö\u0080/\u00825\ní0\u0089K`¡¸ü¦Â\">8¨¤1·\"§_¯=\u0083ó0\u008A#\u0094\u00981ýÝw.­8Îoí×d§\u0092\u009C?¸\u0094CÇ\u0084ö¸ÏyRa\th\u0099\u0091\u0090pÎú÷*µúI¬ÄwªN8¬Y\u0083\u0081¢µ\u009Aå\u0094.\u008DÑ£ÄIæ\u0083OnéÖZ=×Uñ§\u0092÷ôhfk4«$aêô\u0095»»\u009Cm]=Ñ·ìö{Eyç{l\u0090°'¬ù>cSüÂùcÎ5\u009F7¦q ¨¸\u00959N¾\u007FÇ×÷Þ+Êa6«løuÆn>üØ­UçÝ\u00924ÿìùJt·óaåJfqäÌñÛ\u0087Xȳ:ô\u0083bâÀ\u009D%ný\u0080Å'»¨î×äUFÈ[1ÞK8Q¼ á.\u008A·\u008BÁ×ßB\u0092\u0096¡£WVÄ.­\u0084°\u007F\t\u0086¤{ôp+澻Ƕ²·õdª\u0089ËÈ¢\u008B\u0081ôö\u0098:ý
You get the picture. It's illegible and I can't seem to figure out where particular elements are. Any ideas where to go from here?
Works fine for me on MRI Ruby 2.1.1. You can either try to re-install/update Nokogiri and/or do the same with Ruby.
I think you misunderstand how Nokogiri works. Nokogiri does not return the raw HTML on the requested page, it wraps each DOM element within a Nokogiri object and returns a Nokogiri enumerable object that contains all of these elements.
It is difficult to help you as It's unclear if you want to extract all of the HTML or specific parts of the page. Nokogiri works by using CSS style selectors to 'query' the Nokogiri object and extract the elements you want.
If you refer to the Nokogiri docs this will help, but using there example...
doc.css('h3.r a').each do |link|
puts link.content
end
This assumes you have a variable containing results of a Nokogiri scrape (in your case you've also used 'doc').
This then performs a search for all nodes that are links (a tags) that are contained within an h3 tag with the class of 'r'.
In this case they are looping through the elements that match this criteria (.css function also returns an enumerable as there could be multiple elements matching the criteria) and printing these to console.

Ruby -- trying to grab <title>this here</title> even if on multiple lines

Currently, I am grabbing titles using the following method:
title = html_response[/<title[^>]*>(.*?)<\/title>/,1]
This does a great job at catching "This is a title" from <title>This is a title</title>. However, there are some web pages that open the title tag on one line, print the title on the next line, and then close the title tag.
The Ruby line I presented above doesn't catch titles such as those, so I'm just trying to find a fix for that.
This famous stackoverflow post explains why it's a bad idea to use regular expressions to parse HTML. A better approach is to use a gem like Nokogiri to parse out the title tags.
Obligatory don't use regex with HTML sentence.
title = html_response[/<title[^>]*>(.*?)<\/title>/m,1]
The m enables multiline mode.

newline characters screwing up <pre> tags (Ruby on Rails)

I developing a blog and some really annoying stuff is happening with newline characters (\n). Everything works fine except if I make a post that contains pre tags my newline characters screw up the indentation.
So if I have code that looks like this
<pre>
<code>
some code some code
more code more code
</code>
</pre>
For some reason the newline characters that are saved in the db field with the post are causing whatever is inside the pre tag to be indented by a tab or two.
I have no idea why it's doing it, but if I do something like
string.gsub!(/\n/, "<br />")
The indentation is removed, so I know it has to do with the \n. But then my problem is that there are way too many line breaks and the format is then way off.
So then I tried to capture everything inside the pre tags with a method that looks like this
def remove_newlines(string)
regexp = /<pre>\s?(.*?)\s?<\/pre>/
code = regexp.match(string)
code[1].gsub!(/\n/, "<br />")
end
But I can't get that to work properly.
Anyone know how I can rid of this weird indentation problem, or any pointers on this?
Thanks!
It sounds like your template engine is auto-indenting the contents of the <pre> tags. Browsers render the whitespace inside <pre> tags as it is (and so they should, according to specs). This means that the whitespace at the beginning of each line inside the <pre> added by the template engine in order to make the HTML source more readable is rendered in the actual page as well, unlike whitespace most other places in HTML source.
The solution therefore depends on your templating language.
If you are using HAML:
HAML FAQ: How do I stop Haml from indenting the contents of my pre and textarea tags?
Hope this helps.

rails, given a HTML string from a WYSIWYG - how to get just text

I have a large HTML string from a WYSIWYG and want to show a truncates string of just text, no html or html tags. Is there any way to do this built into rails or do I need a gsub to get rid of all html brackets?
Thanks
Rails already includes some powerful sanitization helpers.
string = '<span id="span_is"><br><br><u><i>Hi</i></u></span>'
strip_tags(string)
It depends upon how complex your HTML is, but you could certainly use Nokogiri and XPath to query the text that you want from the HTML. It depends upon how much you want to parse, and whether it justifies an extra library to do it.
A parser can do it but would be overkill if you have simple HTML to present. Something like Loofah or sanitize could strip all the tags using Nokogiri to parse the HTML then strip out the tags, leaving you with the text.
require 'sanitize'
html = '<html><body>Jackdaws love my giant sphinx of quartz.</body></html>'
puts Sanitize.clean(html)
# >> Jackdaws love my giant sphinx of quartz.
I think loofah is more capable than sanitize, but if all you want to do is toss tags away sanitize might be the way to go.

Resources