Ruby -- trying to grab <title>this here</title> even if on multiple lines - ruby-on-rails

Currently, I am grabbing titles using the following method:
title = html_response[/<title[^>]*>(.*?)<\/title>/,1]
This does a great job at catching "This is a title" from <title>This is a title</title>. However, there are some web pages that open the title tag on one line, print the title on the next line, and then close the title tag.
The Ruby line I presented above doesn't catch titles such as those, so I'm just trying to find a fix for that.

This famous stackoverflow post explains why it's a bad idea to use regular expressions to parse HTML. A better approach is to use a gem like Nokogiri to parse out the title tags.

Obligatory don't use regex with HTML sentence.
title = html_response[/<title[^>]*>(.*?)<\/title>/m,1]
The m enables multiline mode.

Related

Regex to ignore specific symbols together

Hello I am doing this problem using RE and the task is to extract the information of a make-up HTML. Title and content is what I need. This is what I came up with so far.
<body>([^<]*)(?:<[^>]*+>)*([^<]*)(?:<[^>]*+>)*([^<]*)(?:<[^>]*+>)*([^<]*)(?:<[^>]*+>)*<\/body>
I know its just repeating the same RE but I couldn't match it otherwise, so please help me there as well.
Title being in the <title> </title> and content being in <body> </body>. But there is a problem. I need to ignore all the /n in the text and get only the text.
this is some sample text :
<html>\n<head><title>Some title</title></head>\n<body>Here<p> is some </p>content <a href="www.somesite.com">\nclick</body>\n</html>
also I know that I should not parse HTML with RE from here RegEx match open tags except XHTML self-contained tags, but my task requires me to use RE.

Displaying user input html with newlines

I have comments section in my application where users enter input in a text area. I want to prevent the line breaks they enter but also display html as a string. For example, if comment.body is
Hello, this is the code: <a href='foo'>foo</a>
Bye
I want it to be displayed just as above. The same with anything else, including iframe tags.
The closest I got is:
= simple_format(comment.body)
but it sanitizes html code and it's not displayed. Example: foo <iframe>biz</iframe> bar is displayed as:
foo biz bar
What should I do to achieve what I want?
Just use it without any method, it will be rendered as plain text:
= comment.body
Using your second example, the output will be:
foo <iframe>biz</iframe> bar
To make \n behave as <br>, you can use CSS:
.add-line {
white-space: pre-wrap;
}
And use it in your view:
.add-line = comment.body
Using your first example:
comment.body = "Hello, this is the code: <a href='foo'>foo</a>\n\nBye"
The output will be:
Hello, this is the code: <a href='foo'>foo</a>
Bye
Having done something similar in the past, I think you must first understand why HTML is sanitized from user input.
Imagine I wrote the following into a field that accepted HTML and displays this to the front page.
<script>alert('Hello')</script>
The code would execute for anyone visiting the front-page and annoyingly trigger a JS alert for every visitor.
Maybe not much of an issue yet, but imagine I wrote some AJAX request that sent user session IDs to my own server. Now this is an issue... because people's sessions are being hijacked.
Furthermore, there is a full JavaScript based exploitation framework called BeEF that relies on this type of website exploit called Cross-site Scripting (XSS).
BeEF does extremely scary stuff and is worth taking a look at when considering user generated HTML.
http://guides.rubyonrails.org/security.html#cross-site-scripting-xss
So what to do? Well if you checked in your DB you'd see that the tags are actually being stored, but like you pointed out aren't displayed.
You could .html_safe the content, but again I strongly advise against this.
Maybe instead you should write an alternative .html_safe method yourself, something like html_safe_whitelisted_tags.
As for removing newlines, you say you want to display as is. So replacing /n with <br>, as pointed out by Michael, would be the solution for you.
comment.body.gsub('\n', '<br />').html_safe_whitelisted_tags
HTML safe allows the html in the comment to be used as html, but would skip the newlines, so doing a quick replace of \n with <br /> would cover the new lines
comment.body.gsub("\n", "<br />").html_safe
If you want the html to be displayed instead of rendered then checkout CGI::escapeHTML(), then do the gsub so that the <br /> does not get escaped.
CGI::escapeHTML(comment.body).gsub("\n", "<br />")

Using regex to get title

I'm not sure how I'd select an title with regex. I've tried
match(/<title>(.*) .*<\/title>/)[1]
but that doesn't match anything.
This is the response body I'm trying to select from.
Trying to select "title I need to select."
The reason it doesn't work is because of the itemprop=\"name\" property. To fix this, you can match it as well:
# copy-paste from the page you provided
html = '<!doctype html>\n<html lang=\"en\" itemscope itemtype=\"https://schema.org/WebPage\">\n<head>\n<meta charset=\"utf-8\"><meta name=\"referrer\" content=\"always\" />\n<title itemprop=\"name\">title I need to select.</title>\n<meta itemprop=\"description\" name=\"description\" content=\\'
html.match(/<title.*?>(.*)<\/title>/)[1] # => "title I need to select."
.*? basically means "match as many characters are needed, but not more"
However, as other have pointed out, regexes are not ideal for html parsing. Instead, you could use a popular ruby gem for that purpose - Nokogiri:
require 'nokogiri'
page = Nokogiri.parse(html)
page.css('title').text # => "title I need to select."
Note that it can handle even malformed html like is the case here.
If you're looking for a much more robust XML/HTML parser, try using Nokogiri which supports XPath.
This post explains why
Use xPath or Regex?
require "nokogiri"
string = "<title itemprop=\"name\">title I need to select.</title>"
html_doc = Nokogiri::HTML(string)
html_doc.xpath("//title").first.text
Here's the regexp that will give you what you want:
<title.*>(.*)<\/title>
As was mentioned, there are better ways to parse HTML. You might want to check out something like Nokogiri.
When I have to get elements from XML I like to convert it to a hash
from_xml(xml, disallowed_types = nil) public
Returns a Hash containing a collection of pairs when the key is the
node name and the value is its content
# http://apidock.com/rails/Hash/from_xml/class
now you can do something like
hash = Hash.from_xml('XML')
hash.title # my favorite book
One solution would be to use the following pattern:
<title.*?>(.*?)<\/title>
https://regex101.com/r/piwm5H/1
Use a HTML/XML parser when dealing with XML or HTML data, except for extremely simple cases. HTML and XML are too complicated for normal regular expressions.
Using Nokogiri I'd do:
require 'nokogiri'
some_html = '
<html>
<head>
<title>the title</title>
</head>
</html>
'
doc = Nokogiri::HTML(some_html)
doc.title # => "the title"
Nokogiri already has a method to return the title so you can take advantage of that. Or, you can do it the normal way:
doc.at('title').text # => "the title"
The problem with a regular expression is that HTML could be written in many ways:
<title>foo</title>
or:
<title>
foo
</title>
or even:
<title>foo
</head>
which, while not correct, will be accepted by browsers and fixed up by Nokogiri which will then still work. Writing a pattern to handle those variants is a pain and error-prone. It only gets worse as the HTML gets more complex, especially when you don't control the generation of the content.

Rails]Using Markdown Markup Language

In my rails application, people are supposed to submit "posts." However, in the default scaffolding, there are some problems in the text input: not allowed HTML code, changing the line doesn't work, etc. From what I've learned, I need to use a markdown-markup language to solve this issue. Is there a guide for me to follow to apply such language to solve my problem?
UPDATE: Here are my problems.
1) Every sentence is combined into one line even if I put a line space.
first line
second line
becomes
first line second line
2) I can't make text bold, italicized, or hyperlink. Like in stackoverflow, user should easily put <b> and make bold text, ** to make italicized, etc. And URL address should automatically be translated to href link.
To do these, I thought I had to use markdown library. I could be mistaken, so I needed someone to guide me through. Railscasts on Markdown
Well, yes, new lines in HTML have no meaning. You need to replace line breaks with <br> to preserve them in HTML. To automatically highlight links, you need to look for links in the text and wrap them in appropriate <a> tags. Finally, if you're not filtering HTML tags, they should still be in there. It all depends on what you're doing. Markdown is something entirely different, a special markup language that enables you to do the above while being easier to write than HTML. It depends on what you want to use.

trouble using tinymce using ruby on rails

I am having trouble in using tinymce editor with rails 3. I want to show text in bold letters and having trouble using tags like when I write something in p tags It should go to next paragraphs. in my case this tags is not working. It remains on same lines and display p tags on site page.
The usual suspect when it comes to rails 3 printing raw html output to the site, is that someone forgot to call html_safe on whatever text should be printed.
So if you have a #my_model_instance.description that you edit with tinymce, you might want to make the view look like #my_model_instance.description.html_safe, or as they suggest in the comment on the documentation, raw(#my_model_instance.description).
If the text is coming from user input, however, you might want to be a bit cautious, since it might be possible for users to input all sorts of nasty injection hacks this way.

Resources