Interpret newlines as <br>s in markdown (Github Markdown-style) in Ruby - ruby-on-rails

I'm using markdown for comments on my site and I want users to be able to create line breaks by pressing enter instead of space space enter (see this meta question for more details on this idea)
How can I do this in Ruby? You'd think Github Flavored Markdown would be exactly what I need, but (surprisingly), it's quite buggy.
Here's their implementation:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
This logic requires that the line start with a \w for a linebreak at the end to create a <br>. The reason for this requirement is that you don't to mess with lists: (But see the edit below; I'm not even sure this makes sense)
* we don't want a <br>
* between these two list items
However, the logic breaks in these cases:
*this line is in italics*
another line
> the start of a blockquote!
another line
I.e., in all of these cases there should be a <br> at the end of the first line, and yet GFM doesn't add one
Oddly, this works correctly in the javascript version of GFM.
Does anyone have a working implementation of "new lines to <br>s" in Ruby?
Edit: It gets even more confusing!
If you check out Github's official Github Flavored Markdown repository, you'll find yet another newline to <br> regex!:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/(\A|^$\n)(^\w[^\n]*\n)(^\w[^\n]*$)+/m) do |x|
x.gsub(/^(.+)$/, "\\1 ")
I have no clue what this regex means, but it doesn't do any better on the above test cases.
Also, it doesn't look like the "don't mess with lists" justification for requiring that lines start with word characters is valid to begin with. I.e., standard markdown list semantics don't change regardless of whether you add 2 trailing spaces. Here:
item 1
item 2
item 3
In the source of this question there are 2 trailing spaces after "item 1", and yet if you look at the HTML, there is no superfluous <br>
This leads me to think the best regex for converting newlines to <br>s is just:
text.gsub!(/^[^\n]+\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")

I'm not sure if this will help, but I just use simple_format()
from ActionView::Helpers::TextHelper
ActionView simple_format
my_text = "Here is some basic text...\n...with a line break."
output => "<p>Here is some basic text...\n<br />...with a line break.</p>"
Even if it doesn't meet your specs, looking at the simple_format() source code .gsub! methods might help you out writing your own version of required markdown.

A little too late, but perhaps useful for other people. I've gotten it to work (but not thoroughly tested) by preprocessing the text using regular expressions, like so. It's hideous as a result of the lack of zero-width lookbehinds, but oh well.
# Append two spaces to a simple line, if it ends in newline, to render the
# markdown properly. Note: do not do this for lists, instead insert two newlines. Also, leave double newlines
# alone.
text.gsub! /^ ([\*\+\-]\s+|\d+\s+)? (.+?) (\ \ )? \r?\n (\r?\n|[\*\+\-]\s+|\d+\s+)? /xi do
full, pre, line, spaces, post = $~.to_a
if post != "\n" && pre.blank? && post.blank? && spaces.blank?
"#{pre}#{line} \n#{post}"
elsif pre.present? || post.present?


Remove contents within a specific tag

Using Rails 3.2. I want to remove all text in <b> and the tags, but I manage to find ways to strip the tags only.:
string = "
<b>Section 1</b>
Everything is good.<br>
<b>Section 2</b>
All is well.
# => "Section 1 Everthing is good. Section 2 All is well."
I want to achieve this:
"Everthing is good. All is well."
Should I add regex matching too?
The "right" way would be to use an html parser like Nokogiri.
However for this simple task, you may use a regex. It's quite simple:
Search for : (?m)<b\s*>.*?<\/b\s*> and replace it with empty string. After that, use strip_tags.
Regex explanation:
(?m) # set the m modifier to match newlines with dots .
<b # match <b
\s* # match a whitespace zero or more times
> # match >
.*? # match anything ungreedy until </b found
<\/b # match </b
\s* # match a whitespace zero or more times
> # match >
Online demo
It would be much better to use an HTML/XML parser for this task. Ruby does not have a native one, but Nokogiri is good and wraps libxml/xslt
doc = Nokogiri::XML string
result = doc.text # or .inner_html to include `<p>`
You can do string.gsub(/<b>.*<\/b>/, '')
if you want to remove tags you can try this :
ActionController::Base.helpers.sanitize("test<br>test<br>test<br> test")
if you want to remove all the tags you need to use this :
ActionView::Base.full_sanitizer.sanitize("test<br>test<br>test<br> test")
these two differ slightly.the first one is good for script tags to prevent Xss attacks but it doesn't remove tages. the second one removes any html tags in the text.

extracting runs of text with Mechanize/Nokogiri

Is there a sensible way to extract each run of text in a Mechanize-parsed HTML document, so that (for example):
<p>Here is <b>some</b> text<p>
is broken into three elements:
Here is
? My hunch is that there's a simple technique using recursive CSS search and/or #flatten, but I've not figured it out yet.
Borrowing from an answer in "Nokogiri recursively get all children":
result = []
doc.traverse { |node| result << node.text if node.text? }
That should give you the array ["Here is ", "some", " text"].
"Getting Mugged by Nokogiri" discusses traverse.
Since you want the contents of each text node, you can do this:'//text()').map(&:text)
The only downside to this (and to the other answer) is that you get all the whitespace between elements as well. If you want to suppress this, you can do this:'//text()').map(&:text).delete_if{|x| x !~ /\w/}
This removes all elements that don't contain a word character.

counting line numbers of a poem with nokogiri / ruby

I've struggling to try to do this with a simple regex but it's never been very accurate. It doesn't have to be perfect.
Source has a combination of and tags. I don't want to count blank lines.
Old way:
self.words = rendered.gsub(/<p> <\/p>/,'').gsub(/<p><br\s?\/?>|(?:<br\s?\/?>){2,}/,'<br>').scan(/<br>|<br \/>|<p/).size+1
New way (not working:
Tries to turn all the + into paragraphs, then throw it into nokogiri to count paragraph tags with more than 3 chars in them (I have no idea how? Counting 1 letter lines would be nice too, but this worked ok in javascript)
h = rendered
h.gsub!(/<br>/gi,"<p>") if h =~ /<br>\s*<br>/
h.prepend "<p>" if !h =~ /^\s*<p[^>]*>/i
h.replace(/<p>\s*<p>/g,"<p> </p><p>")
# find+count p tags with at least 1-3 chars?
# this is javascript not ruby, but you get the idea
$('p', c).each(function(i) { // had to trim it to remove whitespaces from start/end.
if ($(this).children('img').length) return; // skip if it's just an image.
if ($.trim($(this).text()).length > 3)
$(this).append("<div class='num'>"+ (n += 1) +"</div>");
Other methods are welcome!
Example poem ( )
from the other side of silence<br>
you met me with change and a pocket<br>
of unhappy apples.</p>
we bled together to black<br>
and chose the path carefully to<br>
sometimes when you smile<br>
your radiant footsteps fall<br>
and all around us is silence:<br>
each dream step is<br>
false but full of such glory</p>
unhappiness never made a student of you:<br>
just two by two by two. now three<br>
this great we that overflows our<br>
each jewel-like addition to the delicate<br>
crown. but flowers fall and dreams,<br>
all dreams, come to and end with death.</p>
Thank you!
For posterity, here's what I'm using now and it seems to be quite accurate. Non latin chars cause some problems sometimes from ckeditor, so I'm stripping them out for now.
html = Nokogiri::HTML(rendered)
text ='body').inner_text rescue nil
return self.words = rendered.gsub(/<p> <\/p>/,'').gsub(/<p><br\s?\/?>|(?:<br\s?\/?>){2,}/,'<br>').scan(/<br>|<br \/>|<p/).size+1 if !text
#bonus points to strip lines entirely non-letter. idk
#d "text is", text.gsub!(/([\x09|\x0D|\t])|(\xc2\xa0){1,}|[^A-z]/u,'')
#d "text is", text
self.words = text.strip.scan(/(\s*\n\s*)+/).size+1

Remove hard line breaks from text with Ruby

I have some text with hard line breaks in it like this:
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
I want to remove the single newlines but keep the double newlines so it looks like this:
This should all be on one line since it's one sentence.
This is a new paragraph that should be separate.
Is there a single regular expression to do this? (or some easy way)
So far this is my only solution which works but feels hackish.
txt = txt.gsub(/(\r\n|\n|\r)/,'[[[NEWLINE]]]')
txt = txt.gsub('[[[NEWLINE]]][[[NEWLINE]]]', "\n\n")
txt = txt.gsub('[[[NEWLINE]]]', " ")
Replace all newlines that are not followed by or preceded by a newline:
text = <<END
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
p text.gsub /(?<!\n)\n(?!\n)/, ' '
#=> "This should all be on one line since it's one sentence.\n\nThis is a new paragraph that should be separate. "
Or, for Ruby 1.8 without lookarounds:
txt.gsub! /([^\n])\n([^\n])/, '\1 \2'
text.gsub!(/(\S)[^\S\n]*\n[^\S\n]*(\S)/, '\1 \2')
The two (\S) groups serve the same purposes as the lookarounds ((?<!\s)(?<!^) and(?!\s)(?!$)) in #sln's regexes:
they confirm that the linefeed really is in the middle of a sentence, and
they ensure that the [^\S\n]*\n[^\S\n]* part consumes any other whitespace surrounding the linefeed, making it possible for us to normalize it to a single space.
They also make the regex easier to read, and (perhaps most importantly) they work in pre-1.9 versions of Ruby that don't support lookbehinds.
There is more to formatting (turning off word wrap) than you think.
If the output is a result of a formatting operation, then you should go by
those rules to reverse engineer the original.
For instance, the test you have there is
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
If you removed just the single newlines only, it would look like this:
This should all be on one line since it's one sentence.
This is a new paragraph thatshould be separate.
Also, other formatting such as intentional newlines will be lost, so something like:
This is Chapter 1
Section a
Section b
Turns into
This is Chapter 1 Section a Section b
Finding the newline in question is easy /(?<!\n)\n(?!\n)/
but, what do you replace it with.
Edit: Actually, its not that easy even to find standalone newlines, because visually they sit amongst hidden from view (horizontal) whitespaces.
There are 4 ways to go.
Remove newline, keep the surrounding formatting
$text =~ s/(?<!\s)([^\S\n]*)\n([^\S\n]*)(?!\s)/$1$2/g;
Remove newline and formatting, substitute a space
$text =~ s/(?<!\s)[^\S\n]*\n[^\S\n]*(?!\s)/ /g;
Same as above but ignore newline at beginning or end of string
$text =~ s/(?<!\s)(?<!^)[^\S\n]*\n[^\S\n]*(?!$|\s)/ /g;
$text =~ s/(?<!\s)(?<!^)([^\S\n]*)\n([^\S\n]*)(?!$|\s)/$1$2/g;
Example breakdown of regex (this is the minimum required just to isolate a single newline):
(?<!\s) # Not a whitespace behind us (text,number,punct, etc..)
[^\S\n]* # 0 or more whitespaces, but no newlines
\n # a newline we want to remove
[^\S\n]* # 0 or more whitespaces, but no newlines
(?!\s)/ # Not a whitespace in front of us (text,number,punct, etc..)
Well, there is this:
s.gsub /([^\n])\n([^\n])/, '\1 \2'
It won't do anything to leading or trailing newlines. If you don't need leading or trailing white space at all, then you will win with this variation:
s.gsub(/([^\n])\n([^\n])/, '\1 \2').strip
$ ruby -00 -pne 'BEGIN{$\="\n\n"};$_.gsub!(/\n+/,"\0")' file
This should all be on one line since it's one sentence.
This is a new paragraph thatshould be separate.

How to make a Ruby string safe for a filesystem?

I have user entries as filenames. Of course this is not a good idea, so I want to drop everything except [a-z], [A-Z], [0-9], _ and -.
For instance:
my§document$is°° very&interesting___thisIs%nice445.doc.pdf
should become
and then ideally
Is there a nice and elegant way for doing this?
I'd like to suggest a solution that differs from the old one. Note that the old one uses the deprecated returning. By the way, it's anyway specific to Rails, and you didn't explicitly mention Rails in your question (only as a tag). Also, the existing solution fails to encode .doc.pdf into _doc.pdf, as you requested. And, of course, it doesn't collapse the underscores into one.
Here's my solution:
def sanitize_filename(filename)
# Split the name when finding a period which is preceded by some
# character, and is followed by some character other than a period,
# if there is no following period that is followed by something
# other than a period (yeah, confusing, I know)
fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m
# We now have one or two parts (depending on whether we could find
# a suitable period). For each of these parts, replace any unwanted
# sequence of characters with an underscore! { |s| s.gsub /[^a-z0-9\-]+/i, '_' }
# Finally, join the parts with a period and return the result
return fn.join '.'
You haven't specified all the details about the conversion. Thus, I'm making the following assumptions:
There should be at most one filename extension, which means that there should be at most one period in the filename
Trailing periods do not mark the start of an extension
Leading periods do not mark the start of an extension
Any sequence of characters beyond A–Z, a–z, 0–9 and - should be collapsed into a single _ (i.e. underscore is itself regarded as a disallowed character, and the string '$%__°#' would become '_' – rather than '___' from the parts '$%', '__' and '°#')
The complicated part of this is where I split the filename into the main part and extension. With the help of a regular expression, I'm searching for the last period, which is followed by something else than a period, so that there are no following periods matching the same criteria in the string. It must, however, be preceded by some character to make sure it's not the first character in the string.
My results from testing the function:
1.9.3p125 :006 > sanitize_filename 'my§document$is°° very&interesting___thisIs%nice445.doc.pdf'
=> "my_document_is_very_interesting_thisIs_nice445_doc.pdf"
which I think is what you requested. I hope this is nice and elegant enough.
def sanitize_filename(filename)
returning filename.strip do |name|
# NOTE: File.basename doesn't work right with Windows paths on Unix
# get only the filename, not the whole path
name.gsub!(/^.*(\\|\/)/, '')
# Strip out the non-ascii character
name.gsub!(/[^0-9A-Za-z.\-]/, '_')
In Rails you might also be able to use ActiveStorage::Filename#sanitized:"foo:bar.jpg").sanitized # => "foo-bar.jpg""foo/bar.jpg").sanitized # => "foo-bar.jpg"
If you use Rails you can also use String#parameterize. This is not particularly intended for that, but you will obtain a satisfying result.
"my§document$is°° very&interesting___thisIs%nice445.doc.pdf".parameterize
For Rails I found myself wanting to keep any file extensions but using parameterize for the remainder of the characters:
filename = "my§doc$is°° very&itng___thsIs%nie445.doc.pdf"
cleaned = filename.split(".").map(&:parameterize).join(".")
Implementation details and ideas see source:
def parameterize(string, separator: "-", preserve_case: false)
# Turn unwanted chars into the separator.
parameterized_string.gsub!(/[^a-z0-9\-_]+/i, separator)
#... some more stuff
If your goal is just to generate a filename that is "safe" to use on all operating systems (and not to remove any and all non-ASCII characters), then I would recommend the zaru gem. It doesn't do everything the original question specifies, but the filename produced should be safe to use (and still keep any filename-safe unicode characters untouched):
Zaru.sanitize! " what\ēver//wëird:user:înput:"
# => "whatēverwëirduserînput"
Zaru.sanitize! "my§docu*ment$is°° very&interes:ting___thisIs%nice445.doc.pdf"
# => "my§document$is°° very&interesting___thisIs%nice445.doc.pdf"
There is a library that may be helpful, especially if you're interested in replacing weird Unicode characters with ASCII: unidecode.
irb(main):001:0> require 'unidecoder'
=> true
irb(main):004:0> "Grzegżółka".to_ascii
=> "Grzegzolka"
