Remove hard line breaks from text with Ruby - ruby-on-rails

I have some text with hard line breaks in it like this:
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
I want to remove the single newlines but keep the double newlines so it looks like this:
This should all be on one line since it's one sentence.
This is a new paragraph that should be separate.
Is there a single regular expression to do this? (or some easy way)
So far this is my only solution which works but feels hackish.
txt = txt.gsub(/(\r\n|\n|\r)/,'[[[NEWLINE]]]')
txt = txt.gsub('[[[NEWLINE]]][[[NEWLINE]]]', "\n\n")
txt = txt.gsub('[[[NEWLINE]]]', " ")

Replace all newlines that are not followed by or preceded by a newline:
text = <<END
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
END
p text.gsub /(?<!\n)\n(?!\n)/, ' '
#=> "This should all be on one line since it's one sentence.\n\nThis is a new paragraph that should be separate. "
Or, for Ruby 1.8 without lookarounds:
txt.gsub! /([^\n])\n([^\n])/, '\1 \2'

text.gsub!(/(\S)[^\S\n]*\n[^\S\n]*(\S)/, '\1 \2')
The two (\S) groups serve the same purposes as the lookarounds ((?<!\s)(?<!^) and(?!\s)(?!$)) in #sln's regexes:
they confirm that the linefeed really is in the middle of a sentence, and
they ensure that the [^\S\n]*\n[^\S\n]* part consumes any other whitespace surrounding the linefeed, making it possible for us to normalize it to a single space.
They also make the regex easier to read, and (perhaps most importantly) they work in pre-1.9 versions of Ruby that don't support lookbehinds.

There is more to formatting (turning off word wrap) than you think.
If the output is a result of a formatting operation, then you should go by
those rules to reverse engineer the original.
For instance, the test you have there is
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
If you removed just the single newlines only, it would look like this:
This should all be on one line since it's one sentence.
This is a new paragraph thatshould be separate.
Also, other formatting such as intentional newlines will be lost, so something like:
This is Chapter 1
Section a
Section b
Turns into
This is Chapter 1 Section a Section b
Finding the newline in question is easy /(?<!\n)\n(?!\n)/
but, what do you replace it with.
Edit: Actually, its not that easy even to find standalone newlines, because visually they sit amongst hidden from view (horizontal) whitespaces.
There are 4 ways to go.
Remove newline, keep the surrounding formatting
$text =~ s/(?<!\s)([^\S\n]*)\n([^\S\n]*)(?!\s)/$1$2/g;
Remove newline and formatting, substitute a space
$text =~ s/(?<!\s)[^\S\n]*\n[^\S\n]*(?!\s)/ /g;
Same as above but ignore newline at beginning or end of string
$text =~ s/(?<!\s)(?<!^)[^\S\n]*\n[^\S\n]*(?!$|\s)/ /g;
$text =~ s/(?<!\s)(?<!^)([^\S\n]*)\n([^\S\n]*)(?!$|\s)/$1$2/g;
Example breakdown of regex (this is the minimum required just to isolate a single newline):
(?<!\s) # Not a whitespace behind us (text,number,punct, etc..)
[^\S\n]* # 0 or more whitespaces, but no newlines
\n # a newline we want to remove
[^\S\n]* # 0 or more whitespaces, but no newlines
(?!\s)/ # Not a whitespace in front of us (text,number,punct, etc..)

Well, there is this:
s.gsub /([^\n])\n([^\n])/, '\1 \2'
It won't do anything to leading or trailing newlines. If you don't need leading or trailing white space at all, then you will win with this variation:
s.gsub(/([^\n])\n([^\n])/, '\1 \2').strip

$ ruby -00 -pne 'BEGIN{$\="\n\n"};$_.gsub!(/\n+/,"\0")' file
This should all be on one line since it's one sentence.
This is a new paragraph thatshould be separate.

Related

How to delete non text characters, new line, tab, etc, from params of JSON string

I have JSON strings that may contain \n, \t, which I don't want to save into database. strip_tags helps only with simple strings. I am using gsub(/(\\n)|(\\t)/, "").
I wonder if there is another Rails helper method or a better way to achieve this.
e.g
"[{\"type\":\"checkbox-group\",\"label\":\"\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\nFill in\\nthe Gap (Please\\nfill in the blank box with correct wor\",\"name\":\"checkbox-group-1527245153706\",\"values\":[{\"label\":\"Option 1\",\"value\":\"option-1\",\"selected\":true}]},{\"type\":\"text\",\"label\":\"\\n\\t\\t\\n\\t\\n\\t\\n\\t\\t\\n\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\tWhat are the unique features of\\ne-commerce, digital markets, and\\ndigital goods? \\n\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\n\\t\\t\\t\\n\\t\\t\",\"className\":\"form-control\",\"name\":\"text-1527245426509\",\"subtype\":\"text\"}]"
You can make use of squish or squish!
" Some text \n\n and a tab \t\t with new line \n\n ".squish
#=> "Some text and a tab with new line"
Squish removes all the whitespace chars on both ends and grouping remaining whitespace chars (\n, \t, space) in one space
I think this one may help you,
JSON.parse(string).map{ |a| a['label'] = a['label'].squish; a}

Ruby: Split a string into substring of maximum 40 characters

I have some strings with a sentence and i need to subdivise it into a substring of maximum 40 characters.
But i don't want to split the sentence in the middle of a word.
I tried with .gsub function but it's return 40 characters maximum and avoid to cut the string in the middle of a word. But it's return only the first occurence.
sentence[0..40].gsub(/\s\w+$/,'')
I tried with split but i can select only the fist 40 characters and split in the middle of a word...
sentence.split(...){40}
My string is "Sure, we will show ourselves only when we know the east door has been opened.".
The string output i want is
["Sure, we will show ourselves only when we","know the east door has
been opened."]
Do you have a solution ? Thanks
Your first attempt:
sentence[0..40].gsub(/\s\w+$/,'')
almost works, but it has one fatal flaw. You are splitting on the number of characters before cutting off the last word. This means you have no way of knowing whether the bit being trimmed off was a whole word, or a partial word.
Because of this, your code will always cut off the last word.
I would solve the problem as follows:
sentence[/\A.{0,39}[a-z]\b/mi]
\A is an anchor to fix the regex to the start of the string.
.{0,39}[a-z] matches on 1 to 40 characters, where the last character must be a letter. This is to prevent the last selected character from being punctuation or space. (Is that desired behaviour? Your question didn't really specify. Feel free to tweak/remove that [a-z] part, e.g. [a-z.] to match a full stop, if desired.)
\b is a word boundary look-around. It is a zero-width matcher, on beginning/end of words.
/mi modifiers will include case insensitive (i.e. A-Z) and multi-line matches.
One very minor note is that because this regex is matching 1 to 40 characters (rather than zero), it is possible to get a null result. (Although this is seemingly very unlikely, since you'd need a 1-word, 41+ letter string!!) To account for this edge case, call .to_s on the result if needed.
Update: Thank you for the improved edit to your question, providing a concrete example of an input/result. This makes it much clearer what you are asking for, as the original post was somewhat ambiguous.
You could solve this with something like the following:
sentence.scan(/.{0,39}[a-z.!?,;](?:\b|$)/mi)
String#scan returns an array of strings that match the pattern - so you can then re-join these strings to reconstruct the original.
Again, I have added a few more characters (!?,;) to the list of "final characters in the substring". Feel free to tweak this as desired.
(?:\b|$) means "either a word boundary, or the end of the line". This fixes the issue of the result not including the final . in the substrings. Note that I have used a non-capture group (?:) to prevent the result of scan from changing.

Pattern match dropping new lines characters

How to extract the values from a csv like string dropping the new lines characters (\r\n or \n) with a pattern.
A line looks like:
1.1;2.2;Example, 3
Notice there are only 3 values and the separator is ;. The problem I'm having is to come up with a pattern that reads the values while dropping the new line characters (the file comes from a windows machine so it has \r\n, reading it from a linux and would like to be independent from the new line character used).
My simple example right now is:
s = "1.1;2.2;Example, 3\r\n";
p = "(.-);(.-);(.-)";
a, b, c = string.match(s, p);
print(c:byte(1, -1));
The two last characters printed by the code above are the \r\n.
The problem is that both, \r and \n are detected by the %c and %s classes (control characters and space characters), as show by this code:
s = "a\r";
print(s:match("%c"));
print(s:match("%s"));
print(s:match("%d"));
So, is it possible to left out from the match the new lines characters? (It should not be assumed that the last two characters will be new lines characters)
The 3º value may contain spaces, punctuation and alphanumeric characters and since \r\n are detected as space characters a pattern like `"(.-);(.-);([%w%s%c]-).*" does not work.
Your pattern
p = "(.-);(.-);(.-)";
does not work: the third field is always empty because .- matches a little as possible. You need to anchor it at the end of the string, but then the third field will contain trailing newline chars:
p = "(.-);(.-);(.-)$";
So, just stop at the first trailing newline char. This also anchors the last match. Try this pattern instead:
p = "(.-);(.-);(.-)[\r\n]";
If trailing newline chars are optional, try this pattern:
p = "(.-);(.-);(.-)[\r\n]*$";
Without any lua experience I found a naive solution:
clean_CR = s:gsub("\r","");
clean_NL = clean_CR:gsub("\n","");
With POSIX regex syntax I'd use
^([^;]*);([^;]*);([^\n\r]*).*$
.. with "\n" and "\r" possibly included as "^M", "^#" (control/unicode characters) .. depending on your editor.

Interpret newlines as <br>s in markdown (Github Markdown-style) in Ruby

I'm using markdown for comments on my site and I want users to be able to create line breaks by pressing enter instead of space space enter (see this meta question for more details on this idea)
How can I do this in Ruby? You'd think Github Flavored Markdown would be exactly what I need, but (surprisingly), it's quite buggy.
Here's their implementation:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This logic requires that the line start with a \w for a linebreak at the end to create a <br>. The reason for this requirement is that you don't to mess with lists: (But see the edit below; I'm not even sure this makes sense)
* we don't want a <br>
* between these two list items
However, the logic breaks in these cases:
[some](http://google.com)
[links](http://google.com)
*this line is in italics*
another line
> the start of a blockquote!
another line
I.e., in all of these cases there should be a <br> at the end of the first line, and yet GFM doesn't add one
Oddly, this works correctly in the javascript version of GFM.
Does anyone have a working implementation of "new lines to <br>s" in Ruby?
Edit: It gets even more confusing!
If you check out Github's official Github Flavored Markdown repository, you'll find yet another newline to <br> regex!:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/(\A|^$\n)(^\w[^\n]*\n)(^\w[^\n]*$)+/m) do |x|
x.gsub(/^(.+)$/, "\\1 ")
end
I have no clue what this regex means, but it doesn't do any better on the above test cases.
Also, it doesn't look like the "don't mess with lists" justification for requiring that lines start with word characters is valid to begin with. I.e., standard markdown list semantics don't change regardless of whether you add 2 trailing spaces. Here:
item 1
item 2
item 3
In the source of this question there are 2 trailing spaces after "item 1", and yet if you look at the HTML, there is no superfluous <br>
This leads me to think the best regex for converting newlines to <br>s is just:
text.gsub!(/^[^\n]+\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
Thoughts?
I'm not sure if this will help, but I just use simple_format()
from ActionView::Helpers::TextHelper
ActionView simple_format
my_text = "Here is some basic text...\n...with a line break."
simple_format(my_text)
output => "<p>Here is some basic text...\n<br />...with a line break.</p>"
Even if it doesn't meet your specs, looking at the simple_format() source code .gsub! methods might help you out writing your own version of required markdown.
A little too late, but perhaps useful for other people. I've gotten it to work (but not thoroughly tested) by preprocessing the text using regular expressions, like so. It's hideous as a result of the lack of zero-width lookbehinds, but oh well.
# Append two spaces to a simple line, if it ends in newline, to render the
# markdown properly. Note: do not do this for lists, instead insert two newlines. Also, leave double newlines
# alone.
text.gsub! /^ ([\*\+\-]\s+|\d+\s+)? (.+?) (\ \ )? \r?\n (\r?\n|[\*\+\-]\s+|\d+\s+)? /xi do
full, pre, line, spaces, post = $~.to_a
if post != "\n" && pre.blank? && post.blank? && spaces.blank?
"#{pre}#{line} \n#{post}"
elsif pre.present? || post.present?
"#{pre}#{line}\n\n#{post}"
else
full
end
end

removing whitespaces in ActionScript 2 variables

let's say that I have an XML file containing this :
<description><![CDATA[
<h2>lorem ipsum</h2>
<p>some text</p>
]]></description>
that I want to get and parse in ActionScript 2 as HTML text, and setting some CSS before displaying it. Problem is, Flash takes those whitespaces (line feed and tab) and display it as it is.
<some whitespace here>
lorem ipsum
some text
where the output I want is
lorem ipsum
some text
I know that I could remove the whitespaces directly from the XML file (the Flash developer at my workplace also suggests this. I guess that he doesn't have any idea on how to do this [sigh]). But by doing this, it would be difficult to read the section in the XML file, especially when lots of tags are involved and that makes editing more difficult.
So now, I'm looking for a way to strip those whitespaces in ActionScript. I've tried to use PHP's str_replace equivalent (got it from here). But what should I use as a needle (string to search) ? (I've tried to put in "\t" and "\r", don't seem to be able to detect those whitespaces).
edit :
now that I've tried to throw in newline as a needle, it works (meaning that newline successfully got stripped).
mystring = str_replace(newline, '', mystring);
But, newlines only got stripped once, meaning that in every consecutive newlines, (eg. a newline followed by another newline) only one newline can be stripped away.
Now, I don't see that this as a problem in the str_replace function, since every consecutive character other than newline get stripped away just fine.
Pretty much confused about how stuff like this is handled in ActionScript. :-s
edit 2:
I've tried str_replace -ing everything I know of, \n, \r, \t, newline, and tab (by pressing tab key). Replacing \n, \r, and \t seem to have no effect whatsoever.
I know that by successfully doing this, my content can never have real line breaks. That's exactly my intention. I could format the XML the way I want without Flash displaying any of the formatting stuff. :)
Several ways to approach this. Perhaps the simplest answer is, in one sense your Flash developer is probably right, and you should move your whitespace outside of the CDATA container. The reason being, many people (me at least) tend to assume that everything inside a CDATA is "real data", as opposed to markup. On the other hand, whitespace outside a CDATA is normally assumed to be irrelevant, so data like this:
<description>
<![CDATA[<h2>lorem ipsum</h2>
<p>some text</p>]]>
</description>
would be easier to understand and to work with. (The flash developer can use the XML.ignoreWhite property to ignore the whitespace outside the CDATA.)
With that said, if you're editing the XML by hand, then I can see why it would be easier to use the formatting you describe. However, if the extra whitespace is inside the CDATA, then it will inevitable be included in the String data you extract, so your only option is to grab the content of the CDATA and remove the whitespace afterwards.
Then your question reduces to "how do I strip leading/trailing whitespace from a String in AS2?". And unfortunately, since AS2 doesn't support RegEx there's no simple way to do this. I think your best option would be to parse through from the beginning and end to find the first/last non-white character. Something along these lines (untested pseudocode):
myString = stuffFromXML;
whitespace = " " + "\t" + "\n" + "\r" + newline;
start = 0;
end = myString.length;
while ( testString( myString.substr(start,1), whitespace ) ) { start++; }
while ( testString( myString.substr(end-1,1), whitespace ) ) { end--; }
trimmedString = myString.substring( start, end );
function testString( needle, haystack ) {
return ( haystack.indexOf( needle ) > -1 );
}
Hope that helps!
Edit: I notice that in your example you'd also need to remove tabs and whitespace within your text data. This would be tricky, unless you can guarantee that your data will never include "real" tabs in addition to the ones for formatting. No matter what you do with the CDATA tags, it would probably be wiser not to insert extraneous formatting inside your real content and then remove it programmatically afterward. That's just making your own life difficult.
Second edit: As for what character to remove to get rid of newlines, it depends partially on what characters are actually in the XML to begin with (which probably depends on what OS is running where the file is generated), and partially on what character the client machine (that's showing the flash) considers a newline. Lots of gory details here. In practice though, if you remove \r, \n, and \r\n, that usually does the trick. That's why I added both \r and \n to the "whitespace" string in my example code.
its been a while since I've tinkered with AS2.
someXML = new XML();
someXML.ignoreWhite = true;
if you wanted to str_replace try '\n'
Is there a reason that you are using cdata? Admittedly I have no idea what the best practice for this sort of this is, but I tend to leave them out and just have the HTML sit there inside the node.
var foo = node.childnodes.join("") parses it out just fine and I never seem to come across these whitespace problems.
I'm reading this over and over again, and if I'm interpreting you right, all you want to know how to do is strip certain characters (tabs and newlines) from a string in AS2, right? I cannot believe no one has given you the simple one line answer yet:
myString = myString.split("\n").join("");
That's it. Repeat that for \r, \n, and \t and all newlines and tabs will be gone. If you want it as an easy function, then do this:
function stripWhiteSpace(str: String) : String
{
return str.split("\r").join("").split("\n").join("").split("\t").join("");
}
That function won't modify your old string, it will return a new one without \r, \n, or \t. To actually modify the old string use that function like this:
myString = stripWhiteSpace(myString);

Resources