Regular expressions: How do I grab a block of text using regex? (in ruby) - ruby-on-rails

I'm using ruby and I'm trying to find a way to grab text in between the {start_grab_entries} and {end_grab_entries} like so:
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
Something like so:
$1 => "i want to grab
the text that
you see here in
the middle"
So far, I tried this as my regular expression:
\{start_grab_entries}(.|\n)*\{end_grab_entries}
However, using $1, that gives me a blank. Do you know what I can do to grab that block of text in between the tags correctly?

There is a better way to allow the dot to match newlines (/m modifier):
regexp = /\{start_grab_entries\}(.*?)\{end_grab_entries\}/m
Also, make the * lazy by appending a ?, or you might match too much if more than one such section occurs in your input.
That said, the reason why you got a blank match is that you repeated the capturing group itself; therefore you only caught the last repetition (in this case, a \n).
It would have "worked" if you had put the capturing group outside of the repetition:
\{start_grab_entries\}((?:.|\n)*)\{end_grab_entries\}`
but, as said above, there is a better way to do that.

I'm adding this because often we're reading data from a file or data-stream where the range of lines we want are not all in memory at once. "Slurping" a file is discouraged if the data could exceed the available memory, something that easily happens in production corporate environments. This is how we'd grab lines between some boundary markers as the file is being scanned. It doesn't rely on regex, instead using Ruby's "flip-flop" .. operator:
#!/usr/bin/ruby
lines = []
DATA.each_line do |line|
lines << line if (line['{start_grab_entries}'] .. line['{end_grab_entries}'])
end
puts lines # << lines with boundary markers
puts
puts lines[1 .. -2] # << lines without boundary markers
__END__
this is not captured
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
this is not captured either
Output of this code would look like:
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
i want to grab
the text that
you see here in
the middle

string=<<EOF
blah
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
blah
EOF
puts string.scan(/{start_grab_entries}(.*?){end_grab_entries}/m)

Related

Grep lines for multiple words and the line ending, and then replace line ending if matched

I need to grep a long text file for lines that contains multiple possible words and also end in "=1", and then replace the line with the same text except change the "=1" to "=0".
I'm using BBEdit.
So far I have this to find lines that contains the desired match that also ends with 1:
^(.*test|.*disabled|.*inactive|.*server).*(=1)
I'm unable to do the replacement successfully though.
Here are some example lines of text from the file:
OU>2020,OU>Disabled Accounts,DC>net,DC>example,DC>com=1
OU>Distribution Groups,DC>net,DC>example,DC>com=1
OU>Exchange Servers,DC>net,DC>example,DC>com=1
CN>Users,DC>net,DC>example,DC>com=1
OU>Test Servers,OU>Servers,OU>ABC,DC>net,DC>example,DC>com=1
As an example, the first line above would have its =1 changed to =0 like:
OU>2020,OU>Disabled Accounts,DC>net,DC>example,DC>com=0
Other matches would follow that pattern.
After playing around with it more, this seems to work:
Find:
(^.*(test|disable|inactive|server).*)(=1)$
Replace:
\1=0

Copy a table from iPython notebook into Word?

I want to copy a table from iPython notebook into a Word doc. I'm using Word for Mac 2011. The table is a standard pandas output and looks like this:
If I use Apple+C to copy the table, and then paste it into a Word doc, I get this:
Surely there must be an easier way?
Creating a table with the same number of rows/columns in Word and then trying to paste the cells there doesn't work either.
I guess I could screenshot the table, but I'd like to include the raw data in the document if possible.
The problem in this case (from the Word perspective) is not the table layout - it's the paragraph layout. Each paragraph has a substantial indent on right and left, and more space before/after than you would normally want.
I don't think any of the Paste options (e.g. Paste Special) in Word is going to help, unless you paste as unformatted text, then select the text, convert to a table, then proceed from there.
But, even a simple Word VBA macro such as this one will leave you with something a bit more manageable. (Select a table you copied in, then run the macro). A little bit more work on the code would probably allow you to get most of the formatting you want, most of the time.
Sub fixupSelectedTable()
With Selection.Tables(1).Range.ParagraphFormat
.LeftIndent = 0
.RightIndent = 0
.SpaceBefore = 0
.SpaceAfter = 0
.LineSpacingRule = wdLineSpaceSingle
End With
End Sub
If you are more familiar with Applescript, the equivalent looks something like this:
-- you may need to fix up the application name
-- (I use this to ensure that the script uses the Open Word 2011 doc
-- and does not try to start Word for Mac 15 (2016))
tell application "/Applications/Microsoft Office 2011/Microsoft Word.app"
tell the paragraph format of the text object of table 1 of the text object of the selection
set paragraph format left indent to 0
set paragraph format right indent to 0
set space before to 0
set space after to 0
set line spacing rule to line space single
end tell
end tell

extracting runs of text with Mechanize/Nokogiri

Is there a sensible way to extract each run of text in a Mechanize-parsed HTML document, so that (for example):
<p>Here is <b>some</b> text<p>
is broken into three elements:
Here is
some
text
? My hunch is that there's a simple technique using recursive CSS search and/or #flatten, but I've not figured it out yet.
Borrowing from an answer in "Nokogiri recursively get all children":
result = []
doc.traverse { |node| result << node.text if node.text? }
That should give you the array ["Here is ", "some", " text"].
"Getting Mugged by Nokogiri" discusses traverse.
Since you want the contents of each text node, you can do this:
doc.search('//text()').map(&:text)
The only downside to this (and to the other answer) is that you get all the whitespace between elements as well. If you want to suppress this, you can do this:
doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}
This removes all elements that don't contain a word character.

Remove hard line breaks from text with Ruby

I have some text with hard line breaks in it like this:
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
I want to remove the single newlines but keep the double newlines so it looks like this:
This should all be on one line since it's one sentence.
This is a new paragraph that should be separate.
Is there a single regular expression to do this? (or some easy way)
So far this is my only solution which works but feels hackish.
txt = txt.gsub(/(\r\n|\n|\r)/,'[[[NEWLINE]]]')
txt = txt.gsub('[[[NEWLINE]]][[[NEWLINE]]]', "\n\n")
txt = txt.gsub('[[[NEWLINE]]]', " ")
Replace all newlines that are not followed by or preceded by a newline:
text = <<END
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
END
p text.gsub /(?<!\n)\n(?!\n)/, ' '
#=> "This should all be on one line since it's one sentence.\n\nThis is a new paragraph that should be separate. "
Or, for Ruby 1.8 without lookarounds:
txt.gsub! /([^\n])\n([^\n])/, '\1 \2'
text.gsub!(/(\S)[^\S\n]*\n[^\S\n]*(\S)/, '\1 \2')
The two (\S) groups serve the same purposes as the lookarounds ((?<!\s)(?<!^) and(?!\s)(?!$)) in #sln's regexes:
they confirm that the linefeed really is in the middle of a sentence, and
they ensure that the [^\S\n]*\n[^\S\n]* part consumes any other whitespace surrounding the linefeed, making it possible for us to normalize it to a single space.
They also make the regex easier to read, and (perhaps most importantly) they work in pre-1.9 versions of Ruby that don't support lookbehinds.
There is more to formatting (turning off word wrap) than you think.
If the output is a result of a formatting operation, then you should go by
those rules to reverse engineer the original.
For instance, the test you have there is
This should all be on one line
since it's one sentence.
This is a new paragraph that
should be separate.
If you removed just the single newlines only, it would look like this:
This should all be on one line since it's one sentence.
This is a new paragraph thatshould be separate.
Also, other formatting such as intentional newlines will be lost, so something like:
This is Chapter 1
Section a
Section b
Turns into
This is Chapter 1 Section a Section b
Finding the newline in question is easy /(?<!\n)\n(?!\n)/
but, what do you replace it with.
Edit: Actually, its not that easy even to find standalone newlines, because visually they sit amongst hidden from view (horizontal) whitespaces.
There are 4 ways to go.
Remove newline, keep the surrounding formatting
$text =~ s/(?<!\s)([^\S\n]*)\n([^\S\n]*)(?!\s)/$1$2/g;
Remove newline and formatting, substitute a space
$text =~ s/(?<!\s)[^\S\n]*\n[^\S\n]*(?!\s)/ /g;
Same as above but ignore newline at beginning or end of string
$text =~ s/(?<!\s)(?<!^)[^\S\n]*\n[^\S\n]*(?!$|\s)/ /g;
$text =~ s/(?<!\s)(?<!^)([^\S\n]*)\n([^\S\n]*)(?!$|\s)/$1$2/g;
Example breakdown of regex (this is the minimum required just to isolate a single newline):
(?<!\s) # Not a whitespace behind us (text,number,punct, etc..)
[^\S\n]* # 0 or more whitespaces, but no newlines
\n # a newline we want to remove
[^\S\n]* # 0 or more whitespaces, but no newlines
(?!\s)/ # Not a whitespace in front of us (text,number,punct, etc..)
Well, there is this:
s.gsub /([^\n])\n([^\n])/, '\1 \2'
It won't do anything to leading or trailing newlines. If you don't need leading or trailing white space at all, then you will win with this variation:
s.gsub(/([^\n])\n([^\n])/, '\1 \2').strip
$ ruby -00 -pne 'BEGIN{$\="\n\n"};$_.gsub!(/\n+/,"\0")' file
This should all be on one line since it's one sentence.
This is a new paragraph thatshould be separate.

removing whitespaces in ActionScript 2 variables

let's say that I have an XML file containing this :
<description><![CDATA[
<h2>lorem ipsum</h2>
<p>some text</p>
]]></description>
that I want to get and parse in ActionScript 2 as HTML text, and setting some CSS before displaying it. Problem is, Flash takes those whitespaces (line feed and tab) and display it as it is.
<some whitespace here>
lorem ipsum
some text
where the output I want is
lorem ipsum
some text
I know that I could remove the whitespaces directly from the XML file (the Flash developer at my workplace also suggests this. I guess that he doesn't have any idea on how to do this [sigh]). But by doing this, it would be difficult to read the section in the XML file, especially when lots of tags are involved and that makes editing more difficult.
So now, I'm looking for a way to strip those whitespaces in ActionScript. I've tried to use PHP's str_replace equivalent (got it from here). But what should I use as a needle (string to search) ? (I've tried to put in "\t" and "\r", don't seem to be able to detect those whitespaces).
edit :
now that I've tried to throw in newline as a needle, it works (meaning that newline successfully got stripped).
mystring = str_replace(newline, '', mystring);
But, newlines only got stripped once, meaning that in every consecutive newlines, (eg. a newline followed by another newline) only one newline can be stripped away.
Now, I don't see that this as a problem in the str_replace function, since every consecutive character other than newline get stripped away just fine.
Pretty much confused about how stuff like this is handled in ActionScript. :-s
edit 2:
I've tried str_replace -ing everything I know of, \n, \r, \t, newline, and tab (by pressing tab key). Replacing \n, \r, and \t seem to have no effect whatsoever.
I know that by successfully doing this, my content can never have real line breaks. That's exactly my intention. I could format the XML the way I want without Flash displaying any of the formatting stuff. :)
Several ways to approach this. Perhaps the simplest answer is, in one sense your Flash developer is probably right, and you should move your whitespace outside of the CDATA container. The reason being, many people (me at least) tend to assume that everything inside a CDATA is "real data", as opposed to markup. On the other hand, whitespace outside a CDATA is normally assumed to be irrelevant, so data like this:
<description>
<![CDATA[<h2>lorem ipsum</h2>
<p>some text</p>]]>
</description>
would be easier to understand and to work with. (The flash developer can use the XML.ignoreWhite property to ignore the whitespace outside the CDATA.)
With that said, if you're editing the XML by hand, then I can see why it would be easier to use the formatting you describe. However, if the extra whitespace is inside the CDATA, then it will inevitable be included in the String data you extract, so your only option is to grab the content of the CDATA and remove the whitespace afterwards.
Then your question reduces to "how do I strip leading/trailing whitespace from a String in AS2?". And unfortunately, since AS2 doesn't support RegEx there's no simple way to do this. I think your best option would be to parse through from the beginning and end to find the first/last non-white character. Something along these lines (untested pseudocode):
myString = stuffFromXML;
whitespace = " " + "\t" + "\n" + "\r" + newline;
start = 0;
end = myString.length;
while ( testString( myString.substr(start,1), whitespace ) ) { start++; }
while ( testString( myString.substr(end-1,1), whitespace ) ) { end--; }
trimmedString = myString.substring( start, end );
function testString( needle, haystack ) {
return ( haystack.indexOf( needle ) > -1 );
}
Hope that helps!
Edit: I notice that in your example you'd also need to remove tabs and whitespace within your text data. This would be tricky, unless you can guarantee that your data will never include "real" tabs in addition to the ones for formatting. No matter what you do with the CDATA tags, it would probably be wiser not to insert extraneous formatting inside your real content and then remove it programmatically afterward. That's just making your own life difficult.
Second edit: As for what character to remove to get rid of newlines, it depends partially on what characters are actually in the XML to begin with (which probably depends on what OS is running where the file is generated), and partially on what character the client machine (that's showing the flash) considers a newline. Lots of gory details here. In practice though, if you remove \r, \n, and \r\n, that usually does the trick. That's why I added both \r and \n to the "whitespace" string in my example code.
its been a while since I've tinkered with AS2.
someXML = new XML();
someXML.ignoreWhite = true;
if you wanted to str_replace try '\n'
Is there a reason that you are using cdata? Admittedly I have no idea what the best practice for this sort of this is, but I tend to leave them out and just have the HTML sit there inside the node.
var foo = node.childnodes.join("") parses it out just fine and I never seem to come across these whitespace problems.
I'm reading this over and over again, and if I'm interpreting you right, all you want to know how to do is strip certain characters (tabs and newlines) from a string in AS2, right? I cannot believe no one has given you the simple one line answer yet:
myString = myString.split("\n").join("");
That's it. Repeat that for \r, \n, and \t and all newlines and tabs will be gone. If you want it as an easy function, then do this:
function stripWhiteSpace(str: String) : String
{
return str.split("\r").join("").split("\n").join("").split("\t").join("");
}
That function won't modify your old string, it will return a new one without \r, \n, or \t. To actually modify the old string use that function like this:
myString = stripWhiteSpace(myString);

Resources