extracting runs of text with Mechanize/Nokogiri - ruby-on-rails

Is there a sensible way to extract each run of text in a Mechanize-parsed HTML document, so that (for example):
<p>Here is <b>some</b> text<p>
is broken into three elements:
Here is
some
text
? My hunch is that there's a simple technique using recursive CSS search and/or #flatten, but I've not figured it out yet.

Borrowing from an answer in "Nokogiri recursively get all children":
result = []
doc.traverse { |node| result << node.text if node.text? }
That should give you the array ["Here is ", "some", " text"].
"Getting Mugged by Nokogiri" discusses traverse.

Since you want the contents of each text node, you can do this:
doc.search('//text()').map(&:text)
The only downside to this (and to the other answer) is that you get all the whitespace between elements as well. If you want to suppress this, you can do this:
doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}
This removes all elements that don't contain a word character.

Related

Remove contents within a specific tag

Using Rails 3.2. I want to remove all text in <b> and the tags, but I manage to find ways to strip the tags only.:
string = "
<p>
<b>Section 1</b>
Everything is good.<br>
<b>Section 2</b>
All is well.
</p>"
string.strip_tags
# => "Section 1 Everthing is good. Section 2 All is well."
I want to achieve this:
"Everthing is good. All is well."
Should I add regex matching too?
The "right" way would be to use an html parser like Nokogiri.
However for this simple task, you may use a regex. It's quite simple:
Search for : (?m)<b\s*>.*?<\/b\s*> and replace it with empty string. After that, use strip_tags.
Regex explanation:
(?m) # set the m modifier to match newlines with dots .
<b # match <b
\s* # match a whitespace zero or more times
> # match >
.*? # match anything ungreedy until </b found
<\/b # match </b
\s* # match a whitespace zero or more times
> # match >
Online demo
It would be much better to use an HTML/XML parser for this task. Ruby does not have a native one, but Nokogiri is good and wraps libxml/xslt
doc = Nokogiri::XML string
doc.xpath("//b").remove
result = doc.text # or .inner_html to include `<p>`
You can do string.gsub(/<b>.*<\/b>/, '')
http://rubular.com/r/hhmpY6Q6fX
if you want to remove tags you can try this :
ActionController::Base.helpers.sanitize("test<br>test<br>test<br> test")
if you want to remove all the tags you need to use this :
ActionView::Base.full_sanitizer.sanitize("test<br>test<br>test<br> test")
these two differ slightly.the first one is good for script tags to prevent Xss attacks but it doesn't remove tages. the second one removes any html tags in the text.

Remove \text generated by TeXForm

I need to remove all \text generated by TeXForm in Mathematica.
What I am doing now is this:
MyTeXForm[a_]:=StringReplace[ToString[TeXForm[a]], "\\text" -> ""]
But the result keeps the braces, for example:
for a=fx,
the result of TeXForm[a] is \text{fx}
the result of MyTeXForm[a] is {fx}
But what I would like is it to be just fx
You should be able to use string patterns. Based on http://reference.wolfram.com/mathematica/tutorial/StringPatterns.html, something like the following should work:
MyTeXForm[a_]:=StringReplace[ToString[TeXForm[a]], "\\text{"~~s___~~"}"->s]
I don't have Mathematica handy right now, but this should say 'Match "\text{" followed by zero or more characters that are stored in the variable s, followed by "}", then replace all of that with whatever is stored in s.'
UPDATE:
The above works in the simplest case of there being a single "\text{...}" element, but the pattern s___ is greedy, so on input a+bb+xx+y, which Mathematica's TeXForm renders as "a+\text{bb}+\text{xx}+y", it matches everything between the first "\text{" and last "}" --- so, "bb}+\text{xx" --- leading to the output
In[1]:= MyTeXForm[a+bb+xx+y]
Out[1]= a+bb}+\text{xx+y
A fix for this is to wrap the pattern with Shortest[], leading to a second definition
In[2]:= MyTeXForm2[a_] := StringReplace[
ToString[TeXForm[a]],
Shortest["\\text{" ~~ s___ ~~ "}"] -> s
]
which yields the output
In[3]:= MyTeXForm2[a+bb+xx+y]
Out[3]= a+bb+xx+y
as desired.
Unfortunately this still won't work when the text itself contains a closing brace. For example, the input f["a}b","c}d"] (for some reason...) would give
In[4]:= MyTeXForm2[f["a}b","c}d"]]
Out[4]= f(a$\$b},c$\$d})
instead of "f(a$\}$b,c$\}$d)", which would be the proper processing of the TeXForm output "f(\text{a$\}$b},\text{c$\}$d})".
This is what I did (works fine for me):
MyTeXForm[a_] := ToString[ToExpression[StringReplace[ToString[TeXForm[a]], "\\text" -> ""]][[1]]]
This is a really late reply, but I just came up against the same issue and discovered a simple solution. Put a space between the variables in the Mathematica expression that you wish to convert using TexForm.
For the original poster's example, the following code works great:
a=f x
TeXForm[a]
The output is as desired: f x
Since LaTeX will ignore that space in math mode, things will format correctly.
(As an aside, I was having the same issue with subscripted expressions that have two side-by-side variables in the subscript. Inserting a space between them solved the issue.)

Interpret newlines as <br>s in markdown (Github Markdown-style) in Ruby

I'm using markdown for comments on my site and I want users to be able to create line breaks by pressing enter instead of space space enter (see this meta question for more details on this idea)
How can I do this in Ruby? You'd think Github Flavored Markdown would be exactly what I need, but (surprisingly), it's quite buggy.
Here's their implementation:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This logic requires that the line start with a \w for a linebreak at the end to create a <br>. The reason for this requirement is that you don't to mess with lists: (But see the edit below; I'm not even sure this makes sense)
* we don't want a <br>
* between these two list items
However, the logic breaks in these cases:
[some](http://google.com)
[links](http://google.com)
*this line is in italics*
another line
> the start of a blockquote!
another line
I.e., in all of these cases there should be a <br> at the end of the first line, and yet GFM doesn't add one
Oddly, this works correctly in the javascript version of GFM.
Does anyone have a working implementation of "new lines to <br>s" in Ruby?
Edit: It gets even more confusing!
If you check out Github's official Github Flavored Markdown repository, you'll find yet another newline to <br> regex!:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/(\A|^$\n)(^\w[^\n]*\n)(^\w[^\n]*$)+/m) do |x|
x.gsub(/^(.+)$/, "\\1 ")
end
I have no clue what this regex means, but it doesn't do any better on the above test cases.
Also, it doesn't look like the "don't mess with lists" justification for requiring that lines start with word characters is valid to begin with. I.e., standard markdown list semantics don't change regardless of whether you add 2 trailing spaces. Here:
item 1
item 2
item 3
In the source of this question there are 2 trailing spaces after "item 1", and yet if you look at the HTML, there is no superfluous <br>
This leads me to think the best regex for converting newlines to <br>s is just:
text.gsub!(/^[^\n]+\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
Thoughts?
I'm not sure if this will help, but I just use simple_format()
from ActionView::Helpers::TextHelper
ActionView simple_format
my_text = "Here is some basic text...\n...with a line break."
simple_format(my_text)
output => "<p>Here is some basic text...\n<br />...with a line break.</p>"
Even if it doesn't meet your specs, looking at the simple_format() source code .gsub! methods might help you out writing your own version of required markdown.
A little too late, but perhaps useful for other people. I've gotten it to work (but not thoroughly tested) by preprocessing the text using regular expressions, like so. It's hideous as a result of the lack of zero-width lookbehinds, but oh well.
# Append two spaces to a simple line, if it ends in newline, to render the
# markdown properly. Note: do not do this for lists, instead insert two newlines. Also, leave double newlines
# alone.
text.gsub! /^ ([\*\+\-]\s+|\d+\s+)? (.+?) (\ \ )? \r?\n (\r?\n|[\*\+\-]\s+|\d+\s+)? /xi do
full, pre, line, spaces, post = $~.to_a
if post != "\n" && pre.blank? && post.blank? && spaces.blank?
"#{pre}#{line} \n#{post}"
elsif pre.present? || post.present?
"#{pre}#{line}\n\n#{post}"
else
full
end
end

Regular expressions: How do I grab a block of text using regex? (in ruby)

I'm using ruby and I'm trying to find a way to grab text in between the {start_grab_entries} and {end_grab_entries} like so:
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
Something like so:
$1 => "i want to grab
the text that
you see here in
the middle"
So far, I tried this as my regular expression:
\{start_grab_entries}(.|\n)*\{end_grab_entries}
However, using $1, that gives me a blank. Do you know what I can do to grab that block of text in between the tags correctly?
There is a better way to allow the dot to match newlines (/m modifier):
regexp = /\{start_grab_entries\}(.*?)\{end_grab_entries\}/m
Also, make the * lazy by appending a ?, or you might match too much if more than one such section occurs in your input.
That said, the reason why you got a blank match is that you repeated the capturing group itself; therefore you only caught the last repetition (in this case, a \n).
It would have "worked" if you had put the capturing group outside of the repetition:
\{start_grab_entries\}((?:.|\n)*)\{end_grab_entries\}`
but, as said above, there is a better way to do that.
I'm adding this because often we're reading data from a file or data-stream where the range of lines we want are not all in memory at once. "Slurping" a file is discouraged if the data could exceed the available memory, something that easily happens in production corporate environments. This is how we'd grab lines between some boundary markers as the file is being scanned. It doesn't rely on regex, instead using Ruby's "flip-flop" .. operator:
#!/usr/bin/ruby
lines = []
DATA.each_line do |line|
lines << line if (line['{start_grab_entries}'] .. line['{end_grab_entries}'])
end
puts lines # << lines with boundary markers
puts
puts lines[1 .. -2] # << lines without boundary markers
__END__
this is not captured
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
this is not captured either
Output of this code would look like:
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
i want to grab
the text that
you see here in
the middle
string=<<EOF
blah
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
blah
EOF
puts string.scan(/{start_grab_entries}(.*?){end_grab_entries}/m)

removing whitespaces in ActionScript 2 variables

let's say that I have an XML file containing this :
<description><![CDATA[
<h2>lorem ipsum</h2>
<p>some text</p>
]]></description>
that I want to get and parse in ActionScript 2 as HTML text, and setting some CSS before displaying it. Problem is, Flash takes those whitespaces (line feed and tab) and display it as it is.
<some whitespace here>
lorem ipsum
some text
where the output I want is
lorem ipsum
some text
I know that I could remove the whitespaces directly from the XML file (the Flash developer at my workplace also suggests this. I guess that he doesn't have any idea on how to do this [sigh]). But by doing this, it would be difficult to read the section in the XML file, especially when lots of tags are involved and that makes editing more difficult.
So now, I'm looking for a way to strip those whitespaces in ActionScript. I've tried to use PHP's str_replace equivalent (got it from here). But what should I use as a needle (string to search) ? (I've tried to put in "\t" and "\r", don't seem to be able to detect those whitespaces).
edit :
now that I've tried to throw in newline as a needle, it works (meaning that newline successfully got stripped).
mystring = str_replace(newline, '', mystring);
But, newlines only got stripped once, meaning that in every consecutive newlines, (eg. a newline followed by another newline) only one newline can be stripped away.
Now, I don't see that this as a problem in the str_replace function, since every consecutive character other than newline get stripped away just fine.
Pretty much confused about how stuff like this is handled in ActionScript. :-s
edit 2:
I've tried str_replace -ing everything I know of, \n, \r, \t, newline, and tab (by pressing tab key). Replacing \n, \r, and \t seem to have no effect whatsoever.
I know that by successfully doing this, my content can never have real line breaks. That's exactly my intention. I could format the XML the way I want without Flash displaying any of the formatting stuff. :)
Several ways to approach this. Perhaps the simplest answer is, in one sense your Flash developer is probably right, and you should move your whitespace outside of the CDATA container. The reason being, many people (me at least) tend to assume that everything inside a CDATA is "real data", as opposed to markup. On the other hand, whitespace outside a CDATA is normally assumed to be irrelevant, so data like this:
<description>
<![CDATA[<h2>lorem ipsum</h2>
<p>some text</p>]]>
</description>
would be easier to understand and to work with. (The flash developer can use the XML.ignoreWhite property to ignore the whitespace outside the CDATA.)
With that said, if you're editing the XML by hand, then I can see why it would be easier to use the formatting you describe. However, if the extra whitespace is inside the CDATA, then it will inevitable be included in the String data you extract, so your only option is to grab the content of the CDATA and remove the whitespace afterwards.
Then your question reduces to "how do I strip leading/trailing whitespace from a String in AS2?". And unfortunately, since AS2 doesn't support RegEx there's no simple way to do this. I think your best option would be to parse through from the beginning and end to find the first/last non-white character. Something along these lines (untested pseudocode):
myString = stuffFromXML;
whitespace = " " + "\t" + "\n" + "\r" + newline;
start = 0;
end = myString.length;
while ( testString( myString.substr(start,1), whitespace ) ) { start++; }
while ( testString( myString.substr(end-1,1), whitespace ) ) { end--; }
trimmedString = myString.substring( start, end );
function testString( needle, haystack ) {
return ( haystack.indexOf( needle ) > -1 );
}
Hope that helps!
Edit: I notice that in your example you'd also need to remove tabs and whitespace within your text data. This would be tricky, unless you can guarantee that your data will never include "real" tabs in addition to the ones for formatting. No matter what you do with the CDATA tags, it would probably be wiser not to insert extraneous formatting inside your real content and then remove it programmatically afterward. That's just making your own life difficult.
Second edit: As for what character to remove to get rid of newlines, it depends partially on what characters are actually in the XML to begin with (which probably depends on what OS is running where the file is generated), and partially on what character the client machine (that's showing the flash) considers a newline. Lots of gory details here. In practice though, if you remove \r, \n, and \r\n, that usually does the trick. That's why I added both \r and \n to the "whitespace" string in my example code.
its been a while since I've tinkered with AS2.
someXML = new XML();
someXML.ignoreWhite = true;
if you wanted to str_replace try '\n'
Is there a reason that you are using cdata? Admittedly I have no idea what the best practice for this sort of this is, but I tend to leave them out and just have the HTML sit there inside the node.
var foo = node.childnodes.join("") parses it out just fine and I never seem to come across these whitespace problems.
I'm reading this over and over again, and if I'm interpreting you right, all you want to know how to do is strip certain characters (tabs and newlines) from a string in AS2, right? I cannot believe no one has given you the simple one line answer yet:
myString = myString.split("\n").join("");
That's it. Repeat that for \r, \n, and \t and all newlines and tabs will be gone. If you want it as an easy function, then do this:
function stripWhiteSpace(str: String) : String
{
return str.split("\r").join("").split("\n").join("").split("\t").join("");
}
That function won't modify your old string, it will return a new one without \r, \n, or \t. To actually modify the old string use that function like this:
myString = stripWhiteSpace(myString);

Resources