Capybara rejects text with "<" (lower than character) - ruby-on-rails

I have a spec to test if I'm able to show some different characters inside a given element. Say we have the element:
<p class="my-strange-characters-text">
"Here they are: \" & ; ' > # <"
</p>
The problem is that, with Capybara's default driver it doesn't retrieve the "<" character.
In my spec, if I do:
first(".my-strange-characters-text").text
The output is
Here they are: \" & ; ' > #
No "<" character! (nor whatever I insert after)
BUT, if i use :js => true, that will invoke the Poltergeist driver, it returns the text correctly.
I don't want to use :js => true on this specific text.
Obs:
I've tried '<', \< and other tricks to make it appear, but no success.
Any hint?

In HTML, a literal < is written as <. Ampersands must also be replaced with &
Browsers try their best to interpret invalid HTML, which is probably why poltergeist (which under the hood is using WebKit) is able to guess at the text you wanted to insert.

Related

Problem with attachments' character encoding using gmail gem in ruby/rails

What I am doing:
I am using the gmail gem in a Rails 4 app to get email attachments from a specific account at regular intervals. Here is an extract from the core part (here for simplicity only considering the first email and its first attachment):
require 'gmail'
Gmail.connect(#user_email,#user_password) do |gmail|
if gmail.logged_in?
emails = gmail.inbox.emails(:from => #sender_email)
email = emails[0]
attachment = email.message.attachments[0]
File.open("~/temp.csv", 'w') do |file|
file.write(
StringIO.new(attachment.decoded.to_s[2..-2].force_encoding("ISO-8859-15").encode!('UTF-8')).read
)
end
end
end
The encoding of the attached file can vary. The particular one that I am currently having issues with is in Finnish. It contains Finnish characters and a superscripted 3 character.
This is what I expect to get when I run the above code. (This is what I get when I download the attachment manually through gmail user interface):
What the problem is:
However, I am getting the following odd results.
From cat temp.csv (Looks good to me):
With nano temp.csv (Here I have no idea what I am looking at):
This is what temp.csv looks like opened in Sublime Text (directly via winscp). First line and small parts look ok but then Chinese/Japanese characters:
This is what temp.csv looks like in Notepad (after download via winscp). Looks ok except a blank space has been inserted between each character and the new lines seems to be missing:
What I have tried:
I have without success tried:
.force_encoding(...) with all the different "ISO-8859-x" character sets
putting the force_encoding("ISO-8859-15").encode!('UTF-8') outside the .read (works but doesn't solve the problem)
encode to UTF-8 without first forcing another encoding but this leads to Encoding::UndefinedConversionError: "\xC4" from ASCII-8BIT to UTF-8
writing as binary with 'wb' and 'w+b' in the File.open() (which oddly doesn't seem to make a difference to the outcome).
searching stackoverflow and the web for other ideas.
Any ideas would be much appreciated!
Not beautiful, but it will work for me now.
After re-encoding, I convert the string to a char array, then remove the chars I do not want and then join the remaining array elements to form a string.
decoded_att = attachment.decoded
data = decoded_att.encode("UTF-8", "ISO-8859-1", invalid: :replace, undef: :replace).gsub("\r\n", "\n")
data_as_array = data.chars
data_as_array = data_as_array.delete_if {|i| i == "\u0000" || i == "ÿ" || i == "þ"}
data = data_as_array.join('').to_s
File.write("~/temp.csv", data.to_s)
This will work for me now. However, I have no idea how these characters have ended up in the attachment ("ÿ" and "þ" in the start of the document and "\u0000" between all remaining characters).
It seems like you need to do attachment.body.decoded instead of attachment.decoded

RoR: Handling Blanks AND/OR Special Characters

I'm processing emails for upload and occasionally an embedded image in the email comes through either without a file extension or with an extension containing a random combination of letters, numbers and special characters (for example: image001.gif#01CFA02B.47556390). If either instance arrives, I want to ignore it and move on. I think I've got the without extension covered, but wasn't clear on how best to handle the random characters and well as the cleanest way in to write the conditionals. Here is what I have so far:
filename_extension = File.extname(filename)
if filename_extension.blank?
puts "FILENAME EXT IS BLANK"
elsif filename_extension #NEED REGEX or something to handle Random?
puts "FILENAME EXT IS Random"
else #DO PROCESSING
Thanks.
known_extensions = %w[.csv .rb .rbw .html .htm .css]
filenames = %w[1.txt 2.csv 3]
filenames.each do |filename|
filename_extension = File.extname(filename)
if filename_extension.empty?
puts "FILENAME EXT IS BLANK"
elsif known_extensions.include? filename_extension
puts "FILENAME EXT IS Random"
else #DO PROCESSING
puts "Processing"
end
end
The question was tagged ruby without any indicate of gems that may give you the blank? method.
The idea of an 'invalid' extension is rather varied, and of course tied to what it means to be a valid file name. On most Unix file systems, for example, the only limitations on an file name would the limitation of the filename size of 255 bytes, and the reserved characters of / and null. In fact, there is no specification that I am aware of about 'extensions' in Unix, as they are simply a part of a file name, the period in a file name being valid, not signifying anything special. (With the exception of a file name that starts with a period indicating that it should be a 'hidden' file.) On a Windows system, it is a longer list of characters, some of which are / : < > ? \ | + , . ; = [ ] as well as the single and double quotes. On my Commodore, I think it was :, , and =, and on my Amiga system I could use anything except for ", :, or /.
So I think 'invalid' extension might be easier to match than the 'valid' ones. If you are indeed using Rails, and hosting on Unix, then you have a smaller set of things to check for, to ensure a valid extension (indeed, a valid filename). Basing that invalid extension on your hosting system, and any restrictions you would place due to your idea of what a valid extension means to your program.

xpath with contains throws error if string starts with a number

I'm running into a strange problem with nokogiri and xpath. I want to parse a HTML document and get all links by href value and the anchor text they contain.
Here's my xpath so far:
xpath = "//a[contains(text(), #{link['anchor_text']}) and #href='#{link['target_url']}']"
a = doc.search(xpath)
This works fine so far as long as link['anchor_text'] is a string without numbers.
If I'm trying to get a link with the anchor text "11example" it throws the following error:
Invalid expression: //a[contains(text(), 11example) and #href='http://www.example.com/']
Maybe it's just a stupid mistake, but I'm not seeing why this error occurs. If I put some quotes around the #{link['anchor_text']} in the xpath, nothing is working.
Edit: Here's the sample HTML:
<!DOCTYPE html>
<head>
<title>Example.com</title>
</head>
<body>
<p>
<strong>Here is some text</strong><br />
11exampleSome text here and there
</p>
<p>
<strong>Another text</strong><br />
example.comSome text here and there
</p>
</body>
Edit2: If I run these queries manually in irb console everything works as expected, but only if I put the text in quotes.
Thanks in advance!
Kind regards,
madhippie
The simple answer is that you are missing quotes around #{link['anchor_text']}, like you have around #{link['target_url']}. The full XPath should be
xpath = "//a[contains(text(), '#{link['anchor_text']}') and #href='#{link['target_url']}']"
The reason it appears to work (at least not produce an error) when you don’t start with a number is that the string is being interpreted as a node query. For example Nokogiri is looking for a tag named <example.com> inside the <a> tag, then converting it to a string and seeing if the text nodes of the <a> tag contain that string. If the tag isn’t there (as in this case) then the result of contains is always true.
As a demonstration, with the HTML:
<q>foo</q>example
<q>foo</q>foo
foo
Then the query
doc.search("//a[contains(text(), q)]")
doesn’t match the first <a> tag, but does match the second and third.
When the string starts with a number, it can’t be parsed into a node query since names starting with digits aren’t valid XML (or HTML) element names, so you get an error.

RegEx Not working in Ruby!

I am using the following regex
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s))
to match the name [ Burkhart, Peterson & Company ] in this
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Generally parsing (X)HTML using Regular Expressions is bad practice. Ruby has the fantastic Nokogiri Library which uses libxml2 for parsing XHTML efficiently.
Which that being said, your . does not match newlines. Use the m modifier for your regexp which tells the . to match new lines. Or the Regexp::MULTILINE constant. Documented here
Your regular expression is also capturing the HTML before the text you require.
Using nokogiri and XPath would mean you could grab the content of this table cell by referring to its CSS class. Like this:
#!/usr/bin/env ruby
require 'nokogiri'
doc = Nokogiri::HTML DATA.read
p doc.at("td[#class='generalinfo_right']").text
__END__
<td class="generalinfo_left" align="right">Name:</td>
<td class="generalinfo_right">Burkhart, Peterson & Company</td>
Which will return "Burkhart, Peterson & Company"
/m makes the dot match newlines
You'll want to use /m for multiline mode:
str.scan(/Name:</td>(.*?)</td>/m)
html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/s)) doesn't match the new line characters; even if it would match those characters, the (.*?) part would grab everything after </td>, including <td class="generalinfo_right">.
To make the regular expression more generic, and allow to match the exact text you want, you should change the code to
html.scan(Regexp.new(/Name:<\/td><td[^>]*>(.*?)<\/td>/s))
The regular expression could be better written, though.
I would also not suggest to parse HTML/XHTML content with regular expression.
You can verify that all the answers suggesting you add /m or Regexp::MULTILINE are correct by going to rubular.com.
I also verified the solution in console, and also modifed the regex so that it would return only the name instead of all the extra junk.
Loading development environment (Rails 2.3.8)
ree-1.8.7-2010.02 > html = '<td class="generalinfo_left" align="right">Name:</td>
ree-1.8.7-2010.02'> <td class="generalinfo_right">Burkhart, Peterson & Company</td>
ree-1.8.7-2010.02'> '
=> "<td class="generalinfo_left" align="right">Name:</td>\n<td class="generalinfo_right">Burkhart, Peterson & Company</td>\n"
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>(.*?)<\/td>/m))
=> [["\n<td class="generalinfo_right">Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 > html.scan(Regexp.new(/Name:<\/td>.*<td[^>]*>(.*?)<\/td>/m))
=> [["Burkhart, Peterson & Company"]]
ree-1.8.7-2010.02 >

Help with grep in BBEdit

I'd like to grep the following in BBedit.
Find:
<dc:subject>Knowledge, Mashups, Politics, Reviews, Ratings, Ranking, Statistics</dc:subject>
Replace with:
<dc:subject>Knowledge</dc:subject>
<dc:subject>Mashups</dc:subject>
<dc:subject>Politics</dc:subject>
<dc:subject>Reviews</dc:subject>
<dc:subject>Ratings</dc:subject>
<dc:subject>Ranking</dc:subject>
<dc:subject>Statistics</dc:subject>
OR
Find:
<dc:subject>Social web, Email, Twitter</dc:subject>
Replace with:
<dc:subject>Social web</dc:subject>
<dc:subject>Email</dc:subject>
<dc:subject>Twitter</dc:subject>
Basically, when there's more than one category, I need to find the comma and space, add a linebreak and wrap the open/close around the category.
Any thoughts?
Wow. Lots of complex answers here. How about find:
,
(there's a space after the comma)
and replace with:
</dc:subject>\r<dc:subject>
Find:
(.+?),\s?
Replace:
\1\r
I'm not sure what you meant by “wrap the open/close around the category” but if you mean that you want to wrap it in some sort of tag or link just add it to the replace.
Replace:
\1\r
Would give you
Social web
Email
Twitter
Or get fancier with Replace:
\1\r
Would give you
Social web
Email
Twitter
In that last example you may have a problem with the “Social web” URL having a space in it. I wouldn't recommend that, but I wanted to show you that you could use the \1 backreference more than once.
The Grep reference in the BBEdit Manual is fantastic. Go to Help->User Manual and then Chapter 8. Learning how to use RegEx well will change your life.
UPDATE
Weird, when I first looked at this it didn't show me your full example. Based upon what I see now you should
Find:
(.+?),\s?
Replace:
<dc:subject>\1</dc:subject>\r
I don't use BBEdit, but in Vim you can do this:
%s/(_[^<]+)</dc:subject>/\=substitute(submatch(0), ",[ \t]*", "</dc:subject>\r", "g")/g
It will handle multiple lines and tags that span content with line breaks. It handles lines with multiple too, but won't always get the newline between the close and start tag.
If you post this to the google group vim_use and ask for a Vim solution and the corresponding perl version of it, you would probably get a bunch of suggestions and something that works in BBEdit and then also outside any editor in perl.
Don
You can use sed to do this either, in theory you just need to replace ", " with the closing and opening <dc:subject> and a newline character in between, and output to a new file. But sed doesn't seem to like the html angle brackets...I tried escaping them but still get error messages any time they're included. This is all I had time for so far, so if I get a chance to come back to it I will. Maybe someone else can solve the angle bracket issue:
sed s/, /</dc:subject>\n<dc:subject>/g file.txt > G:\newfile.txt
Ok I think I figured it out. Basically had to put the replacement text containing angle brackets in double quotes and change the separator character sed uses to something other than forward slash, as this is in the replacement text and sed didn't like it. I don't know much about grep but read that grep just matches things whereas sed will replace, so is better for this type of thing:
sed s%", "%"</dc:subject>\n<dc:subject>"%g file.txt > newfile.txt
You can't do this via normal grep. But you can add a "Unix Filter" to BBEdit doing this work for you:
#!/usr/bin/perl -w
while(<>) {
my $line = $_;
$line =~ /<dc:subject>(.+)<\/dc:subject>/;
my $content = $1;
my #arr;
if ($content =~ /,/) {
#arr = split(/,/,$content);
}
my $newline = '';
foreach my $part (#arr) {
$newline .= "\n" if ($newline ne '');
$part =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$newline .= "<dc:subject>$part</dc:subject>";
}
print $newline;
}
How to add this UNIX-Filter to BBEdit you can read at the "Installation"-Part of this URL: http://blog.elitecoderz.net/windows-zeichen-fur-mac-konvertieren-und-umgekehrt-filter-fur-bbeditconverting-windows-characters-to-mac-and-vice-versa-filter-for-bbedit/2009/01/

Resources