I have a database, and currently many of the items within it have been html escaped. I need to undo this (don't ask why!), for which I'll carry out a data migration.
But is the a way to un-escape these strings? I've not been able to find anything..
Ruby's CGI::unescapeHTML can do HTML unescaping.
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" <baz>"
You should take a look at the htmlentities gem
If i understand it correct you need to replace strings like > to >. If so - check xml documentation and replace required strings with their real values. I dont code in ruby, so this one you got to figure out :]
XML special characters
Related
Given a rails models column that contains
"Something & Something Else" when outputting to_xml
Rails will escape the Ampersand like so:
<MyElement>Something & Something Else</MyElement>
Our client software is all UTF aware and it would be better if we can just leave the column content raw in our XML output.
There was an old solution that worked by setting $KCODE="UTF8" in an environment file, but this trick no longer works, and was always an All or Nothing solution.
Any recommendations on how to disable this? on a case by case basis?
It does not matter if the client software is UTF-8-aware. An ampersand cannot be used unescaped in XML. If the software is supposed to also be XML-aware, then any content that includes ampersands is not allowed to be kept "raw".
This is nothing to do with Unicode (or "UTF"). Ampersands in XML must be escaped, otherwise it isn't XML, and no XML software will accept it. If you're saying you want the escaping disabled, then you're saying you don't want the output to be XML.
I am getting text from a feed that has alot of characters like:
Insignia™ 2.0 Stereo Computer Speaker System (2-Piece) - Black
4th-Generation AppleĀ® iPodĀ® touch
Is there an easy way to get rid of these, or do I have to anticipate which characters I want to delete and use the delete method to remove them? Also, when I try to remove
&
with
str.delete("&")
It leaves behind "amp;" Is there a better way to delete this type of character? Do I need to re-encode the text?
String#delete is certainly not what you want, as it works on characters, not the string as a whole.
Try
str.gsub /&/, ""
You may also want to try replacing the & with a literal ampersand, such as:
str.gsub /&/, "&"
If this is closer to what you really want, you may get the best results unescaping the HTML string. If so try this:
CGI::unescapeHTML(str)
Details of the unescapeHTML method are here.
If you are getting data from a 'feed', aka RSS XML, then you should be using an XML parser like Nokogiri to process the XML. This will automatically unescape HTML entities and allow you to get the proper string representation directly.
For removing try to use gsub method, something like this:
text = "foo&bar"
text.gsub /\b&\b/, "" #=> foobar
I have a Ruby on Rails Application that is using the X virtual framebuffer along with another program to grab images from the web. I have structured my command as shown below:
xvfb-run --server-args=-screen 0 1024x768x24 /my/c++/app #{user_provided_url}
What is the best way to make this call in rails with the maximum amount of safety from user input?
You probably don't need to sanitize this input in rails. If it's a URL and it's in a string format then it already has properly escaped characters to be passed as a URL to a Net::HTTP call. That said, you could write a regular expression to check that the URL looks valid. You could also do the following to make sure that the URL is parse-able:
uri = URI.parse(user_provided_url)
You can then query the object for it's relevant parts:
uri.path
uri.host
uri.port
Maybe I'm wrong, but why don't you just make sure that the string given is really an URL (URI::parse), surround it with single quotes and escape any single quote (') character that appears inside?
I have a string that is a bunch of XML tags.
Basically there is the contents to one tag I want and ignore everything else:
The input would look like:
<Some><XML><stuff>
<title type='text'>key</title>
<Some><other><XML><stuff>
The output would look like:
key
I'm not sure if XML is appropriate since there doesn't seem very much structure to this particular XML.
Can regex do this in RoR or is it more of just a pattern matching thing (true or false) in ruby on rails?
Thanks so much!
Cheers,
Zigu
No. If your source could not be strictly valid XML, I strongly suggest you to use Nokogiri.
Handle the source as an HTML document and extract the info you need in this way:
doc = Nokogiri::HTML("Your string with <key>some value</key>"))
doc.search('key').each do |value|
puts value.content # do whatever you want
end
Here's why you don't parse xml with regexen: RegEx match open tags except XHTML self-contained tags
Greetings everyone:
I would love to get some information from a huge collection of Google Search Result pages.
The only thing I need is the URLs inside a bunch of <cite></cite> HTML tags.
I cannot get a solution in any other proper way to handle this problem so now I am moving to ruby.
This is so far what I have written:
require 'net/http'
require 'uri'
url=URI.parse('http://www.google.com.au')
res= Net::HTTP.start(url.host, url.port){|http|
http.get('/#hl=en&q=helloworld')}
puts res.body
Unfortunately I cannot use the recommended hpricot ruby gem (because it misses a make command or something?)
So I would like to stick with this approach.
Now that I can get the response body as a string, the only thing I need is to retrieve whatever is inside the ciite(remove an i to see the true name :)) HTML tags.
How should I do that? using regular expression? Can anyone give me an example?
Here's one way to do it using Nokogiri:
Nokogiri::HTML(res.body).css("cite").map {|cite| cite.content}
I think this will solve it:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten
# This one to ignore empty tags:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten.select{|x| !x.empty?}
If you're having problems with hpricot, you could also try nokogiri which is very similar, and allows you to do the same things.
Split the string on the tag you want. Assuming only one instance of tag (or specify only one split) you'll have two pieces I'll call head and tail. Take tail and split it on the closing tag (once), so you'll now have two pieces in your new array. The new head is what was between your tags, and the new tail is the remainder of the string, which you may process again if the tag could appear more than once.
An example that may not be exactly correct but you get the idea:
head1, tail1 = str.split('<tag>', 1) # finds the opening tag
head2, tail2 = tail1.split('</tag>', 1) # finds the closing tag