I have a string that is a bunch of XML tags.
Basically there is the contents to one tag I want and ignore everything else:
The input would look like:
<Some><XML><stuff>
<title type='text'>key</title>
<Some><other><XML><stuff>
The output would look like:
key
I'm not sure if XML is appropriate since there doesn't seem very much structure to this particular XML.
Can regex do this in RoR or is it more of just a pattern matching thing (true or false) in ruby on rails?
Thanks so much!
Cheers,
Zigu
No. If your source could not be strictly valid XML, I strongly suggest you to use Nokogiri.
Handle the source as an HTML document and extract the info you need in this way:
doc = Nokogiri::HTML("Your string with <key>some value</key>"))
doc.search('key').each do |value|
puts value.content # do whatever you want
end
Here's why you don't parse xml with regexen: RegEx match open tags except XHTML self-contained tags
Related
I'm trying to convert an XML document into a Ruby hash for the first time, and having no success. I have my XML document, doc.xml, in a folder along with my script hashrunner.rb.
In hashrunner.rb:
require 'active_support/core_ext/hash'
hash = Hash.from_xml("doc.xml")
puts hash
The first line of the XML document is <?xml version="1.0" encoding="US-ASCII"?>, if that is helpful.
In my console, when I run ruby hashrunner.rb, I get the error message:
/Users/me/.rvm/gems/ruby-1.9.3-p374/gems/activesupport-4.0.0/lib/active_support/xml_mini/rexml.rb:34:in `parse':The document "doc.xml" does not have a valid root (REXML::ParseException)
As someone relatively new to Ruby, I don't understand what this means, and some internet searching didn't turn up an explanation, either. To start, I'm not even sure if I'm calling the XML file correctly in the from_xml method, so please let me know if that's the case. I'd be open to using different gems or a different approach if that would help.
I'm pretty sure Hash::from_xml has to take an XML string, not a filename string. Try:
hash = Hash.from_xml(File.read("doc.xml"))
I have an XML document which I want to parse using NSXMLParser. One of the tags it can contain is <html>, and in my parsed representation I want the contents of that tag, verbatim. However, when I parse the document, my delegate methods are called for the start, end and contents of each tag inside the html tag.
I can't get the provider of the document to add CDATA tags; nor can I use something other than NSXMLParser to parse the document.
Is there a way for me to tell the parser to treat the contents of HTML tags as CDATA and to leave them unparsed, even if they contain other tags?
That's too bad that the owner of the XML feed won't fix it because, depending on the HTML, you may end up with a malformed XML feed. If it really is an XML document, they definitely should wrap it in a CDATA or replace all the < with < and all the > with >.
Frankly, if all you need is the HTML, and all you have is XML tag that contains the HTML without the CDATA or appropriate character replacement, I might not be inclined to try to run it through NSXMLParser at all (because the successful parsing is contingent on the nature of the HTML included). I'd use a NSScanner or NSRegularExpression to extract all of the text between the XML's opening and closing tag that wrap your HTML.
Or, if you really want to use NSXMLParser (because there's other stuff in addition to the HTML that you need), then manually alter the NSData, wrapping the HTML in a CDATA yourself.
If, on the other hand, the document you're trying to parse really isn't XML, but rather is just HTML, then of course, you shouldn't be parsing it with an XML parser. You should be using a HTML parser, like HPPLE, as described in Galloway's article, How to Parse HTML on iOS on the Ray Wendlich site.
I have a database, and currently many of the items within it have been html escaped. I need to undo this (don't ask why!), for which I'll carry out a data migration.
But is the a way to un-escape these strings? I've not been able to find anything..
Ruby's CGI::unescapeHTML can do HTML unescaping.
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" <baz>"
You should take a look at the htmlentities gem
If i understand it correct you need to replace strings like > to >. If so - check xml documentation and replace required strings with their real values. I dont code in ruby, so this one you got to figure out :]
XML special characters
how can I convert html to word
thanks.
I have created a Ruby html to word gem that should help you do just that. You can check it out at https://github.com/nickfrandsen/htmltoword - You simply pass it a html string and it will create a corresponding word docx file.
def show
respond_to do |format|
format.docx do
file = Htmltoword::Document.create params[:docx_html_source], "file_name.docx"
send_file file.path, :disposition => "attachment"
end
end
end
Hope you find it helpful.
I am not aware of any solution which does this, i.e. convert HTML to Word format. If you literally mean that, you will have to parse the HTML document first using something like Nokogiri. If you mean you want to output data persisted in your model objects, there is obviously no need to parse HTML! As far as outputting to Word, I'm afraid it looks as if you will have to directly interface with a running instance of Microsoft Word via OLE!
A quick google search for win32ole ruby word will get you started:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/241606
Good luck!
I agree with CodeJoust that it is better to generate a PDF. However, if you really need to generate a Word document then you can do the following:
If your server is a Windows machine, you can install Office in it and use ruby's OLE binding to generate the Word document into the public folder and then deliver the file in the response.
To use ruby's OLE binding, see the "Programming Ruby" ebook that comes with the one-click ruby installer for Windows. You may have to use custom logic to convert from HTML to Word unless you can find a function in the OLE api of Word to do that.
http://prawn.majesticseacreature.com/
You could allow the user to download a PDF or a .html file, but there aren't any helpful ruby libraries to do that. You're better off generating a 'printable and downloadable' version, without much styling, and/or a pdf version using a library like prawn.
You could always generate a simple .rtf file, I think word'll be pretty happy reading that...
Greetings everyone:
I would love to get some information from a huge collection of Google Search Result pages.
The only thing I need is the URLs inside a bunch of <cite></cite> HTML tags.
I cannot get a solution in any other proper way to handle this problem so now I am moving to ruby.
This is so far what I have written:
require 'net/http'
require 'uri'
url=URI.parse('http://www.google.com.au')
res= Net::HTTP.start(url.host, url.port){|http|
http.get('/#hl=en&q=helloworld')}
puts res.body
Unfortunately I cannot use the recommended hpricot ruby gem (because it misses a make command or something?)
So I would like to stick with this approach.
Now that I can get the response body as a string, the only thing I need is to retrieve whatever is inside the ciite(remove an i to see the true name :)) HTML tags.
How should I do that? using regular expression? Can anyone give me an example?
Here's one way to do it using Nokogiri:
Nokogiri::HTML(res.body).css("cite").map {|cite| cite.content}
I think this will solve it:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten
# This one to ignore empty tags:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten.select{|x| !x.empty?}
If you're having problems with hpricot, you could also try nokogiri which is very similar, and allows you to do the same things.
Split the string on the tag you want. Assuming only one instance of tag (or specify only one split) you'll have two pieces I'll call head and tail. Take tail and split it on the closing tag (once), so you'll now have two pieces in your new array. The new head is what was between your tags, and the new tail is the remainder of the string, which you may process again if the tag could appear more than once.
An example that may not be exactly correct but you get the idea:
head1, tail1 = str.split('<tag>', 1) # finds the opening tag
head2, tail2 = tail1.split('</tag>', 1) # finds the closing tag