I want to share sometext on whatsapp so I'm converting html to text otherwise it displays all the tags.
Currently I'm using strip_tags to remove tags but that also removes breaks from the text. How do I convert html to text and convert breaks to newline characters and url encode the text.
currently I'm using following
#whatsapp_text = u strip_tags(#post.summary)
I suggest you tu use Nokogiri to solve this problem. Nokogiri can parse HTML and convert Websites Source into human readable text although it doiesnt convert html breaks to linebreaks it can take away many problems from you. To do this add the follofing line to your Gemfile
gem 'nokogiri'
run bundle install. Then you can solve your problem like this:
Nokogiri::HTML.parse(#post.summary.gsub("<br>", "\r\n").gsub("<br/>", "\r\n")).inner_text
That should do it for you.
Related
I am trying to read a XML file from a third party with Nokogiri in my rails project.
One of the nodes I have ot parse contains an URL with unescaped ampersands (like foo.com/index.html?page=1&query=bar)
I understand that this is considered malformed XML, and Nokogiri just tries to parse it anyway, resulting in foo.com/index.html?page=1=bar.
How can I obtain the full URL? Can I tweak Nokogiri? Would you do a search&replace-prerun or what would be the best practice?
Had the same issue parsing SVGs with image links containing ampersands.
Parsing SVGs as HTML seems to correctly handle the links, escaping &.
fixed_svg = Nokogiri::HTML.fragment(raw_svg).to_html
# proceed with XML parsing
svg = Nokogiri::XML(fixed_svg)
I have the rails application which accepts the XML output from another application. For some condition the XML tage content come up with CSS code
For example :
<\/sample/> .headermenu{float:left;no-repeat right;font-size:0.75em; padding-bottom:3px}, #div{float:left} This is the test value from another site <\/sample/>
In my ruby application i have parse the XML content and display the content.
It start displaying CSS content like the above. I want to display strip the CSS code if exist in the content.
Is their any way . we can do this please help...
raw method might help you.It outputs data without escaping a string. Check here http://apidock.com/rails/ActionView/Helpers/RawOutputHelper/raw for more details.
I dont know if this is what you are looking for but you can try css parser by the way whenever you need a rails or ruby gem just search for it at rubygems
I am parsing titles from pdf files using pdftk has various language specific characters in it.
This ruby on rails application I need to do this in is using ruby 1.8.7 and rails 2.3.14 so any encoding solutions built into ruby 1.9 aren't an option for me right now.
Example of what I need to do:
If the title includes a ü, when I read the pdf content using pdftk (either command line or using ruby pdf-toolkit gem) the "ü" gets converted to ü
In my application, I really want this in the ü as this seems to work fine for my needs in a web page and in XML file.
I can convert the character explicitly in ruby using
>> string = "ü"
=> "ü"
>> string.gsub("ü","ü")
=> "ü"
but obviously I don't want to do this one by one.
I've tried using Iconv to do this but I feel I don't know what to specify to get this converted to the rendered character. I thought maybe this was just a utf-8 but it doesn't seem to convert to rendered character
>> Iconv.iconv("latin1", "utf-8","ü").join
=> "ü"
I am little confused about what format to/from to use here to get the end result of the rendered character.
So how do use Iconv or other tools to make this conversion for all characters converted to this HTML code from pdftk?
Or how to tell pdftk to do this when I read the pdf file in the first place!
Ok - I think the issue here is the codes that pdftk are returning are HTML so unescaping the HTML first is the path that works
>> Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(string) ).join
=> "ü"
Update:
Using the following
pdf = PDF::Toolkit.open(file)
pdf.title = Iconv.iconv("utf8", "latin1", CGI.unescapeHTML(pdf.title)).join
This seems to work for most languages but when I apply this to japanese and chinese, it mangles things and doesn't result in the original as it appears in the PDF.
Update:
Getting closer - it appears that the html codes pdftk puts in the title for japanese and chinese already render correctly if I just unescape them and don't attempt any Iconv conversion.
CGI.unescapeHTML(pdf.title)
This renders correctly.
So... how do I test the pdf.title to see ahead of time if this is chinese or japanese (double byte ?) before I try to apply the conversion needed for other languages?
Maybe something like:
string.gsub(/&#\d+;/){|x| x[/\d+/].to_i.chr}
I am getting text from a feed that has alot of characters like:
Insignia™ 2.0 Stereo Computer Speaker System (2-Piece) - Black
4th-Generation Apple® iPod® touch
Is there an easy way to get rid of these, or do I have to anticipate which characters I want to delete and use the delete method to remove them? Also, when I try to remove
&
with
str.delete("&")
It leaves behind "amp;" Is there a better way to delete this type of character? Do I need to re-encode the text?
String#delete is certainly not what you want, as it works on characters, not the string as a whole.
Try
str.gsub /&/, ""
You may also want to try replacing the & with a literal ampersand, such as:
str.gsub /&/, "&"
If this is closer to what you really want, you may get the best results unescaping the HTML string. If so try this:
CGI::unescapeHTML(str)
Details of the unescapeHTML method are here.
If you are getting data from a 'feed', aka RSS XML, then you should be using an XML parser like Nokogiri to process the XML. This will automatically unescape HTML entities and allow you to get the proper string representation directly.
For removing try to use gsub method, something like this:
text = "foo&bar"
text.gsub /\b&\b/, "" #=> foobar
I have a database, and currently many of the items within it have been html escaped. I need to undo this (don't ask why!), for which I'll carry out a data migration.
But is the a way to un-escape these strings? I've not been able to find anything..
Ruby's CGI::unescapeHTML can do HTML unescaping.
Unescape a string that has been HTML-escaped
CGI::unescapeHTML("Usage: foo "bar" <baz>")
# => "Usage: foo \"bar\" <baz>"
You should take a look at the htmlentities gem
If i understand it correct you need to replace strings like > to >. If so - check xml documentation and replace required strings with their real values. I dont code in ruby, so this one you got to figure out :]
XML special characters