nokogiri replace - ruby-on-rails

I'm parsing an HTML document and trying to replace the image src. It seems to do what I want when I attempt it in the console however in my model it doesn't seem to save it. Now, I'm not sure if what I'm doing is wrong with the way to save in Rails (i'm trying to update the content field and replacing external images with local ones) or if it's using nokogiri, but it's not saving the result using the set_attribute method
It does the rest of it perfectly.
before_save :replace_zemanta_images
def replace_zemanta_images
doc = Nokogiri::HTML(content)
unless doc.css('div.zemanta-img').blank?
doc.css('div.zemanta-img img').each do |img|
io = open(URI.parse(img[:src]))
if photos.find_by_data_remote_url(img[:src]).blank?
photo = photos.build(:data => io, :data_remote_url => img[:src])
img.set_attribute('src', photo.data.url(:original)) #doesn't work!
end
end
end
end

I am assuming that content is an attribute on your model.
When you are doing img.set_attribute you are updating the attribute in the Nokogiri::XML::Element object but this doesn't update the text of content.
At the end of your method you will need to add something like:
self.content = doc.to_s

JackChance mentioned to use Nokogiri::HTTP::DocumentFragment.parse(content) here for a fragment (if you don't want DOCTYPE/HTML/BODY tags), I didn't have any luck with that since my original HTML was a snippet instead of the whole document.
I ended up using something like this:
html = Nokogiri::HTML.fragment
to initially convert the HTML string snippet to a Nokogiri object without the unnecessary tags.
Then once we use img.set_attribute, we can convert back html.to_s

Related

How to convert PDF to Excel or CSV in Rails 4

I have searched a lot. I have no choice unless asking this here. Do you guys know an online convertor which has API or Gem/s that can convert PDF to Excel or CSV file?
I am not sure if here is the best place to ask this either.
My application is in Rails 4.2.
PDF file has contains a header and a big table with about 10 columns.
More info:
User upload the PDF via a form then I need to grab the PDF parse it to CSV and read the content. I tried to read the content with PDF Reader Gem however the result wasn't really promising.
I have used: freepdfconvert.com/pdf-excel Unfortunately then don't supply API. (I have contacted them)
Sample PDF
This piece of code convert the PDF into the text which is handy.
Gem: pdf-reader
def self.parse
reader = PDF::Reader.new("pdf_uploaded_by_user.pdf")
reader.pages.each do |page|
puts page.text
end
end
Now if you check the sample attached PDF you will see some fields might be empty which it means I simply can't split the text line with space and put it in an array as I won't be able to map the array to the correct fields.
Thank you.
Ok, After lots of research I couldn't find an API or even a proper software that does it. Here how I did it.
I first extract the Table out of the PDF into the Table with this API pdftables. It is cheap.
Then I convert the HTML table to CSV.
(This is not ideal but it works)
Here is the code:
require 'httmultiparty'
class PageTextReceiver
include HTTMultiParty
base_uri 'http://localhost:3000'
def run
response = PageTextReceiver.post('https://pdftables.com/api?key=myapikey', :query => { f: File.new("/path/to/pdf/uploaded_pdf.pdf", "r") })
File.open('/path/to/save/as/html/response.html', 'w') do |f|
f.puts response
end
end
def convert
f = File.open("/path/to/saved/html/response.html")
doc = Nokogiri::HTML(f)
csv = CSV.open("path/to/csv/t.csv", 'w',{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
doc.xpath('//table/tr').each do |row|
tarray = []
row.xpath('td').each do |cell|
tarray << cell.text
end
csv << tarray
end
csv.close
end
end
Now Run it like this:
#> page = PageTextReceiver.new
#> page.run
#> page.convert
It is not refactored. Just proof of concept. You need to consider performance.
I might use the gem Sidekiq to run it in background and move the result to the main thread.
Check Tabula-Extractor project and also check how it is used in projects like NYPD Moving Summonses Parser and CompStat criminal complaints parser.
Ryan Bates covers csv exports in his rails casts > http://railscasts.com/episodes/362-exporting-csv-and-excel this might give you some pointers.
Edit: as you now mention you need the raw data from an uploaded PDF, you could use JavaScript to read the PDF file and the populate the data into Ryan Bates' export method. Reading PDF's was covered excellently in the following question:
extract text from pdf in Javascript
I would imagine the flow would be something like:
PDF new action
user uploads PDF
PDF show action
PDF is displayed
JavaScript reads PDF
JavaScript populates Ryan's raw data
Raw data is exported with PDF data included

Passing data to helper (including tags)

I have a helper function, which is called inside a haml template. It uses xpath to modify the content of data. To ensure this working, data must be valid html. The following example fails, because data is not valid html.
%pre
= foo(data)
The idea now, is to wrap data into the pre tag, and pass this valid html to foo like the following.
= foo("<pre>#{data}</pre>")
Is something like that possible in haml? Without using string manipulation?
The solution should work for any kind of tags. For example it should work for headings:
%h1.title
= foo(heading_data)
Data
Firstly what are you sending pure HTML to a helper?
As is the case in most programming, the "arguments" you send to a helper are just ways to call a method. This means you just send the most succinct data you need, and let the helper / method output it as you require.
Frankly, I'd recommend against sending the data file you're looking at.
Recommendation
Without knowing what your foo method is all about, I'd recommend this:
#view
= foo data
#app/helpers/application_helper.rb
class ApplicationHelper
def foo data
code = "<pre>#{data}</pre>"
return raw code
end
end
It's not completely clear to me what you are trying to do, but it sounds to me that the usage of your function should instead be along the lines of
= foo do
%pre= data
or
= foo do
%h1= data
in the second case. To support this, your foo function should look like
def foo(&block)
data = capture(&block)
# transform data and return it
# Don't forget about html safety!
end
Here data will be set to the result of evaluating the block of markup that was passed to your foo method

Ruby on Rails PDF Stamper / iText

I have done a lot of searching and cannot find a solution for getting PDF-stamper to work in my rails application. From the tutorials it appears that I write a method in the model? I wrote a simple app with two fields: nameLast and nameFirst. All I want to do is write these to a PDF I have that contains fields for user info. Two field happen to be FirstName and LastName so perfect time to use PDF-stamper right? I just want to take user data from the rails application and have then be able to push a button and generate a PDF. Here is the method I have in my model.
def savePDF
pdf = PDF::Stamper.new("sample.pdf")
pdf.text :nameFirst, "Jason"
pdf.text :nameLast, "Yates"
pdf.save_as "my_output.pdf"
end
That was clearly taken from a tutorial that I must not properly understand. I can actually get this working in java pretty easy, but I don't want to use jRuby. I am using rjb which is working fine. I just don't think I properly understand what needs to happen to get this working. Any help is greatly appreciated!
I'm the author of the pdf-stamper gem.
The save_as method saves the created PDF to the filesystem. If you are building a Rails application, I don't think that is what you want.
I'm guessing from your question you want to send a "stamped" PDF back to the browser. If that is the case, you should call to_s on the created PDF and then pass the output of that to Rails send_data method.
In your controller(not the model) you'll want to add some code like this.
def send
pdf = PDF::Stamper.new("sample.pdf")
pdf.text :nameFirst, "Jason"
pdf.text :nameLast, "Yates"
send_data(pdf.to_s, :filename => "output.pdf", :type => "application/pdf",:disposition => "inline")
end
The problem here really is the documentation for the pdf-stamper gem. The feature you wanted was there just undocumented, hence your confusion. I'll have to fix that.
i was doing the same with use of xfdf as a source data for fields, the following code worked for me, maybe it will be helpful to you aswell:
pdfreader = Rjb::import('com.itextpdf.text.pdf.PdfReader')
pdfstamper = Rjb::import('com.itextpdf.text.pdf.PdfStamper')
pdffields = Rjb::import('com.itextpdf.text.pdf.AcroFields')
xfdfreader = Rjb::import('com.itextpdf.text.pdf.XfdfReader')
pdf = pdfreader.new("#{Rails.root}/public/out/temp/form1.pdf", nil)
xfdf = xfdfreader.new(f)
stamp = pdfstamper.new(pdf, filestream.new("/tmp/text#{i}.pdf"))
pdffields = stamp.getAcroFields()
pdffields.setFields(xfdf)
stamp.close

Rails way to offer modified attributes

The case is simple: I have markdown in my database, and want it parsed on output(*).
#post.body is mapped to the posts.body column in the database. Simple, default Activerecord ORM. That column stores the markdown text a user inserts.
Now, I see four ways to offer the markdown rendered version to my views:
First, in app/models/post.rb:
# ...
def body
markdown = RDiscount.new(body)
markdown.to_html
end
Allowing me to simply call #post.body and get an already rendered version. I do see lots of potential problems with that, e.g. on edit the textfield being pre-filled with the rendered HMTL instead of the markdown code.
Second option would be a new attribute in the form of a method
In app/models/post.rb:
# ...
def body_mardownified
markdown = RDiscount.new(body)
markdown.to_html
end
Seems cleanest to me.
Or, third in a helper in app/helpers/application_helper.rb
def markdownify(string)
markdown = RDiscount.new(string)
markdown.to_html
end
Which is used in the view, instead of <%= body %>, <%= mardownify(body) %>.
The fourth way, would be to parse this in the PostsController.
def index
#posts = Post.find(:all)
#posts.each do |p|
p.body = RDiscount.new(string).to_html
#rendered_posts << p
end
end
I am not too familiar with Rails 3 proper method and attribute architecture. How should I go with this? Is there a fifth option? Should I be aware of gotchas, pitfalls or performance issues with one or another of these options?
(*) In future, potentially updated with a database caching layer, or even special columns for rendered versions. But that is beyond the point, merely pointing out, so to avoid discussion on filter-on-output versus filter-on-input :).
The first option you've described won't work as-is. It will cause an infinite loop because when you call RDiscount.new(body) it will use the body method you've just defined to pass into RDiscount (which in turn will call itself again, and again, and so on). If you want to do it this way, you'd need to use RDiscount.new(read_attribute('body')) instead.
Apart from this fact, I think the first option would be confusing for someone new looking at your app as it would not be instantly clear when they see in your view #post.body that this is in fact a modified version of the body.
Personally, I'd go for the second or third options. If you're going to provide it from the model, having a method which describes what it's doing to the body will make it very obvious to anyone else what is going on. If the html version of body will only ever be used in views or mailers (which would be logical), I'd argue that it makes more sense to have the logic in a helper as it seems like the more logical place to have a method that outputs html.
Do not put it in the controller as in your fourth idea, it's really not the right place for it.
Yet another way would be extending the String class with a to_markdown method. This has the benefit of working on any string anywhere in your application
class String
def to_markdown
RDiscount.new(self)
end
end
#post.body.to_markdown
normal bold italic
If you were using HAML, for example in app/views/posts/show.html.haml
:markdown
= #post.body
http://haml-lang.com/docs/yardoc/file.HAML_REFERENCE.html#markdown-filter
How about a reader for body that accepts a parse_with parameter?
def body(parse_with=nil)
b = read_attribute('body')
case parse_with
when :markdown then RDiscount.new(b)
when :escape then CGI.escape(b)
else b
end
end
This way, a regular call to body will function as it used to, and you can pass a parameter to specify what to render with:
#post.body
normal **bold** *italic*
#post.body(:markdown)
normal bold italic

Remove all html tags from attributes in rails

I have a Project model and it has some text attributes, one is summary. I have some projects that have html tags in the summary and I want to convert that to plain text. I have this method that has a regex that will remove all html tags.
def strip_html_comments_on_data
self.attributes.each{|key,value| value.to_s.gsub!(/(<[^>]+>| |\r|\n)/,"")}
end
I also have a before_save filter
before_save :strip_html_comments_on_data
The problem is that the html tags are still there after saving the project. What am I missing?
And, is there a really easy way to have that method called in all the models?
Thanks,
Nicolás Hock Isaza
untested
include ActionView::Helpers::SanitizeHelper
def foo
sanitized_output = sanitize(html_input)
end
where html_input is a string containing HTML tags.
EDIT
You can strip all tags by passing :tags=>[] as an option:
plain_text = sanitize(html_input, :tags=>[])
Although reading the docs I see there is a better method:
plain_text = strip_tags(html_input)
Then make it into a before filter per smotchkiss and you're good to go.
It would be better not to include view helpers in your model. Just use:
HTML::FullSanitizer.new.sanitize(text)
Just use the strip_tags() text helper as mentioned by zetetic
First, the issue here is that Array#each returns the input array regardless of the block contents. A couple people just went over Array#each with me in a question I asked: "Return hash with modified values in Ruby".
Second, Aside from Array#each not really doing what you want it to here, I don't think you should be doing this anyway. Why would you need to run this method over ALL the model's attributes?
Finally, why not keep the HTML input from the users and just use the standard h() helper when outputting it?
# this will output as plain text
<%=h string_with_html %>
This is useful because you can view the database and see the unmodified data exactly as it was entered by the user (if needed). If you really must convert to plain text before saving the value, #zetetic's solution gets you started.
include ActionView::Helpers::SanitizeHelper
class Comment < ActiveRecord::Base
before_save :sanitize_html
protected
def sanitize_html
self.text = sanitize(text)
end
end
Reference Rails' sanitizer directly without using includes.
def text
ActionView::Base.full_sanitizer.sanitize(html).html_safe
end
NOTE: I appended .html_safe to make HTML entities like render correctly. Don't use this if there is a potential for malicious JavaScript injection.
If you want to remove along with html tags, nokogiri can be used
include ActionView::Helpers::SanitizeHelper
def foo
sanitized_output = strip_tags(html_input)
Nokogiri::HTML.fragment(sanitized_output)
end

Resources