Scraping images from user input URL using MetaInspector (Rails) - ruby-on-rails

I'm trying to create an app where a user can submit a URL link, a title and description, and it'll create a post with the title, description and an image. I want to be able to scrape the best or main image from directly from the URL path that the user submitted and display it on the show page using MetaInspector. (The reason I didn't use Nokogiri or Mechanize is because I didn't understand it all that well and MetaInspector seems alot less daunting)
The problem is I'm very new to rails and I'm having a hard time following most tutorials.
Is anyone able to explain to me step by step how to do this or show me a source that's very detailed and noob friendly?
I have a Post model that contains the link, and should also save the scraped image as a Paperclip attachment:
class Post < ActiveRecord::Base
belongs_to :user
has_attached_file :image
end
# == Schema Information
#
# Table name: posts
#
# id :integer not null, primary key
# title :string
# link :string
# description :text
# created_at :datetime
# updated_at :datetime
# user_id :integer
# image_file_name :string
# image_content_type :string
# image_file_size :integer
# image_updated_at :datetime
The full code of my app is available at github.com/johnnyji/wanderful.
I really appreciate any help at all! Thank you

Let's walk through this step by step.
First, add the MetaInspector gem to your Gemfile
gem 'metainspector'
and run the bundle command.
We need another bit of code: open-uri. With it, we can read remote files from URLs as if they were local files. It is part of Rubys standard library, so it's already built in, but we still need to require it at the top of your post.rb:
require 'open-uri'
class Post < ActiveRecord::Base
belongs_to :user
has_attached_file :image
end
We want to grab an image whenever a Posts link changes, so we make a before_save callback that triggers whenever that happens:
class Post < ActiveRecord::Base
belongs_to :user
has_attached_file :image
before_save :get_image_from_link,
if: ->(post) { post.link_changed? }
end
you can find more about before_save and other callbacks in the ActiveRecord::Callbacks guide.
the link_changed? method is part of the "dirty tracking" functionality ActiveModel::Dirty provides
that if: ->(post) thing is called a "stabby lambda" - it's basically just a Ruby function that is called with the current post as an argument. If it returns true, the before_action is run. It could also be written as if: Proc.new { |post| post.link_changed? }
Now we need our get_image_from_link method. Since it's only supposed to be called from within the Post model itself and not from the outside (say, Post.find(5).get_image_from_link), we make it a private method:
class Post < ActiveRecord::Base
belongs_to :user
has_attached_file :image
before_save :get_image_from_link,
if: ->(post) { post.link_changed? }
private
def get_image_from_link
end
end
Reading MetaInspectors README, it has a cool method called page.images.best that does the hard work for us selecting the right image from that page. So we are going to
parse the link with MetaInspector
open the image it selected as best with open-uri as a File-like object
give that File-like object to Paperclip to save as an attachment
So:
def get_image_from_link
# `link` here is `self.link` = the current post.
# At least when reading attributes, `self` is implicit
# in Ruby
page = MetaInspector.new(link)
# maybe the page didn't have images?
return unless page.images.best.present?
# when you use IO resources such as files, you need
# to take care that you `.close` everything you open.
# Using the block form takes care of that automatically.
open(page.images.best) do |file|
# when writing/assigning a value, `self` is not
# implicit, because when you write `something = 5`,
# Ruby cannot know whether you want to assign to
# `self.something` or create a new local variable
# called `something`
self.image = file
end
end
This is far from perfect, because it lacks some error handling (what if MetaInspector fails to open the page? Or open-uri cannot read the image URL?). Also, this has the drawback that all that parsing, downloading and so on takes place right when the user submits or updates her post, so when she clicks on the save button, she'll have to wait for all this to complete.
For the next iteration, look into doing things like these asynchronously, for example with a job queue. Rails' new Active Job system might be a good starting point.

Related

Rails Active Storage - Keep Existing Files / Uploads?

I have a Rails model with:
has_many_attached :files
When uploading via Active Storage by default if you upload new files it deletes all the existing uploads and replaces them with the new ones.
I have a controller hack from this which is less than desirable for many reasons:
What is the correct way to update images with has_many_attached in Rails 6
Is there a way to configure Active Storage to keep the existing ones?
Looks like there is a configuration that does exactly that
config.active_storage.replace_on_assign_to_many = false
Unfortunately it is deprecated according to current rails source code and it will be removed in Rails 7.1
config.active_storage.replace_on_assign_to_many is deprecated and will be removed in Rails 7.1. Make sure that your code works well with config.active_storage.replace_on_assign_to_many set to true before upgrading.
To append new attachables to the Active Storage association, prefer using attach.
Using association setter would result in purging the existing attached attachments and replacing them with new ones.
It looks like explicite usage of attach will be the only way forward.
So one way is to set everything in the controller:
def update
...
if model.update(model_params)
model.files.attach(params[:model][:files]) if params.dig(:model, :files).present?
else
...
end
end
If you don't like to have this code in controller. You can for example override default setter for the model eg like this:
class Model < ApplicationModel
has_many_attached :files
def files=(attachables)
files.attach(attachables)
end
end
Not sure if I'd suggest this solution. I'd prefer to add new method just for appending files:
class Model < ApplicationModel
has_many_attached :files
def append_files=(attachables)
files.attach(attachables)
end
end
and in your form use
<%= f.file_field :append_files %>
It might need also a reader in the model and probably a better name, but it should demonstrate the concept.
The solution suggested for overwriting the writer by #edariedl DOES NOT WORK because it causes a stack level too deep
1st solution
Based on ActiveStorage source code at this line
You can override the writer for the has_many_attached like so:
class Model < ApplicationModel
has_many_attached :files
def files=(attachables)
attachables = Array(attachables).compact_blank
if attachables.any?
attachment_changes["files"] =
ActiveStorage::Attached::Changes::CreateMany.new("files", self, files.blobs + attachables)
end
end
end
Refactor / 2nd solution
You can create a model concern that will encapsulate all this logic and make it a bit more dynamic, by allowing you to specify the has_many_attached fields for which you want the old behaviour, while still maintaining the new behaviour for newer has_many_attached fields, should you add any after you enable the new behaviour.
in app/models/concerns/append_to_has_many_attached.rb
module AppendToHasManyAttached
def self.[](fields)
Module.new do
extend ActiveSupport::Concern
fields = Array(fields).compact_blank # will always return an array ( worst case is an empty array)
fields.each do |field|
field = field.to_s # We need the string version
define_method :"#{field}=" do |attachables|
attachables = Array(attachables).compact_blank
if attachables.any?
attachment_changes[field] =
ActiveStorage::Attached::Changes::CreateMany.new(field, self, public_send(field).public_send(:blobs) + attachables)
end
end
end
end
end
end
and in your model :
class Model < ApplicationModel
include AppendToHasManyAttached['files'] # you can include it before or after, order does not matter, explanation below
has_many_attached :files
end
NOTE: It does not matter if you prepend or include the module because the methods generated by ActiveStorage are added inside this generated module which is called very early when you inherit from ActiveRecord::Base here
==> So your writer will always take precedence.
Alternative/Last solution:
If you want something even more dynamic and robust, you can still create a model concern, but instead you loop inside the attachment_reflections of your model like so :
reflection_names = Model.reflect_on_all_attachments.filter { _1.macro == :has_many_attached }.map { _1.name.to_s } # we filter to exclude `has_one_attached` fields
# => returns ['files']
reflection_names.each do |name|
define_method :"#{name}=" do |attachables|
# ....
end
end
However I believe for this to work, you need to include this module after all the calls to your has_many_attached otherwise it won't work because the reflections array won't be fully populated ( each call to has_many_attached appends to that array)

Model's changes for Algolia not showing in the rails console in PRODUCTION

I have a model as bellow:
class Note < Record
include Shared::ContentBasedModel
algoliasearch disable_indexing: AppConfig.apis.algolia.disable_indexing do
attributes :id, :key
[:keywords, :tags, :description, :summary].each do |attr|
attribute [attr] do
self.meta[attr.to_s]
end
end
attribute :content do
Nokogiri.HTML(self.meta["html"]).text.split(' ').reject { |i| i.to_s.length < 5 }.map(&:strip).join ' '
end
attribute :photo do
unless self.meta["images"].blank?
self.meta["images"].first["thumb"]
end
end
attribute :slug do
to_param
end
attribute :url do
Rails.application.routes.url_helpers.note_path(self)
end
end
end
I am using AlgoliaSearch gem to index my models into the Algolia's API and when I was trying to index the model with some long content I get the following error:
Error: Algolia::AlgoliaProtocolError (400: Cannot POST to https://XXXX.algolia.net/1/indexes/Note/batch: {"message":"Record at the position 1 objectID=56 is too big size=20715 bytes. Contact us if you need an extended quota","position":1,"objectID":"56","status":400} (400))
After this, I removed EVERYTHING as the following, BUT I am still getting the exact same error!!
class Note < Record
include Shared::ContentBasedModel
algoliasearch disable_indexing: AppConfig.apis.algolia.disable_indexing do
attributes :id
end
end
It seems that Rails does not update the cached models.
Envirnoment: production
Rails version: v6
Question: Why is this happening & how can I clear cached model?
Note: I have tried everything, including removing the tmp/cache folder but it does not go away!
It looks like the object's size itself is bigger than some max allowed size.
objectID=56 is too big size=20715 bytes
Contact https://www.algolia.com/ (as the suggest)
Contact us if you need an extended quota
How do you check your code? Are you entering in rails console on your server? Might it be that you run an old release instead of the new one, in the case if you use Capistrano or Mina for deploy?

Callback for Active Storage file upload

Is there a callback for active storage files on a model
after_update or after_save is getting called when a field on the model is changed. However when you update (or rather upload a new file) no callback seems to be called?
context:
class Person < ApplicationRecord
#name :string
has_one_attached :id_document
after_update :call_some_service
def call_some_service
#do something
end
end
When a new id_document is uploaded after_update is not called however when the name of the person is changed the after_update callback is executed
For now, it seems like there is no callback for this case.
What you could do is create a model to handle the creation of an active storage attachment which is what is created when you attach a file to your person model.
So create a new model
class ActiveStorageAttachment < ActiveRecord::Base
after_update :after_update
private
def after_update
if record_type == 'Person'
record.do_something
end
end
end
You normally have created the model table already in your database so no need for a migration, just create this model
Erm i would just comment but since this is not possible without rep..
Uelb's answer works but you need to fix the error in comments and add it as an initializer instead of model. Eg:
require 'active_storage/attachment'
class ActiveStorage::Attachment
before_save :do_something
def do_something
puts 'yeah!'
end
end
In my case tracking attachment timestamp worked
class Person < ApplicationRecord
has_one_attached :id_document
after_save do
if id_document.attached? && (Time.now - id_document.attachment.created_at)<5
Rails.logger.info "id_document change detected"
end
end
end
The answer from #Uleb got me 90% of the way, but for completion sake I will post my final solution.
The issue I had was that I was not able to monkey patch the class (not sure why, even requiring the class as per #user10692737 did not help)
So I copied the source code (https://github.com/rails/rails/blob/fc5dd0b85189811062c85520fd70de8389b55aeb/activestorage/app/models/active_storage/attachment.rb#L20)
and modified it to include the callback
require "active_support/core_ext/module/delegation"
# Attachments associate records with blobs. Usually that's a one record-many blobs relationship,
# but it is possible to associate many different records with the same blob. If you're doing that,
# you'll want to declare with <tt>has_one/many_attached :thingy, dependent: false</tt>, so that destroying
# any one record won't destroy the blob as well. (Then you'll need to do your own garbage collecting, though).
class ActiveStorage::Attachment < ActiveRecord::Base
self.table_name = "active_storage_attachments"
belongs_to :record, polymorphic: true, touch: true
belongs_to :blob, class_name: "ActiveStorage::Blob"
delegate_missing_to :blob
#CUSTOMIZED AT THE END:
after_create_commit :analyze_blob_later, :identify_blob, :do_something
# Synchronously purges the blob (deletes it from the configured service) and destroys the attachment.
def purge
blob.purge
destroy
end
# Destroys the attachment and asynchronously purges the blob (deletes it from the configured service).
def purge_later
blob.purge_later
destroy
end
private
def identify_blob
blob.identify
end
def analyze_blob_later
blob.analyze_later unless blob.analyzed?
end
#CUSTOMIZED:
def do_something
end
end
Not sure its the best method, and will update if I find a better solution
None of these really hit the nail on the head, but you can achieve what you were looking for by following this blog post https://redgreen.no/2021/01/25/active-storage-callbacks.html
I was able to modify the code there to work on attachments instead of blobs like this
Rails.configuration.to_prepare do
module ActiveStorage::Attachment::Callbacks
# Gives us some convenient shortcuts, like `prepended`
extend ActiveSupport::Concern
# When prepended into a class, define our callback
prepended do
after_commit :attachment_changed, on: %i[create update]
end
# callback method
def attachment_changed
record.after_attachment_update(self) if record.respond_to? :after_attachment_update
end
end
# After defining the module, call on ActiveStorage::Blob to prepend it in.
ActiveStorage::Attachment.prepend ActiveStorage::Attachment::Callbacks
end
What I do is add a callback on my record:
after_touch :check_after_touch_data
This gets called if an ActiveStorage object is added, edited or deleted. I use this callback to check if something changed.

Storing an image based upon nothing but a URL in Paperclip

I am using paperclip for a profile picture upload feature in my rails app. This works nicely for the default case of uploading images to a profile, but I want to allow users without a picture to pick from one of a selection of precanned 'stock' images.
These images are hosted locally, within my assets images folder. Therefore on these occasions I want to be able to add an image to my EventImage object without actually uploading an image, more just referencing a URL at a local path.
I have tried pretty much every answer from this post : Save image from URL by paperclip but none of them seem to work. I am using paperclip version paperclip (4.3.1 37589f9)
When I try the solution of :
def photo_from_url(url)
puts "we got:"+url
Thread.new do
self.photo = URI.parse(url)
end
end
It results in no image reference being stored, and regardless of the URL to an image I pass into that method, it never displays my image when I do : <%= image_tag #event.event_images.first.photo.url %> - instead it shows the default image for when an image has not been located or stored.
I also have to put it in a new thread otherwise it gets tied up and blocks / resulting in a timeout which seems to be a problem with URI.parse, also the image ends up failing validation as photo is 'empty' which is not allowed in my validation, so I end up removing the validation presence line on :photo, which still does not solve the problem. I really just want the models paperclip method :photo - to point to a local url sometimes, and my correctly normally uploaded files other times.
See the whole class here:
# == Schema Information
#
# Table name: event_images
#
# id :integer not null, primary key
# caption :string
# event_id :integer
# created_at :datetime not null
# updated_at :datetime not null
# photo_file_name :string
# photo_content_type :string
# photo_file_size :integer
# photo_updated_at :datetime
#
class EventImage < ActiveRecord::Base
attr_accessor :PAPERCLIP_STORAGE_OPTS
has_attached_file :photo , PAPERCLIP_STORAGE_OPTS
validates_attachment_presence :photo
validates_attachment_content_type :photo, :content_type => ["image/jpg", "image/jpeg", "image/gif", "image/png"]
validates_with AttachmentSizeValidator, :attributes => :photo, :less_than => 3.megabytes
belongs_to :event
def photo_from_url(url)
Thread.new do
self.photo = URI.parse(url)
end
end
end
Instead, I have decided that it would be best to add a 'canned_image_id' to the EventImage model, then use a photo_url method in the model which can choose to return either the paperclip url or the canned image url depending on whether or not the image is a canned image. It also hides this complexity behind a helper method :)

Displaying paperclip attachments stored in a non-public folder

My conundrum is how to embed in an html page an image whose source is not available to the Internet at large.
Let's say I have, in a Rails/Paperclip setup, the following model:
class Figure < ActiveRecord::Base
has_attached_file :image
...
end
class User < ActiveRecord::Base
... (authentication code here)
has_many :figures
end
In the controller:
class FiguresController < ActionController::Base
def show
# users must be authenticated, and they can only access their own figures
#figure = current_user.figures.find(params[:id])
end
end
In the view:
<%= image_tag(#figure.image.url) %>
The problem with this, of course, is that with the default Paperclip settings images are stored in the public directory, and anyone with the link can access the stored image bypassing authentication/authorization.
Now, if we tell Paperclip to store attachments at a private locations:
class Figure < ActiveRecord::Base
has_attached_file :image, path: ":rails_root/private/:class/:attachment/:id_partition/:style/:filename",
url: ":rails_root/private/:class/:attachment/:id_partition/:style/:filename"
...
end
Then it's easy to control who the image gets served to:
class FiguresController < ActionController::Base
def show
#figure = current_user.figures.find(params[:id])
send_file #figure.image.path, type: 'image/jpeg', disposition: 'inline'
end
end
The effect of this action is to display the image in its own browser window/tab.
On the other hand, image_tag(#figure.image.url) will understandably produce a routing error, because the source cannot be accessed!
Thus, is there a way to display the image via image_tag in a regular HTML page, while still restricting access to it?
You need to change the :url option passed to has_attached_file so that it matches the route for your figures controller.
For example, if the correct url is /figures/123 for the figure with is 123 then the url you pass to has_attached_file should be
'/figures/:id'
Or even
'/:class/:id'
Since the :class segment will be interpolated to the pluralized lowercase underscore form of the name. You could also append the extension or the filename if you wanted (but you would then have to change the controller code slightly to extract the id)

Resources