Rails won't read a link with nokogiri and open-uri - ruby-on-rails

I have a controller which gets a url passed as a parameter and I am trying to scrape the entire page at that url. But when I try to read the url I get the following error: No such file or directory # rb_sysopen - www.google.com
Controller:
lass PageScraperController < ApplicationController
require 'nokogiri'
require 'open-uri'
require 'diffy'
require 'htmlentities'
def scrape
require 'open-uri'
#url = watched_link_params.to_s
#url = #url.slice(9..#url.length-3)
puts "LOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOG#{#url}"
page = Nokogiri::HTML(open(#url))
# coder = HTMLEntities.new
# encodedHTML = coder.encode(page)
puts page
end
def watched_link_params
params.require(:default).permit(:url)
end
end

Try this:
def scrape
#url = watched_link_params[:url]
page = Nokogiri::HTML(open(#url))
puts page
end
You will need to pass in the entire url, including the protocol designator; that is to say, you need to use http://www.google.com instead of www.google.com:
>> params = ActionController::Parameters.new(default: {url: 'http://www.google.com'})
>> watched_link_params = params.require(:default).permit(:url)
>> #url = watched_link_params[:url]
"http://www.google.com"
>> page = Nokogiri::HTML(open(#url))

Related

How to open a file with with ruby?

I am trying to open a file on rails user model with ruby and call it from user controller but it kept throwing me back wrong number of arguments (given 0, expected 1..3).
This is my file directory 'app' ,'assets', 'files', 'test_list.txt'
'app' ,'controllers', 'users controller'
can you help?thanks
class User < ApplicationRecord
def self.my_method
my_array = []
file = File.join(Rails.root, 'app' 'models','assets', 'files', 'test_list.txt')
File.open.each do |line|
my_array << line.gsub!(/\n?/, "")
end
return my_array.to_s
end
end
class UsersController < ApplicationController
require 'open-uri'
require 'net/http'
def show
# uri = URI('https://gist.githubusercontent.com/Kalagan/3b26be21cbf65b62cf05ab549433314e/raw')
# data = Net::HTTP.get(uri)
# anagrams = data.split(/\n/)
#vari = User.my_method
#query = params[:query]
#results = anagrams.select { |word| #query.split('').sort.join == word.split('').sort.join }
end
end
You're passing nothing to the open method. Pass the filename
Change
File.open
to
File.open(file)
open method needs to know at least the filename it has to open
I think you missed a comma.You can write the below code.
file = File.join(Rails.root, 'app', 'models','assets', 'files', 'test_list.txt')
and for reading the content
File.read(file) do |file|
file.each do |line|
p line
end
end

Rails invalid byte sequence in UTF-8 when using htmlentities gem

So I have the controller who scrapes the entire html of a page and stores it into mysql database. Before I store the data I want to encode it using the htmlentities gem. My issue is that with some websites it works ok e.g https://www.lookagain.co.uk/ but with others I get invalid byte sequence in UTF-8 such as https://www.google.co.uk/ and I do not know why. At first I though it might be something wrong with the database so I have changed all the fields to LONGTEXT but the problem still persists
Controller:
class PageScraperController < ApplicationController
require 'nokogiri'
require 'open-uri'
require 'diffy'
require 'htmlentities'
def scrape
#url = watched_link_params[:url].to_s
puts "LOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOG#{#url}"
#page = Nokogiri::HTML(open(#url))
coder = HTMLEntities.new
#encodedHTML = coder.encode(#page)
create
end
def index
#savedHTML = ScrapedPage.all
end
def show
#savedHTML = ScrapedPage.find(id)
end
def new
#savedHTML = ScrapedPage.new
end
def create
#savedHTML = ScrapedPage.create(domain: #url, html: #encodedHTML, css: '', javascript: '')
if #savedHTML.save
puts "ADDED TO THE DATABASE"
redirect_to(root_path)
else
puts "FAILED TO ADD TO THE DATABASE"
end
end
def edit
end
def upadate
end
def delete
#watched_links = ScrapedPage.find(params[:id])
end
def destroy
#watched_links = ScrapedPage.find(params[:id])
#watched_links.destroy
redirect_to(root_path)
end
def watched_link_params
params.require(:default).permit(:url)
end
end

Running into "uninitialized constant" with whenever gem

Trying to use whenever gem today. Running into this error uninitialized constant EntriesController::RedditScrapper ... how do I fix this?
Current Controller
class EntriesController < ApplicationController
def index
#entries = Entry.all
end
def scrape
RedditScrapper.scrape
respond_to do |format|
format.html { redirect_to entries_url, notice: 'Entries were successfully scraped.' }
format.json { entriesArray.to_json }
end
end
end
lib/reddit_scrapper.rb
require 'open-uri'
module RedditScrapper
def self.scrape
doc = Nokogiri::HTML(open("https://www.reddit.com/"))
entries = doc.css('.entry')
entriesArray = []
entries.each do |entry|
title = entry.css('p.title > a').text
link = entry.css('p.title > a')[0]['href']
entriesArray << Entry.new({ title: title, link: link })
end
if entriesArray.map(&:valid?)
entriesArray.map(&:save!)
end
end
end
config/schedule.rb
RAILS_ROOT = File.expand_path(File.dirname(__FILE__) + '/')
every 2.minutes do
runner "RedditScrapper.scrape", :environment => "development"
end
Please help me to figure out the right runner task to write in ...
Application.rb
require_relative 'boot'
require 'rails/all'
Bundler.require(*Rails.groups)
module ScrapeModel
class Application < Rails::Application
config.autoload_paths << Rails.root.join('lib')
end
end
Rails doesn't auto load the lib folder. You need to add the following line to your config/application.rb:
config.autoload_paths << Rails.root.join('lib')
From what I can tell, you've defined RedditScrapper as a module, but you are trying to use it as a class... (ie calling a method on it).
You can either: turn it into a class (just change module to class) OR define all relevant methods as module_functions
The former is probably preferable given your chosen usage.

If open-uri works, why does net/http return an empty string?

I am attempting to download a page from Wikipedia. For such a task, I am using gems. When using net/http, all I get is an empty string. So I tried with open-uri and it works fine.
Nevertheless, I prefer the first option because it gives me a much more explicit control; but why is it returning an empty string?
class Downloader
attr_accessor :entry, :url, :page
def initialize
# require 'net/http'
require 'open-uri'
end
def getEntry
print "Article name? "
#entry = gets.chomp
end
def getURL(entry)
if entry.include?(" ")
#url = "http://en.wikipedia.org/wiki/" + entry.gsub!(/\s/, "_")
else
#url = "http://en.wikipedia.org/wiki/" + entry
end
#url.downcase!
end
def getPage(url)
=begin THIS FAULTY SOLUTION RETURNS AN EMPTY STRING ???
connector = URI.parse(url)
connection = Net::HTTP.start(connector.host, connector.port) do |http|
http.get(connector.path)
end
puts "Body:"
#page = connection.body
=end
#page = open(url).read
end
end
test = Downloader.new
test.getEntry
test.getURL(test.entry)
test.getPage(test.url)
puts test.page
P.S.: I am an autodidact programmer so the code might not fit good practices. My apologies.
Because your request return 301 Redirect (check connection.code value), you should follow redirect manually if you are using net/http. Here is more details.

getting error bad URI when putting facebook access token on the url

This is the error i get:
Bad Request
bad URI `/v1/user/authenticate.json?access_key=1443560867|2.AQBCC2jMKOEzSjnO.3600.1312826400.0-1129666978|VttMJncSU17Br-g38R9eGF5_qCQ'.
The authenticate method:
require 'open-uri'
class UserController < ApplicationController
respond_to :json
def authenticate
file = open(URI.encode("https://graph.facebook.com/me/permissions?access_token=" + params[:access_key]))
facebook = JSON.parse(file.read)
if facebook["data"].present?
#result = "200"
else
#result = "403"
end
respond_with(#result)
end
end
EDIT: SOLVED THE CODE WORKS ON HEROKU... THE PROBLEM IS ON THE LOCALHOST:3000
On Webrick server, simply add following to config/environments/development.rb:
# Allow webrick to accept pipe char in query string
URI::DEFAULT_PARSER = URI::Parser.new(:UNRESERVED => URI::REGEXP::PATTERN::UNRESERVED + '|')

Resources