Scrape image from YouTube video page - youtube

When I copy a YouTube link, like this one, for example (http://www.youtube.com/watch?v=_Ka2kpzTAL8), and paste it into a Facebook status update, Facebook somehow grabs a still of the video. I'd like to build a feature that does the same thing, but I don't really know how to find the image.
Can anyone point me in the right direction? I'm open to PHP, Python, Ruby or Perl.

Alright, found it !
Take this link: http://i4.ytimg.com/vi/something/hqdefault.jpg and replace something with the video's ID on Youtube. For a smaller image, just remove the hq part of the URL.

I can also confirm this implementation is the way two go. I have used it a Ruby screen scraper ( search application ) # http://stripsearch.me
# search10b Youtube crawl9
#search10a1 = search_array[1]
if search_array[1].present?
#results_9b = []
params_9b = search_array[1].split.map(&:downcase).join('+')
search_url_yt_9b = url_9b << params_9b
doc_9b = Nokogiri::HTML(open(search_url_yt_9b).read, nil, 'utf-8')
entries_9b = doc_9b.css('.yt-ui-ellipsis-2.spf-link')
if entries_9b.size != 0
entries_9b.each do |entry_9b|
title = entry_9b['title']
li = "https://www.youtube.com"
nk = entry_9b['href']
link = li << nk
img_src = entry_9b['href']
uid = img_src.split('=').last
precede_url = "http://i4.ytimg.com/vi/"
append_url = "/hqdefault.jpg"
img_url = precede_url << uid << append_url
precede_tag = "<img src="
append_tag = ">"
img = precede_tag << img_url << append_tag
#results_9b.push(title, link, img)
end
else
#error_9b = "no results found here for " + params_9b
end
end

Related

Ruby on Rails + Google API - How to get the correct numeration from numbered lists

Context:
In our platform we allow users to upload word documents, those documents are stored in google drive and then dowloaded again to our platform in HTML format to create a section where the users can interact with that content.
Rails 5.0.7
Ruby 2.5.7p206
google-api-client 0.53.0
nokogiri 1.10.0
Problem:
Column A
Column B
Original Word Doc
After rendering in our site
The initial problem we had was that the lists after being rendered in our site didn't keep the correct numeration, we already fixed it and now it is being showed correctly as showed below.
Column A
Column B
Original Word Doc
After rendering in our site
Now the issue we are facing is when we have lists that continue the numeration even if they are placed in different sections of the document, after running the script that fixes the numeration it also groups the lists in one single list, as showed below.
Column A
Column B
Original Word Doc
After rendering in our site
We want to know if there is an easy way using the google-api-client or other gems to get more information from the file to know that a list is continue but that are not positioned after the previous one.
This is the code of our current implementation, we are using Nokogiri to get the OLs elements from the document:
def reorganize_list_html(ols, pos = 1)
return ols if ols.length == pos || ols.empty?
ol = ols[pos]
list_position = get_list_position(ol)
last_parent = ols.xpath('//*[#last_parent]')[0]
last_parent_position = last_parent.present? ? last_parent.attribute('last_parent').to_s.to_i : list_position
last_parent.remove_attribute('last_parent') if last_parent.present?
if list_position == last_parent_position
li = ol.css('li').last
margin_value = calculate_margin_value(li)
pending_parents = ols.xpath('//*[#pending_parent]')
if pending_parents.present?
prev_li = pending_parents.first
prev_li.remove_attribute('pending_parent')
prev_ol = prev_li.parent
else
prev_ol = ols[pos - 1]
prev_li = prev_ol.css('li').last
end
prev_margin_value = calculate_margin_value(prev_li)
if margin_value == prev_margin_value
prev_ol.add_child(li)
prev_ol.set_attribute('last_parent', list_position)
elsif margin_value > prev_margin_value
prev_li.add_child(ol)
prev_ol.set_attribute('last_parent', list_position)
else
parent = prev_ol.parent
while margin_value < prev_margin_value
prev_margin_value = calculate_margin_value(parent)
parent = parent.parent
end
if parent.name == 'li'
parent.add_child(li)
else
last_child = ol.children.last
last_child.set_attribute('pending_parent', true)
parent.add_child(ol.children)
end
parent.set_attribute('last_parent', list_position)
end
end
empty_ols = ols.xpath('//*[not(*) and normalize-space(text())=""]')
full_ols = ols - empty_ols
next_pos = full_ols.length < ols.length ? pos : pos + 1
reorganize_list_html(full_ols, next_pos)
end

Convert image into base64 using capybara

I'm currently using capybara to run some scrapping tasks as well as site testing. I have been having difficulties in downloading images/files using capybara. All the documentations I found only guides on simple buttons,forms, links interaction.
Would really appreciate it if someone knows how to download/convert images on a webpage into base64 format.
This example extracts an image from a web page with Capybara / Selenium :
require 'capybara'
JS_GET_IMAGE = "
var ele = arguments[0], callback = arguments[1], img = new Image();
img.crossOrigin = 'Anonymous';
img.onload = function(){
var cnv = document.createElement('CANVAS');
cnv.width = this.width;
cnv.height = this.height;
cnv.getContext('2d').drawImage(this, 0, 0);
var type = this.src.endsWith('png') ? 'png' : 'jpeg';
callback(cnv.toDataURL('image/' + type).substring(22));
};
var src = ele.src || window.getComputedStyle(ele).backgroundImage;
img.src = /https?:/.test(src) ? src.match(/https?:[^\"')]+/)[0] : callback(''); "
session = Capybara::Session.new(:selenium)
driver = session.driver.browser
driver.manage.timeouts.script_timeout = 5000
# navigate to google
session.visit "https://www.google.co.uk/"
# get the logo element
ele = session.find(:css, '#hplogo img:nth-child(1)')
# get the logo as base64 string
imgBase64 = driver.execute_async_script(JS_GET_IMAGE, ele.native)
# save to a file
file = File.new("C:\\temp\\image." + (imgBase64[0] == 'i' ? 'png' : 'jpg'), 'wb')
file.write(Base64.decode64(imgBase64))
file.close
Just looked through the capybara gem and found a .render_base64 and save_screenshot method which could save the image into png or jpg file and after that i could crop the part i wanted. Method can be found here: https://github.com/teampoltergeist/poltergeist

How do I pull the correct image URL from this Wikipedia table?

I built a scraper to pull all the information out of a Wikipedia table and upload it to my database. All was good until I realized I was pulling the wrong URL on images, and I wanted the actual image URL "http://upload.wikimedia.org/wikipedia/commons/thumb/3/38/Baconbutty.jpg" and not the "/wiki/File:Baconbutty.jpg" it was apt to give me. Here is my code so far:
def initialize
#url = "http://en.wikipedia.org/wiki/List_of_sandwiches"
#nodes = Nokogiri::HTML(open(#url))
end
def summary
sammich_data = #nodes
sammiches = sammich_data.css('div.mw-content-ltr table.wikitable tr')
sammich_data.search('sup').remove
sammich_hashes = sammiches.map {|x|
if content = x.css('td')[0]
name = content.text
end
if content = x.css('td a.image').map {|link| link ['href']}
image =content[0]
end
if content = x.css('td')[2]
origin = content.text
end
if content = x.css('td')[3]
description =content.text
end
My issue is with this line:
if content = x.css('td a.image').map {|link| link ['href']}
image =content[0]
If I go to td a.image img, it just gives me a null entry.
Any suggestions?
Here's how I'd do it (if I was to scrape Wikipedia, which I wouldn't because they do have an API for this stuff):
require 'nokogiri'
require 'open-uri'
require 'pp'
doc = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/List_of_sandwiches"))
sammich_hashes = doc.css('table.wikitable tr').map { |tr|
name, image, origin, description = tr.css('td,th')
name, origin, description = [name, origin, description].map{ |n| n && n.text ? n.text : nil }
image = image.at('img')['src'] rescue nil
{
name: name,
origin: origin,
description: description,
image: image
}
}
pp sammich_hashes
Which outputs:
[
{:name=>"Name", :origin=>"Origin", :description=>"Description", :image=>nil},
{
:name=>"Bacon",
:origin=>"United Kingdom",
:description=>"Often served with ketchup or brown sauce",
:image=>"//upload.wikimedia.org/wikipedia/commons/thumb/3/38/Baconbutty.jpg/120px-Baconbutty.jpg"
},
... [lots removed] ...
{
:name=>"Zapiekanka",
:origin=>"Poland",
:description=>"A halved baguette or other bread usually topped with mushrooms and cheese, ham or other meats, and vegetables",
:image=>"//upload.wikimedia.org/wikipedia/commons/thumb/1/12/Zapiekanka_3..jpg/120px-Zapiekanka_3..jpg"
}
]
If an image isn't available, the field will be set to nil in the returned hashes.
You could use the srcset attribute of the imgelement, split it and keep one of the available resized images.
if content = x.at_css('td a.image img')
image =content['srcset'].split(' 1.5x,').first

Execute big scripts with rails

I made a very big script to feel my initial datas into my rails app. I have about 3000 lines in my CSV and 10000 images.
After maybe 300 upload i got this message :
/usr/local/lib/ruby/gems/1.9.1/gems/activesupport-3.0.9/lib/active_support/core_ext/kernel/agnostics.rb:7:in ``': Cannot allocate memory - identify -format %wx%h '/tmp/stream20111104-14788-1hsumv7.jpg[0]' (Errno::ENOMEM)
My upload script :
if (row[28] != nil)
hotelalbum = HotelAlbumPhoto.find_or_create_by_title(h.title)
hotelalbum.description = "Album photo de l'hotel " + h.title.capitalize
hotelalbum.hotel_id = h.id
hotelalbum.save
files = Dir.glob('IMAGES/' + row[28].gsub(/\\/,'/') + '/*.jpg')
i =0
for file in files
i += 1
photopath = File.expand_path('../../import', __FILE__) + '/' + file
photoname = file.split('/').last
if (i==1)
hotelalbum.thumbnail = open(photopath)
hotelalbum.save
end
if (i==1)
h.thumbnail = open(photopath)
end
photo = HotelImage.find_or_create_by_image_file_name_and_hotel_album_photo_id(photoname,hotelalbum.id)
if (photo.image_file_size == nil || photo.image_file_name != photoname)
photo.image = open(photopath)
photo.activated = true
photo.alt = "Photo de l'hotel " + h.title
photo.save
else
puts photopath + ' already updated'
end
end
end
When i check my memory with top command, i see ruby process use more memory on each upload. How can i manage this?
Thank you for help
ps : My server is a virtual machine with 512Mb memory, one solution is to inscrease this memory but i hope to find an other.
I don't know where the open function is defined, but I'm suspicious that I don't see a corresponding close...
update Better idea, change photo.image = open(photopath) to photo.image = File.read(photopath)
According to the docs, read:
Opens the file, optionally seeks to the given offset, then
returns length bytes (defaulting to the rest of the file).
read ensures the file is closed before returning.
Looks like a memory leak problem of ImageMagick? Maybe it will help to process the list in big blocks or chunks with in_groups_of and force garbage collection with GC.start after each chunk:
files.in_groups_of(100) {|chunk|
# process chunk
GC.start
}

how to get google search results links and store them in array using mechanize

i want to get the 10 google search results links (href) using mechanize so i wrote this code, but the code does not return the right google search results, what should i write?
#searchword = params[:q]
#sitesurl = Array.new
agent = Mechanize.new
page = agent.get("http://www.google.com")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = #searchword.to_s
search_results = agent.submit(search_form)
count = 0
c = 0
while c < 10
if (search_results/"li")[count].attributes['class'].to_s == "g knavi"
site = (search_results/"li")[count]
code = (site/"a")[0].attributes['href']
#sitesurl << code
c += 1
end
count += 1
end
Something like this should work:
#searchword = params[:q]
#sitesurl = Array.new
agent = Mechanize.new
page = agent.get("http://www.google.com")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = #searchword.to_s
search_results = agent.submit(search_form)
(search_results/"li.g").each do |result|
#sitesurl << (result/"a").first.attribute('href') if result.attribute('class').to_s == 'g knavi'
end
This is the updated one for now. Tested and work's fine
require 'rubygems'
require 'mechanize'
require 'hpricot'
agent = Mechanize.new
agent.user_agent_alias = 'Linux Firefox'
page = agent.get('http://google.com/')
google_form = page.form('f') google_form.q = 'your search'
page = agent.submit(google_form)
page.links.each do |link|
if link.href.to_s =~/url.q/
str=link.href.to_s
strList=str.split(%r{=|&})
url=strList[1]
# if You need cached url's then just remove this condition and simply use URL's
if ! url.include? "webcache"
puts url
end
end
end
Just create a new array and push the url's to array.
It's not working now, I suppose that could be because Google recently changes their HTML in the search results and URLs.
Nowadays, the answers above don't work anymore. We have released our own gem that is easy to use and allow custom locations:
query = GoogleSearchResults.new q: "coffee", location: "Portand"
hash_results = query.get_hash
Repository: https://github.com/serpapi/google-search-results-ruby

Resources