How to check if a url contains video? - ruby-on-rails

i am creating a script ruby on rails in which user shares a link and if the link contains a video, the embed code of the video is extracted. in other words, i am trying to implement a "facebook post link" like feature.. can someone please guide me how can this be achieved??

The only way I can think of to do this would be to manually check each post for a link to the video, for example:
YOUTUBE_EMBED = '<iframe title="YouTube video player" width="640" height="390" src="" frameborder="0" allowfullscreen></iframe>'
if comment =~ /.*http:\/\/(\w+\.)?\/watch\?v=(\w+).*/
return YOUTUBE_EMBED.gsub(/VIDEO_ID/, $2)
Then repeat this process for each video site. I am wrestling with a similar concept so if you figure out a better way to do it let me know!

You can analyze http headers for that link
require "net/http"
Net::HTTP.start('', 80) do |http|
Outputs "video/mpeg"
Now that you know it's really a video, do what you want with it

You could also use a utility like file and fork a system process so that it executes a command like file -b downloaded_file.mpg.
So your code would look something like this:
IO.popen("file -b /path/to/video.mpg") { |stdout| #stdout = stdout.gets }
if not #stdout.grep(/MPEG/).empty?
puts "MPEG Detected"

Flash videos usually have the extension .flv, so you just need to look for files that have it.
If you need other file formats, just change the regexp.

You can use an external service like
Here you're a gem that may help you


Youtube link to embed - What is going on in the code?

Could someone break it out what ruby is doing in here?
I would like to understand the code so I could adapt it in the future.
One of the things I wanted to do was to adapt it to Vimeo, but I need to understand what is going on so I can learn more.
module ApplicationHelper
def youtube_embed(youtube_url)
if youtube_url[/youtu\.be\/([^\?]*)/]
youtube_id = $1
# Regex from #
youtube_id = $5
%Q{<iframe title="YouTube video player" src="{ youtube_id }?rel=0&enablejsapi=1" frameborder="0" allowfullscreen></iframe>}
This piece of code tries to build code for embet youtube player from any youtube link on video.
This if youtube_url[/youtu\.be\/([^\?]*)/] verifies if given url has a "" format and then extracts video id. ($1 contains result of evaluating regular expression (expression between /.../) )
If string do not starts with '', code uses some tricky regular expression (which explanation ypu could find if follow link to stackoverflow in code) and then from $5 also gets video id.
And then it returns a string that is the html-code of embed youtube player for that video. It inserts video id at the end of this string using #{youtube_id}

How to scrape images from eBay and Amazon using XPath in Nokogiri from JSON

I'm trying to scrape images from websites using Nokogiri and XPath, so far with limited success. For a typical website whose HTML has img and src, I can use:
tmp2 = Nokogiri::HTML(open(site_url))
tmp2.xpath("//img/#src").each do |src| whatever
However, some sites like Amazon and eBay only trigger certain images with JavaScript. If I look at the code I can see the data in arrays. For example, from Amazon:
<script type="text/javascript">
P.when('jQuery', 'cf').execute(function($, cf){
P.when('A', 'jQuery', 'ImageBlockATF', 'cf').register('ImageBlockBTF', function(A, $, imageBlockATF, cf){
var data = {"indexToColor":[],"burjImageBlock":0,"isSwatchHoverConsistent":1,"heroFocalPoint":null,"visualDimensions":["color_name"],"productGroupID":"apparel_display_on_website","newVideoMissing":0,"useIV":0,"useClickZoom":null,"useChildVideos":0,"numColors":7,"logMetrics":0,"defaultColor":"initial","airyConfig":{"enableContinuousPlay":null,"installFlashButtonText":"Install Flash Player","contentTitle":null,"autoplayCutOffTimeSeconds":null,"ageGate":{"monthNames":["January","February","March","April","May","June","July","August","September","October","November","December"],"deniedPrompt":"We're sorry. You are not old enough to watch this video.","submitText":"Submit","prompt":"This video is not intended for all audiences. What date were you born?"},"videoAds":null,"videoUnsupportedPrompt":"Sorry, this video is unsupported on this browser.","desiredMode":null,"swfUrl":"","isAutoplayEnabled":null,"installFlashPrompt":"Adobe Flash Player is required to watch this video.","isLiveStream":null,"regionCode":"NA","contentId":null,"playbackErrorPrompt":"Sorry, an error has occurred while attempting video playback. Please try again later.","contentMinAge":null,"isForesterTrackingDisabled":null,"streamingUrls":null,"parentId":null,"foresterMetadataParams":{"client":"Dpx","requestId":"1MX7VHFRVAS6TWY64BXC","marketplaceId":"ATVPDKIKX0DER","session":"182-9511970-7757812","method":"Apparel.ImageBlock"},"jsUrl":""},"mainImageMaxSizes":null,"staticStrings":{"playVideo":"Click to play video","rollOverToZoom":"Roll over image to zoom in","images":"Images","video":"video","clickToZoom":"Click on image to zoom in","touchToZoom":"Touch the image to zoom in","videos":"Videos","close":"Close","pleaseSelect":"Please select","clickToExpand":"Click to open expanded view","allMedia":"All Media"},"notThumbnailClickImmersiveView":1,"gIsNewTwister":1,"title":"Threads 4 Thought Women's Tabitha Basic Tank Top","ivRepresentativeAsin":{"6":"B00T46V76W","4":"B00WM3O7ES","1":"B00T46YZES","3":"B00WM3NLPE","2":"B00T46VD16","5":"B00T46VGXQ"},"mainImageSizes":[[342,445],[385,500],[425,550],[466,606],[522,679]],"isQuickview":0,"ipadVideoSizes":[[340,444],[384,500]],"colorToAsin":{"Coral Dreams":{"asin":"B00T46V76W"},"Heather Grey":{"asin":"B00WM3NLPE"},"Black":{"asin":"B00T46YZES"},"White":{"asin":"B00T46VGXQ"},"Deep Blue Sea":{"asin":"B00T46VD16"},"Sea Glass":{"asin":"B00WM3O7ES"}},"thumbExperimentEnabledValue":1,"showLITBOnClick":0,"videoSizes":[[342,445],[384,500]],"stretchyGoodnessWidth":[1280,1440,1640,1800],"autoplayVideo":0,"hoverZoomIndicator":"","sitbReftag":"","useHoverZoom":1,"staticImages":{"zoomOut":"","hoverZoomIcon":"","zoomIn":"","zoomLensBackground":"","videoThumbIcon":",0,0,38,50_.gif","spinner":"","zoomInCur":"","videoSWFPath":"","arrow":"","zoomOutCur":""},"videos":[],"gPreferChildVideos":0,"altsOnLeft":1,"ivImageSetKeys":{"Coral Dreams":"6","Heather Grey":"3","Black":"1","initial":0,"White":"5","Deep Blue Sea":"2","Sea Glass":"4"},"useHoverZoomIpad":"","isUDP":1,"alwaysIncludeVideo":0,"widths":[1280,1440,1640,1800],"maxAlts":7,"useChromelessVideoPlayer":1,"mainImageHeightPartitions":null};
data["customerImages"] = eval('[]');
data["colorImages"] = {"Coral Dreams":[{"large":"","variant":"MAIN","hiRes":"","thumb":",50_.jpg","main":{"":["466","606"],"":["522","679"],"":["423","550"],"":["342","445"],"":["385","500"]}},{"large":"","variant":"BACK","hiRes":"","thumb":",50_.jpg","main":{"":["385","500"],"":["522","679"],"":["342","445"],"":["466","606"],"":["423","550"]}}],"Heather Grey":[{"large":"","variant":"MAIN","hiRes":"","thumb":",50_.jpg","main":{"":["466","606"],"":["385","500"],"":["423","550"],"":["522","679"],"":["342","445"]}},{"large":"","variant":"BACK","hiRes":"","thumb":",50_.jpg","main":{"":["342","445"],"":["423","550"],"":["385","500"],"":["522","679"],"":["466","606"]}}],"Black":[{"large":"","variant":"MAIN","hiRes":"","thumb":",50_.jpg","main":{"":["423","550"],"":["342","445"],"":["522","679"],"":["385","500"],"":["466","606"]}},{"large":"","variant":"BACK","hiRes":"","thumb":",50_.jpg","main":{"":["385","500"],"":["522","679"],"":["342","445"],"":["466","606"],"":["423","550"]}}],"White":[{"large":"","variant":"MAIN","hiRes":"","thumb":",50_.jpg","main":{"":["423","550"],"":["522","679"],"":["385","500"],"":["342","445"],"":["466","606"]}},{"large":"","variant":"BACK","hiRes":"","thumb":",50_.jpg","main":{"":["466","606"],"":["342","445"],"":["522","679"],"":["385","500"],"":["423","550"]}}],"Deep Blue Sea":[{"large":"","variant":"MAIN","hiRes":"","thumb":",50_.jpg","main":{"":["342","445"],"":["522","679"],"":["423","550"],"":["385","500"],"":["466","606"]}},{"large":"","variant":"BACK","hiRes":"","thumb":",50_.jpg","main":{"":["342","445"],"":["385","500"],"":["522","679"],"":["466","606"],"":["423","550"]}}],"Sea Glass":[{"large":"","variant":"MAIN","hiRes":"","thumb":",50_.jpg","main":{"":["342","445"],"":["522","679"],"":["466","606"],"":["385","500"],"":["423","550"]}},{"large":"","variant":"BACK","hiRes":"","thumb":",50_.jpg","main":{"":["385","500"],"":["342","445"],"":["522","679"],"":["466","606"],"":["423","550"]}}]};
data["heroImage"] = {};
data["landingAsinColor"] = 'Coral Dreams';
data["shouldApplyResizeFix"] = false;
return data;
The filenames I want to grab don't have src (i.e. In this case, the array is called data["colorImages"]. But I can't hard-code anything because the same thing happens on eBay.
The filenames I need here are in enImgCarousel.
On a side note, when I use the following JavaScript bookmarklet for each URL to get images, I'm able to get the correct images:
for (b=0;b<document.images.length;b++){
a+='<img src='+document.images[b].src+'><br>'};
alert('No images!')
Back to Nokogiri and XPath, I've also tried:
tmp2.xpath("//img").each do |src|...
tmp2.xpath("html//img").each do |src|
Any ideas how I should do this or which direction to go in?
This is alternative way to solve what you want; you can use Capybara and Poltergeist.
I assume you don't have to dive into JavaScript with this solution.
If you scrape, I recommend that you consider Capybara with Poltergeist, you can find many sources to reference.
This is the code I tried:
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
Capybara.register_driver :poltergeist_debug do |app|, inspector: true)
Capybara.javascript_driver = :poltergeist_debug
Capybara.current_driver = :poltergeist_debug
# Amazon Case
doc_amazon = Nokogiri::HTML.parse(page.html)
doc_amazon.xpath("//img/#src").each do |src|
p src.value
#ebay case
doc_ebay = Nokogiri::HTML.parse(page.html)
doc_ebay.xpath("//img/#src").each do |src|
p src.value
If you want to dig into it:
# => ""
# => ""
Are you trying to generate a database of competitors items with pricing, etc.?
Are you trying to grab entire categories or individual sellers?
The reason why I ask is you can get an RSS feed of items each seller lists if they have turned that feature on. This way, you do not have to waste time scraping a page when you can get the central data from an RSS feed.
When parsing webpages, depending upon where you are in the webpage (you mentioned carousel) the indices you are encountering are from the stash of thumbnails representing the larger images.
I recommend looking at the eBay API and the Amazon API and finding the RSS feeds for the sellers first.
As far as getting past any Javascript issues, the webpage loads rotating slideshows and carousels dynamically, so you will have to use Mechanize (as RAJ suggested above) or Beautiful Soup or Selenium to get fully rendered web pages in which all images are in a scrapable state.
Feel free to post your source if there is anything else I can help with.
Sorry, as I am posting the answer from mobile phone, I can't write full code right away, however, I can give you a way. You should use Mechanize with selenium-webdriver & watir instead of only Nokogiri.
Using Mechanize, you will be able to handle elements coming from JavaScript. You can mock the actual moves on browser i.e. you can code for clicking on links/buttons, you can wait for image load and then can scrape it. And all this can be done using Mechanize very easily.

Downloading a YouTube video through Wget

I am trying to download YouTube videos through Wget. The first thing necessary is to capture the URL of the actual video resource. Suppose I want to download this video: video. Opening up the page in the Firebug console reveals something like this:
The link which I have encircled looks like the link to the resource, for there we see only the video: However, when I am trying to download this resource with Wget, a 4 KB file of name r-KBncrOggI#version=3&autohide=1 gets stored in my hard-drive, nothing else. What should I do to get the actual video?
And secondly, is there a way to capture different resources for videos of different resolutions, like 360px, 480px, etc.?
Here is one VERY simplified, yet functional version of the youtube-download utility I cited on my another answer:
#!/usr/bin/env perl
use strict;
use warnings;
# CPAN modules we depend on
use JSON::XS;
use LWP::UserAgent;
use URI::Escape;
# Initialize the User Agent
# YouTube servers are weird, so *don't* parse headers!
my $ua = LWP::UserAgent->new(parse_head => 0);
# fetch video page or abort
my $res = $ua->get($ARGV[0]);
die "bad HTTP response" unless $res->is_success;
# scrape video metadata
if ($res->content =~ /\byt\.playerConfig\s*=\s*({.+?});/sx) {
# parse as JSON or abort
my $json = eval { decode_json $1 };
die "bad JSON: $1" if $#;
# inside the JSON 'args' property, there's an encoded
# url_encoded_fmt_stream_map property which points
# to stream URLs and signatures
while ($json->{args}{url_encoded_fmt_stream_map} =~ /\burl=(http.+?)&sig=([0-9A-F\.]+)/gx) {
# decode URL and attach signature
my $url = uri_unescape($1) . "&signature=$2";
print $url, "\n";
Usage example (it returns several URLs to streams with different encoding/quality):
$ perl | head -n 1,quality=large&signature=A1E7E91DD087067ED59101EF2AE421A3503C7FED.87CBE6AE7FB8D9E2B67FEFA9449D0FA769AEA739
I'm afraid it's not that easy do get the right link for the video resource.
The link you got,, points to the player rather than the video itself. There is one Perl utility, youtube-download, which is well-maintained and does the trick. This is how to get the HQ version (magic fmt=18) of that video:
stas#Stanislaws-MacBook-Pro:~$ youtube-download -o "{title}.{suffix}" --fmt 18 r-KBncrOggI
--> Working on r-KBncrOggI
Downloading `Sourav Ganguly in Farhan Akhtar's Show - Oye! It's Friday!.mp4`
75161060/75161060 (100.00%)
Download successful!
There might be better command-line YouTube Downloaders around. But sorry, one doesn't simply download a video using Firebug and wget any more :(
The only way I know to capture that URL manually is by watching the active downloads of the browser:
That largest data chunks are video data, so you can copy its URL:,919009,922403,916709,912806,929110,928008,920201,901451,909708,913605,925710,916623,929104,913302,910221,911116,914093,922405,929901&scoville=1&el=detailpage&bd=6676317&nsidf=1&vid=Yfg8gnutZoTD4G5SVKCxpsPvirbqG7pvR&bt=40.333&mos=0&vq=auto
However, for a large video, this will only return a part of the stream unless you figure out the URL query parameter responsible for stream range to be downloaded and adjust it.
A bonus: everything changes periodically as YouTube is constantly evolving. So, don't do that manually unless you carve pain.

SimplePie RSS Parser - Encoding and Weird Characters even on UTF-8

I am using SimplePie to Parse an RSS feed, and I am getting the following output:
Don't forget our "Spot It, Post It" .....
My code is:
header('Content-type:text/html; charset=utf-8');
// We'll process this feed with all of the default options.
$feed = new SimplePie();
// Set which feed to process.
I'm using HTML5 Doctype AND I also have: <meta charset="charset=utf-8">
I've looked it up and everything talks about changing the charset to UTF-8 which I clearly have.. so I'm not too sure what else is causing this.
Any ideas?
I don't know if you've managed to fix this, but thought I'd share my solution with anyone else who is looking. I had the same problem - the characters were being 'corrupted' in the feed. My code initially (with the problem) was:
include_once $_SERVER['DOCUMENT_ROOT'] . '/inc/';
$feed = new SimplePie('');
Seeing the post above, I tried adding the following header and it worked!
header('Content-type:text/html; charset=utf-8');
include_once $_SERVER['DOCUMENT_ROOT'] . '/inc/';
$feed = new SimplePie('');
I hope this helps someone else experiencing the same problems.
Does this happen with every feed? Or just one particular feed? It might be the feed itself. You can use $item->get_content() and look at the content of the feed directly if the description itself is proving problematic. Sometimes it is necessary to do processing on information from a feed or web API, there is PHP code and examples for stripping and replacing characters, the News Blocks 2.0 demo on the SimplePie site has some cleaning code I've been using a lot recently.
Good luck.

How to parse embedded videos from youtube, vimeo, etc

I'm working with Ruby On Rails 2.3.8 and I'm using TinyMCE with image and video upload functionalities.
I've figured out that when I insert a Vimeo video, it won't work, because it needs it's own iframe, as the following:
<iframe src="" width="400" height="225" frameborder="0"></iframe><p>YOU! - Heart from KUSKUS on Vimeo.</p>
I'm now wondering how to show either youtube (which work just fine), vimeo, and other kind of embedded videos.
Searching on the internet I've found the following code, in the file /plugins/media/media.js, within getType function:
// Vimeo
if ( v.match(/^http:\/\/(?:www\.){0,1}vimeo\.com\/(\d+)$/) ) {
f.width.value = '400';
f.height.value = '321';
f.src.value = '' + v.match(/^http:\/\/(?:www\.){0,1}vimeo\.com\/(\d+)$/)[1];
return 'flash';
But it's not working for me. At least, all I see is that it's treating it as it was a common flash video, instead of inserting an iframe on the html for playing it (as it's done when you click the "Embed" button at
The iframe tag usually gets removed (cleanup) if you do not specify otherwise.
Add this to your tinymce configuration to keep iframes inside the editor:
This thread might be of help too.
