How do some sites download YouTube captions? - youtube-api

This is somewhat of a duplicate question of Does YouTube API forbid to download video captions if you are not it's owner?, Get YouTube captions and Does YouTube API forbid to download video captions if you are not it's owner?, which all basically say it's not possible unless to download captions via the YouTube API unless you are the owner or third-party contributions are not enabled; however, my question is how to sites like http://downsub.com/ or http://www.lilsubs.com/ have access to all captions?
In other words, when I access the YouTube API myself (even with youtubepartner and youtube.force-ssl scopes), I can only download the captions of some videos, but when I try the same videos that failed for me with 403: The permissions associated with the request are not sufficient to download the caption track. The request might not be properly authorized, or the video order might not have enabled third-party contributions for this caption. on these other sites, it works fine. I'm assuming they are using the YouTube API to access the captions, but what special sauce are they using? Some special partner key? An different API version? Are they just scraping from the videos themselves or something?

Send a GET request on:
http://video.google.com/timedtext?lang={LANG}&v={VIDEOID}
Example for your video in comment: http://video.google.com/timedtext?lang=ko&v=0db1_qWZjRA
Let's look at another example of yours, i.e. https://www.youtube.com/watch?v=7068mw-6lmI (and I agree about differentiation part in your comment).
There are multiple subtitles available for the video
English
Korean
Spanish
Korean (auto-generated) also called asr (automatic speech recognition)
These stand for the subtitle name parameter (i.e., name=English).
lang stands for the country code.
In your example: https://www.youtube.com/api/timedtext?lang=es-MX&v=7068mw-6lmI&name=Spanish
If subtitle track is available, it is possible to do translation form it, namely using tlang parameter.
https://www.youtube.com/api/timedtext?lang=en&v=7068mw-6lmI&name=English&tlang=lv
https://www.youtube.com/api/timedtext?lang=ko&v=7068mw-6lmI&name=Korean&tlang=lv
This would be my bid for what these sites are using, i.e. translation of the available subtitle track (confirm by trying to use a video without subtitle track as input for one of their sites).
As for asr signature seems to always be needed, but as long as one of the subtitle tracks are available, you could use that for translation. E.g. in your OP comment example:
https://www.youtube.com/api/timedtext?lang=en&v=vx6NCUyg1NE&tlang=lv
Looks like the last example is special with both of subtitle tracks being asr (checked with Chrome -> Inspect -> Network) therefore you need to omit the subtitle name parameter part. This difference unfortunately is not visible in YouTube video's settings wheel.

A 2022 answer:
Option 1: Send a curl request to the webpage: curl -L "https://youtu.be/YbJOTdZBX1g", search for timedtext in the result, and you would get a URL. replace \u0026 with & and you get the link for the subtitle.
Option 2: Use the yt-dlp package:
# For installing see: https://github.com/yt-dlp/yt-dlp#with-pip
from yt_dlp import YoutubeDL
ydl_opts = {
"skip_download": True,
"writesubtitles": True,
"subtitleslangs": ["all", "-live_chat"],
# Looks like formats available are vtt, ttml, srv3, srv2, srv1, json3
"subtitlesformat": "json3",
# You can skip the following option
"sleep_interval_subtitles": 1,
}
with YoutubeDL(ydl_opts) as ydl:
ydl.download(["YbJOTdZBX1g"])

There is this unofficial API used by Youtube :
https://www.youtube.com/api/timedtext?lang={LANG}&v={VIDEO_ID}
LANG here is ISO 639-1 2 letter country code. For your example it would be :
https://www.youtube.com/api/timedtext?lang=ko&v=0db1_qWZjRA
You can check it in network tab while toggling the closed caption button :

I have used youtube-transcript-api successfully to retrieve transcripts. The below is a demo to dump the transcript into HTML with links back to the timestamps in the video:
import sys
from youtube_transcript_api import YouTubeTranscriptApi
video_id = sys.argv[1]
# Retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
# Just use the first transcript, let it raise an exception if none exist.
transcript = next(iter(transcript_list))
print("<html><body>")
for line_map in transcript.fetch():
st_sec = int(line_map['start'] / 60)
st_msec = int(line_map['start'] - st_sec * 60)
tstmp = f"{st_sec}:{st_msec}"
link_to_tstmp = f"https://youtu.be/{video_id}?t={st_sec*60}"
tstmp_str = ("%2d:%-2d" % (st_sec, st_msec)).replace(" ", " ")
#print(f"{st_sec}:{st_msec} {line_map['text']}")
print("""%s %s<br/>""" % (link_to_tstmp, tstmp_str, line_map['text']))
print("</html></body>")
If there are multiple transcripts, the library provides API to search by language etc.
You can further tweak the logic to merge text so you only get one link every so many minutes. I got good results for a lecture by linking at every 1 min and format the lines into a HTML table.

Related

Description linebreaks on YouTube v3 API

I've been playing around with the YouTube API but when I try to get the video description, it appears as one long string with no line breaks. Is there a method to get line breaks?
The API does not have any documentation to explain this matter.
Please use the PHP_EOL, like so:
$desp='Cette vidéo et trouvée sur: http://localhost/ '.PHP_EOL;
$desp.='Pour plus d\'infos visitez: http://localhost/123 '.PHP_EOL;
$desp.='CopyRight to: http://localhost/789';
$snippet->setDescription($desp);
When uploading a video using the API, I set .Snippet.Description, my data contains CrLf characters to cause line breaks. Later, when I retrieve the Video object via the API, the line breaks are present.

Read site html from a site in a different geo region

I am using python and Beautiful soup to read html pages. Unfortunately some sites redirect to my Geo region (AU) so I can't retrieve the target countries version i.e. (UK, US, FR, NZ...)
I have tried using a VPN service but this requires me to manually change the region so I can't automate the process. I have tried using the python quartz.Coregraphics library to click the options on screen but this is temperamental.
Is there a way I can achieve this programmatically?
I have manage to nut this one out myself. Best answered by example for reading a uk based site.
import urllib2
url = 'Some-uk-url'
req = urllib2.Request(url)
req.add_header('Accept-Language', 'en-gb')
req.add_header('X-Forwarded-For', [a uk proxy ipaddress here])
htmltext = urllib2.urlopen(req).read()

What top-level domains does youtube have?

I am trying to identify youtube link (generally), and I wonder what top-level domains is youtube using?
So far I know about:
.com (youtube.com)
.be (youtu.be)
Are there any others?
PS: for those looking for checking youtube/vimeo video particulary I would recommend to check how to check the valid Youtube url using jquery ...
At the moment, YouTube videos can be accessed by two kinds of link, either the usual URL generated by the UI itself, as you click from one video to the next:
https://www.youtube.com/watch?v=tRgl-78sDX2
Or through a sharing URL, created within the UI by clicking the "share" button:
https://youtu.be/6DzSAaNQHR8
Regarding the second part of your question, what domains are required for access to YouTube? Unfortunately this is a moving target, as the YouTube changes. At the time of writing (Mar 2015), that list is as follows:
*.youtube.com
*.googlevideo.com
*.ytimg.com
In addition to the core domains above, some ancillary domains are also needed to display the ads, etc on most YouTube pages:
apis.google.com
*.googleusercontent.com
*.gstatic.com
The bulk of the traffic itself comes from *.googlevideo.com. Beginning in late 2014, YouTube started progressively enabling SSL on the youtube.com and googlevideo.com domains. You'll need to be aware of this if you're using this information on a filtering or caching device.
Also note this was tested through a browser; native clients (iOS, Android, etc) may operate differently.
I have discovered these Youtube domains so far:
youtube.com // https://www.youtube.com/watch?v=dfGZ8NyGFS8
youtu.be // https://youtu.be/dfGZ8NyGFS8
youtube-nocookie.com // https://www.youtube-nocookie.com/embed/dfGZ8NyGFS8
also youtube.com has a mobile version on subdomain:
m.youtube.com // https://m.youtube.com/watch?v=dfGZ8NyGFS8
I use this list for PHP function parse_url. You may find this question usefull too.
Here are some of the main domains, subdomains and gTLDs related to youtube
1st Level (domains):
.googlevideo.com
.youtu.be
.youtube.com
.youtube.com.br
.youtube.co.nz
.youtube.de
.youtube.es
.youtube.it
.youtube.nl
.youtube-nocookie.com
.youtube.ru
.ytimg.com
2nd Level (subdomains):
.video-stats.l.google.com
.youtube.googleapis.com
.youtubei.googleapis.com
.ytimg.l.google.com
gTLD:
.youtube
gTLD subdomains:
.rewind.youtube
.blog.youtube
Here are all (147) top-level URLs which redirected to youtube.com that I could find (see below for how I got these URLs):
www.youtube.ae https://www.youtube.com/?gl=AE
www.youtube.at https://www.youtube.com/?gl=AT
www.youtube.az https://www.youtube.com/?gl=AZ
www.youtube.ba https://www.youtube.com/?gl=BA
www.youtube.be https://www.youtube.com/?gl=BE
www.youtube.bg https://www.youtube.com/?gl=BG
www.youtube.bh https://www.youtube.com/?gl=BH
www.youtube.bo https://www.youtube.com/?gl=BO
www.youtube.by https://www.youtube.com/?gl=BY
www.youtube.ca https://www.youtube.com/?gl=CA
www.youtube.cat https://www.youtube.com/?gl=ES
www.youtube.ch https://www.youtube.com/?gl=CH
www.youtube.cl https://www.youtube.com/?gl=CL
www.youtube.co https://www.youtube.com/?gl=CO
www.youtube.co.ae https://www.youtube.com/?gl=AE
www.youtube.co.at https://www.youtube.com/?gl=AT
www.youtube.co.cr https://www.youtube.com/?gl=CR
www.youtube.co.hu https://www.youtube.com/?gl=HU
www.youtube.co.id https://www.youtube.com/?gl=ID
www.youtube.co.il https://www.youtube.com/?gl=IL
www.youtube.co.in https://www.youtube.com/?gl=IN
www.youtube.co.jp https://www.youtube.com/?gl=JP
www.youtube.co.ke https://www.youtube.com/?gl=KE
www.youtube.co.kr https://www.youtube.com/?gl=KR
www.youtube.co.ma https://www.youtube.com/?gl=MA
www.youtube.co.nz https://www.youtube.com/?gl=NZ
www.youtube.co.th https://www.youtube.com/?gl=TH
www.youtube.co.tz https://www.youtube.com/?gl=TZ
www.youtube.co.ug https://www.youtube.com/?gl=UG
www.youtube.co.uk https://www.youtube.com/?gl=GB
www.youtube.co.ve https://www.youtube.com/?gl=VE
www.youtube.co.za https://www.youtube.com/?gl=ZA
www.youtube.co.zw https://www.youtube.com/?gl=ZW
www.youtube.com https://www.youtube.com/
www.youtube.com.ar https://www.youtube.com/?gl=AR
www.youtube.com.au https://www.youtube.com/?gl=AU
www.youtube.com.az https://www.youtube.com/?gl=AZ
www.youtube.com.bd https://www.youtube.com/?gl=BD
www.youtube.com.bh https://www.youtube.com/?gl=BH
www.youtube.com.bo https://www.youtube.com/?gl=BO
www.youtube.com.br https://www.youtube.com/?gl=BR
www.youtube.com.by https://www.youtube.com/?gl=BY
www.youtube.com.co https://www.youtube.com/?gl=CO
www.youtube.com.do https://www.youtube.com/?gl=DO
www.youtube.com.ec https://www.youtube.com/?gl=EC
www.youtube.com.ee https://www.youtube.com/?gl=EE
www.youtube.com.eg https://www.youtube.com/?gl=EG
www.youtube.com.es https://www.youtube.com/?gl=ES
www.youtube.com.gh https://www.youtube.com/?gl=GH
www.youtube.com.gr https://www.youtube.com/?gl=GR
www.youtube.com.gt https://www.youtube.com/?gl=GT
www.youtube.com.hk https://www.youtube.com/?gl=HK
www.youtube.com.hn https://www.youtube.com/?gl=HN
www.youtube.com.hr https://www.youtube.com/?gl=HR
www.youtube.com.jm https://www.youtube.com/?gl=JM
www.youtube.com.jo https://www.youtube.com/?gl=JO
www.youtube.com.kw https://www.youtube.com/?gl=KW
www.youtube.com.lb https://www.youtube.com/?gl=LB
www.youtube.com.lv https://www.youtube.com/?gl=LV
www.youtube.com.ly https://www.youtube.com/?gl=LY
www.youtube.com.mk https://www.youtube.com/?gl=MK
www.youtube.com.mt https://www.youtube.com/?gl=MT
www.youtube.com.mx https://www.youtube.com/?gl=MX
www.youtube.com.my https://www.youtube.com/?gl=MY
www.youtube.com.ng https://www.youtube.com/?gl=NG
www.youtube.com.ni https://www.youtube.com/?gl=NI
www.youtube.com.om https://www.youtube.com/?gl=OM
www.youtube.com.pa https://www.youtube.com/?gl=PA
www.youtube.com.pe https://www.youtube.com/?gl=PE
www.youtube.com.ph https://www.youtube.com/?gl=PH
www.youtube.com.pk https://www.youtube.com/?gl=PK
www.youtube.com.pt https://www.youtube.com/?gl=PT
www.youtube.com.py https://www.youtube.com/?gl=PY
www.youtube.com.qa https://www.youtube.com/?gl=QA
www.youtube.com.ro https://www.youtube.com/?gl=RO
www.youtube.com.sa https://www.youtube.com/?gl=SA
www.youtube.com.sg https://www.youtube.com/?gl=SG
www.youtube.com.sv https://www.youtube.com/?gl=SV
www.youtube.com.tn https://www.youtube.com/?gl=TN
www.youtube.com.tr https://www.youtube.com/?gl=TR
www.youtube.com.tw https://www.youtube.com/?gl=TW
www.youtube.com.ua https://www.youtube.com/?gl=UA
www.youtube.com.uy https://www.youtube.com/?gl=UY
www.youtube.com.ve https://www.youtube.com/?gl=VE
www.youtube.cr https://www.youtube.com/?gl=CR
www.youtube.cz https://www.youtube.com/?gl=CZ
www.youtube.de https://www.youtube.com/?gl=DE
www.youtube.dk https://www.youtube.com/?gl=DK
www.youtube.ee https://www.youtube.com/?gl=EE
www.youtube.es https://www.youtube.com/?gl=ES
www.youtube.fi https://www.youtube.com/?gl=FI
www.youtube.fr https://www.youtube.com/?gl=FR
www.youtube.ge https://www.youtube.com/?gl=GE
www.youtube.gr https://www.youtube.com/?gl=GR
www.youtube.gt https://www.youtube.com/?gl=GT
www.youtube.hk https://www.youtube.com/?gl=HK
www.youtube.hr https://www.youtube.com/?gl=HR
www.youtube.hu https://www.youtube.com/?gl=HU
www.youtube.ie https://www.youtube.com/?gl=IE
www.youtube.in https://www.youtube.com/?gl=IN
www.youtube.iq https://www.youtube.com/?gl=IQ
www.youtube.is https://www.youtube.com/?gl=IS
www.youtube.it https://www.youtube.com/?gl=IT
www.youtube.jo https://www.youtube.com/?gl=JO
www.youtube.jp https://www.youtube.com/?gl=JP
www.youtube.kr https://www.youtube.com/?gl=KR
www.youtube.kz https://www.youtube.com/?gl=KZ
www.youtube.lk https://www.youtube.com/?gl=LK
www.youtube.lt https://www.youtube.com/?gl=LT
www.youtube.lu https://www.youtube.com/?gl=LU
www.youtube.lv https://www.youtube.com/?gl=LV
www.youtube.ly https://www.youtube.com/?gl=LY
www.youtube.ma https://www.youtube.com/?gl=MA
www.youtube.me https://www.youtube.com/?gl=ME
www.youtube.mk https://www.youtube.com/?gl=MK
www.youtube.mx https://www.youtube.com/?gl=MX
www.youtube.my https://www.youtube.com/?gl=MY
www.youtube.net.in https://www.youtube.com/
www.youtube.ng https://www.youtube.com/?gl=NG
www.youtube.ni https://www.youtube.com/?gl=NI
www.youtube.nl https://www.youtube.com/?gl=NL
www.youtube.no https://www.youtube.com/?gl=NO
www.youtube.pa https://www.youtube.com/?gl=PA
www.youtube.pe https://www.youtube.com/?gl=PE
www.youtube.ph https://www.youtube.com/?gl=PH
www.youtube.pk https://www.youtube.com/?gl=PK
www.youtube.pl https://www.youtube.com/?gl=PL
www.youtube.pr https://www.youtube.com/?gl=PR
www.youtube.pt https://www.youtube.com/?gl=PT
www.youtube.qa https://www.youtube.com/?gl=QA
www.youtube.ro https://www.youtube.com/?gl=RO
www.youtube.rs https://www.youtube.com/?gl=RS
www.youtube.ru https://www.youtube.com/?gl=RU
www.youtube.sa https://www.youtube.com/?gl=SA
www.youtube.se https://www.youtube.com/?gl=SE
www.youtube.sg https://www.youtube.com/?gl=SG
www.youtube.si https://www.youtube.com/?gl=SI
www.youtube.sk https://www.youtube.com/?gl=SK
www.youtube.sn https://www.youtube.com/?gl=SN
www.youtube.sv https://www.youtube.com/?gl=SV
www.youtube.tn https://www.youtube.com/?gl=TN
www.youtube.tv https://tv.youtube.com/welcome/
www.youtube.ua https://www.youtube.com/?gl=UA
www.youtube.ug https://www.youtube.com/?gl=UG
www.youtube.uy https://www.youtube.com/?gl=UY
www.youtube.vn https://www.youtube.com/?gl=VN
www.youtube.voto https://www.youtube.com/
Here is the first column of the above table (suitable for copying and pasting):
www.youtube.ae
www.youtube.at
www.youtube.az
www.youtube.ba
www.youtube.be
www.youtube.bg
www.youtube.bh
www.youtube.bo
www.youtube.by
www.youtube.ca
www.youtube.cat
www.youtube.ch
www.youtube.cl
www.youtube.co
www.youtube.co.ae
www.youtube.co.at
www.youtube.co.cr
www.youtube.co.hu
www.youtube.co.id
www.youtube.co.il
www.youtube.co.in
www.youtube.co.jp
www.youtube.co.ke
www.youtube.co.kr
www.youtube.co.ma
www.youtube.co.nz
www.youtube.co.th
www.youtube.co.tz
www.youtube.co.ug
www.youtube.co.uk
www.youtube.co.ve
www.youtube.co.za
www.youtube.co.zw
www.youtube.com
www.youtube.com.ar
www.youtube.com.au
www.youtube.com.az
www.youtube.com.bd
www.youtube.com.bh
www.youtube.com.bo
www.youtube.com.br
www.youtube.com.by
www.youtube.com.co
www.youtube.com.do
www.youtube.com.ec
www.youtube.com.ee
www.youtube.com.eg
www.youtube.com.es
www.youtube.com.gh
www.youtube.com.gr
www.youtube.com.gt
www.youtube.com.hk
www.youtube.com.hn
www.youtube.com.hr
www.youtube.com.jm
www.youtube.com.jo
www.youtube.com.kw
www.youtube.com.lb
www.youtube.com.lv
www.youtube.com.ly
www.youtube.com.mk
www.youtube.com.mt
www.youtube.com.mx
www.youtube.com.my
www.youtube.com.ng
www.youtube.com.ni
www.youtube.com.om
www.youtube.com.pa
www.youtube.com.pe
www.youtube.com.ph
www.youtube.com.pk
www.youtube.com.pt
www.youtube.com.py
www.youtube.com.qa
www.youtube.com.ro
www.youtube.com.sa
www.youtube.com.sg
www.youtube.com.sv
www.youtube.com.tn
www.youtube.com.tr
www.youtube.com.tw
www.youtube.com.ua
www.youtube.com.uy
www.youtube.com.ve
www.youtube.cr
www.youtube.cz
www.youtube.de
www.youtube.dk
www.youtube.ee
www.youtube.es
www.youtube.fi
www.youtube.fr
www.youtube.ge
www.youtube.gr
www.youtube.gt
www.youtube.hk
www.youtube.hr
www.youtube.hu
www.youtube.ie
www.youtube.in
www.youtube.iq
www.youtube.is
www.youtube.it
www.youtube.jo
www.youtube.jp
www.youtube.kr
www.youtube.kz
www.youtube.lk
www.youtube.lt
www.youtube.lu
www.youtube.lv
www.youtube.ly
www.youtube.ma
www.youtube.me
www.youtube.mk
www.youtube.mx
www.youtube.my
www.youtube.net.in
www.youtube.ng
www.youtube.ni
www.youtube.nl
www.youtube.no
www.youtube.pa
www.youtube.pe
www.youtube.ph
www.youtube.pk
www.youtube.pl
www.youtube.pr
www.youtube.pt
www.youtube.qa
www.youtube.ro
www.youtube.rs
www.youtube.ru
www.youtube.sa
www.youtube.se
www.youtube.sg
www.youtube.si
www.youtube.sk
www.youtube.sn
www.youtube.sv
www.youtube.tn
www.youtube.tv
www.youtube.ua
www.youtube.ug
www.youtube.uy
www.youtube.vn
www.youtube.voto
There are also the following domains which are not included in the table:
m.youtube.com
m.youtube.<country code>
youtu.be (with and without the trailing dot)
<country code>.youtube.com (e.g. jp.youtube.com)
youtube.com. (with the trailing dot)
youtube.<country code>. (with the trailing dot)
domains like gaming.youtube.com in YouTube's sitemap https://www.youtube.com/sitemaps/sitemap.xml (see linked sub-sitemaps)
For comprehensiveness, I went through Wikipedia's 676 https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes, took all possible two-letter language codes and appended them to the URL as https://www.youtube.com/?gl=<language code>, retrieved the URL, and checked if the source code contained the text "gl":"<language code>". If the language code is not found (e.g. ZZZZZ), the source code will use the default language code of "gl":"CA". Additionally, I went through Mozilla's Public Suffix list https://wiki.mozilla.org/Public_Suffix_List and took all of those and tried every single one.
Here are the following URLs that I couldn't find a corresponding TLD for (it doesn't imply that there should be one, however) which had the TLD code in the URL also in the HTML source:
https://www.youtube.com/?gl=CY
https://www.youtube.com/?gl=LI
https://www.youtube.com/?gl=NP
https://www.youtube.com/?gl=PG
https://www.youtube.com/?gl=YE
Bash script to get URL redirects (sometimes this script doesn't follow the redirect; I am not sure if it is because of my flaky internet, a rate-limit, or because of the script):
while read line; do echo "$line -> "$(curl -Ls -o /dev/null -w %{url_effective} $line); done < youtube_urls.txt;
There's even .youtube, which is quite cool I think.
See https://lifeinaday.youtube/ for example.
If you are trying to catch youtube video links I am using this regexp part I don't know they use national domains for example if you go to youtube.mx they redirect you back to .com with Mexico localizations... I guess they figured out that they cannot host content on some national domains for example if you use domain of Saudi Arabia you have to accept all local laws of Saudi Arabia and if somebody put content that is in violation of their laws owner of website might be in serious legal trouble... so they are using .com for everything and keep content under jurisdiction and laws of US
https?:\/\/ # Either http or https
(?:[\w]+\.)* # Optional subdomains
(?: # Group host alternatives.
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com
| youtube-nocookie\.com # or youtube-nocookie.com
) # End Host Group
By this page http://data.iana.org/TLD/tlds-alpha-by-domain.txt
There is also a .youtube domain itself.
As in the domain https://rewind.youtube
This gave me an idea. Why not use these domains as channel redirects.
eg. My channel (WMTV) will have the domain https://wmtv.youtube
Of course, this would be limited to bigger creators, but as time goes on, this might be a great feature from an individual YouTuber´s marketing standpoint.

Downloading a YouTube video through Wget

I am trying to download YouTube videos through Wget. The first thing necessary is to capture the URL of the actual video resource. Suppose I want to download this video: video. Opening up the page in the Firebug console reveals something like this:
The link which I have encircled looks like the link to the resource, for there we see only the video: http://www.youtube.com/v/r-KBncrOggI?version=3&autohide=1. However, when I am trying to download this resource with Wget, a 4 KB file of name r-KBncrOggI#version=3&autohide=1 gets stored in my hard-drive, nothing else. What should I do to get the actual video?
And secondly, is there a way to capture different resources for videos of different resolutions, like 360px, 480px, etc.?
Here is one VERY simplified, yet functional version of the youtube-download utility I cited on my another answer:
#!/usr/bin/env perl
use strict;
use warnings;
# CPAN modules we depend on
use JSON::XS;
use LWP::UserAgent;
use URI::Escape;
# Initialize the User Agent
# YouTube servers are weird, so *don't* parse headers!
my $ua = LWP::UserAgent->new(parse_head => 0);
# fetch video page or abort
my $res = $ua->get($ARGV[0]);
die "bad HTTP response" unless $res->is_success;
# scrape video metadata
if ($res->content =~ /\byt\.playerConfig\s*=\s*({.+?});/sx) {
# parse as JSON or abort
my $json = eval { decode_json $1 };
die "bad JSON: $1" if $#;
# inside the JSON 'args' property, there's an encoded
# url_encoded_fmt_stream_map property which points
# to stream URLs and signatures
while ($json->{args}{url_encoded_fmt_stream_map} =~ /\burl=(http.+?)&sig=([0-9A-F\.]+)/gx) {
# decode URL and attach signature
my $url = uri_unescape($1) . "&signature=$2";
print $url, "\n";
}
}
Usage example (it returns several URLs to streams with different encoding/quality):
$ perl youtube.pl http://www.youtube.com/watch?v=r-KBncrOggI | head -n 1
http://r19---sn-bg07sner.c.youtube.com/videoplayback?fexp=923014%2C916623%2C920704%2C912806%2C922403%2C922405%2C929901%2C913605%2C925710%2C929104%2C929110%2C908493%2C920201%2C913302%2C919009%2C911116%2C926403%2C910221%2C901451&ms=au&mv=m&mt=1357996514&cp=U0hUTVBNUF9FUUNONF9IR1RCOk01RjRyaG4wTHdQ&id=afe2819dcace8202&ratebypass=yes&key=yt1&newshard=yes&expire=1358022107&ip=201.52.68.216&ipbits=8&upn=m-kyX9-4Tgc&sparams=cp%2Cid%2Cip%2Cipbits%2Citag%2Cratebypass%2Csource%2Cupn%2Cexpire&itag=44&sver=3&source=youtube,quality=large&signature=A1E7E91DD087067ED59101EF2AE421A3503C7FED.87CBE6AE7FB8D9E2B67FEFA9449D0FA769AEA739
I'm afraid it's not that easy do get the right link for the video resource.
The link you got, http://www.youtube.com/v/r-KBncrOggI?version=3&autohide=1, points to the player rather than the video itself. There is one Perl utility, youtube-download, which is well-maintained and does the trick. This is how to get the HQ version (magic fmt=18) of that video:
stas#Stanislaws-MacBook-Pro:~$ youtube-download -o "{title}.{suffix}" --fmt 18 r-KBncrOggI
--> Working on r-KBncrOggI
Downloading `Sourav Ganguly in Farhan Akhtar's Show - Oye! It's Friday!.mp4`
75161060/75161060 (100.00%)
Download successful!
stas#Stanislaws-MacBook-Pro:~$
There might be better command-line YouTube Downloaders around. But sorry, one doesn't simply download a video using Firebug and wget any more :(
The only way I know to capture that URL manually is by watching the active downloads of the browser:
That largest data chunks are video data, so you can copy its URL:
http://s.youtube.com/s?lact=111116&uga=m30&volume=4.513679238953965&sd=BBE62AA4AHH1357937949850490&rendering=accelerated&fs=0&decoding=software&nsivbblmax=679542.000&hcbt=105.345&sendtmp=1&fmt=35&w=640&vtmp=1&referrer=None&hl=en_US&nsivbblmin=486355.000&nsivbblmean=603805.166&md=1&plid=AATTCZEEeM825vCx&ns=yt&ptk=youtube_none&csipt=watch7&rt=110.904&tsphab=1&nsiabblmax=129097.000&tspne=0&tpmt=110&nsiabblmin=123113.000&tspfdt=436&hbd=30900552&et=110.146&hbt=30.770&st=70.213&cfps=25&cr=BR&h=480&screenw=1440&nsiabblmean=125949.872&cpn=JlqV9j_oE1jzk7Zc&nsivbblc=343&nsiabblc=343&docid=r-KBncrOggI&len=1302.676&screenh=900&abd=1&pixel_ratio=1&bc=26131333&playerw=854&idpj=0&hcbd=25408143&playerh=510&ldpj=0&fexp=920704,919009,922403,916709,912806,929110,928008,920201,901451,909708,913605,925710,916623,929104,913302,910221,911116,914093,922405,929901&scoville=1&el=detailpage&bd=6676317&nsidf=1&vid=Yfg8gnutZoTD4G5SVKCxpsPvirbqG7pvR&bt=40.333&mos=0&vq=auto
However, for a large video, this will only return a part of the stream unless you figure out the URL query parameter responsible for stream range to be downloaded and adjust it.
A bonus: everything changes periodically as YouTube is constantly evolving. So, don't do that manually unless you carve pain.

How to extract closed caption transcript from YouTube video?

Is it possible to extract the closed caption transcript from YouTube videos?
We have over 200 webcasts on YouTube and each is at least one hour long. YouTube has closed caption for all videos but it seems users have no way to get it.
I tried the URL in this blog but it does not work with our videos.
http://googlesystem.blogspot.com/2010/10/download-youtube-captions.html
Here's how to get the transcript of a YouTube video (when available):
Go to YouTube and open the video of your choice.
Click on the "More actions" button (3 horizontal dots) located next to the Share button.
Click "Open transcript"
Although the syntax may be a little goofy this is a pretty good solution.
Source: http://ccm.net/faq/40644-youtube-how-to-get-the-transcript-of-a-video
Get timedtext file directly from YouTube
curl -s "$video_url"|grep -o '"baseUrl":"https://www.youtube.com/api/timedtext[^"]*lang=en'|cut -d \" -f4|sed 's/\\u0026/\&/g'|xargs curl -Ls|grep -o '<text[^<]*</text>'|sed -E 's/<text start="([^"]*)".*>(.*)<.*/\1 \2/'|sed 's/\xc2\xa0/ /g;s/&/\&/g'|recode xml|awk '{$1=sprintf("%02d:%02d:%02d",$1/3600,$1%3600/60,$1%60)}1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1
yt-dlp
yt-dlp supports saving the automatically generated closed captions in a JSON format:
cap()(printf %s\\n "${#-$(cat)}"|parallel -j10 -q yt-dlp -i --skip-download --write-auto-sub --sub-format json3 -o '%(upload_date)s.%(title)s.%(uploader)s.%(id)s.%(ext)s' --;for f in *.json3;do jq -r '.events[]|select(.segs and .segs[0].utf8!="\n")|(.tStartMs|tostring)+" "+([.segs[]?.utf8]|join(""))' "$f"|awk '{x=$1/1e3;$1=sprintf("%02d:%02d:%02d",x/3600,x%3600/60,x%60)}1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1 >"${f%.json3}";rm "$f";done)
You can also use the function above to download the captions for all videos on a channel or playlist if you give the ID or URL of the channel or playlist as an argument. When there is an error downloading a single video, the -i (--ignore-errors) option skips the video instead of exiting with an error.
Or this just gets the text without the timestamps:
yt-dlp --skip-download --write-auto-sub --sub-format json3 $youtube_url_or_id;jq -r '.events[]|select(.segs and.segs[0].utf8!="\n")|[.segs[].utf8]|join("")' *json3|paste -sd\ -|fold -sw60
youtube-dl
As of 2022, the format of the VTT and TTML downloaded by youtube-dl --write-auto-sub is messed up so that all subtitle texts are placed under a few long lines so that the timestamps of the subtitles are not visible. If you don't need the timestamps, then it shouldn't matter, but otherwise you can fix it by substituting yt-dlp for youtube-dl in the following commands. But with yt-dlp, you can also use a more convenient JSON format, so you don't need the following approach to deal with the VTT subtitle format.
This downloads the subtitles as VTT:
youtube-dl --skip-download --write-auto-sub $youtube_url
The other available formats are ttml, srv3, srv2, and srv1 (shown by --list-subs):
--write-sub
Write subtitle file
--write-auto-sub
Write automatically generated subtitle file (YouTube only)
--all-subs
Download all the available subtitles of the video
--list-subs
List all available subtitles for the video
--sub-format FORMAT
Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"
--sub-lang LANGS
Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags
You can use ffmpeg to convert the subtitle file to another format:
ffmpeg -i input.vtt output.srt
In the VTT subtitles, each subtitle text is repeated three times, and there is typically a new subtitle text every eighth line (but under some mysterious circumstances it's every 12th line instead):
WEBVTT
Kind: captions
Language: en
00:00:01.429 --> 00:00:04.249 align:start position:0%
ladies<00:00:02.429><c> and</c><00:00:02.580><c> gentlemen</c><c.colorE5E5E5><00:00:02.879><c> I'd</c></c><c.colorCCCCCC><00:00:03.870><c> like</c></c><c.colorE5E5E5><00:00:04.020><c> to</c><00:00:04.110><c> thank</c></c>
00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
</c>
00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
you<00:00:04.440><c> for</c><00:00:04.620><c> coming</c><00:00:05.069><c> tonight</c><00:00:05.190><c> especially</c></c><c.colorCCCCCC><00:00:05.609><c> at</c></c>
00:00:05.930 --> 00:00:05.940 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
</c>
00:00:05.940 --> 00:00:07.730 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
such<00:00:06.180><c> short</c><00:00:06.690><c> notice</c></c>
00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice
00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm<00:00:08.370><c> sure</c><c.colorE5E5E5><00:00:08.580><c> mr.</c><00:00:08.820><c> Irving</c><00:00:09.000><c> will</c><00:00:09.120><c> fill</c><00:00:09.300><c> you</c><00:00:09.389><c> in</c><00:00:09.420><c> on</c></c>
00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
</c>
00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
the<00:00:09.750><c> circumstances</c><00:00:10.440><c> that's</c><00:00:10.620><c> brought</c><00:00:10.920><c> us</c></c>
00:00:11.030 --> 00:00:11.040 align:start position:0%
<c.colorE5E5E5>the circumstances that's brought us
</c>
This converts the VTT subtitles to a simpler format:
sed '1,/^$/d' *.vtt| # remove the lines at the top of the file
sed 's/<[^>]*>//g'| # remove tags
awk -F. 'NR%4==1{printf"%s ",$1}NR%4==3' | # print each new subtitle text and its start time without milliseconds
awk NF\>1 # remove lines with only one field
Output:
00:00:01 ladies and gentlemen I'd like to thank
00:00:04 you for coming tonight especially at
00:00:05 such short notice
00:00:07 I'm sure mr. Irving will fill you in on
00:00:09 the circumstances that's brought us
In maybe around 10% of videos that I tested with (like for example p9M3shEU-QM and aE05_REXnBc), there were one or more subtitle texts which came 12 and not 8 lines after the previous subtitle text. But a workaround is to print every fourth line but to then remove empty lines.
Function form:
cap()(printf %s\\n "${#-$(cat)}"|parallel -j10 -q youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(uploader)s.%(id)s.%(ext)s' --;for f in *.vtt;do sed '1,/^$/d' -- "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%4==1{printf"%s ",$1}NR%4==3'|awk 'NF>1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1 >"${f%.vtt}";rm "$f";done)
Following document says only the owner of the channel can do this via standard youtube interface:
https://developers.google.com/youtube/2.0/developers_guide_protocol_captions?hl=en
Cheap fix:
You can click on the "interactive transscript" button - and copy the content this way.
Of course you lose the milliseconds this way.
Extremely cheap fix:
A shared youtube account -
so that multiple people can edit and upload caption files.
Challenging solution:
The youtube API allows downloading and uploading of caption files via HTTP...
You may write a youtube API application to provide a browser user interface for uploading or downloading for ANY user or particular users.
Here is an example project for this in java
http://apiblog.youtube.com/2011/01/youtube-captions-uploader-web-app.html
Here is very simple example of a working upload for everybody:
http://yt-captions-uploader.appspot.com/
You can view/copy/download a timecoded xml file of a youtube's closed captions file by accessing
http://video.google.com/timedtext?lang=[LANGUAGE]&v=[YOUTUBE VIDEO IDENTIFIER]
For example http://video.google.com/timedtext?lang=pt&v=WSVKbw7LC2w
NOTE: this method does not download autogenerated closed captions, even if you get the language right (maybe there's a special code for autogenerated languages).
You can download the streaming subtitles from YouTube with KeepSubs DownSub and SaveSubs.
You can choose from the Automatic Transcript or author supplied close captions. It also offers the possibility to automatically translate the English subtitles into other languages using Google Translate.
(Obligatory 'this is probably an internal youtube.com interface and might break at any time')
Instead of linking to another tool that does this, here's an answer to the question of "how to do this"
Use fiddler or your browser devtools (e.g.
Chrome) to inspect the youtube.com HTTP traffic, and there's a response from /api/timedtext that contains the closed caption info as XML.
It seems that a response like this:
<p t="0" d="5430" w="1">
<s p="2" ac="136">we've</s>
<s t="780" ac="252"> got</s>
</p>
<p t="2280" d="7170" w="1">
<s ac="243">we're</s>
<s t="810" ac="233"> going</s>
</p>
means at time 0 is the word we've and at time 0+780 is the word got and at time 2280+810 is the word going, etc. This time is in milliseconds so for time 3090 you'd want to append &t=3 to the URL.
You can use any tool to stitch together the XML into something readable, but here's my Power BI Desktop script to find words like "privilege":
let
Source = Xml.Tables(File.Contents("C:\Download\body.xml")),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Attribute:format", Int64.Type}}),
body = #"Changed Type"{0}[body],
p = body{0}[p],
#"Changed Type1" = Table.TransformColumnTypes(p,{{"Attribute:t", Int64.Type}, {"Attribute:d", Int64.Type}, {"Attribute:w", Int64.Type}, {"Attribute:a", Int64.Type}, {"Attribute:p", Int64.Type}}),
#"Expanded s" = Table.ExpandTableColumn(#"Changed Type1", "s", {"Attribute:ac", "Attribute:p", "Attribute:t", "Element:Text"}, {"s.Attribute:ac", "s.Attribute:p", "s.Attribute:t", "s.Element:Text"}),
#"Changed Type2" = Table.TransformColumnTypes(#"Expanded s",{{"s.Attribute:t", Int64.Type}}),
#"Removed Other Columns" = Table.SelectColumns(#"Changed Type2",{"s.Attribute:t", "s.Element:Text", "Attribute:t"}),
#"Replaced Value" = Table.ReplaceValue(#"Removed Other Columns",null,0,Replacer.ReplaceValue,{"s.Attribute:t"}),
#"Filtered Rows" = Table.SelectRows(#"Replaced Value", each [#"s.Element:Text"] <> null),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Time", each [#"Attribute:t"] + [#"s.Attribute:t"]),
#"Filtered Rows1" = Table.SelectRows(#"Added Custom", each ([#"s.Element:Text"] = " privilege" or [#"s.Element:Text"] = " privileged" or [#"s.Element:Text"] = " privileges" or [#"s.Element:Text"] = "privilege" or [#"s.Element:Text"] = "privileges"))
in
#"Filtered Rows1"
There is a free python tool called YouTube transcript API
You can use it in scripts or as a command line tool:
pip install youtube_transcript_api
With the YouTube video updated as of June 2020 it's very straight forward
select on the 3 dots next to like/dislike buttons to open further menu options
select "add translations"
select language
click autogenerate if needed
click Actions > Download
You will get and .sbv file
Choose Open Transcript from the ... dropdown to the right of the vote up/down and share links.
This will open a Transcript scrolling div on the right side.
You can then use Copy. Note that you cannot use Select All but need to click the top line, then scroll to the bottom using the scroll thumb, and then shift-click on the last line.
Note that you can also search within this text using the normal web page search.
I just got this easily done manually by opening the transcript at the beginning of the video and left-clicking and dragging at the time 00:00 marker with the shift key pressed over a few lines at the beginning.
I then advanced the video to near the end. When the video stopped, I clicked the end of the last sentence whilst holding down the shift key once more. With CTRL-C I copied the text to the clipboard and pasted it into an editor.
Done!
Caveat: Be sure to have no RDP-Windows sharing the clipboard or Software such as Teamviewer is running at the same time as this procedure will overflow their buffers where a large amount of text is copied.

Resources