How to extract closed caption transcript from YouTube video?

How to extract closed caption transcript from YouTube video? - youtube

Is it possible to extract the closed caption transcript from YouTube videos?
We have over 200 webcasts on YouTube and each is at least one hour long. YouTube has closed caption for all videos but it seems users have no way to get it.
I tried the URL in this blog but it does not work with our videos.
http://googlesystem.blogspot.com/2010/10/download-youtube-captions.html

Here's how to get the transcript of a YouTube video (when available):
Go to YouTube and open the video of your choice.
Click on the "More actions" button (3 horizontal dots) located next to the Share button.
Click "Open transcript"
Although the syntax may be a little goofy this is a pretty good solution.
Source: http://ccm.net/faq/40644-youtube-how-to-get-the-transcript-of-a-video

Get timedtext file directly from YouTube
curl -s "$video_url"|grep -o '"baseUrl":"https://www.youtube.com/api/timedtext[^"]*lang=en'|cut -d \" -f4|sed 's/\\u0026/\&/g'|xargs curl -Ls|grep -o '<text[^<]*</text>'|sed -E 's/<text start="([^"]*)".*>(.*)<.*/\1 \2/'|sed 's/\xc2\xa0/ /g;s/&/\&/g'|recode xml|awk '{$1=sprintf("%02d:%02d:%02d",$1/3600,$1%3600/60,$1%60)}1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1
yt-dlp
yt-dlp supports saving the automatically generated closed captions in a JSON format:
cap()(printf %s\\n "${#-$(cat)}"|parallel -j10 -q yt-dlp -i --skip-download --write-auto-sub --sub-format json3 -o '%(upload_date)s.%(title)s.%(uploader)s.%(id)s.%(ext)s' --;for f in *.json3;do jq -r '.events[]|select(.segs and .segs[0].utf8!="\n")|(.tStartMs|tostring)+" "+([.segs[]?.utf8]|join(""))' "$f"|awk '{x=$1/1e3;$1=sprintf("%02d:%02d:%02d",x/3600,x%3600/60,x%60)}1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1 >"${f%.json3}";rm "$f";done)
You can also use the function above to download the captions for all videos on a channel or playlist if you give the ID or URL of the channel or playlist as an argument. When there is an error downloading a single video, the -i (--ignore-errors) option skips the video instead of exiting with an error.
Or this just gets the text without the timestamps:
yt-dlp --skip-download --write-auto-sub --sub-format json3 $youtube_url_or_id;jq -r '.events[]|select(.segs and.segs[0].utf8!="\n")|[.segs[].utf8]|join("")' *json3|paste -sd\ -|fold -sw60
youtube-dl
As of 2022, the format of the VTT and TTML downloaded by youtube-dl --write-auto-sub is messed up so that all subtitle texts are placed under a few long lines so that the timestamps of the subtitles are not visible. If you don't need the timestamps, then it shouldn't matter, but otherwise you can fix it by substituting yt-dlp for youtube-dl in the following commands. But with yt-dlp, you can also use a more convenient JSON format, so you don't need the following approach to deal with the VTT subtitle format.
This downloads the subtitles as VTT:
youtube-dl --skip-download --write-auto-sub $youtube_url
The other available formats are ttml, srv3, srv2, and srv1 (shown by --list-subs):
--write-sub
Write subtitle file
--write-auto-sub
Write automatically generated subtitle file (YouTube only)
--all-subs
Download all the available subtitles of the video
--list-subs
List all available subtitles for the video
--sub-format FORMAT
Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best"
--sub-lang LANGS
Languages of the subtitles to download (optional) separated by commas, use --list-subs for available language tags
You can use ffmpeg to convert the subtitle file to another format:
ffmpeg -i input.vtt output.srt
In the VTT subtitles, each subtitle text is repeated three times, and there is typically a new subtitle text every eighth line (but under some mysterious circumstances it's every 12th line instead):
WEBVTT
Kind: captions
Language: en
00:00:01.429 --> 00:00:04.249 align:start position:0%
ladies<00:00:02.429><c> and</c><00:00:02.580><c> gentlemen</c><c.colorE5E5E5><00:00:02.879><c> I'd</c></c><c.colorCCCCCC><00:00:03.870><c> like</c></c><c.colorE5E5E5><00:00:04.020><c> to</c><00:00:04.110><c> thank</c></c>
00:00:04.249 --> 00:00:04.259 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
</c>
00:00:04.259 --> 00:00:05.930 align:start position:0%
ladies and gentlemen<c.colorE5E5E5> I'd</c><c.colorCCCCCC> like</c><c.colorE5E5E5> to thank
you<00:00:04.440><c> for</c><00:00:04.620><c> coming</c><00:00:05.069><c> tonight</c><00:00:05.190><c> especially</c></c><c.colorCCCCCC><00:00:05.609><c> at</c></c>
00:00:05.930 --> 00:00:05.940 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
</c>
00:00:05.940 --> 00:00:07.730 align:start position:0%
you<c.colorE5E5E5> for coming tonight especially</c><c.colorCCCCCC> at
such<00:00:06.180><c> short</c><00:00:06.690><c> notice</c></c>
00:00:07.730 --> 00:00:07.740 align:start position:0%
such short notice
00:00:07.740 --> 00:00:09.620 align:start position:0%
such short notice
I'm<00:00:08.370><c> sure</c><c.colorE5E5E5><00:00:08.580><c> mr.</c><00:00:08.820><c> Irving</c><00:00:09.000><c> will</c><00:00:09.120><c> fill</c><00:00:09.300><c> you</c><00:00:09.389><c> in</c><00:00:09.420><c> on</c></c>
00:00:09.620 --> 00:00:09.630 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
</c>
00:00:09.630 --> 00:00:11.030 align:start position:0%
I'm sure<c.colorE5E5E5> mr. Irving will fill you in on
the<00:00:09.750><c> circumstances</c><00:00:10.440><c> that's</c><00:00:10.620><c> brought</c><00:00:10.920><c> us</c></c>
00:00:11.030 --> 00:00:11.040 align:start position:0%
<c.colorE5E5E5>the circumstances that's brought us
</c>
This converts the VTT subtitles to a simpler format:
sed '1,/^$/d' *.vtt| # remove the lines at the top of the file
sed 's/<[^>]*>//g'| # remove tags
awk -F. 'NR%4==1{printf"%s ",$1}NR%4==3' | # print each new subtitle text and its start time without milliseconds
awk NF\>1 # remove lines with only one field
Output:
00:00:01 ladies and gentlemen I'd like to thank
00:00:04 you for coming tonight especially at
00:00:05 such short notice
00:00:07 I'm sure mr. Irving will fill you in on
00:00:09 the circumstances that's brought us
In maybe around 10% of videos that I tested with (like for example p9M3shEU-QM and aE05_REXnBc), there were one or more subtitle texts which came 12 and not 8 lines after the previous subtitle text. But a workaround is to print every fourth line but to then remove empty lines.
Function form:
cap()(printf %s\\n "${#-$(cat)}"|parallel -j10 -q youtube-dl -i --skip-download --write-auto-sub -o '%(upload_date)s.%(title)s.%(uploader)s.%(id)s.%(ext)s' --;for f in *.vtt;do sed '1,/^$/d' -- "$f"|sed 's/<[^>]*>//g'|awk -F. 'NR%4==1{printf"%s ",$1}NR%4==3'|awk 'NF>1'|awk 'NR%n==1{printf"%s ",$1}{sub(/^[^ ]* /,"");printf"%s"(NR%n?FS:RS),$0}' n=2|awk 1 >"${f%.vtt}";rm "$f";done)

Following document says only the owner of the channel can do this via standard youtube interface:
https://developers.google.com/youtube/2.0/developers_guide_protocol_captions?hl=en
Cheap fix:
You can click on the "interactive transscript" button - and copy the content this way.
Of course you lose the milliseconds this way.
Extremely cheap fix:
A shared youtube account -
so that multiple people can edit and upload caption files.
Challenging solution:
The youtube API allows downloading and uploading of caption files via HTTP...
You may write a youtube API application to provide a browser user interface for uploading or downloading for ANY user or particular users.
Here is an example project for this in java
http://apiblog.youtube.com/2011/01/youtube-captions-uploader-web-app.html
Here is very simple example of a working upload for everybody:
http://yt-captions-uploader.appspot.com/

You can view/copy/download a timecoded xml file of a youtube's closed captions file by accessing
http://video.google.com/timedtext?lang=[LANGUAGE]&v=[YOUTUBE VIDEO IDENTIFIER]
For example http://video.google.com/timedtext?lang=pt&v=WSVKbw7LC2w
NOTE: this method does not download autogenerated closed captions, even if you get the language right (maybe there's a special code for autogenerated languages).

You can download the streaming subtitles from YouTube with KeepSubs DownSub and SaveSubs.
You can choose from the Automatic Transcript or author supplied close captions. It also offers the possibility to automatically translate the English subtitles into other languages using Google Translate.

(Obligatory 'this is probably an internal youtube.com interface and might break at any time')
Instead of linking to another tool that does this, here's an answer to the question of "how to do this"
Use fiddler or your browser devtools (e.g.
Chrome) to inspect the youtube.com HTTP traffic, and there's a response from /api/timedtext that contains the closed caption info as XML.
It seems that a response like this:
<p t="0" d="5430" w="1">
<s p="2" ac="136">we've</s>
<s t="780" ac="252"> got</s>
</p>
<p t="2280" d="7170" w="1">
<s ac="243">we're</s>
<s t="810" ac="233"> going</s>
</p>
means at time 0 is the word we've and at time 0+780 is the word got and at time 2280+810 is the word going, etc. This time is in milliseconds so for time 3090 you'd want to append &t=3 to the URL.
You can use any tool to stitch together the XML into something readable, but here's my Power BI Desktop script to find words like "privilege":
let
Source = Xml.Tables(File.Contents("C:\Download\body.xml")),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Attribute:format", Int64.Type}}),
body = #"Changed Type"{0}[body],
p = body{0}[p],
#"Changed Type1" = Table.TransformColumnTypes(p,{{"Attribute:t", Int64.Type}, {"Attribute:d", Int64.Type}, {"Attribute:w", Int64.Type}, {"Attribute:a", Int64.Type}, {"Attribute:p", Int64.Type}}),
#"Expanded s" = Table.ExpandTableColumn(#"Changed Type1", "s", {"Attribute:ac", "Attribute:p", "Attribute:t", "Element:Text"}, {"s.Attribute:ac", "s.Attribute:p", "s.Attribute:t", "s.Element:Text"}),
#"Changed Type2" = Table.TransformColumnTypes(#"Expanded s",{{"s.Attribute:t", Int64.Type}}),
#"Removed Other Columns" = Table.SelectColumns(#"Changed Type2",{"s.Attribute:t", "s.Element:Text", "Attribute:t"}),
#"Replaced Value" = Table.ReplaceValue(#"Removed Other Columns",null,0,Replacer.ReplaceValue,{"s.Attribute:t"}),
#"Filtered Rows" = Table.SelectRows(#"Replaced Value", each [#"s.Element:Text"] <> null),
#"Added Custom" = Table.AddColumn(#"Filtered Rows", "Time", each [#"Attribute:t"] + [#"s.Attribute:t"]),
#"Filtered Rows1" = Table.SelectRows(#"Added Custom", each ([#"s.Element:Text"] = " privilege" or [#"s.Element:Text"] = " privileged" or [#"s.Element:Text"] = " privileges" or [#"s.Element:Text"] = "privilege" or [#"s.Element:Text"] = "privileges"))
in
#"Filtered Rows1"

There is a free python tool called YouTube transcript API
You can use it in scripts or as a command line tool:
pip install youtube_transcript_api

With the YouTube video updated as of June 2020 it's very straight forward
select on the 3 dots next to like/dislike buttons to open further menu options
select "add translations"
select language
click autogenerate if needed
click Actions > Download
You will get and .sbv file

Choose Open Transcript from the ... dropdown to the right of the vote up/down and share links.
This will open a Transcript scrolling div on the right side.
You can then use Copy. Note that you cannot use Select All but need to click the top line, then scroll to the bottom using the scroll thumb, and then shift-click on the last line.
Note that you can also search within this text using the normal web page search.

I just got this easily done manually by opening the transcript at the beginning of the video and left-clicking and dragging at the time 00:00 marker with the shift key pressed over a few lines at the beginning.
I then advanced the video to near the end. When the video stopped, I clicked the end of the last sentence whilst holding down the shift key once more. With CTRL-C I copied the text to the clipboard and pasted it into an editor.
Done!
Caveat: Be sure to have no RDP-Windows sharing the clipboard or Software such as Teamviewer is running at the same time as this procedure will overflow their buffers where a large amount of text is copied.

Related

Zebra Printer - Cut on last page

I've a Zebra ZT610 and I want to print a label, in pdf format, containing multiple pages and then have it cut on the last page. I've tried using the delayed cut mode and sending the ~JK command but I'm using a self written java application to do the invocation of printing. I've also tried to add the string "${^XB}$" into the PDF document before each page break, except the last, and used the pass-through setting in the driver to inhibit the cut command but that seems to not work either as the java print job is rendering such text as an image.
I've tried the official Zebra driver as well as using the NiceLabel zebra driver too in the hope that they may have more "Custom Commands" options in the settings but nothing has yet come to light.

After we had the same issues for several weeks and neither the vendor nor google nor Zebra's own support came up with a FULL working solution, we've worked out the following EASY 5 step solution for this (apparently pretty common) Zebra Cutter issue/problem:
Step 1:
Set Cutter-Mode to Tear-Off in the settings.
This will disable the auto-cutting after every single page.
Step 2: Go to Customer-Commands in the settings dialog (Allows ZPL coding).
Step 3: Set the first drop-down to "DOCUMENT".
Step 4: Set the Start-Section to "TEXT" and paste in
^XA^MMD^XZ^XA^JUS^XZ
MMD enables PAUSE-Mode. The JK command is only available in Pause-Mode and many Zebra printers do not support the much easier command CN (Cut-Now).
JUS saves the setting to the printer.
Step 5: Set the End-Section to "ANALYZED TEXT" and paste in
˜JK˜PS
JK sets the cut command to the end of the document, PS disables the pause mode (and thus starts printing immediately). When everything looks as described above, hit "APPLY" and your Zebra printer will automatically cut after the end of each document you send to it. You just send your PDF using sumatra or whatever you prefer. The cutter handling is now automatically done by the printer settings.
Alternatively, if you want to do this programmaticaly, use the START and END codes at the corresponding positions in your ZPL code instead. Note that ˜CMDs cannot be send in combination with ^CMDs, thats why there's no XA...XZ block to reset any settings (which is not necessary in this scenario as it only affects the print session and PS turns the pause mode back to OFF).

I had similar concern but as the print server was CUPS, I wasn't able to use Windows drivers and utilities (settings dialog). So basically, I did the following:
On the printer, set Cutter mode. This will cut after each printed label.
In my Java code, thanks to Apache PDFBox lib, open the PDF and for each page, render it as a monochrome BufferedImage, get bytes array from it, and get its hex representation.
Write a few ZPL commands to download hex as graphic data, and add the ^XB command before the ^XZ one, in order to prevent a cut here, except for the last page, so that there is a cut only at the end of the document.
Send the generated ZPL code to the printer. In my case, I send it as a raw document through IPP, using application/vnd.cups-raw as mime-type, thanks to the great lib ipp-client-kotlin, but it is also possible to use Java native printing API with bytes.
Below in a snippet of Java code, for demo purpose:
public void printPdfStream(InputStream pdfStream) throws IOException {
try (PDDocument pdDocument = PDDocument.load(pdfStream)) {
PDFRenderer pdfRenderer = new PDFRenderer(pdDocument);
StringBuilder builder = new StringBuilder();
for (int pageIndex = 0; pageIndex < pdDocument.getNumberOfPages(); pageIndex++) {
boolean isLastPage = pageIndex == pdDocument.getNumberOfPages() - 1;
BufferedImage bufferedImage = pdfRenderer.renderImageWithDPI(pageIndex, 300, ImageType.BINARY);
byte[] data = ((DataBufferByte) bufferedImage.getData().getDataBuffer()).getData();
int length = data.length;
// Invert bytes
for (int i = 0; i < length; i++) {
data[i] ^= 0xFF;
}
builder.append("~DGR:label,").append(length).append(",").append(length / bufferedImage.getHeight())
.append(",").append(Hex.getString(data));
builder.append("^XA");
builder.append("^FO0,0");
builder.append("^XGR:label,1,1");
builder.append("^FS");
if (!isLastPage) {
builder.append("^XB");
}
builder.append("^XZ");
}
IppPrinter ippPrinter = new IppPrinter("ipp://printserver/printers/myprinter");
ippPrinter.printJob(new ByteArrayInputStream(builder.toString().getBytes()),
documentFormat("application/vnd.cups-raw"));
}
}
Important: hex data can (and should) be compressed, as mentioned in ZPL Programming Guide, section Alternative Data Compression Scheme for ~DG and ~DB Commands. Depending on the PDF content, it may drastically reduce the data size (by a factor 10 in my case!).
Note that Zebra's support provides a few more alternatives in order to controller the cutter, but this one worked immediately.

Zebra Automatic Cut - Found another solution.
Create a file with the name: Delayed Cut Settings.txt
Insert the following code: ^XA^MMC,N^XZ
Send it to the printer
After you do the 3 steps above, all the documents you send to the printer will be cut automatically.
(To disable that function send again the 'Delayed Cut Setting.txt' with the following code:^XA^MMD^XZ )
The first document you send to the printer, you need to ADD (just once) the command ^MMC,N before the ^XZ
My EXAMPLE TXT:
^XA
^FX Top section with logo, name and address.
^CF0,60
^FO50,50^GB100,100,100^FS
^FO75,75^FR^GB100,100,100^FS
^FO93,93^GB40,40,40^FS
^FO220,50^FDIntershipping, Inc.^FS
^CF0,30
^FO220,115^FD1000 Shipping Lane^FS
^FO220,155^FDShelbyville TN 38102^FS
^FO220,195^FDUnited States (USA)^FS
^FO50,250^GB700,3,3^FS
^FX Second section with recipient address and permit information.
^CFA,30
^FO50,300^FDJohn Doe^FS
^FO50,340^FD100 Main Street^FS
^FO50,380^FDSpringfield TN 39021^FS
^FO50,420^FDUnited States (USA)^FS
^CFA,15
^FO600,300^GB150,150,3^FS
^FO638,340^FDPermit^FS
^FO638,390^FD123456^FS
^FO50,500^GB700,3,3^FS
^FX Third section with bar code.
^BY5,2,270
^FO100,550^BC^FD12345678^FS
^FX Fourth section (the two boxes on the bottom).
^FO50,900^GB700,250,3^FS
^FO400,900^GB3,250,3^FS
^CF0,40
^FO100,960^FDCtr. X34B-1^FS
^FO100,1010^FDREF1 F00B47^FS
^FO100,1060^FDREF2 BL4H8^FS
^CF0,190
^FO470,955^FDCA^FS
^MMC,N
^XZ

How do some sites download YouTube captions?

This is somewhat of a duplicate question of Does YouTube API forbid to download video captions if you are not it's owner?, Get YouTube captions and Does YouTube API forbid to download video captions if you are not it's owner?, which all basically say it's not possible unless to download captions via the YouTube API unless you are the owner or third-party contributions are not enabled; however, my question is how to sites like http://downsub.com/ or http://www.lilsubs.com/ have access to all captions?
In other words, when I access the YouTube API myself (even with youtubepartner and youtube.force-ssl scopes), I can only download the captions of some videos, but when I try the same videos that failed for me with 403: The permissions associated with the request are not sufficient to download the caption track. The request might not be properly authorized, or the video order might not have enabled third-party contributions for this caption. on these other sites, it works fine. I'm assuming they are using the YouTube API to access the captions, but what special sauce are they using? Some special partner key? An different API version? Are they just scraping from the videos themselves or something?

Send a GET request on:
http://video.google.com/timedtext?lang={LANG}&v={VIDEOID}
Example for your video in comment: http://video.google.com/timedtext?lang=ko&v=0db1_qWZjRA
Let's look at another example of yours, i.e. https://www.youtube.com/watch?v=7068mw-6lmI (and I agree about differentiation part in your comment).
There are multiple subtitles available for the video
English
Korean
Spanish
Korean (auto-generated) also called asr (automatic speech recognition)
These stand for the subtitle name parameter (i.e., name=English).
lang stands for the country code.
In your example: https://www.youtube.com/api/timedtext?lang=es-MX&v=7068mw-6lmI&name=Spanish
If subtitle track is available, it is possible to do translation form it, namely using tlang parameter.
https://www.youtube.com/api/timedtext?lang=en&v=7068mw-6lmI&name=English&tlang=lv
https://www.youtube.com/api/timedtext?lang=ko&v=7068mw-6lmI&name=Korean&tlang=lv
This would be my bid for what these sites are using, i.e. translation of the available subtitle track (confirm by trying to use a video without subtitle track as input for one of their sites).
As for asr signature seems to always be needed, but as long as one of the subtitle tracks are available, you could use that for translation. E.g. in your OP comment example:
https://www.youtube.com/api/timedtext?lang=en&v=vx6NCUyg1NE&tlang=lv
Looks like the last example is special with both of subtitle tracks being asr (checked with Chrome -> Inspect -> Network) therefore you need to omit the subtitle name parameter part. This difference unfortunately is not visible in YouTube video's settings wheel.

A 2022 answer:
Option 1: Send a curl request to the webpage: curl -L "https://youtu.be/YbJOTdZBX1g", search for timedtext in the result, and you would get a URL. replace \u0026 with & and you get the link for the subtitle.
Option 2: Use the yt-dlp package:
# For installing see: https://github.com/yt-dlp/yt-dlp#with-pip
from yt_dlp import YoutubeDL
ydl_opts = {
"skip_download": True,
"writesubtitles": True,
"subtitleslangs": ["all", "-live_chat"],
# Looks like formats available are vtt, ttml, srv3, srv2, srv1, json3
"subtitlesformat": "json3",
# You can skip the following option
"sleep_interval_subtitles": 1,
}
with YoutubeDL(ydl_opts) as ydl:
ydl.download(["YbJOTdZBX1g"])

There is this unofficial API used by Youtube :
https://www.youtube.com/api/timedtext?lang={LANG}&v={VIDEO_ID}
LANG here is ISO 639-1 2 letter country code. For your example it would be :
https://www.youtube.com/api/timedtext?lang=ko&v=0db1_qWZjRA
You can check it in network tab while toggling the closed caption button :

I have used youtube-transcript-api successfully to retrieve transcripts. The below is a demo to dump the transcript into HTML with links back to the timestamps in the video:
import sys
from youtube_transcript_api import YouTubeTranscriptApi
video_id = sys.argv[1]
# Retrieve the available transcripts
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
# Just use the first transcript, let it raise an exception if none exist.
transcript = next(iter(transcript_list))
print("<html><body>")
for line_map in transcript.fetch():
st_sec = int(line_map['start'] / 60)
st_msec = int(line_map['start'] - st_sec * 60)
tstmp = f"{st_sec}:{st_msec}"
link_to_tstmp = f"https://youtu.be/{video_id}?t={st_sec*60}"
tstmp_str = ("%2d:%-2d" % (st_sec, st_msec)).replace(" ", " ")
#print(f"{st_sec}:{st_msec} {line_map['text']}")
print("""%s %s<br/>""" % (link_to_tstmp, tstmp_str, line_map['text']))
print("</html></body>")
If there are multiple transcripts, the library provides API to search by language etc.
You can further tweak the logic to merge text so you only get one link every so many minutes. I got good results for a lecture by linking at every 1 min and format the lines into a HTML table.

How to make TextEdit links clickable

This script is supposed to open a series of web pages in a new browser window, and then open TextEdit with predetermined text and links.
Safari does what it is supposed to.
Text edit opens and pastes the text I want, but the links are not clickable.
I know I could just right click and choose Substitutions> Add Links> myself, but I am trying to automate the entire process.
I appreciate your time and efforts on my behalf! Thank you!
OpenWebPages()
OpenTextEditPage()
to OpenTextEditPage()
-- Create a variable for text
set docText to ""
tell application "TextEdit"
activate
make new document
-- Define the text to be pasted into TextEdit
set docText to docText & "Some text to show in TextEdit." & linefeed & "
My favorite site about coding is http://stackoverflow.com/
My favorite site for paper modeling is http://www.ss42.com/toys.html
My favorite site for inventing is http://www.instructables.com/howto/bubble+machine/
" & linefeed & "Click the links above to improve your mind!" as string
-- Past the above text and links into TextEdit
set the text of the front document to docText & "" as string
tell application "System Events"
tell process "TextEdit"
-- highlight all text
keystroke "a" using command down
-- Think of a clever way to right click and choose Substitutions> Add Links>
-- Or think of another clever way to turn all URLs into links please.
end tell
end tell
end tell
end OpenTextEditPage
to OpenWebPages()
-- Start new Safari window
tell application "Safari"
-- activate Safari and open the StackOverflow AppleScript page
make new document with properties {URL:"http://stackoverflow.com/search?q=applescript"}
-- Yoda is watching you
open location "http://www.ss42.com/pt/Yoda/YodaGallery/yoda-gallery.html"
-- Indoor boomerang
open location "http://www.ss42.com/pt/paperang/paperang.html"
-- Are you a Human ?
open location "http://stackoverflow.com/nocaptcha?s=f5c92674-b080-4cea-9ff2-4fdf1d6d19de"
end tell
end OpenWebPages

According to this question I built this little handler. It takes a path to your rtf-file like makeURLsHyper((path to desktop folder as string) & "TestDoc.rtf") and worked quite fine in my little tests. But it doesn't care about text formatting at this point.
on makeURLsHyper(pathOfRtfFile)
-- converting the given path to a posix path (quoted for later use)
set myRtfPosixPath to quoted form of (POSIX path of pathOfRtfFile)
-- RTF Hyperlink start
set rtfLinkStart to "{\\\\field{\\\\*\\\\fldinst HYPERLINK \""
-- RTF Hyperlink middle
set rtfMiddlePart to "\"}{\\\\fldrslt "
-- RTF Hyperlink end
set rtfLinkEnd to "}}"
-- use sed to convert http-strings to rtf hyperlinks
set newFileContent to (do shell script "sed -i bak -e 's#\\(http[^[:space:]]*\\)#" & rtfLinkStart & "\\1" & rtfMiddlePart & "\\1" & rtfLinkEnd & "#g' " & myRtfPosixPath)
end makeURLsHyper
Have a nice day, Michael / Hamburg

youtube-dl -citw downloads only one video instead of all

I used
user#ubuntu:~/Arbeitsfläche$ sudo youtube-dl -citw ytuser: raz\ malca
to download all videos from this channel. But it wodnloads only one:
user#ubuntu:~/Arbeitsfläche$ sudo youtube-dl -citw ytuser: raz\ malca
[generic] ytuser:: Requesting header
ERROR: Invalid URL protocol; please report this issue on https://yt-dl.org/bug . Be sure to call youtube-dl with the --verbose flag and include its complete output. Make sure you are using the latest version; type youtube-dl -U to update.
WARNING: Falling back to youtube search for raz malca . Set --default-search to "auto" to suppress this warning.
[youtube:search] query "raz malca": Downloading page 1
[download] Downloading playlist: raz malca
[youtube:search] playlist raz malca: Collected 1 video ids (downloading 1 of them)
[download] Downloading video #1 of 1
[youtube] Setting language
[youtube] R72vpqqNSDw: Downloading webpage
[youtube] R72vpqqNSDw: Downloading video info webpage
[youtube] R72vpqqNSDw: Extracting video information
I use the newest version of this programm.
My system is ubuntu 13.10 :)
Has anyone a idea ?

First of all, don't run youtube-dl with sudo - normal user rights suffice. There is also absolutely no need to pass in -ctw. But the main problem is that your command line contains a superfluous space after ytuser:. In any case, even if the space weren't there, raz malca is not a valid YouTube user ID. There is, however, a channel named so. Simply pass in that channel's URL into youtube-dl:
youtube-dl -i https://www.youtube.com/channel/UCnmQSqOPhkawAdndZgjfanA

to find out the channel's URL, do an inspect on the user that created the channel. E.g. for the channel mix - o grande amor (Tom Jobim) - Solo Guitar by Chiba Kosei, right click (in Firefox) on Chiba Kosei and select inspect element. Notice that the channel is in the href part. So copy that and your final channel-url is https://youtube.com/channel/UC5q9SrhlCJ4YciPTLLNTR5w
a href="/channel/UC5q9SrhlCJ4YciPTLLNTR5w" class="yt-uix-sessionlink g-hovercard spf-link " data-ytid="UC5q9SrhlCJ4YciPTLLNTR5w" data-sessionlink="itct=CDIQ4TkiEwjBvtSO7v3KAhWPnr4KHUYMDbko-B0">Chiba Kosei

Downloading a YouTube video through Wget

I am trying to download YouTube videos through Wget. The first thing necessary is to capture the URL of the actual video resource. Suppose I want to download this video: video. Opening up the page in the Firebug console reveals something like this:
The link which I have encircled looks like the link to the resource, for there we see only the video: http://www.youtube.com/v/r-KBncrOggI?version=3&autohide=1. However, when I am trying to download this resource with Wget, a 4 KB file of name r-KBncrOggI#version=3&autohide=1 gets stored in my hard-drive, nothing else. What should I do to get the actual video?
And secondly, is there a way to capture different resources for videos of different resolutions, like 360px, 480px, etc.?

Here is one VERY simplified, yet functional version of the youtube-download utility I cited on my another answer:
#!/usr/bin/env perl
use strict;
use warnings;
# CPAN modules we depend on
use JSON::XS;
use LWP::UserAgent;
use URI::Escape;
# Initialize the User Agent
# YouTube servers are weird, so *don't* parse headers!
my $ua = LWP::UserAgent->new(parse_head => 0);
# fetch video page or abort
my $res = $ua->get($ARGV[0]);
die "bad HTTP response" unless $res->is_success;
# scrape video metadata
if ($res->content =~ /\byt\.playerConfig\s*=\s*({.+?});/sx) {
# parse as JSON or abort
my $json = eval { decode_json $1 };
die "bad JSON: $1" if $#;
# inside the JSON 'args' property, there's an encoded
# url_encoded_fmt_stream_map property which points
# to stream URLs and signatures
while ($json->{args}{url_encoded_fmt_stream_map} =~ /\burl=(http.+?)&sig=([0-9A-F\.]+)/gx) {
# decode URL and attach signature
my $url = uri_unescape($1) . "&signature=$2";
print $url, "\n";
}
}
Usage example (it returns several URLs to streams with different encoding/quality):
$ perl youtube.pl http://www.youtube.com/watch?v=r-KBncrOggI | head -n 1
http://r19---sn-bg07sner.c.youtube.com/videoplayback?fexp=923014%2C916623%2C920704%2C912806%2C922403%2C922405%2C929901%2C913605%2C925710%2C929104%2C929110%2C908493%2C920201%2C913302%2C919009%2C911116%2C926403%2C910221%2C901451&ms=au&mv=m&mt=1357996514&cp=U0hUTVBNUF9FUUNONF9IR1RCOk01RjRyaG4wTHdQ&id=afe2819dcace8202&ratebypass=yes&key=yt1&newshard=yes&expire=1358022107&ip=201.52.68.216&ipbits=8&upn=m-kyX9-4Tgc&sparams=cp%2Cid%2Cip%2Cipbits%2Citag%2Cratebypass%2Csource%2Cupn%2Cexpire&itag=44&sver=3&source=youtube,quality=large&signature=A1E7E91DD087067ED59101EF2AE421A3503C7FED.87CBE6AE7FB8D9E2B67FEFA9449D0FA769AEA739

I'm afraid it's not that easy do get the right link for the video resource.
The link you got, http://www.youtube.com/v/r-KBncrOggI?version=3&autohide=1, points to the player rather than the video itself. There is one Perl utility, youtube-download, which is well-maintained and does the trick. This is how to get the HQ version (magic fmt=18) of that video:
stas#Stanislaws-MacBook-Pro:~$ youtube-download -o "{title}.{suffix}" --fmt 18 r-KBncrOggI
--> Working on r-KBncrOggI
Downloading `Sourav Ganguly in Farhan Akhtar's Show - Oye! It's Friday!.mp4`
75161060/75161060 (100.00%)
Download successful!
stas#Stanislaws-MacBook-Pro:~$
There might be better command-line YouTube Downloaders around. But sorry, one doesn't simply download a video using Firebug and wget any more :(
The only way I know to capture that URL manually is by watching the active downloads of the browser:
That largest data chunks are video data, so you can copy its URL:
http://s.youtube.com/s?lact=111116&uga=m30&volume=4.513679238953965&sd=BBE62AA4AHH1357937949850490&rendering=accelerated&fs=0&decoding=software&nsivbblmax=679542.000&hcbt=105.345&sendtmp=1&fmt=35&w=640&vtmp=1&referrer=None&hl=en_US&nsivbblmin=486355.000&nsivbblmean=603805.166&md=1&plid=AATTCZEEeM825vCx&ns=yt&ptk=youtube_none&csipt=watch7&rt=110.904&tsphab=1&nsiabblmax=129097.000&tspne=0&tpmt=110&nsiabblmin=123113.000&tspfdt=436&hbd=30900552&et=110.146&hbt=30.770&st=70.213&cfps=25&cr=BR&h=480&screenw=1440&nsiabblmean=125949.872&cpn=JlqV9j_oE1jzk7Zc&nsivbblc=343&nsiabblc=343&docid=r-KBncrOggI&len=1302.676&screenh=900&abd=1&pixel_ratio=1&bc=26131333&playerw=854&idpj=0&hcbd=25408143&playerh=510&ldpj=0&fexp=920704,919009,922403,916709,912806,929110,928008,920201,901451,909708,913605,925710,916623,929104,913302,910221,911116,914093,922405,929901&scoville=1&el=detailpage&bd=6676317&nsidf=1&vid=Yfg8gnutZoTD4G5SVKCxpsPvirbqG7pvR&bt=40.333&mos=0&vq=auto
However, for a large video, this will only return a part of the stream unless you figure out the URL query parameter responsible for stream range to be downloaded and adjust it.
A bonus: everything changes periodically as YouTube is constantly evolving. So, don't do that manually unless you carve pain.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to extract closed caption transcript from YouTube video? - youtube

You can download the streaming subtitles from YouTube with KeepSubs DownSub and SaveSubs. You can choose from the Automatic Transcript or author supplied close captions. It also offers the possibility to automatically translate the English subtitles into other languages using Google Translate.

There is a free python tool called YouTube transcript API You can use it in scripts or as a command line tool: pip install youtube_transcript_api

With the YouTube video updated as of June 2020 it's very straight forward select on the 3 dots next to like/dislike buttons to open further menu options select "add translations" select language click autogenerate if needed click Actions > Download You will get and .sbv file

Related

Zebra Printer - Cut on last page

How do some sites download YouTube captions?

How to make TextEdit links clickable

youtube-dl -citw downloads only one video instead of all

Downloading a YouTube video through Wget

Categories

Resources