Getting YouTube auto-transcript from API? - youtube-api

Is it possible to pull the auto (non-user) generated video transcripts from any of the YouTube APIs?

As of Aug 2019 the following method you to download transcripts:
Open in Browser
https://www.youtube.com/watch?v=[Video ID]
From Console type:
JSON.parse(ytplayer.config.args.player_response).captions.playerCaptionsTracklistRenderer.captionTracks[0].baseUrl

You may refer with this thread: How to get "transcript" in youtube-api v3
If you're authenticating with oAuth2, you could do a quick call to
this feed:
http://gdata.youtube.com/feeds/api/videos/[VIDEOID]/captiondata/[CAPTIONTRACKID]
to get the data you want. To retrieve a list of possible caption track
IDs with v2 of the API, you access this feed:
https://gdata.youtube.com/feeds/api/videos/[VIDEOID]/captions
That feed request also accepts some optional parameters, including
language, max-results, etc. For more details, along with a sample that
shows the returned format of the caption track list, see the
documentation at
https://developers.google.com/youtube/2.0/developers_guide_protocol_captions#Retrieve_Caption_Set
Also, here are some references which migh help:
https://www.quora.com/Is-there-any-way-to-download-the-YouTube-transcripts-that-are-generated-automatically
http://ccm.net/faq/40644-how-to-get-the-transcript-of-a-youtube-video

1 Install youtube-transcript-api (https://github.com/jdepoix/youtube-transcript-api), e.g.:
pip3 install youtube_transcript_api
2 Create youtube_transcript_api-wrapper.py with the following code (based partially on https://stackoverflow.com/a/65325576/2585501):
from youtube_transcript_api import YouTubeTranscriptApi
#srt = YouTubeTranscriptApi.get_transcript(video_id)
videoListName = "youtubeVideoIDlist.txt"
with open(videoListName) as f:
video_ids = f.read().splitlines()
transcript_list, unretrievable_videos = YouTubeTranscriptApi.get_transcripts(video_ids, continue_after_error=True)
for video_id in video_ids:
if video_id in transcript_list.keys():
print("\nvideo_id = ", video_id)
#print(transcript)
srt = transcript_list.get(video_id)
text_list = []
for i in srt:
text_list.append(i['text'])
text = ' '.join(text_list)
print(text)
3 Create youtubeVideoIDlist.txt containing a list of video_ids
4 python3 youtube_transcript_api-wrapper.py

Related

How to get Twitter mentions id using academictwitteR package?

I am trying to create several network analyses from Twitter. To get the data, I used the academictwitteR package and their get_all_tweets command.
get_all_tweets(
users = c("LegaSalvini"),
start_tweets = "2007-01-01T00:00:00Z",
end_tweets = "2022-07-01T00:00:00Z",
file = "tweets_lega",
data_path = "tweetslega/",
bind_tweets = FALSE
)
## Binding JSON files into data.frame objects
tweets_bind_lega <- bind_tweets(data_path = "tweetslega/")
##Tyding
tweets_bind_lega_tidy <- bind_tweets(data_path = "tweetslega/", output_format = "tidy")
With this, I can easily access the ids for the creation of a retweet and reply network. However, the tidy format does not provide a tidy column for the mentions, instead it deletes them.
However, they are in my untidy df tweets_bind_lega , but stored as a list tweets_bind_afd$entities$mentions. Now I would like to somehow unnest this list and create a tidy df with a column that has contains the mentioned Twitter user ids.
Has anyone created a mention network with academictwitteR before and can help me out?
Thanks!

YouTube API: differentiating between Premiered and Livestream

I am using YouTube data API and trying to differentiate prior livestreams vs premiered content. The liveStreamingDetails in the video list is populated for both livestreams and premiered content. Is there a way I can differentiate between the two?
Below is my python code for getting live stream start time. If its not populated, then I know that video is not live stream. But the problem is that this value is getting populated for premiered content as well.
vid_request = youtube.videos().list(part = 'contentDetails, statistics, snippet, liveStreamingDetails, status',id = ','.join(vid_ids))
vid_response = vid_request.execute()
for videoitem in vid_response['items']:
try:
livestreamStartTime = videoitem['liveStreamingDetails']['actualStartTime']
except:
livestreamStartTime = ''
Any pointers on what could work would really help?

How do I find the Conversion Action ID for use in the Google Ads API?

I'm using the latest (v7) Google Ads API to upload offline conversions for Google Ads, using the Python Client Library. This is the standard code I'm using:
import os
from google.ads.googleads.client import GoogleAdsClient
client = GoogleAdsClient.load_from_env(version='v7')
def process_adwords_conversion(
conversion_date_time,
gclid,
conversion_action_id,
conversion_value
):
conversion_date_time = convert_datetime(conversion_date_time)
customer_id = os.environ['GOOGLE_ADS_LOGIN_CUSTOMER_ID']
click_conversion = client.get_type("ClickConversion")
conversion_action_service = client.get_service("ConversionActionService")
click_conversion.conversion_action = (
conversion_action_service.conversion_action_path(
customer_id, conversion_action_id
)
)
click_conversion.gclid = gclid
click_conversion.conversion_value = float(conversion_value)
click_conversion.conversion_date_time = conversion_date_time
click_conversion.currency_code = "USD"
conversion_upload_service = client.get_service("ConversionUploadService")
request = client.get_type("UploadClickConversionsRequest")
request.customer_id = customer_id
request.conversions = [click_conversion]
request.partial_failure = True
conversion_upload_response = (
conversion_upload_service.upload_click_conversions(
request=request,
)
)
uploaded_click_conversion = conversion_upload_response.results[0]
print(conversion_upload_response)
print(
f"Uploaded conversion that occurred at "
f'"{uploaded_click_conversion.conversion_date_time}" from '
f'Google Click ID "{uploaded_click_conversion.gclid}" '
f'to "{uploaded_click_conversion.conversion_action}"'
)
return False
I believe the code is fine, but I'm having problems locating the conversion_action_id value to use. In the Google Ads UI there's a screen listing the different Conversion Actions, with no sign of an ID anywhere. You can click on the name and get more details, but still no ID:
The conversion action detail screen in Google Ads UI
I've tried the following:
Using the ocid, ctId, euid, __u, uscid, __c, subid URL parameters from this detail page as the conversion_action_id. That always gives an error:
partial_failure_error {
code: 3
message: "This customer does not have an import conversion action that matches the conversion action provided., at conversions[0].conversion_action"
details {
type_url: "type.googleapis.com/google.ads.googleads.v7.errors.GoogleAdsFailure"
value: "\n\305\001\n\003\370\006\t\022dThis customer does not have an import conversion action that matches the conversion action provided.\0320*.customers/9603123598/conversionActions/6095821\"&\022\017\n\013conversions\030\000\022\023\n\021conversion_action"
}
}
Using the standard Google answer:
https://support.google.com/google-ads/thread/1449693/where-can-we-find-google-ads-conversion-ids?hl=en
Google suggests creating a new Conversion Action and obtaining the ID in the process. Unfortunately their instructions don't correspond to the current UI version, at least for me. The sequence I follow is:
Click the + icon on the Conversion Actions page
Select "Import" as the kind of conversion I want
Select "Other data sources or CRMs" then "Track conversions from clicks"
Click "Create and Continue"
I then get the screen:
Screen following Conversion Action creation
The recommended answer says:
The conversion action is now created and you are ready to set up the tag to add it to your website. You have three options and the recommended answer in this thread is discussing the Google Tag Manager option, which is the only option that uses the Conversion ID and Conversion Label. If you do not click on the Google Tag Manager option you will not be presented with the Conversion ID and Conversion Label.
Not so! What three options? The first "Learn more" link mentions the Google Tag Manager, but in the context of collecting the GCLID, which I already have. The "three options" mentioned in the official answer have gone. Clicking "done" simply takes me back to the Conversion Actions listing.
Using the REST API
I've tried authenticating and interrogating the endpoint:
https://googleads.googleapis.com/v7/customers/9603123598/conversionActions/
hoping that would give a list of conversion actions, but it doesn't. It just gives a 404.
Does anybody know a way of getting the Conversion Action ID, either from the UI or programmatically (via client library, REST or some other method)?
Thanks!
If you're using Python, you can list your conversions via next snippet:
ads: GoogleAdsServiceClient = client.get_service('GoogleAdsService')
pages = ads.search(query="SELECT conversion_action.id, conversion_action.name FROM conversion_action", customer_id='YOUR_MCC_ID')
for page in pages:
print(page.conversion_action)
You can also open conversion action in UI and locate &ctId=, that's the conversion action id.
I found this post with the solution how to get the Conversion Action ID:
(…) I found out that conversionActionId can be found also in Google
Ads panel. When you go to conversion action details, in URL there is
parameter "ctId=123456789" which represent conversion action id.
By the way, I tried something similar and it's still not working, but with this Conversion Action ID I get a different "Partial Failure" message, at least.
At least in Google Ads API (REST) v10,
it works if field conversionAction is set with value:
'customers/{customerId}/conversionActions/{ctId}'
customerId - without hyphens
ctId - conversion action id, as mentioned in above comments,
taken from GET parameters in Google Ads panel when specific conversion is opened.
Can also be found programmatically with API method.

Tweepy API: get full text of tweet (>140 characters)

I am unable to get the entire text of a tweet - every tweet is limited to only 140 characters, afterwards ends in "..."
I am already using full_text and tweet_mode='extended', still it does not yet work
tweets = tw.Cursor(api.search,q=search_words,lang="en",since=date_since,until=date_until, tweet_mode='extended').items(10)
users_locs = [[tweet.user.screen_name, tweet.user.location,tweet.full_text, tweet.created_at] for tweet in tweets]
tweet_text = pd.DataFrame(data=users_locs,columns=['user', 'location','text','date'])
Could you help me with this?
Check out the version of Tweepy you are using and update to the latest.
I used three different (including yours) ways and I am able to get the "full_text" with "tweet_mode='extended'" with Tweepy 3.9.0
The methods I used are:
status_list = api.statuses_lookup(list_of_ids, trim_user=False, tweet_mode="extended")
tweet_status = api.get_status(single_id, tweet_mode="extended")
tweets = tw.Cursor(api.search,q=search_words,lang="en",since=date_since,until=date_until, tweet_mode='extended').items(10)
And got the full_text in every one of them.

Get length of videos calling youtube playlist API (v3)

$http.get("https://www.googleapis.com/youtube/v3/playlistItems?part=snippet&playlistId=PLFgquLnL59alCl_2TQvOiD5Vgm1hCaGSI&key={mykey}&maxResults=10")
I used the playlistItems but couldn't get the statistic part which contain duration of the video. Do I need to call twice? Get the video Id and make another call? or I'm missing something in this case?
For whatever reason, playlistItems do not include some things like statistics or category. You'll need to make a separate call using the video ID and https://developers.google.com/youtube/v3/docs/videos/list in order to get those fields.
This is how I do it (using Python but you can adapt it for whatever language you are using with http requests and JSON parsing)
url = "https://www.googleapis.com/youtube/v3/videos?id=" + videoId
+ "&key=" + DEVELOPER_KEY + "&part=snippet,contentDetails"
r = requests.get(url)
metadata = r.json()["items"][0]
channelName = metadata["snippet"]["channelTitle"]
publishedTime = metadata["snippet"]["publishedAt"]
duration = metadata["contentDetails"]["duration"]
duration is in a strange format that looks like
PT4M11S
meaning 4 minutes 11 seconds. You will have to "parse" this.

Resources