I'm working on a website to load multiple youtube channels live streams. At first i was trying to figure out a way to do this without utilizing youtube's api but have decided to give in.
To find whether a channel is live streaming and to get the live stream links I've been using:
https://www.googleapis.com/youtube/v3/search?part=snippet&channelId={CHANNEL_ID}&eventType=live&maxResults=10&type=video&key={API_KEY}
However with the minimum quota being 10000 and each search being worth 100, Im only able to do about 100 searches before I exceed my quota limit which doesn't help at all. I ended up exceeding the quota limit in about 10 minutes. :(
Does anyone know of a better way to figure out if a channel is currently live streaming and what the live stream links are, using as minimal quota points as possible?
I want to reload youtube data for each user every 3 minutes, save it into a database, and display the information using my own api to save server resources as well as quota points.
Hopefully someone has a good solution to this problem!
If nothing can be done about links just determining if the user is live without using 100 quota points each time would be a big help.
Since the question only specified that Search API quotas should not be used in finding out if the channel is streaming, I thought I would share a sort of work-around method. It might require a bit more work than a simple API call, but it reduces API quota use to practically nothing:
I used a simple Perl GET request to retrieve a Youtube channel's main page. Several unique elements are found in the HTML of a channel page that is streaming live:
The number of live viewers tag, e.g. <li>753 watching</li>. The LIVE NOW
badge tag: <span class="yt-badge yt-badge-live" >Live now</span>.
To ascertain whether a channel is currently streaming live requires a simple match to see if the unique HTML tag is contained in the GET request results. Something like: if ($get_results =~ /$unique_html/) (Perl). Then, an API call can be made only to a channel ID that is actually streaming, in order to obtain the video ID of the stream.
The advantage of this is that you already know the channel is streaming, instead of using thousands of quota points to find out. My test script successfully identifies whether a channel is streaming, by looking in the HTML code for: <span class="yt-badge yt-badge-live" > (note the weird extra spaces in the code from Youtube).
I don't know what language OP is using, or I would help with a basic GET request in that language. I used Perl, and included browser headers, User Agent and cookies, to look like a normal computer visit.
Youtube's robots.txt doesn't seem to forbid crawling a channel's main page, only the community page of a channel.
Let me know what you think about the pros and cons of this method, and please comment with what might be improved rather than disliking if you find a flaw. Thanks, happy coding!
2020 UPDATE
The yt-badge-live seems to have been deprecated, it no longer reliably shows whether the channel is streaming. Instead, I now check the HTML for this string:
{"text":" watching"}
If I get a match, it means the page is streaming. (Non-streaming channels don't contain this string.) Again, note the weird extra whitespace. I also escape all the quotation marks since I'm using Perl.
Here are my two suggestions:
Check my answer where I explain how you can check how retrieve videos from channels who are livesrteaming.
Another option could be use the following URL and somehow make request(s) each time for check if there's a livestreaming.
https://www.youtube.com/channel/<CHANNEL_ID>/live
Where CHANNEL_ID is the channel id you want check if that channel is livestreaming1.
1 Just notice that maybe the URL wont work in all channels (and that depends of the channel itself).
For example, if you check the channel_id UC7_YxT-KID8kRbqZo7MyscQ - link to this channel livestreaming - https://www.youtube.com/channel/UC4nprx9Vd84-ly7N-1Ce6Og/live, this channel will show if he is livestreaming, but, with his channel id UC4nprx9Vd84-ly7N-1Ce6Og - link to this channel livestreaming -, it will show his main page instead.
Adding to the answer by Bman70, I tried eliminating the need of making a costly search request after knowing that the channel is streaming live. I did this using two indicators in the HTML response from channels page who are streaming live.
function findLiveStreamVideoId(channelId, cb){
$.ajax({
url: 'https://www.youtube.com/channel/'+channelId,
type: "GET",
headers: {
'Access-Control-Allow-Origin': '*',
'Accept-Language': 'en-US, en;q=0.5'
}}).done(function(resp) {
//one method to find live video
let n = resp.search(/\{"videoId[\sA-Za-z0-9:"\{\}\]\[,\-_]+BADGE_STYLE_TYPE_LIVE_NOW/i);
//If found
if(n>=0){
let videoId = resp.slice(n+1, resp.indexOf("}",n)-1).split("\":\"")[1]
return cb(videoId);
}
//If not found, then try another method to find live video
n = resp.search(/https:\/\/i.ytimg.com\/vi\/[A-Za-z0-9\-_]+\/hqdefault_live.jpg/i);
if (n >= 0){
let videoId = resp.slice(n,resp.indexOf(".jpg",n)-1).split("/")[4]
return cb(videoId);
}
//No streams found
return cb(null, "No live streams found");
}).fail(function() {
return cb(null, "CORS Request blocked");
});
}
However, there's a tradeoff. This method confuses a recently ended stream with currently live streams. A workaround for this issue is to get status of the videoId returned from Youtube API (costs a single unit from your quota).
I found youtube API to be very restrictive given the cost of search operation. Apparently the accepted answer did not work for me as I found the string on non live streams as well. Web scraping with aiohttp and beautifulsoup was not an option since the better indicators required javascript support. Hence I turned to selenium. I looked for the css selector
#info-text
and then search for the string Started streaming or with watching now in it.
To reduce load on my tiny server that would have otherwise required lot more resources, I moved this test of functionality to a heroku dyno with a small flask app.
# import flask dependencies
import os
from flask import Flask, request, make_response, jsonify
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
base = "https://www.youtube.com/watch?v={0}"
delay = 3
# initialize the flask app
app = Flask(__name__)
# default route
#app.route("/")
def index():
return "Hello World!"
# create a route for webhook
#app.route("/islive", methods=["GET", "POST"])
def is_live():
chrome_options = Options()
chrome_options.binary_location = os.environ.get('GOOGLE_CHROME_BIN')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--remote-debugging-port=9222')
driver = webdriver.Chrome(executable_path=os.environ.get('CHROMEDRIVER_PATH'), chrome_options=chrome_options)
url = request.args.get("url")
if "youtube.com" in url:
video_id = url.split("?v=")[-1]
else:
video_id = url
url = base.format(url)
print(url)
response = { "url": url, "is_live": False, "ok": False, "video_id": video_id }
driver.get(url)
try:
element = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#info-text")))
result = element.text.lower().find("Started streaming".lower())
if result != -1:
response["is_live"] = True
else:
result = element.text.lower().find("watching now".lower())
if result != -1:
response["is_live"] = True
response["ok"] = True
return jsonify(response)
except Exception as e:
print(e)
return jsonify(response)
finally:
driver.close()
# run the app
if __name__ == "__main__":
app.run()
You'll however need to add the following buildpacks in settings
https://github.com/heroku/heroku-buildpack-google-chrome
https://github.com/heroku/heroku-buildpack-chromedriver
https://github.com/heroku/heroku-buildpack-python
Set the following Config Vars in settings
CHROMEDRIVER_PATH=/app/.chromedriver/bin/chromedriver
GOOGLE_CHROME_BIN=/app/.apt/usr/bin/google-chrome
You can find supported python runtime here but anything below python 3.9 should be good since selenium had problems with improper use of is operator
I hope youtube will provide better alternatives than workarounds.
I know this is a old thread, but i thought i share my way of checking to for example grab the status code to use in an app.
This is for a single Channel, but you could easly do a foreach with it.
<?php
#####
$ytchannelID = "UCd0BTXriKLvOs1ANx3puZ3Q";
#####
$ytliveurl = "https://www.youtube.com/channel/".$ytchannelID."/live";
$ytchannelLIVE = '{"text":" watching now"}';
$contents = file_get_contents($ytliveurl);
if ( strpos($contents, $ytchannelLIVE) !== false ){http_response_code(200);} else {http_response_code(201);}
unset($ytliveurl);
?>
Adding onto the other answers here, I use a GET request to https://www.youtube.com/c/<CHANNEL_NAME>/live and then search for "isLive":true (rather than {"text":" watching"})
I have a c# application that downloads multiple tiny files from websites (torrents). Some sites restrict the number of downloads per IP per day.
I do a HttpWebRequest and if the stream is a valid torrent, I save it to disk.
Is there a way for my c# application to spoof my IP when performing the HttpWebRequest, so that the download will not fail ?
I spaced out the download time to one per 10 minutes, but no luck. I still get blocked eventually.
I have heard that "TOR" can use diffrent IPs, but I don't want the people using my desktop app to have to install TOR browser separately.
HttpWebResponse resp = null;
try
{
var req = (HttpWebRequest)WebRequest.Create("http://www.exampe.com/test.torrent);
req.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
req.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;
req.Timeout = 30000;
req.KeepAlive = true;
resp = (HttpWebResponse)(req.GetResponse());
}
Any solutions ?
To do so, you need to manipulate tcp/ip packets. This means that you need to capture the outgoing packet created by HttpWebRequest and change its source IP to the spoofed one.
I found this forum post that seemingly has to do with what you want to do, check it out : http://pcapdotnet.codeplex.com/discussions/349978
As far as I know you can do it through PCap.net or SharpPcap libraries.
I am working on crawler and I have to extract data from 200-300 links on Google Scholar. I have working parser which is getting data from pages (on every pages are 1-10 people profiles as result of my query. I'm extracting proper links, go to another page and do it again). During run of my program I spotted above error:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=503, URL=https://ipv4.google.com/sorry/IndexRedirect?continue=https://scholar.google.pl/citations%3Fmauthors%3DAGH%2BUniversity%2Bof%2BScience%2Band%2BTechnology%26hl%3Dpl%26view_op%3Dsearch_authors&q=CGMSBFMKrI0YiJHfqgUiGQDxp4NLfGBv6zgPSjfyQ9LBi5F-K1EbGwQ
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
I know it is linked with simple google protection against robots. How I can improve my connection
Connection connection =
Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10000)
.followRedirects(true);
to not have temporary ban? I know there is a way to check response, like this:
Connection.Response response =
Jsoup.connect(url)
.userAgent("Mozilla/5.0")
.timeout(10000)
.execute();
int statusCode = response.statusCode();
if (statusCode == 200) { ... }
else if (statusCode == 503) { do recconect magic}
But what should I do, when I got 503 error? Have I to use proxy? Random wait time beetween connections? I hope there is better idea than saving my results in file, do manual hard-restart of router and try with new IP :P
You have already provided your own answers...
Have I to use proxy?
Of course. You should already have setup a bunch of proxies for your wrawling activity.
Random wait time beetween connections?
Yes. Use some random wait between 3000 and 5000 ms.
Alternatively, you could use an online captcha service resolving if you hit the URL https://ipv4.google.com/sorry/IndexRedirect.... Don't hit it too often or you'll get banned.
Happy coding :)
I can't seem to make any progress with this one. My CI session settings are these:
$config['sess_cookie_name'] = 'ci_session';
$config['sess_expiration'] = 0;
$config['sess_expire_on_close'] = FALSE;
$config['sess_encrypt_cookie'] = FALSE;
$config['sess_use_database'] = TRUE;
$config['sess_table_name'] = 'ci_sessions';
$config['sess_match_ip'] = FALSE;
$config['sess_match_useragent'] = FALSE;
$config['sess_time_to_update'] = 7200;
$config['cookie_prefix'] = "";
$config['cookie_domain'] = "";
$config['cookie_path'] = "/";
$config['cookie_secure'] = FALSE;
The session library is loaded on autoload. I've commented the sess_update function to prevent an AJAX bug that I've found about reading the CI forum.
The ci_sessions table in the database has collation utf8_general_ci (there was a bug that lost the session after every redirect() call and it was linked to the fact that the collation was latin1_swedish_ci by default).
It always breaks after a user of my admin section tries to add a long article and clicks the save button. The save action looks like this:
function save($id = 0){
if($this->my_model->save_article($id)){
$this->session->set_flashdata('message', 'success!');
redirect('admin/article_listing');
}else{
$this->session->set_flashdata('message', 'errors encountered');
redirect('admin/article_add');
}
}
If you spend more than 20minutes and click save, the article will be added but on redirect the user will be logged out.
I've also enabled logging and sometimes when the error occurs i get the message The session cookie data did not match what was expected. This could be a possible hacking attempt. but only half of the time. The other half I get nothing: a message that I've placed at the end of the Session constructor is displayed and nothing else. In all the cases if I look at the cookie stored in my browser, after the error the cookie's first part doesn't match the hash.
Also, although I know Codeigniter doesn't use native sessions, I've set session.gc_maxlifetime to 86400.
Another thing to mention is that I'm unable to reproduce the error on my computer but on all the other computers I've tested this bug appears by the same pattern as mentioned above.
If you have any ideas on what to do next, I'd greatly appreciate them. Changing to a new version or using a native session class (the old one was for CI 1.7, will it still work?) are also options I'm willing to consider.
Edit : I've run a diff between the Session class in CI 2.0.3 and the latest CI Session class and they're the same.
Here's how I solved it: the standards say that a browser shouldn't allow redirects after a POST request. CI's redirect() method is sending a 302 redirect by default. The logical way would be to send a 307 redirect, which solved my problem but has the caveat of showing a confirm dialog about the redirect. Other options are a 301 (meaning moved permanently) redirect or, the solution I've chosen, a javascript redirect.
I've looked all over for some documentation on this, but haven't found it. Some posts reference a user-agent string:
http://groups.google.com/group/feedburner-services/browse_thread/thread/7aee14cf6a2432e7/49464335d2228e25?lnk=gst&q=aweber#49464335d2228e25
I had assumed there would be an API or something. More generally, how does ANY rss feed reader/aggregator (like Bloglines, etc) report subscriber numbers to Feedburner?
I'm working on developing a new app that would need this functionality.
Thanks for your help!
Brian
As you discovered in your link, you put the subscriber count in your user-agent, then you contact the Feedburner Support Group and tell them what format you will be using.
The consensus format is something
User-agent: Service Name (http://example.com/service/info/; ### subscribers ; [optional feed identifier] )
The optional feed identifier is typically used if you run several different services, and fetch the feed separately for each one; e.g. if you have a mail service and a web-based reader service, with different subscribers, then you might either use:
User-agent: SO Agg/1.3 (http://example.com/SOAgg ; 5000 subscribers ; feed-id=mail-134 )
on request for the mailer, and
User-agent: SO Agg/1.3 (http://example.com/SOAgg ; 2000 subscribers ; feed-id=web-134 )
on the request for the website; or use
User-agent: SO Agg/1.3 (http://example.com/SOAgg ; 7000 subscribers ; )
if your system makes only one request for both services...
You will usually need to specify what IP addresses are authorised to request the feed with that user-agent, as well.
Many major aggregators report user stats by including them as part of the useragent string. Examples:
Bloglines reporting description in blog comment
Google Reader: Tips for Publishers
PostRank: Reporting Subscription Counts
There's no standard for this at this time.
To the best of my knowledge, folks will contact major feed analytics vendors like Feedburner directly, to make sure their useragent-based reporting is being counted.