How to catch the redirect with a webapp using playwright - playwright

When you go to this link, the page will run some javascript and then automatically redirect to a pdf. I have a hard time getting that final url from Playwright.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://scnv.io/760y", wait_until="networkidle")
print(page.url)
page.close()
Is there a way to get that final url?

There are multiple ways to do it. One way is using page.expect_response:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
# Catch any responses with '.pdf' at the end of the url
with page.expect_response('**/*.pdf') as response:
page.goto("https://scnv.io/760y")
print(response.value.url)
page.close()
Output
https://qcg-media.s3.amazonaws.com/media/uploads/72778/2022/06/20220622_663043_221.pdf
Check out this section of the documentation that details handling network traffic in playwright.
Also note that I did not include wait_until='networkidle' because that was not appropriate for this use case. For that event to trigger, the network must remain idle for at least 500 ms, which does not happen in the case of this website when it's making the request to the pdf. Therefore, if you were to include that, then the code will be inconsistent at best in catching the request we wanted the url of.

Related

Workbox redirect the clients page when resource is not cached and offline

Usually whenever I read a blog post about PWA's, the tutorial seems to just precache every single asset. But this seems to go against the app shell pattern a bit, which as I understand is: Cache the bare necessities (only the app shell), and runtime cache as you go. (Please correct me if I understood this incorrectly)
Imagine I have this single page application, it's a simple index.html with a web component: <my-app>. That <my-app> component sets up some routes which looks a little bit like this, I'm using Vaadin router and web components, but I imagine the problem would be the same using React with React Router or something similar.
router.setRoutes([
{
path: '/',
component: 'app-main', // statically loaded
},
{
path: '/posts',
component: 'app-posts',
action: () => { import('./app-posts.js');} // dynamically loaded
},
/* many, many, many more routes */
{
path: '/offline', // redirect here when a resource is not cached and failed to get from network
component: 'app-offline', // also statically loaded
}
]);
My app may have many many routes, and may get very large. I don't want to precache all those resources straight away, but only cache the stuff I absolutely need, so in this case: my index.html, my-app.js, app-main.js, and app-offline.js. I want to cache app-posts.js at runtime, when it's requested.
Setting up runtime caching is simple enough, but my problem arises when my user visits one of the potentially many many routes that is not cached yet (because maybe the user hasn't visited that route before, so the js file may not have loaded/cached yet), and the user has no internet connection.
What I want to happen, in that case (when a route is not cached yet and there is no network), is for the user to be redirected to the /offline route, which is handled by my client side router. I could easily do something like: import('./app-posts.js').catch(() => /* redirect user to /offline */), but I'm wondering if there is a way to achieve this from workbox itself.
So in a nutshell:
When a js file hasn't been cached yet, and the user has no network, and so the request for the file fails: let workbox redirect the page to the /offline route.
Option 1 (not always useful):
As far as I can see and according to this answer, you cannot open a new window or change the URL of the browser from within the service worker. However you can open a new window only if the clients.openWindow() function is called from within the notificationclick event.
Option 2 (hardest):
You could use the WindowClient.navigate method within the activate event of the service worker however is a bit trickier as you still need to check if the file requested exists in the cache or not.
Option 3 (easiest & hackiest):
Otherwise, you could respond with a new Request object to the offline page:
const cacheOnly = new workbox.strategies.CacheOnly();
const networkFirst = new workbox.strategies.NetworkFirst();
workbox.routing.registerRoute(
/\/posts.|\/articles/,
async args => {
const offlineRequest = new Request('/offline.html');
try {
const response = await networkFirst.handle(args);
return response || await cacheOnly.handle({request: offlineRequest});
} catch (error) {
return await cacheOnly.handle({request: offlineRequest})
}
}
);
and then rewrite the URL of the browser in your offline.html file:
<head>
<script>
window.history.replaceState({}, 'You are offline', '/offline');
</script>
</head>
The above logic in Option 3 will respond to the requested URL by using the network first. If the network is not available will fallback to the cache and even if the request is not found in the cache, will fetch the offline.html file instead. Once the offline.html file is parsed, the browser URL will be replaced to /offline.

How to find if a youtube channel is currently live streaming without using search?

I'm working on a website to load multiple youtube channels live streams. At first i was trying to figure out a way to do this without utilizing youtube's api but have decided to give in.
To find whether a channel is live streaming and to get the live stream links I've been using:
https://www.googleapis.com/youtube/v3/search?part=snippet&channelId={CHANNEL_ID}&eventType=live&maxResults=10&type=video&key={API_KEY}
However with the minimum quota being 10000 and each search being worth 100, Im only able to do about 100 searches before I exceed my quota limit which doesn't help at all. I ended up exceeding the quota limit in about 10 minutes. :(
Does anyone know of a better way to figure out if a channel is currently live streaming and what the live stream links are, using as minimal quota points as possible?
I want to reload youtube data for each user every 3 minutes, save it into a database, and display the information using my own api to save server resources as well as quota points.
Hopefully someone has a good solution to this problem!
If nothing can be done about links just determining if the user is live without using 100 quota points each time would be a big help.
Since the question only specified that Search API quotas should not be used in finding out if the channel is streaming, I thought I would share a sort of work-around method. It might require a bit more work than a simple API call, but it reduces API quota use to practically nothing:
I used a simple Perl GET request to retrieve a Youtube channel's main page. Several unique elements are found in the HTML of a channel page that is streaming live:
The number of live viewers tag, e.g. <li>753 watching</li>. The LIVE NOW
badge tag: <span class="yt-badge yt-badge-live" >Live now</span>.
To ascertain whether a channel is currently streaming live requires a simple match to see if the unique HTML tag is contained in the GET request results. Something like: if ($get_results =~ /$unique_html/) (Perl). Then, an API call can be made only to a channel ID that is actually streaming, in order to obtain the video ID of the stream.
The advantage of this is that you already know the channel is streaming, instead of using thousands of quota points to find out. My test script successfully identifies whether a channel is streaming, by looking in the HTML code for: <span class="yt-badge yt-badge-live" > (note the weird extra spaces in the code from Youtube).
I don't know what language OP is using, or I would help with a basic GET request in that language. I used Perl, and included browser headers, User Agent and cookies, to look like a normal computer visit.
Youtube's robots.txt doesn't seem to forbid crawling a channel's main page, only the community page of a channel.
Let me know what you think about the pros and cons of this method, and please comment with what might be improved rather than disliking if you find a flaw. Thanks, happy coding!
2020 UPDATE
The yt-badge-live seems to have been deprecated, it no longer reliably shows whether the channel is streaming. Instead, I now check the HTML for this string:
{"text":" watching"}
If I get a match, it means the page is streaming. (Non-streaming channels don't contain this string.) Again, note the weird extra whitespace. I also escape all the quotation marks since I'm using Perl.
Here are my two suggestions:
Check my answer where I explain how you can check how retrieve videos from channels who are livesrteaming.
Another option could be use the following URL and somehow make request(s) each time for check if there's a livestreaming.
https://www.youtube.com/channel/<CHANNEL_ID>/live
Where CHANNEL_ID is the channel id you want check if that channel is livestreaming1.
1 Just notice that maybe the URL wont work in all channels (and that depends of the channel itself).
For example, if you check the channel_id UC7_YxT-KID8kRbqZo7MyscQ - link to this channel livestreaming - https://www.youtube.com/channel/UC4nprx9Vd84-ly7N-1Ce6Og/live, this channel will show if he is livestreaming, but, with his channel id UC4nprx9Vd84-ly7N-1Ce6Og - link to this channel livestreaming -, it will show his main page instead.
Adding to the answer by Bman70, I tried eliminating the need of making a costly search request after knowing that the channel is streaming live. I did this using two indicators in the HTML response from channels page who are streaming live.
function findLiveStreamVideoId(channelId, cb){
$.ajax({
url: 'https://www.youtube.com/channel/'+channelId,
type: "GET",
headers: {
'Access-Control-Allow-Origin': '*',
'Accept-Language': 'en-US, en;q=0.5'
}}).done(function(resp) {
//one method to find live video
let n = resp.search(/\{"videoId[\sA-Za-z0-9:"\{\}\]\[,\-_]+BADGE_STYLE_TYPE_LIVE_NOW/i);
//If found
if(n>=0){
let videoId = resp.slice(n+1, resp.indexOf("}",n)-1).split("\":\"")[1]
return cb(videoId);
}
//If not found, then try another method to find live video
n = resp.search(/https:\/\/i.ytimg.com\/vi\/[A-Za-z0-9\-_]+\/hqdefault_live.jpg/i);
if (n >= 0){
let videoId = resp.slice(n,resp.indexOf(".jpg",n)-1).split("/")[4]
return cb(videoId);
}
//No streams found
return cb(null, "No live streams found");
}).fail(function() {
return cb(null, "CORS Request blocked");
});
}
However, there's a tradeoff. This method confuses a recently ended stream with currently live streams. A workaround for this issue is to get status of the videoId returned from Youtube API (costs a single unit from your quota).
I found youtube API to be very restrictive given the cost of search operation. Apparently the accepted answer did not work for me as I found the string on non live streams as well. Web scraping with aiohttp and beautifulsoup was not an option since the better indicators required javascript support. Hence I turned to selenium. I looked for the css selector
#info-text
and then search for the string Started streaming or with watching now in it.
To reduce load on my tiny server that would have otherwise required lot more resources, I moved this test of functionality to a heroku dyno with a small flask app.
# import flask dependencies
import os
from flask import Flask, request, make_response, jsonify
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
base = "https://www.youtube.com/watch?v={0}"
delay = 3
# initialize the flask app
app = Flask(__name__)
# default route
#app.route("/")
def index():
return "Hello World!"
# create a route for webhook
#app.route("/islive", methods=["GET", "POST"])
def is_live():
chrome_options = Options()
chrome_options.binary_location = os.environ.get('GOOGLE_CHROME_BIN')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--remote-debugging-port=9222')
driver = webdriver.Chrome(executable_path=os.environ.get('CHROMEDRIVER_PATH'), chrome_options=chrome_options)
url = request.args.get("url")
if "youtube.com" in url:
video_id = url.split("?v=")[-1]
else:
video_id = url
url = base.format(url)
print(url)
response = { "url": url, "is_live": False, "ok": False, "video_id": video_id }
driver.get(url)
try:
element = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#info-text")))
result = element.text.lower().find("Started streaming".lower())
if result != -1:
response["is_live"] = True
else:
result = element.text.lower().find("watching now".lower())
if result != -1:
response["is_live"] = True
response["ok"] = True
return jsonify(response)
except Exception as e:
print(e)
return jsonify(response)
finally:
driver.close()
# run the app
if __name__ == "__main__":
app.run()
You'll however need to add the following buildpacks in settings
https://github.com/heroku/heroku-buildpack-google-chrome
https://github.com/heroku/heroku-buildpack-chromedriver
https://github.com/heroku/heroku-buildpack-python
Set the following Config Vars in settings
CHROMEDRIVER_PATH=/app/.chromedriver/bin/chromedriver
GOOGLE_CHROME_BIN=/app/.apt/usr/bin/google-chrome
You can find supported python runtime here but anything below python 3.9 should be good since selenium had problems with improper use of is operator
I hope youtube will provide better alternatives than workarounds.
I know this is a old thread, but i thought i share my way of checking to for example grab the status code to use in an app.
This is for a single Channel, but you could easly do a foreach with it.
<?php
#####
$ytchannelID = "UCd0BTXriKLvOs1ANx3puZ3Q";
#####
$ytliveurl = "https://www.youtube.com/channel/".$ytchannelID."/live";
$ytchannelLIVE = '{"text":" watching now"}';
$contents = file_get_contents($ytliveurl);
if ( strpos($contents, $ytchannelLIVE) !== false ){http_response_code(200);} else {http_response_code(201);}
unset($ytliveurl);
?>
Adding onto the other answers here, I use a GET request to https://www.youtube.com/c/<CHANNEL_NAME>/live and then search for "isLive":true (rather than {"text":" watching"})

Crystal-lang: How to Find End URL After a Redirect?

I'm just dipping a toe in the water with Crystal at the moment and, as an exercise, trying to port one of my Python scripts across.
The script in question downloads the 'latest' PDF from a URL which takes the form: "http://somesite.com/download/latest/". When visited that URL automatically redirects to the page for the latest download eg. "http://somesite.com/download/4563/"
I'm having difficulty working out how to implement this in Crystal so that I can grab the actual URL that the redirect ends up on.
In Python I do:
currenturl = urllib.request.urlopen(latesturl)
#above will redirect to URL of format http://somesite.com/download/XXXXX/
#where XXXXX is the current d/load
endurl = currenturl.geturl()
...which gives me the end URL in the "endurl" variable.
But, reading the docs for Crystal's "http/client" I can't see any way to return the actual URL that a redirect ends up on. Is it possible?
Crystal's HTTP::Client currently can't automatically follow redirects.
Please note that you're reading an outdated version of the API docs, the current is at https://crystal-lang.org/api/latest/HTTP/Client.html (I don't think there have been relevant changes between 0.24.1 and 0.26.1 though).
But you can easily access the redirect URL from reading the Location header of an HTTP response:
response = HTTP::Client.get latesturl
endurl = response.headers["Location"]

Titanium: How to redirect from external to internal URL in webview

So I know how to access both external and internal URL's in the Titanium Webview. But I have no idea how to redirect from an external url to the internal url.
I've got a file called "index.html" in the root folder, so for the webview this should work to access it:
Ti.UI.createWebView({
url: 'index.html'
});
External urls are pretty straight forward
Ti.UI.createWebView({
url: 'http://www.google.com'
});
However, when on the external url, how do I redirect to this local file? None of these work:
LOCAL?
LOCAL?
or the javascript variant
window.location = 'file:///index.html';
Any clues on how to do this?
What I discovered, in the end, are 2 possibilities to achieve this. But it can't be done through redirection.
One: Poll for a certain variable using the webview.evalJS() function
var my_data = $.login_webview.evalJS('global.data;');
Of course, it works only with strings, not with objects. So if you're passing JSON, make sure it is set as a string!
Two: Do an actual redirection server side, to another serverside page and monitor for URL change and then do evalJS() once again, but no need for polling
$.login_webview.addEventListener('load',function(e){
if (e.url.indexOf('mobile/redirect.php') > -1){
var my_data = $.login_webview.evalJS('global.data');
}
});
Just make sure, with 2 that you're actually setting the required data in Javascript using server side technology.

Modify URL before loading page in firefox

I want to prefix URLs which match my patterns. When I open a new tab in Firefox and enter a matching URL the page should not be loaded normally, the URL should first be modified and then loading the page should start.
Is it possible to modify an URL through a Mozilla Firefox Addon before the page starts loading?
Browsing the HTTPS Everywhere add-on suggests the following steps:
Register an observer for the "http-on-modify-request" observer topic with nsIObserverService
Proceed if the subject of your observer notification is an instance of nsIHttpChannel and subject.URI.spec (the URL) matches your criteria
Create a new nsIStandardURL
Create a new nsIHttpChannel
Replace the old channel with the new. The code for doing this in HTTPS Everywhere is quite dense and probably much more than you need. I'd suggest starting with chrome/content/IOUtils.js.
Note that you should register a single "http-on-modify-request" observer for your entire application, which means you should put it in an XPCOM component (see HTTPS Everywhere for an example).
The following articles do not solve your problem directly, but they do contain a lot of sample code that you might find helpful:
https://developer.mozilla.org/en/Setting_HTTP_request_headers
https://developer.mozilla.org/en/XUL_School/Intercepting_Page_Loads
Thanks to Iwburk, I have been able to do this.
We can do this my overriding the nsiHttpChannel with a new one, doing this is slightly complicated but luckily the add-on https-everywhere implements this to force a https connection.
https-everywhere's source code is available here
Most of the code needed for this is in the files
IO Util.js
ChannelReplacement.js
We can work with the above files alone provided we have the basic variables like Cc,Ci set up and the function xpcom_generateQI defined.
var httpRequestObserver =
{
observe: function(subject, topic, data) {
if (topic == "http-on-modify-request") {
var httpChannel = subject.QueryInterface(Components.interfaces.nsIHttpChannel);
var requestURL = subject.URI.spec;
if(isToBeReplaced(requestURL)) {
var newURL = getURL(requestURL);
ChannelReplacement.runWhenPending(subject, function() {
var cr = new ChannelReplacement(subject, ch);
cr.replace(true,null);
cr.open();
});
}
}
},
get observerService() {
return Components.classes["#mozilla.org/observer-service;1"]
.getService(Components.interfaces.nsIObserverService);
},
register: function() {
this.observerService.addObserver(this, "http-on-modify-request", false);
},
unregister: function() {
this.observerService.removeObserver(this, "http-on-modify-request");
}
};
httpRequestObserver.register();
The code will replace the request not redirect.
While I have tested the above code well enough, I am not sure about its implementation. As far I can make out, it copies all the attributes of the requested channel and sets them to the channel to be overridden. After which somehow the output requested by original request is supplied using the new channel.
P.S. I had seen a SO post in which this approach was suggested.
You could listen for the page load event or maybe the DOMContentLoaded event instead. Or you can make an nsIURIContentListener but that's probably more complicated.
Is it possible to modify an URL through a Mozilla Firefox Addon before the page starts loading?
YES it is possible.
Use page-mod of the Addon-SDK by setting contentScriptWhen: "start"
Then after completely preventing the document from getting parsed you can either
fetch a different document from the same domain and inject it in the page.
after some document.URL processing do a location.replace() call
Here is an example of doing 1. https://stackoverflow.com/a/36097573/6085033

Resources