Crystal-lang: How to Find End URL After a Redirect? - url

I'm just dipping a toe in the water with Crystal at the moment and, as an exercise, trying to port one of my Python scripts across.
The script in question downloads the 'latest' PDF from a URL which takes the form: "http://somesite.com/download/latest/". When visited that URL automatically redirects to the page for the latest download eg. "http://somesite.com/download/4563/"
I'm having difficulty working out how to implement this in Crystal so that I can grab the actual URL that the redirect ends up on.
In Python I do:
currenturl = urllib.request.urlopen(latesturl)
#above will redirect to URL of format http://somesite.com/download/XXXXX/
#where XXXXX is the current d/load
endurl = currenturl.geturl()
...which gives me the end URL in the "endurl" variable.
But, reading the docs for Crystal's "http/client" I can't see any way to return the actual URL that a redirect ends up on. Is it possible?

Crystal's HTTP::Client currently can't automatically follow redirects.
Please note that you're reading an outdated version of the API docs, the current is at https://crystal-lang.org/api/latest/HTTP/Client.html (I don't think there have been relevant changes between 0.24.1 and 0.26.1 though).
But you can easily access the redirect URL from reading the Location header of an HTTP response:
response = HTTP::Client.get latesturl
endurl = response.headers["Location"]

Related

How to catch the redirect with a webapp using playwright

When you go to this link, the page will run some javascript and then automatically redirect to a pdf. I have a hard time getting that final url from Playwright.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://scnv.io/760y", wait_until="networkidle")
print(page.url)
page.close()
Is there a way to get that final url?
There are multiple ways to do it. One way is using page.expect_response:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
# Catch any responses with '.pdf' at the end of the url
with page.expect_response('**/*.pdf') as response:
page.goto("https://scnv.io/760y")
print(response.value.url)
page.close()
Output
https://qcg-media.s3.amazonaws.com/media/uploads/72778/2022/06/20220622_663043_221.pdf
Check out this section of the documentation that details handling network traffic in playwright.
Also note that I did not include wait_until='networkidle' because that was not appropriate for this use case. For that event to trigger, the network must remain idle for at least 500 ms, which does not happen in the case of this website when it's making the request to the pdf. Therefore, if you were to include that, then the code will be inconsistent at best in catching the request we wanted the url of.

How to find if a youtube channel is currently live streaming without using search?

I'm working on a website to load multiple youtube channels live streams. At first i was trying to figure out a way to do this without utilizing youtube's api but have decided to give in.
To find whether a channel is live streaming and to get the live stream links I've been using:
https://www.googleapis.com/youtube/v3/search?part=snippet&channelId={CHANNEL_ID}&eventType=live&maxResults=10&type=video&key={API_KEY}
However with the minimum quota being 10000 and each search being worth 100, Im only able to do about 100 searches before I exceed my quota limit which doesn't help at all. I ended up exceeding the quota limit in about 10 minutes. :(
Does anyone know of a better way to figure out if a channel is currently live streaming and what the live stream links are, using as minimal quota points as possible?
I want to reload youtube data for each user every 3 minutes, save it into a database, and display the information using my own api to save server resources as well as quota points.
Hopefully someone has a good solution to this problem!
If nothing can be done about links just determining if the user is live without using 100 quota points each time would be a big help.
Since the question only specified that Search API quotas should not be used in finding out if the channel is streaming, I thought I would share a sort of work-around method. It might require a bit more work than a simple API call, but it reduces API quota use to practically nothing:
I used a simple Perl GET request to retrieve a Youtube channel's main page. Several unique elements are found in the HTML of a channel page that is streaming live:
The number of live viewers tag, e.g. <li>753 watching</li>. The LIVE NOW
badge tag: <span class="yt-badge yt-badge-live" >Live now</span>.
To ascertain whether a channel is currently streaming live requires a simple match to see if the unique HTML tag is contained in the GET request results. Something like: if ($get_results =~ /$unique_html/) (Perl). Then, an API call can be made only to a channel ID that is actually streaming, in order to obtain the video ID of the stream.
The advantage of this is that you already know the channel is streaming, instead of using thousands of quota points to find out. My test script successfully identifies whether a channel is streaming, by looking in the HTML code for: <span class="yt-badge yt-badge-live" > (note the weird extra spaces in the code from Youtube).
I don't know what language OP is using, or I would help with a basic GET request in that language. I used Perl, and included browser headers, User Agent and cookies, to look like a normal computer visit.
Youtube's robots.txt doesn't seem to forbid crawling a channel's main page, only the community page of a channel.
Let me know what you think about the pros and cons of this method, and please comment with what might be improved rather than disliking if you find a flaw. Thanks, happy coding!
2020 UPDATE
The yt-badge-live seems to have been deprecated, it no longer reliably shows whether the channel is streaming. Instead, I now check the HTML for this string:
{"text":" watching"}
If I get a match, it means the page is streaming. (Non-streaming channels don't contain this string.) Again, note the weird extra whitespace. I also escape all the quotation marks since I'm using Perl.
Here are my two suggestions:
Check my answer where I explain how you can check how retrieve videos from channels who are livesrteaming.
Another option could be use the following URL and somehow make request(s) each time for check if there's a livestreaming.
https://www.youtube.com/channel/<CHANNEL_ID>/live
Where CHANNEL_ID is the channel id you want check if that channel is livestreaming1.
1 Just notice that maybe the URL wont work in all channels (and that depends of the channel itself).
For example, if you check the channel_id UC7_YxT-KID8kRbqZo7MyscQ - link to this channel livestreaming - https://www.youtube.com/channel/UC4nprx9Vd84-ly7N-1Ce6Og/live, this channel will show if he is livestreaming, but, with his channel id UC4nprx9Vd84-ly7N-1Ce6Og - link to this channel livestreaming -, it will show his main page instead.
Adding to the answer by Bman70, I tried eliminating the need of making a costly search request after knowing that the channel is streaming live. I did this using two indicators in the HTML response from channels page who are streaming live.
function findLiveStreamVideoId(channelId, cb){
$.ajax({
url: 'https://www.youtube.com/channel/'+channelId,
type: "GET",
headers: {
'Access-Control-Allow-Origin': '*',
'Accept-Language': 'en-US, en;q=0.5'
}}).done(function(resp) {
//one method to find live video
let n = resp.search(/\{"videoId[\sA-Za-z0-9:"\{\}\]\[,\-_]+BADGE_STYLE_TYPE_LIVE_NOW/i);
//If found
if(n>=0){
let videoId = resp.slice(n+1, resp.indexOf("}",n)-1).split("\":\"")[1]
return cb(videoId);
}
//If not found, then try another method to find live video
n = resp.search(/https:\/\/i.ytimg.com\/vi\/[A-Za-z0-9\-_]+\/hqdefault_live.jpg/i);
if (n >= 0){
let videoId = resp.slice(n,resp.indexOf(".jpg",n)-1).split("/")[4]
return cb(videoId);
}
//No streams found
return cb(null, "No live streams found");
}).fail(function() {
return cb(null, "CORS Request blocked");
});
}
However, there's a tradeoff. This method confuses a recently ended stream with currently live streams. A workaround for this issue is to get status of the videoId returned from Youtube API (costs a single unit from your quota).
I found youtube API to be very restrictive given the cost of search operation. Apparently the accepted answer did not work for me as I found the string on non live streams as well. Web scraping with aiohttp and beautifulsoup was not an option since the better indicators required javascript support. Hence I turned to selenium. I looked for the css selector
#info-text
and then search for the string Started streaming or with watching now in it.
To reduce load on my tiny server that would have otherwise required lot more resources, I moved this test of functionality to a heroku dyno with a small flask app.
# import flask dependencies
import os
from flask import Flask, request, make_response, jsonify
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
base = "https://www.youtube.com/watch?v={0}"
delay = 3
# initialize the flask app
app = Flask(__name__)
# default route
#app.route("/")
def index():
return "Hello World!"
# create a route for webhook
#app.route("/islive", methods=["GET", "POST"])
def is_live():
chrome_options = Options()
chrome_options.binary_location = os.environ.get('GOOGLE_CHROME_BIN')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
chrome_options.add_argument('--remote-debugging-port=9222')
driver = webdriver.Chrome(executable_path=os.environ.get('CHROMEDRIVER_PATH'), chrome_options=chrome_options)
url = request.args.get("url")
if "youtube.com" in url:
video_id = url.split("?v=")[-1]
else:
video_id = url
url = base.format(url)
print(url)
response = { "url": url, "is_live": False, "ok": False, "video_id": video_id }
driver.get(url)
try:
element = WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CSS_SELECTOR, "#info-text")))
result = element.text.lower().find("Started streaming".lower())
if result != -1:
response["is_live"] = True
else:
result = element.text.lower().find("watching now".lower())
if result != -1:
response["is_live"] = True
response["ok"] = True
return jsonify(response)
except Exception as e:
print(e)
return jsonify(response)
finally:
driver.close()
# run the app
if __name__ == "__main__":
app.run()
You'll however need to add the following buildpacks in settings
https://github.com/heroku/heroku-buildpack-google-chrome
https://github.com/heroku/heroku-buildpack-chromedriver
https://github.com/heroku/heroku-buildpack-python
Set the following Config Vars in settings
CHROMEDRIVER_PATH=/app/.chromedriver/bin/chromedriver
GOOGLE_CHROME_BIN=/app/.apt/usr/bin/google-chrome
You can find supported python runtime here but anything below python 3.9 should be good since selenium had problems with improper use of is operator
I hope youtube will provide better alternatives than workarounds.
I know this is a old thread, but i thought i share my way of checking to for example grab the status code to use in an app.
This is for a single Channel, but you could easly do a foreach with it.
<?php
#####
$ytchannelID = "UCd0BTXriKLvOs1ANx3puZ3Q";
#####
$ytliveurl = "https://www.youtube.com/channel/".$ytchannelID."/live";
$ytchannelLIVE = '{"text":" watching now"}';
$contents = file_get_contents($ytliveurl);
if ( strpos($contents, $ytchannelLIVE) !== false ){http_response_code(200);} else {http_response_code(201);}
unset($ytliveurl);
?>
Adding onto the other answers here, I use a GET request to https://www.youtube.com/c/<CHANNEL_NAME>/live and then search for "isLive":true (rather than {"text":" watching"})

Getting statuses on Twitter via REST API doesn't always return media URLs

I can't seem to get the embedded URL in a status, for example, in id=780804331608109057 -
https://twitter.com/i/web/status/780804331608109057
When I retrieve this via GET /statuses/:id, with include_entities set to true, the response looks like this:
"text":"Here\u2019s WSJ \"An Underwhelming Trump-Clinton Debate\u201d trying to spin this as something other than a Clinton triumph\u2026 https:\/\/t.co\/dpkmphGI8k",
"truncated":true,
"entities":
{"hashtags":[],"symbols":[],"user_mentions":[],"urls":
[{"url":"https:\/\/t.co\/dpkmphGI8k",
"expanded_url":"https:\/\/twitter.com\/i\/web\/status\/780804331608109057",
"display_url":"twitter.com\/i\/web\/status\/7\u2026","indices":[114,137]}]},
"source":"\u003ca href=\"https:\/\/about.twitter.com\/products\/tweetdeck\"rel=\"nofollow\"\u003eTweetDeck\u003c\/a\u003e",....
When viewed on my web client, the status instead displays the link to WSJ (referred through t.co) What I would like is one or both of these URLs to show up in my API response:
https://pbs.twimg.com/media/CtX5Sz8WIAAm4tq.jpg
what would be the short URL that looks like "t.co" followed by "/HJs4kbmTKz" (I have to break this up so SO doesn't complain.)
What am I doing wrong here?
The incredibly fast response from a staffer on the TwitterCommunity website was most gratifying:
You need to use tweet_mode=extended for the new longer Tweet format.
Ref: https://twittercommunity.com/t/missing-media-property-in-entities/70388/4
A search on this new parameter yields the appropriate documentation on dev.twitter.com - more documentation links on this mode probably exist out there. The most significant change appears to be that the key text is no longer used in non-compatibility mode - that's where the status text is, and unless you turn on compatibility mode, you have to now use the key full_text
https://dev.twitter.com/overview/api/upcoming-changes-to-tweets

Need to open a browser and launch a URL and extract some values from the URL

Initially I tried to get the source of the URL and and extract the values which I was looking for the URL as given below
Dim URL = http://www.google.com
Dim oHTTP : Set oHTTP = CreateObject("MSXML2.ServerXMLHTTP.6.0")
oHTTP.setproxy 2,"<Proxy server:Port>"
oHTTP.Open "GET", URL, False
oHTTP.Send
And since the values which I am looking in the URL are proxy specific. And the values vary based the proxy server name which give explicitily as mentioned the above snippet.
I realised that giving proxy server name explicitlly is not the right approach since PAC file are used. Giving the proxy name only lists down the servers under that name.
So I thought opening a URL from browser such that it takes the PAC details in account and then draw out the neccessry values from there.
And opening a URL from the default browser is achieved as given below, I wanted to know is there a way extract the values from the browser.
CreateObject("WScript.Shell").Run("http://www.google.com")

How to construct/get Office Web App URL for sharepoint documents

I am trying to get the right redirection URL for my sharepoint documents which then I can use to open documents in WebView of iOS. Currently I am giving the absolute URL for the document where the doc is rendered inside WebView as PDF(Image/Readonly). Whereas I want to redirect to office webapp. Now my issue is I dont know if the URL for office web app is something which I can construct like appending /_layouts/15/WopiFrame.aspx?sourcedoc= or is the URL custom based on installations and we need to call some Sharepoint API which will let us know what is the base URL for Wopi service.
Currently I am passing URL like - https://.sharepoint.com/Shared%20Documents/demo/demo.docx
Whereas I want to pass URL like - https://.sharepoint.com/_layouts/15/WopiFrame.aspx?sourcedoc=/Shared%20Documents/demo/demo.docx
Looking forward for help.
Thanks in advance,
Vishwesh
File f = clientContext.Web.GetFileByServerRelativeUrl("/sites/ /Shared%20Documents/Title.docx");
clientContext.Load(f);
clientContext.ExecuteQuery();
ClientResult<String> result = f.ListItemAllFields.GetWOPIFrameUrl(SPWOPIFrameAction.Edit);
clientContext.Load(f.ListItemAllFields);
clientContext.ExecuteQuery();
result.Value contains a URL, something like this:
http://sharep.xxx:8080/sites/zxxx/_layouts/15/WopiFrame.aspx?sourcedoc=%2Fsites%2Fzxxx%2FShared%20Documents%2FTitle%2Edocx&action=edit
Also you can extract the extract Office Web Apps URL from the above page, if you don't want to hit the sharepoint at all.
using Microsoft.SharePoint.Client;
using Microsoft.SharePoint.Client.Utilities;
// Assume we have these variables:
// ctx: A valid client context
// serverRelativeUrl: the URL of the document
File f = ctx.Web.GetFileByServerRelativeUrl (serverRelativeUrl);
result = f.ListItemAllFields.GetWOPIFrameUrl(SPWOPIFrameAction.Edit);
ctx.Load(f.ListItemAllFields);
ctx.ExecuteQuery();
This builds on the answer from #thebitlic which was the silver bullet for sure! However he or she is doing two calls to the server. Through the wonders of CSOM batching, it's possible to do it in one round trip, and no need to bring back the File object at all.

Resources