CBS has the TV show "The Challenge", and there are many seasons and many episodes for each season. They are on the CBS website here: https://www.cbs.com/shows/the-challenge/
I would like a list of all the video links, such as this (first 4 episodes of Season 11).
https://www.cbs.com/shows/the-challenge/video/C5NCzTv2qJp1GwhcV5KWxeKWN4p6_Mqt/the-challenge-throwing-down-the-gauntlet/
https://www.cbs.com/shows/the-challenge/video/kXdno68B36gd6s06OhdrUDUvAAYY4q_e/the-challenge-derrick-steps-it-up/
https://www.cbs.com/shows/the-challenge/video/RYA43Dqs2bRJsgAtcZIZhN8zVVQ1FIxf/the-challenge-we-can-work-it-out/
https://www.cbs.com/shows/the-challenge/video/lJvc_Lkns9Q2NYkDfmsNQmeajXP3QjRm/the-challenge-the-10-000-pyramid/
How can I automatically extract the video links for all of the episodes? I was able to "view page source", but it only showed 12/18 episodes for Season 11: (Open with chrome): view-source:https://www.cbs.com/shows/the-challenge/ , search for https://www.cbs.com/shows/the-challenge/video, should show 12 matches.
The page "hides" episodes and seasons inside the main page, so there is not a separate URL for other seasons. The solution I have now is to manually copy the link address for each of the videos...
This page is (partially) dynamically loaded using javascript. For example, the links for episodes 13-18 are loaded that way.
To capture those, you'll need to use your browser's Developer tab (that's a long and complicated story; you can start reading about it here, for example).
Once you have that link, the response is a json, which - treated as a python dictionary, yields the desired output.
So all together:
import requests
cookies = {
'CBS_ADV_VAL': 'c',
'CBS_ADV_SUBSES_VAL': '4',
'ovvuid': '9f064779-4c06-49f1-9cdd-7e64e653145e',
'OptanonConsent': 'isIABGlobal=false&datestamp=Wed+Sep+09+2020+15%3A44%3A13+GMT-0400+(Eastern+Daylight+Time)&version=6.5.0&hosts=&consentId=d1c945ba-78ea-46e6-ba6f-5329085e06d8&interactionCount=1&landingPath=https%3A%2F%2Fwww.cbs.com%2Fshows%2Fthe-challenge%2F&groups=1%3A1%2C2%3A1%2C3%3A1%2C4%3A1%2C5%3A1',
}
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Language': 'en-US,en;q=0.5',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'https://www.cbs.com/shows/the-challenge/',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'TE': 'Trailers',
}
response = requests.get('https://www.cbs.com/shows/the-challenge/xhr/episodes/page/0/size/18/xs/0/season/11/', headers=headers, cookies=cookies)
links = response.json()
for entry in (links['result']['data']):
print(entry['url'])
Output:
/shows/the-challenge/video/IBWXQxtaPVmI40RnAACOc_zo0u13Ups1/the-challenge-blind-panic/
/shows/the-challenge/video/uFv8wFmvUFRKfiM29HVT3K_gGCZ4IWYS/the-challenge-last-men-standing/
/shows/the-challenge/video/9GP_ASLg9U_MmFvFmXPHO9liRzjdHhwI/the-challenge-don-t-bet-on-it/
etc., all 18 episodes. You can then concatenate each of these links with the base url (https://www.cbs.com) to form the final links.
Related
So, I was making a program that will search google and fetch all the results for a given keyword. I wanted to get all the URLs and print them out to the screen, and I decided to use BS4 for this and this is how I did it:
r = requests.get(f'https://www.google.com/search?q={dork}&start={page}',headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'})
soup = BeautifulSoup(r.text, "html.parser")
urls = soup.find_all('div', attrs={'class': 'BNeawe UPmit AP7Wnd'})
for url in urls:
url = url.split('<div class="BNeawe UPmit AP7Wnd">')[1].split('</div>')[0]
url = url.replace(' › ','/')
print(f'{Fore.GREEN}{url}{Fore.WHITE}')
open(f'results/{timeLol}/urls.txt', "a")
But, it did not return the complete URL instead, if the URL was big it returned ... after some of the URL, is there any way at all to get the complete URL even if it is not using BS4 and Requests.
Any search query example would be appreciated.
While you don't provide query example, you can try to use bs4 css selectors (css selectors reference):
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
# https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
# other URLs below...
Code and example in the online IDE that scrapes more:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'how to cook best corn on the cob'}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
---------
'''
https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
https://www.allrecipes.com/recipe/222352/jamies-sweet-and-easy-corn-on-the-cob/
https://www.delish.com/cooking/a22487458/corn-on-the-cob/
https://www.thekitchn.com/best-method-cook-corn-skills-showdown-23045869
https://natashaskitchen.com/15-minute-corn-on-the-cob/
https://www.thegunnysack.com/how-long-to-boil-corn-on-the-cob/
https://www.epicurious.com/recipes/food/views/basic-method-for-cooking-corn-on-the-cob-40047
https://houseofnasheats.com/the-best-boiled-corn-on-the-cob/
https://www.tasteofhome.com/article/perfect-corn-on-the-cob/
'''
Alternatively, you can do the same thing using Google Search Results API from SerpApi, but without thinking about how to parse stuff since it's already done for the end user. All that needs to be done is just to iterate over structured JSON string.
It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "how to cook best corn on the cob",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
----------
'''
https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
https://www.allrecipes.com/recipe/222352/jamies-sweet-and-easy-corn-on-the-cob/
https://www.delish.com/cooking/a22487458/corn-on-the-cob/
https://www.thekitchn.com/best-method-cook-corn-skills-showdown-23045869
https://natashaskitchen.com/15-minute-corn-on-the-cob/
https://www.thegunnysack.com/how-long-to-boil-corn-on-the-cob/
https://www.epicurious.com/recipes/food/views/basic-method-for-cooking-corn-on-the-cob-40047
https://houseofnasheats.com/the-best-boiled-corn-on-the-cob/
https://www.tasteofhome.com/article/perfect-corn-on-the-cob/
'''
Disclaimer, I work for SerpApi.
I am trying to get the historical economic calendar data from this website - https://www.investing.com/economic-calendar/ from the following dates (1 Feb 2020 to 5 Feb 2020).
Today is 4 Feb 2020.
If I use the https://www.investing.com/economic-calendar/ url below, I am able to extract the table using beautifulsoup but I am unable to select any day except the current day. I get a table saved in my python script for (4 Feb 2020) which is today.
import requests
import pandas as pd
from bs4 import BeautifulSoup
payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
"dateFrom":"2020-02-01",
"dateTo":"2020-02-05",
"timeZone":"8",
"timeFilter":"timeRemain",
"currentTab":"custom",
"limit_from":"0"}
urlheader = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://www.investing.com/economic-calendar/"
req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")
The table variable looks like this
I can see that it sends a post request to "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData" whenever I change the date range or filter settings.
Here is the request data I found.
Here is the POST link
So I use the following code instead, as I want to select the dates.
import requests
import pandas as pd
from bs4 import BeautifulSoup
payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
"dateFrom":"2020-02-01",
"dateTo":"2020-02-05",
"timeZone":"8",
"timeFilter":"timeRemain",
"currentTab":"custom",
"limit_from":"0"}
urlheader = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"
req = requests.post(url, data=payload, headers=urlheader)
print(req)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('table', id="economicCalendarData")
But this time, there is no economicCalendarData, so the table variable comes out empty.
The soup variable has data in it but there's no table data in it.
This is the table I'm trying to save.
Like I said earlier, if I use the url as https://www.investing.com/economic-calendar/, I can get the table data for the current day only (4 Feb 2020); no matter what dates I enter into the payload (dateFrom, dateTo).
For some reason, the table comes up empty when I try to post to https://www.investing.com/economic-calendar/Service/getCalendarFilteredData instead, even though the soup variable contains data, it's not the data I request. What am I doing wrong? How do I save the tables on the dates I select?
You were real close. If I understood your requirements, the following should get you there:
import requests
from bs4 import BeautifulSoup
url = "https://www.investing.com/economic-calendar/Service/getCalendarFilteredData"
payload = {"country[]":["25","32","6","37","72","22","17","39","14","10","35","43","56","36","110","11","26","12","4","5"],
"dateFrom":"2020-02-01",
"dateTo":"2020-02-05",
"timeZone":"8",
"timeFilter":"timeRemain",
"currentTab":"custom",
"limit_from":"0"}
req = requests.post(url, data=payload, headers={
"User-Agent":"Mozilla/5.0",
"X-Requested-With": "XMLHttpRequest"
})
soup = BeautifulSoup(req.json()['data'],"lxml")
for items in soup.select("tr"):
data = [item.get_text(strip=True) for item in items.select("th,td")]
print(data)
I've been using Radview's Webload IDE tool for a couple of test simulation projects and it has worked well. But for this one scenario where I have a client web session for a login a screen, it would always fail with a 500 Response error for a particular HTTP post as the page loads.
When I try the scenario to load the page manually with a browser it works fine with no issues.
During the recording I would set clear browser cache and cookies and no luck. And I've also tried out many configuration combinations from the "Recording and Script Generatinon Options: Post Data" settings.
/***** WLIDE - URL : http://192.168.2.2/ - ID:2 *****/
wlGlobals.GetFrames = false
wlGlobals.UserAgent = "Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko"
wlHttp.Get("http://192.168.2.2/")
// END WLIDE
/***** WLIDE - URL : http://192.168.2.2/Api.ashx?c=Images&action=GetSettings - ID:3 *****/
wlHttp.Header["Referer"] = "http://192.168.2.2/"
wlHttp.FormdataEncodingType = 1
wlHttp.ContentType = "application/x-www-form-urlencoded"
wlHttp.FormData["c"] = "Images"
wlHttp.FormData["action"] = "GetSettings"
wlHttp.Post("http://192.168.2.2/Api.ashx"+"?c=Images&action=GetSettings")
// END WLIDE
Anybody with experience with Radview's Webload can give me some suggestions?
I noticed that commenting out the formdata "c" and "actions" lines works. but later I notice a similar error which requires a sessionID in the URL so I'm not sure if I can comment out the formdata "sessionID" line.
To run the API from Webload you need to specify the authorization if its secured.
Using wlHttp.FormData is not the same as adding a parameter to the URL for a POST request.
FormData will be send as part of the post-data request body, while adding it to the URL will send it as a query string - your sever probably expects one form but not the other.
Contact RadView support if you can't get it to work and they'll help you
In my webserver when user requests URLs with weird characters, I remove these characters. And system logs these cases. When I check sanitized cases I found these. I'm curious that what would be the objective of these URLs ?
I check the IPs and these are real people and uses website as a normal person. But 1 time in their 20 URL requets of these people, URL has these weird characters at last.
http://example.com/#%EF%BF%BD%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0,
http://example.com/%60E%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%60E%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/p%EF%BF%BD%1D%01?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%EF%BF%BDC%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%EF%BF%BDR%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD`%EF%BF%BD%EF%BF%BD%7F, agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
http://example.com/%EF%BF%BDe%EF%BF%BDv8%01%EF%BF%BD?o=3&g=P%01%EF%BF%BD&s=&z=%EF%BF%BD%EF%BF%BD%15%01%EF%BF%BD%EF%BF%BD, agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
http://en.wikipedia.org/wiki/Specials_(Unicode_block)
They are essentially malformed URLs. They can be generated from a specific malware that is trying to exploit web site vulnerabilities, from malfunctioning browser plugin or extension, or from a bug in a JS file (i.e. tracking with Google Analytics) in combination with a specific browser version/operating system. In any case, you can't actually control what requests will come from a client and there's nothing you can do to stop that so, if your generated HTML/JS code is correct, you have done your work.
If you like to correct those URLs for any reason, you can enable URL rewriting and set a rule with a regular expression filter to transform those URLs to valid URLs. Anyway, I don't suggest do that: the web server should respond with a error 404 page not found message, because that is the standard (it's a client error, after all), and this is in my opinion a faster and safer method than applying URL rewriting. (rewriting procedure may contains bugs, so someone can try to exploit that, etc, etc)
For sake of curiosity, you can easily decode those URLs with an online URL decoder of your choice (i.e. this), but essentially you will discover what you already know: there are a lot of UTF-8 replacement characters in those URLs.
In fact, %EF%BF%BD is the url-encoded version of the hex representation of the 3 bytes (EF BF BD) of the UTF-8 replacement character. You can see that character also as � or EF BF BD or FFFD or ï ¿ ½, and so on, depending of the representation method you choose.
Also, you can check by your own how the client handles that character. Go here:
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%EF%BF%BD&mode=char
press the GO button and, using your browser developer tools, check what really happens: the browser is actually encoding the unknown character with %EF%BF%BD before sending it to the web server.
These look like corrupted URLs being inserted by a piece of Malware/Adware called "Adpeak".
Here are some details on Adpeak:
How to remove AdPeak lqw.me script from my web pages?
Adpeak has a client side component that sticks the following tag into web pages:
<script type="text/javascript" id="2f2a695a6afce2c2d833c706cd677a8e" src="http://d.lqw.me/xuiow/?g=7FC3E74A-AFDA-0667-FB93-1C86261E6E1C&s=4150&z=1385998326"></script>
Adpeak also sometimes uses the host names "d.sitespeeds.com", "d.jazzedcdn.com", "d.deliversuper.com", "d.blazeapi.com", "d.quikcdn.com", probably others. Here are a few more examples:
<script type="text/javascript" id="2f2a695a6afce2c2d833c706cd677a8e" src="http://d.deliversuper.com/xuiow/?o=3&g=823F0056-D574-7451-58CF-01151D4A9833&s=7B0A8368-1A6F-48A5-B236-8BD61816B3F9&z=1399243226"></script>
<script type="text/javascript" id="2f2a695a6afce2c2d833c706cd677a8e" src="http://d.jazzedcdn.com/xuiow/?o=3&g=B43EA207-C6AC-E01B-7865-62634815F491&s=B021CBBD-E38E-4F8C-8E93-6624B0597A23&z=1407935653"></script>
<SCRIPT id=2f2a695a6afce2c2d833c706cd677a8e type=text/javascript src="http://d.lqw.me/xuiow/?o=3&g=87B35A3E-C25D-041E-0A0F-C3E8E473A019&s=BBA5481A-926B-4561-BD79-249F618495E6&z=1393532281"></SCRIPT>
<SCRIPT id=2f2a695a6afce2c2d833c706cd677a8e type=text/javascript src="http://d.lqw.me/xuiow/?o=2&g=0AD3E5F2-B632-382A-0473-4C994188DBBA&s=9D0EB5E9-CCC9-4360-B7CA-3E645650CC53&z=1387549919"></SCRIPT>
The "id" is consistent: it's always "2f2a695a6afce2c2d833c706cd677a8e" in the cases we've seen.
There's always a "g", "s", and "z" parameter, and sometimes a "o" parameter that has values of 2 or 3.
We've noticed that with our pages, a certain version of this script is 100% correlated with seeing corrupted characters in the DOM: if "o" is omitted or set to 2, we'll see a Unicode FFFD injected near the end of the page or sometimes a Ux000E character, a.k.a. SHIFT OUT, which blows up standard JSON/XML serialization libraries, which is why we've been researching these URLs. We've never seen a corruption for "o=3"
However, sometimes it looks like Adpeak gets confused, and inserts junk like this:
<script type="text/javascript" id="2f2a695a6afce2c2d833c706cd677a8e" src="��?o=3&g=&s=&z=����������~?"></script>
Now, we don't know that this is Adpeak, because the URLs are mangled, but the "o=3", "g", "s", and "z" parameters are four smoking guns. The host is missing here, so it will resolve against our server, so these UxFFFDs will get sent up as UTF-8 hex-encoded "%EF%BF%BD" sequences, which are identical to what people have been seeing above.
If you're curious about how common this is, for a particular customer with high traffic and a wide demographic, we see Adpeak URLs injected into about 1.09% of their web pages, both well-formed Adpeak URLs as well as URLs with UxFFFD's. If you just look for Adpeak URLs with UxFFFD sequences, those appear in 0.053% of all web pages. And if you just look for Adpeak URLs that cause DOM corruptions (e.g., the valid URLs that contain "o=2" or no "o" parameter), that covers 0.20% of all web pages.
Probably your site's character-set is not initialized to UTF-8, but when you request a page in the site it thinks that the character are encoded with utf-8. When it "understands" that the characters are not encoded in UTF-8 format, it replaces any character that it doesn't know with the bytes sequence EF BF BD ("character place keeper").
Make sure you use UTF-8 in everyplace in your site by using <meta charset="UTF-8"> in every page.
Another example for this in a different situation: Whats going on with this byte array?
You have to use Regular Expression Functions, Search for it in php official site or google it...
The url's which are in other languages rather than english are causing this problem,
Meta charset utf 8 will not affect the url,so it wont help..meta charset only helps you to display other languages text on your web page ,not your URL..
using php Regex you can shown even chinese text in url..
Hope it will work ..
just un-check the EnableBrowserLink option in visual studio. Every Thing will work out of box.
Previously i was able to download YouTube videos as mp3 via youtube-mp3.org Using this method:
http://www.youtube-mp3.org/api/pushItem/?item=http%3A//www.youtube.com/watch%3Fv%3D<VIDEOID>&xy=_
Then it returned the video id and they started converting the video on their servers. Then this request would return a JSON string with info about the video and the current conversion status:
http://www.youtube-mp3.org/api/itemInfo/?video_id=<VIDEOID>&adloc=
After repeating the request until the value for status is 'serving' I then started the last request by taking the value for key h from the JSON response from the previous request, and this would download a the mp3 file.
http://www.youtube-mp3.org/get?video_id=<VIDEOID>&h=<JSON string value for h>
Now the first request always returns nothing. The second and third requests only succeed if the requested video is cached on their servers (like popular music videos). If thats not the case then the second request would return nil and so the 3rd request can't be started because of the missing hvalue from the second request. Could anybody help me with getting the website to start a conversion something needs to be wrong with the first URL i just dont know what. Thanks
I just tested it. For the first request, you need to send with it a header of:
Accept-Location: *
Otherwise, it will return a 500 (Internal Server Error). But with that header, it will return a string of the youtube video id, and you can use the 2nd api for checking the progress.
Here's the C# code I used for testing:
HttpWebRequest wr = (HttpWebRequest)WebRequest.Create("FIRST_API_URL");
wr.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.75 Safari/535.7";
wr.Headers.Add("Accept-Location", "*");
string res = (new StreamReader(wr.GetResponse().GetResponseStream())).ReadToEnd();
Btw, you can keep track of the headers in the browser's Network (Chrome) debug tab.
Regards