Save description of a number of youtube videos - youtube

I need to save descriptions of a list of youtube videos. I want to feed the urls of videos (i.e. http://www.youtube.com/watch?v=sVtkQCz9Sx8) and then get the corresponding info of the "about" section of the youtube video. Is this possible for me to do without learning even basics of programming?

In python 3, something like this :
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from urllib.request import urlopen
with open('links.txt') as f:
for link in f:
page = urlopen(link)
soup = BeautifulSoup(page.read())
description_tag = soup.find(id='eow-description')
upload_date_tag = soup.find(id='eow-date')
print(link)
print('Published on', upload_date_tag.text)
print(description_tag.text)
print()
Type your urls in links.txt (one url per lines).

Related

Trouble parsing CNN search results using Python 3 lxml

I am trying to parse the response from a search on the CNN site like so:
import requests
from lxml import html
from lxml import etree
r = requests.get('https://www.cnn.com/search?q=climate+change')
doc = etree.HTML(r.content)
for url in doc.xpath('//a[#href]'):
u = url.get('href')
print(u)
This gives a bunch of links, primarily to different sections on the site, but it gives no links at all to the actual stories returned by the search. What am I doing wrong?

Using python to parse twitter url

I am using the following code but I am not able to extract any information from the url.
from urllib.parse import urlparse
if __name__ == "__main__":
z = 5
url = 'https://twitter.com/isro/status/1170331318132957184'
df = urlparse(url)
print(df)
ParseResult(scheme='https', netloc='twitter.com', path='/isro/status/1170331318132957184', params='', query='', fragment='')
I was hoping to extract the tweet message, time of tweet and other information available from the link but the code above clearly doesn't achieve that. How do I go about it from here ?
print(df)
ParseResult(scheme='https', netloc='twitter.com', path='/isro/status/1170331318132957184', params='', query='', fragment='')
I think you may be misunderstanding the purpose of the urllib parseurl function. From the Python documentation:
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment
From the result you are seeing in ParseResult, your code is working perfectly - it is breaking your URL up into the component parts.
It sounds as though you actually want to fetch the web content at that URL. In that case, I might take a look at urllib.request.urlopen instead.

Could anyone scrape an element with Jsoup?

I'm trying to scrape this link using Jsoup with Kotlin/Java. And I have problem in scrapping players part (under Current Squad). Could anyone parse it?
You can not access the information directly using only the response from that link.
You can make a JSON object with the http response from https://stats.fn.sportradar.com/betsgi/en/America:Argentina:Buenos_Aires/gismo/stats_team_squad/2817 and https://stats.fn.sportradar.com/betsgi/en/America:Argentina:Buenos_Aires/gismo/stats_teamplayer_facts/2817/42556.
As an example in python you can get the minutes played by each player as follows:
import urllib
import json
f=urllib.urlopen('https://stats.fn.sportradar.com/betsgi/en/America:Argentina:Buenos_Aires/gismo/stats_team_squad/2817')
f2=urllib.urlopen('https://stats.fn.sportradar.com/betsgi/en/America:Argentina:Buenos_Aires/gismo/stats_teamplayer_facts/2817/42556')
j=json.loads(f.read())
j2=json.loads(f2.read())
plrs=j['doc'][0]['data']['players']
for plr in plrs:
print '========================='
print plr['name']
try:
print 'minutes played:' +str(j2['doc'][0]['data'][str(plr['_id'])]['stats']['total']['minutes_played'])
except KeyError, e:
pass

Python code to display default page name for url

is there a way in python to display what the default initial page is for www.xyz.com when there isn't an index.html, default.html page?
For example I am trying to use Beautiful Soup to do the following:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.example.com/index.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)
But index.html and default.html do not exist.
How do I find out what the default page is when I go to www.example.com?
You can just use urlopen("http://www.example.com/"), which will open the default page served by the webserver.

Extracting an element from XML with Python3?

I am trying to write a Python 3 script where I am querying a web api and receiving an XML response. The response looks like this –
<?xml version="1.0" encoding="UTF-8"?>
<ipinfo>
<ip_address>4.2.2.2</ip_address>
<ip_type>Mapped</ip_type>
<anonymizer_status/>
<Network>
<organization>level 3 communications inc.</organization>
<OrganizationData>
<home>false</home>
<organization_type>Telecommunications</organization_type>
<naics_code>518219</naics_code>
<isic_code>J6311</isic_code>
</OrganizationData>
<carrier>level 3 communications</carrier>
<asn>3356</asn>
<connection_type>tx</connection_type>
<line_speed>high</line_speed>
<ip_routing_type>fixed</ip_routing_type>
<Domain>
<tld>net</tld>
<sld>bbnplanet</sld>
</Domain>
</Network>
<Location>
<continent>north america</continent>
<CountryData>
<country>united states</country>
<country_code>us</country_code>
<country_cf>99</country_cf>
</CountryData>
<region>southwest</region>
<StateData>
<state>california</state>
<state_code>ca</state_code>
<state_cf>88</state_cf>
</StateData>
<dma>803</dma>
<msa>31100</msa>
<CityData>
<city>san juan capistrano</city>
<postal_code>92675</postal_code>
<time_zone>-8</time_zone>
<area_code>949</area_code>
<city_cf>77</city_cf>
</CityData>
<latitude>33.499</latitude>
<longitude>-117.662</longitude>
</Location>
</ipinfo>
This is the code I have so far –
import urllib.request
import urllib.error
import sys
import xml.etree.ElementTree as etree
…
try:
xml = urllib.request.urlopen(targetURL, data=None)
except urllib.error.HTTPError as e:
print("HTTP error: " + str(e) + " URL: " + targetURL)
sys.exit()
tree = etree.parse(xml)
root = tree.getroot()
The API query works and through the debugger I can see all of the information inside the ‘root’ variable. My issue is that I have not been able to figure out how to extract something like the ASN (<asn></asn>) from the returned XML. I’ve been beating my head against this for a day with a whole wide variety of finds, findalls and all other sorts of methods but not been able to crack this. I think I have reached the point where I cannot see the wood for all the trees and every example I have found on the internet doesn’t seem to help. Can someone show me a code snippet which can extract the contents of a XML element from inside the tree structure?
Many thanks
Tim
I would recommend using Beautiful Soup.
It's a very powerful when it comes to extracting data from xml-code.
Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(targetURL)
soup.find_all('asn') #Would return all the <asn></asn> tags found!

Resources