Trouble parsing CNN search results using Python 3 lxml - html-parsing

I am trying to parse the response from a search on the CNN site like so:
import requests
from lxml import html
from lxml import etree
r = requests.get('https://www.cnn.com/search?q=climate+change')
doc = etree.HTML(r.content)
for url in doc.xpath('//a[#href]'):
u = url.get('href')
print(u)
This gives a bunch of links, primarily to different sections on the site, but it gives no links at all to the actual stories returned by the search. What am I doing wrong?

Related

Using python to parse twitter url

I am using the following code but I am not able to extract any information from the url.
from urllib.parse import urlparse
if __name__ == "__main__":
z = 5
url = 'https://twitter.com/isro/status/1170331318132957184'
df = urlparse(url)
print(df)
ParseResult(scheme='https', netloc='twitter.com', path='/isro/status/1170331318132957184', params='', query='', fragment='')
I was hoping to extract the tweet message, time of tweet and other information available from the link but the code above clearly doesn't achieve that. How do I go about it from here ?
print(df)
ParseResult(scheme='https', netloc='twitter.com', path='/isro/status/1170331318132957184', params='', query='', fragment='')
I think you may be misunderstanding the purpose of the urllib parseurl function. From the Python documentation:
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
Parse a URL into six components, returning a 6-item named tuple. This
corresponds to the general structure of a URL:
scheme://netloc/path;parameters?query#fragment
From the result you are seeing in ParseResult, your code is working perfectly - it is breaking your URL up into the component parts.
It sounds as though you actually want to fetch the web content at that URL. In that case, I might take a look at urllib.request.urlopen instead.

LXML does not parse broken HTML : XMLSyntaxError error finding by XPath

I'm trying to extract a csrf token from a login page.
I'm using as a parser the lxml library.
s = requests.Session()
login_html = etree.fromstring(
s.get('https://www.uwkotinleuven.be/fr/login').text)
find = etree.XPath('//*[#id="login-form-2"]/input[3]')
print(find(login_html).value )
Here is the error:
lxml.etree.XMLSyntaxError: Opening and ending tag mismatch: link line 19 and head, line 46, column 24
I'm unsure wether the error is coming from the XPath finder, or any broken HTML that is sent.
Should I change parse, or give parameters? Is there a conventional way to parse broken HTML ?
It looks like a broken html problem. See if this works for you:
import requests
import lxml.etree as etree
from io import StringIO
s = requests.Session()
dat = s.get('https://www.uwkotinleuven.be/fr/login')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(dat.text), parser)
find = tree.xpath('//*[#id="login-form-2"]/input[3]')
print(find[0].attrib.values()[2])
Output:
3pKL_AsLLBE07T6S-VY8eXJ4ooK_QH5kMgajPEwKSso

Latex output from sympy does not correctly display in Google Colaboratory Jupyter notebooks

I am using Google's Colaboratory platform to run python in a Jupyter notebook. In standard Jupyter notebooks, the output of sympy functions is correctly typeset Latex, but the Colaboratory notebook just outputs the Latex, as in the following code snippet:
import numpy as np
import sympy as sp
sp.init_printing(use_unicode=True)
x=sp.symbols('x')
a=sp.Integral(sp.sin(x)*sp.exp(x),x);a
results in Latex output like this:
$$\int e^{x} \sin{\left (x \right )}\, dx$$
The answer cited in these questions, Rendering LaTeX in output cells in Colaboratory and LaTeX equations do not render in google Colaboratory when using IPython.display.Latex doesn't fix the problem. While it provides a method to display Latex expressions in the output of a code cell, it doesn't fix the output from the built-in sympy functions.
Any suggestions on how to get sympy output to properly render? Or is this a problem with the Colaboratory notebook?
I have just made this code snippet to make sympy works like a charm in colab.research.googlr.com !!!
def custom_latex_printer(exp,**options):
from google.colab.output._publish import javascript
url = "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=default"
javascript(url=url)
return sympy.printing.latex(exp,**options)
init_printing(use_latex="mathjax",latex_printer=custom_latex_printer)
Put it after you imported sympy
This one basically tell sympy to embed mathjax library using colab api before they actually output any syntax.
You need to include MathJax library before display. Set it up in a cell like this first.
from google.colab.output._publish import javascript
url = "https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/latest.js?config=default"
Later, you include javascript(url=url) before displaying:
x=sp.symbols('x')
a=sp.Integral(sp.sin(x)*sp.exp(x),x)
javascript(url=url)
a
Then, it will display correctly.
Using colab's mathjax and setting the configuration file to TeX-MML-AM_HTMLorMML worked for me. Below is the code:
from sympy import init_printing
from sympy.printing import latex
def colab_LaTeX_printer(exp, **options):
from google.colab.output._publish import javascript
url_ = "https://colab.research.google.com/static/mathjax/MathJax.js?"
cfg_ = "config=TeX-MML-AM_HTMLorMML" # "config=default"
javascript(url=url_+cfg_)
return latex(exp, **options)
# end of def
init_printing(use_latex="mathjax", latex_printer=colab_LaTeX_printer)

Save description of a number of youtube videos

I need to save descriptions of a list of youtube videos. I want to feed the urls of videos (i.e. http://www.youtube.com/watch?v=sVtkQCz9Sx8) and then get the corresponding info of the "about" section of the youtube video. Is this possible for me to do without learning even basics of programming?
In python 3, something like this :
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
from urllib.request import urlopen
with open('links.txt') as f:
for link in f:
page = urlopen(link)
soup = BeautifulSoup(page.read())
description_tag = soup.find(id='eow-description')
upload_date_tag = soup.find(id='eow-date')
print(link)
print('Published on', upload_date_tag.text)
print(description_tag.text)
print()
Type your urls in links.txt (one url per lines).

Extracting an element from XML with Python3?

I am trying to write a Python 3 script where I am querying a web api and receiving an XML response. The response looks like this –
<?xml version="1.0" encoding="UTF-8"?>
<ipinfo>
<ip_address>4.2.2.2</ip_address>
<ip_type>Mapped</ip_type>
<anonymizer_status/>
<Network>
<organization>level 3 communications inc.</organization>
<OrganizationData>
<home>false</home>
<organization_type>Telecommunications</organization_type>
<naics_code>518219</naics_code>
<isic_code>J6311</isic_code>
</OrganizationData>
<carrier>level 3 communications</carrier>
<asn>3356</asn>
<connection_type>tx</connection_type>
<line_speed>high</line_speed>
<ip_routing_type>fixed</ip_routing_type>
<Domain>
<tld>net</tld>
<sld>bbnplanet</sld>
</Domain>
</Network>
<Location>
<continent>north america</continent>
<CountryData>
<country>united states</country>
<country_code>us</country_code>
<country_cf>99</country_cf>
</CountryData>
<region>southwest</region>
<StateData>
<state>california</state>
<state_code>ca</state_code>
<state_cf>88</state_cf>
</StateData>
<dma>803</dma>
<msa>31100</msa>
<CityData>
<city>san juan capistrano</city>
<postal_code>92675</postal_code>
<time_zone>-8</time_zone>
<area_code>949</area_code>
<city_cf>77</city_cf>
</CityData>
<latitude>33.499</latitude>
<longitude>-117.662</longitude>
</Location>
</ipinfo>
This is the code I have so far –
import urllib.request
import urllib.error
import sys
import xml.etree.ElementTree as etree
…
try:
xml = urllib.request.urlopen(targetURL, data=None)
except urllib.error.HTTPError as e:
print("HTTP error: " + str(e) + " URL: " + targetURL)
sys.exit()
tree = etree.parse(xml)
root = tree.getroot()
The API query works and through the debugger I can see all of the information inside the ‘root’ variable. My issue is that I have not been able to figure out how to extract something like the ASN (<asn></asn>) from the returned XML. I’ve been beating my head against this for a day with a whole wide variety of finds, findalls and all other sorts of methods but not been able to crack this. I think I have reached the point where I cannot see the wood for all the trees and every example I have found on the internet doesn’t seem to help. Can someone show me a code snippet which can extract the contents of a XML element from inside the tree structure?
Many thanks
Tim
I would recommend using Beautiful Soup.
It's a very powerful when it comes to extracting data from xml-code.
Example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(targetURL)
soup.find_all('asn') #Would return all the <asn></asn> tags found!

Resources