Cannot parse all the information of this website with a single loop

Cannot parse all the information of this website with a single loop - parsing

I am trying to scrape this website
Scraping information is possible manually
However, I cannot access information within p...p tags and ul...ul tags with one loop. These two tags are in a similar division. However, the loop breaks whenever p replaces ul or vice-versa.
Is this possible with just one loop??
import requests
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"Accept-Encoding":"gzip, deflate, br",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"DNT":"1",
"Connection":"close",
"Upgrade-Insecure-Requests":"1"}
source = requests.get('https://insights.blackcoffer.com/how-small-business-can-survive-the-coronavirus-crisis/',
headers=headers)
page = source.content
soup = bs(page, 'html.parser')
information = ''
for section in soup.find('div', class_='td-post-content').find_all('p'):
if information != '':
information = information + '\n' + section.text
else:
information = section.text
print(information)

import requests
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"Accept-Encoding":"gzip, deflate, br",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"DNT":"1",
"Connection":"close",
"Upgrade-Insecure-Requests":"1"}
source = requests.get('https://insights.blackcoffer.com/how-small-business-can-survive-the-coronavirus-crisis/',
headers=headers)
page = source.content
soup = bs(page, 'html.parser')
information = ''
for section in soup.find('div', class_='td-post-content').find_all(['p', 'li']):
information += '\n\n' + section.text
print(information.strip())

Related

How to get the complete URL by making a google search with BS4 and Requests

So, I was making a program that will search google and fetch all the results for a given keyword. I wanted to get all the URLs and print them out to the screen, and I decided to use BS4 for this and this is how I did it:
r = requests.get(f'https://www.google.com/search?q={dork}&start={page}',headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'})
soup = BeautifulSoup(r.text, "html.parser")
urls = soup.find_all('div', attrs={'class': 'BNeawe UPmit AP7Wnd'})
for url in urls:
url = url.split('<div class="BNeawe UPmit AP7Wnd">')[1].split('</div>')[0]
url = url.replace(' › ','/')
print(f'{Fore.GREEN}{url}{Fore.WHITE}')
open(f'results/{timeLol}/urls.txt', "a")
But, it did not return the complete URL instead, if the URL was big it returned ... after some of the URL, is there any way at all to get the complete URL even if it is not using BS4 and Requests.

Any search query example would be appreciated.
While you don't provide query example, you can try to use bs4 css selectors (css selectors reference):
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
# https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
# other URLs below...
Code and example in the online IDE that scrapes more:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'how to cook best corn on the cob'}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
---------
'''
https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
https://www.allrecipes.com/recipe/222352/jamies-sweet-and-easy-corn-on-the-cob/
https://www.delish.com/cooking/a22487458/corn-on-the-cob/
https://www.thekitchn.com/best-method-cook-corn-skills-showdown-23045869
https://natashaskitchen.com/15-minute-corn-on-the-cob/
https://www.thegunnysack.com/how-long-to-boil-corn-on-the-cob/
https://www.epicurious.com/recipes/food/views/basic-method-for-cooking-corn-on-the-cob-40047
https://houseofnasheats.com/the-best-boiled-corn-on-the-cob/
https://www.tasteofhome.com/article/perfect-corn-on-the-cob/
'''
Alternatively, you can do the same thing using Google Search Results API from SerpApi, but without thinking about how to parse stuff since it's already done for the end user. All that needs to be done is just to iterate over structured JSON string.
It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "how to cook best corn on the cob",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
----------
'''
https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
https://www.allrecipes.com/recipe/222352/jamies-sweet-and-easy-corn-on-the-cob/
https://www.delish.com/cooking/a22487458/corn-on-the-cob/
https://www.thekitchn.com/best-method-cook-corn-skills-showdown-23045869
https://natashaskitchen.com/15-minute-corn-on-the-cob/
https://www.thegunnysack.com/how-long-to-boil-corn-on-the-cob/
https://www.epicurious.com/recipes/food/views/basic-method-for-cooking-corn-on-the-cob-40047
https://houseofnasheats.com/the-best-boiled-corn-on-the-cob/
https://www.tasteofhome.com/article/perfect-corn-on-the-cob/
'''
Disclaimer, I work for SerpApi.

How can I iterate over list of URL's to scrape the data in Scrapy?

import scrapy
class oneplus_spider(scrapy.Spider):
name='one_plus'
page_number=0
start_urls=[
'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3'
]
def parse(self,response):
all_links=[]
total_links=[]
domain='https://www.amazon.com'
href=[]
link_set=set()
href=response.css('a.a-link-normal.a-text-normal').xpath('#href').extract()
for x in href:
link_set.add(domain+x)
for x in link_set:
next_page=x
yield response.follow(next_page, callback=self.parse_page1)
def parse_page1(self, response):
title=response.css('span.a-size-large product-title-word-break::text').extract()
print(title)
Error after running the code - (failed 2 times): 503 Service Unavailable.
I tried many ways but failed. Please help me. Thanks in advance!

Check url by "curl" first. like,
curl -I "https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3"
then, you can see 503 response.
HTTP/2 503
In other words, your request is wrong.
you have to find proper request.
Chrome DevTools will help you. like
I think that user-agent ( like browser ) must be needed.
curl 'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3' \
-H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36' \
--compressed
so... It may work,
import scrapy
class oneplus_spider(scrapy.Spider):
name='one_plus'
page_number=0
user_agent = "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
start_urls=[
'https://www.amazon.com/s?k=samsung+mobile&page=3&qid=1600763713&ref=sr_pg_3'
]
def parse(self,response):
all_links=[]
total_links=[]
domain='https://www.amazon.com'
href=[]
link_set=set()
href=response.css('a.a-link-normal.a-text-normal').xpath('#href').extract()
for x in href:
link_set.add(domain+x)
for x in link_set:
next_page=x
yield response.follow(next_page, callback=self.parse_page1)
def parse_page1(self, response):
title=response.css('span.a-size-large product-title-word-break::text').extract()
print(title)

Splunk AWS ALB logs not properly parsing

I'm trying to ingest my AWS ALB logs into Splunk. After all, I could search my ALB logs in Splunk. But still the events are not properly parsing. Did anyone had similar issue or any suggestion?
Here is my prop.conf
[aws:alb:accesslogs]
SHOULD_LINEMERGE=false
FIELD_DELIMITER = whitespace
pulldown_type=true
FIELD_NAMES=type,timestamp,elb,client_ip,client_port,target,request_processing_time,target_processing_time,response_processing_time,elb_status_code,target_status_code,received_bytes,sent_bytes,request,user_agent,ssl_cipher,ssl_protocol,target_group_arn,trace_id
EXTRACT-elb = ^\s*(?P<type>[^\s]+)\s+(?P<timestamp>[^\s]+)\s+(?P<elb>[^\s]+)\s+(?P<client_ip>[0-9.]+):(?P<client_port>\d+)\s+(?P<target>[^\s]+)\s+(?P<request_processing_time>[^\s]+)\s+(?P<target_processing_time>[^\s]+)\s+(?P<response_processing_time>[^\s]+)\s+(?P<elb_status_code>[\d-]+)\s+(?P<target_status_code>[\d-]+)\s+(?P<received_bytes>\d+)\s+(?P<sent_bytes>\d+)\s+"(?P<request>.+)"\s+"(?P<user_agent>.+)"\s+(?P<ssl_cipher>[-\w]+)\s*(?P<ssl_protocol>[-\w\.]+)\s+(?P<target_group_arn>[^\s]+)\s+(?P<trace_id>[^\s]+)
EVAL-rtt = request_processing_time + target_processing_time + response_processing_time
Sample data
https 2020-08-20T12:40:00.274478Z app/my-aws-alb/e7538073dd1a6fd8 162.158.26.188:21098 172.0.51.37:80 0.000 0.004 0.000 405 405 974 424 "POST https://my-aws-alb-domain:443/api/ps/fpx/callback HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.2840.91 Safari/537.36" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:ap-southeast-1:111111111111:targetgroup/my-aws-target-group/41dbd234b301e3d84 "Root=1-5f3e6f20-3fdasdsfffdsf" "api.mydomain.com" "arn:aws:acm:ap-southeast-1:11111111111:certificate/be4344424-a40f-416e-8434c-88a8a3b072f5" 0 2020-08-20T12:40:00.270000Z "forward" "-" "-" "172.0.51.37:80" "405" "-" "-"

Using transforms is pretty straightforward. Start with a stanza in transforms.conf.
[elb]
REGEX = ^\s*(?P<type>[^\s]+)\s+(?P<timestamp>[^\s]+)\s+(?P<elb>[^\s]+)\s+(?P<client_ip>[0-9.]+):(?P<client_port>\d+)\s+(?P<target>[^\s]+)\s+(?P<request_processing_time>[^\s]+)\s+(?P<target_processing_time>[^\s]+)\s+(?P<response_processing_time>[^\s]+)\s+(?P<elb_status_code>[\d-]+)\s+(?P<target_status_code>[\d-]+)\s+(?P<received_bytes>\d+)\s+(?P<sent_bytes>\d+)\s+"(?P<request>.+)"\s+"(?P<user_agent>.+)"\s+(?P<ssl_cipher>[-\w]+)\s*(?P<ssl_protocol>[-\w\.]+)\s+(?P<target_group_arn>[^\s]+)\s+(?P<trace_id>[^\s]+)
Then refer to the transform in props.conf
[aws:alb:accesslogs]
TIME_PREFIX = https\s
TIME_FORMAT = %Y-%m-%dT%H:%M:%S.%6N%Z
MAX_TIMESTAMP_LOOKAHEAD = 32
SHOULD_LINEMERGE=false
NO_BINARY_CHECK=true
TRANSFORMS-elb = elb
EVAL-rtt = request_processing_time + target_processing_time + response_processing_time

How to detect Microsoft Chromium Edge (chredge , edgium) in Javascript

'Edge 75' will be (is?) the first Chromium Based Edge browser. How can I check if this browser is Edge on Chrome ?
(What I really want to know is if the browser fully supports data-uri's - https://caniuse.com/#feat=datauri - so feature detection would even be better. If you know a way to do that, I can change the question)

You could use the window.navigator userAgent to check whether the browser is Microsoft Chromium Edge or Chrome.
Code as below:
<script>
var browser = (function (agent) {
switch (true) {
case agent.indexOf("edge") > -1: return "edge";
case agent.indexOf("edg/") > -1: return "chromium based edge (dev or canary)"; // Match also / to avoid matching for the older Edge
case agent.indexOf("opr") > -1 && !!window.opr: return "opera";
case agent.indexOf("chrome") > -1 && !!window.chrome: return "chrome";
case agent.indexOf("trident") > -1: return "ie";
case agent.indexOf("firefox") > -1: return "firefox";
case agent.indexOf("safari") > -1: return "safari";
default: return "other";
}
})(window.navigator.userAgent.toLowerCase());
document.body.innerHTML = window.navigator.userAgent.toLowerCase() + "<br>" + browser;
</script>
The Chrome browser userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml,
like gecko) chrome/74.0.3729.169 safari/537.36
The Edge browser userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml,
like gecko) chrome/64.0.3282.140 safari/537.36 edge/18.17763
The Microsoft Chromium Edge Dev userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml,
like gecko) chrome/76.0.3800.0 safari/537.36 edg/76.0.167.1
The Microsoft Chromium Edge Canary userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml,
like gecko) chrome/76.0.3800.0 safari/537.36 edg/76.0.167.1
As we can see that Microsoft Chromium Edge userAgent contains the "edg" keyword, we could use it to detect whether the browser is Chromium Edge browser or Chrome browser.

Using CanIUse, the most universal feature which is unsupported on old Edge (which used the EdgeHtml engine) but supported in Edge Chromium and everywhere else (except IE), is the reversed attribute on an OL list. This attribute has the advantage of having been supported for ages in everything else.
(This is the only one I can find which covers all other browsers including Opera Mini; if that's not a worry for you there are plenty of others.)
So, you can use simple feature detection to see if you're on Old Edge (or IE) -
var isOldEdgeOrIE = !('reversed' in document.createElement('ol'));

Since I found this question from the other side, how to actually check if a pre-chromium-edge is being used, I found the following solution (IE checks included):
// Edge < 18
if (window.navigator.userAgent.indexOf('Edge') !== -1) {
return true;
}
// IE 11
if (window.document.documentMode) {
return true;
}
// IE 10
if (navigator.appVersion.indexOf('MSIE 10') !== -1) {
return true;
}
return false;

Google Authentication - User agent gives error on WebView (Nylas api)

We are using Nylas api to get Access token for different type of Email account like Gmail, Outlook.. But We couldn't authenticate for Gmail.
let myURL = URL(string: getNylasAuthUrl())
let userAgent = getUserAgentParams()
webView.customUserAgent = userAgent
let myRequest = URLRequest(url: myURL!)
webView.load(myRequest)
got below error
Finally found a way, by setting User-Agent, we could do authentication for gmail from post
Tried below User-agents but didn't help
let userAgent = "Mozilla/5.0 (Apple \(Utils.getDeviceModel()) ) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
let userAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
let userAgent = "Mozilla/5.0 (Google) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"

Finally, I found the working user-agent.
let userAgent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_2 like Mac OS X)
AppleWebKit/603.1.30 (KHTML, like Gecko) Mobile/14F89 Safari/602.1"
If you want to Google auth via Webview, use this user-agent especially for getting access token.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Cannot parse all the information of this website with a single loop - parsing

Related

How to get the complete URL by making a google search with BS4 and Requests

How can I iterate over list of URL's to scrape the data in Scrapy?

Splunk AWS ALB logs not properly parsing

How to detect Microsoft Chromium Edge (chredge , edgium) in Javascript

Google Authentication - User agent gives error on WebView (Nylas api)

Categories

Resources