I was trying to render HTML and view the website content using Splash. The script was working fine but suddenly I'm unable to view the website content using the same script for https://www.galaxus.ch/search?q=5010533606001 this website. The script is working fine for other websites but for this website https://www.galaxus.ch/search?q=5010533606001 I'm unable to view any content.
Note that, I've used Splash for the same website before and it worked fine that time.
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36")
splash.private_mode_enabled = false
assert(splash:go(args.url))
assert(splash:wait(5))
splash:set_viewport_full()
return {
splash:png(),
splash:html()
}
end
Related
I am trying to scrape this website
Scraping information is possible manually
However, I cannot access information within p...p tags and ul...ul tags with one loop. These two tags are in a similar division. However, the loop breaks whenever p replaces ul or vice-versa.
Is this possible with just one loop??
import requests
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"Accept-Encoding":"gzip, deflate, br",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"DNT":"1",
"Connection":"close",
"Upgrade-Insecure-Requests":"1"}
source = requests.get('https://insights.blackcoffer.com/how-small-business-can-survive-the-coronavirus-crisis/',
headers=headers)
page = source.content
soup = bs(page, 'html.parser')
information = ''
for section in soup.find('div', class_='td-post-content').find_all('p'):
if information != '':
information = information + '\n' + section.text
else:
information = section.text
print(information)
import requests
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
"Accept-Encoding":"gzip, deflate, br",
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"DNT":"1",
"Connection":"close",
"Upgrade-Insecure-Requests":"1"}
source = requests.get('https://insights.blackcoffer.com/how-small-business-can-survive-the-coronavirus-crisis/',
headers=headers)
page = source.content
soup = bs(page, 'html.parser')
information = ''
for section in soup.find('div', class_='td-post-content').find_all(['p', 'li']):
information += '\n\n' + section.text
print(information.strip())
'Edge 75' will be (is?) the first Chromium Based Edge browser. How can I check if this browser is Edge on Chrome ?
(What I really want to know is if the browser fully supports data-uri's - https://caniuse.com/#feat=datauri - so feature detection would even be better. If you know a way to do that, I can change the question)
You could use the window.navigator userAgent to check whether the browser is Microsoft Chromium Edge or Chrome.
Code as below:
<script>
var browser = (function (agent) {
switch (true) {
case agent.indexOf("edge") > -1: return "edge";
case agent.indexOf("edg/") > -1: return "chromium based edge (dev or canary)"; // Match also / to avoid matching for the older Edge
case agent.indexOf("opr") > -1 && !!window.opr: return "opera";
case agent.indexOf("chrome") > -1 && !!window.chrome: return "chrome";
case agent.indexOf("trident") > -1: return "ie";
case agent.indexOf("firefox") > -1: return "firefox";
case agent.indexOf("safari") > -1: return "safari";
default: return "other";
}
})(window.navigator.userAgent.toLowerCase());
document.body.innerHTML = window.navigator.userAgent.toLowerCase() + "<br>" + browser;
</script>
The Chrome browser userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml,
like gecko) chrome/74.0.3729.169 safari/537.36
The Edge browser userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml,
like gecko) chrome/64.0.3282.140 safari/537.36 edge/18.17763
The Microsoft Chromium Edge Dev userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml,
like gecko) chrome/76.0.3800.0 safari/537.36 edg/76.0.167.1
The Microsoft Chromium Edge Canary userAgent:
mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 (khtml,
like gecko) chrome/76.0.3800.0 safari/537.36 edg/76.0.167.1
As we can see that Microsoft Chromium Edge userAgent contains the "edg" keyword, we could use it to detect whether the browser is Chromium Edge browser or Chrome browser.
Using CanIUse, the most universal feature which is unsupported on old Edge (which used the EdgeHtml engine) but supported in Edge Chromium and everywhere else (except IE), is the reversed attribute on an OL list. This attribute has the advantage of having been supported for ages in everything else.
(This is the only one I can find which covers all other browsers including Opera Mini; if that's not a worry for you there are plenty of others.)
So, you can use simple feature detection to see if you're on Old Edge (or IE) -
var isOldEdgeOrIE = !('reversed' in document.createElement('ol'));
Since I found this question from the other side, how to actually check if a pre-chromium-edge is being used, I found the following solution (IE checks included):
// Edge < 18
if (window.navigator.userAgent.indexOf('Edge') !== -1) {
return true;
}
// IE 11
if (window.document.documentMode) {
return true;
}
// IE 10
if (navigator.appVersion.indexOf('MSIE 10') !== -1) {
return true;
}
return false;
I am trying to use the graph api upload feature (in my case it is using Windev Mobile 21).
The files are appearing in the appfolder. They are the right size and have the right extensions but they can not be opened
sCd1 is ANSI string = "Authorization: Bearer"+" "+gsTokenAcc
HTTPCreateForm("driveEnq")
sContentType is ANSI string = "image/jpeg"
HTTPAddFile("driveEnq","File",sAd,sContentType)=False
sEnding is ANSI string
sHTTPRes is string
sUserAgent is ANSI string = "'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393'"
IF HTTPSendForm("driveEnq",sEnding,httpPut,sUserAgent ,sCd1)=True THEN
bufResHTTP is Buffer = HTTPGetResult(httpResult)
I am convinced that this is something to do with the content type or the format by which the files are added
The key to this appears to be to add the file (or fragment) via an empty
HTTPAddParameter("form_name","",sFrag)
Adding the form as a application/XML appears to help with the boundaries of each field.
Hope this helps
i want to crawl an e commerce website using Html agility pack but i am having an issue the html agility pack is getting source of front web page only because when i try to get the source of inner or sub items in that website i am not having that bunch of code in the source that i get from html agility pack.when i click on items then i can see code of submenu items through firebug but not in the actual source that i have.so please guide me or tell me what to do
string url="";
WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/45.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36";
html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
through this code using html agility pack i can only have code of first web page
tested this, it gets the code of the whole website against particular URL
I am trying to decipher this information from user-agent string on a node.js server based on Sails.js framework.
I have access to user-agent in req.headers["user-agent"]
Currently I am using this function to segregate iPhone, iPad and Android devices.
function findPlatform(userAgent){
var iphoneReg = /\biphone\b/gi;
var ipadReg = /\bipad\b/gi;
var androidReg = /\bandroid\b/gi;
if(!userAgent){
sails.log.error("cant infer user agent");
return "others";
}
if(userAgent.search(androidReg) > -1){
return "android";
}
else if(userAgent.search(iphoneReg) > -1){
return "iphone";
}
else if(userAgent.search(ipadReg) > -1){
return "ipad";
}
else {
return "others";
}
}
However, I also need to segregate between mobile app and mobile browser for both android and iOS. I was looking at certain requests and could see that user-agent from mobile app looks like this:
"Mozilla/5.0 (Linux; Android 5.0; Lenovo A7000-a Build/LRX21M; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/45.0.2454.95 Mobile Safari/537.36"
While from mobile browser, it looked like this:
"Mozilla/5.0 (Linux; Android 4.4.4; MI 4W Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36"
Can I user a regex to match keyword "Version" to identify the request as coming from app ? What about iOS ?