html Agility pack is getting source of first web page only - parsing

i want to crawl an e commerce website using Html agility pack but i am having an issue the html agility pack is getting source of front web page only because when i try to get the source of inner or sub items in that website i am not having that bunch of code in the source that i get from html agility pack.when i click on items then i can see code of submenu items through firebug but not in the actual source that i have.so please guide me or tell me what to do
string url="";
WebClient client = new WebClient();
client.Headers[HttpRequestHeader.UserAgent] = "Mozilla/45.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36";
html = client.DownloadString(url);
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
through this code using html agility pack i can only have code of first web page

tested this, it gets the code of the whole website against particular URL

Related

Splash: Unable to view web contents

I was trying to render HTML and view the website content using Splash. The script was working fine but suddenly I'm unable to view the website content using the same script for https://www.galaxus.ch/search?q=5010533606001 this website. The script is working fine for other websites but for this website https://www.galaxus.ch/search?q=5010533606001 I'm unable to view any content.
Note that, I've used Splash for the same website before and it worked fine that time.
function main(splash, args)
splash:set_user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36")
splash.private_mode_enabled = false
assert(splash:go(args.url))
assert(splash:wait(5))
splash:set_viewport_full()
return {
splash:png(),
splash:html()
}
end

How to get the complete URL by making a google search with BS4 and Requests

So, I was making a program that will search google and fetch all the results for a given keyword. I wanted to get all the URLs and print them out to the screen, and I decided to use BS4 for this and this is how I did it:
r = requests.get(f'https://www.google.com/search?q={dork}&start={page}',headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:54.0) Gecko/20100101 Firefox/54.0'})
soup = BeautifulSoup(r.text, "html.parser")
urls = soup.find_all('div', attrs={'class': 'BNeawe UPmit AP7Wnd'})
for url in urls:
url = url.split('<div class="BNeawe UPmit AP7Wnd">')[1].split('</div>')[0]
url = url.replace(' › ','/')
print(f'{Fore.GREEN}{url}{Fore.WHITE}')
open(f'results/{timeLol}/urls.txt', "a")
But, it did not return the complete URL instead, if the URL was big it returned ... after some of the URL, is there any way at all to get the complete URL even if it is not using BS4 and Requests.
Any search query example would be appreciated.
While you don't provide query example, you can try to use bs4 css selectors (css selectors reference):
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
# https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
# other URLs below...
Code and example in the online IDE that scrapes more:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {'q': 'how to cook best corn on the cob'}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
---------
'''
https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
https://www.allrecipes.com/recipe/222352/jamies-sweet-and-easy-corn-on-the-cob/
https://www.delish.com/cooking/a22487458/corn-on-the-cob/
https://www.thekitchn.com/best-method-cook-corn-skills-showdown-23045869
https://natashaskitchen.com/15-minute-corn-on-the-cob/
https://www.thegunnysack.com/how-long-to-boil-corn-on-the-cob/
https://www.epicurious.com/recipes/food/views/basic-method-for-cooking-corn-on-the-cob-40047
https://houseofnasheats.com/the-best-boiled-corn-on-the-cob/
https://www.tasteofhome.com/article/perfect-corn-on-the-cob/
'''
Alternatively, you can do the same thing using Google Search Results API from SerpApi, but without thinking about how to parse stuff since it's already done for the end user. All that needs to be done is just to iterate over structured JSON string.
It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "how to cook best corn on the cob",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
print(link)
----------
'''
https://spicysouthernkitchen.com/best-way-to-cook-corn-on-the-cob/
https://www.allrecipes.com/recipe/222352/jamies-sweet-and-easy-corn-on-the-cob/
https://www.delish.com/cooking/a22487458/corn-on-the-cob/
https://www.thekitchn.com/best-method-cook-corn-skills-showdown-23045869
https://natashaskitchen.com/15-minute-corn-on-the-cob/
https://www.thegunnysack.com/how-long-to-boil-corn-on-the-cob/
https://www.epicurious.com/recipes/food/views/basic-method-for-cooking-corn-on-the-cob-40047
https://houseofnasheats.com/the-best-boiled-corn-on-the-cob/
https://www.tasteofhome.com/article/perfect-corn-on-the-cob/
'''
Disclaimer, I work for SerpApi.

files cannot be read after WINDEV mobile upload using microsoft graph api

I am trying to use the graph api upload feature (in my case it is using Windev Mobile 21).
The files are appearing in the appfolder. They are the right size and have the right extensions but they can not be opened
sCd1 is ANSI string = "Authorization: Bearer"+" "+gsTokenAcc
HTTPCreateForm("driveEnq")
sContentType is ANSI string = "image/jpeg"
HTTPAddFile("driveEnq","File",sAd,sContentType)=False
sEnding is ANSI string
sHTTPRes is string
sUserAgent is ANSI string = "'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393'"
IF HTTPSendForm("driveEnq",sEnding,httpPut,sUserAgent ,sCd1)=True THEN
bufResHTTP is Buffer = HTTPGetResult(httpResult)
I am convinced that this is something to do with the content type or the format by which the files are added
The key to this appears to be to add the file (or fragment) via an empty
HTTPAddParameter("form_name","",sFrag)
Adding the form as a application/XML appears to help with the boundaries of each field.
Hope this helps

Weird characters in URL

In my webserver when user requests URLs with weird characters, I remove these characters. And system logs these cases. When I check sanitized cases I found these. I'm curious that what would be the objective of these URLs ?
I check the IPs and these are real people and uses website as a normal person. But 1 time in their 20 URL requets of these people, URL has these weird characters at last.
http://example.com/#%EF%BF%BD%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0,
http://example.com/%60E%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%60E%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/p%EF%BF%BD%1D%01?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%EF%BF%BDC%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%3E?, agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0
http://example.com/%EF%BF%BDR%EF%BF%BD%02?o=3&g=&s=&z=%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD%EF%BF%BD`%EF%BF%BD%EF%BF%BD%7F, agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
http://example.com/%EF%BF%BDe%EF%BF%BDv8%01%EF%BF%BD?o=3&g=P%01%EF%BF%BD&s=&z=%EF%BF%BD%EF%BF%BD%15%01%EF%BF%BD%EF%BF%BD, agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
http://en.wikipedia.org/wiki/Specials_(Unicode_block)
They are essentially malformed URLs. They can be generated from a specific malware that is trying to exploit web site vulnerabilities, from malfunctioning browser plugin or extension, or from a bug in a JS file (i.e. tracking with Google Analytics) in combination with a specific browser version/operating system. In any case, you can't actually control what requests will come from a client and there's nothing you can do to stop that so, if your generated HTML/JS code is correct, you have done your work.
If you like to correct those URLs for any reason, you can enable URL rewriting and set a rule with a regular expression filter to transform those URLs to valid URLs. Anyway, I don't suggest do that: the web server should respond with a error 404 page not found message, because that is the standard (it's a client error, after all), and this is in my opinion a faster and safer method than applying URL rewriting. (rewriting procedure may contains bugs, so someone can try to exploit that, etc, etc)
For sake of curiosity, you can easily decode those URLs with an online URL decoder of your choice (i.e. this), but essentially you will discover what you already know: there are a lot of UTF-8 replacement characters in those URLs.
In fact, %EF%BF%BD is the url-encoded version of the hex representation of the 3 bytes (EF BF BD) of the UTF-8 replacement character. You can see that character also as � or EF BF BD or FFFD or ï ¿ ½, and so on, depending of the representation method you choose.
Also, you can check by your own how the client handles that character. Go here:
http://www.cogsci.ed.ac.uk/~richard/utf-8.cgi?input=%EF%BF%BD&mode=char
press the GO button and, using your browser developer tools, check what really happens: the browser is actually encoding the unknown character with %EF%BF%BD before sending it to the web server.
These look like corrupted URLs being inserted by a piece of Malware/Adware called "Adpeak".
Here are some details on Adpeak:
How to remove AdPeak lqw.me script from my web pages?
Adpeak has a client side component that sticks the following tag into web pages:
<script type="text/javascript" id="2f2a695a6afce2c2d833c706cd677a8e" src="http://d.lqw.me/xuiow/?g=7FC3E74A-AFDA-0667-FB93-1C86261E6E1C&s=4150&z=1385998326"></script>
Adpeak also sometimes uses the host names "d.sitespeeds.com", "d.jazzedcdn.com", "d.deliversuper.com", "d.blazeapi.com", "d.quikcdn.com", probably others. Here are a few more examples:
<script type="text/javascript" id="2f2a695a6afce2c2d833c706cd677a8e" src="http://d.deliversuper.com/xuiow/?o=3&g=823F0056-D574-7451-58CF-01151D4A9833&s=7B0A8368-1A6F-48A5-B236-8BD61816B3F9&z=1399243226"></script>
<script type="text/javascript" id="2f2a695a6afce2c2d833c706cd677a8e" src="http://d.jazzedcdn.com/xuiow/?o=3&g=B43EA207-C6AC-E01B-7865-62634815F491&s=B021CBBD-E38E-4F8C-8E93-6624B0597A23&z=1407935653"></script>
<SCRIPT id=2f2a695a6afce2c2d833c706cd677a8e type=text/javascript src="http://d.lqw.me/xuiow/?o=3&g=87B35A3E-C25D-041E-0A0F-C3E8E473A019&s=BBA5481A-926B-4561-BD79-249F618495E6&z=1393532281"></SCRIPT>
<SCRIPT id=2f2a695a6afce2c2d833c706cd677a8e type=text/javascript src="http://d.lqw.me/xuiow/?o=2&g=0AD3E5F2-B632-382A-0473-4C994188DBBA&s=9D0EB5E9-CCC9-4360-B7CA-3E645650CC53&z=1387549919"></SCRIPT>
The "id" is consistent: it's always "2f2a695a6afce2c2d833c706cd677a8e" in the cases we've seen.
There's always a "g", "s", and "z" parameter, and sometimes a "o" parameter that has values of 2 or 3.
We've noticed that with our pages, a certain version of this script is 100% correlated with seeing corrupted characters in the DOM: if "o" is omitted or set to 2, we'll see a Unicode FFFD injected near the end of the page or sometimes a Ux000E character, a.k.a. SHIFT OUT, which blows up standard JSON/XML serialization libraries, which is why we've been researching these URLs. We've never seen a corruption for "o=3"
However, sometimes it looks like Adpeak gets confused, and inserts junk like this:
<script type="text/javascript" id="2f2a695a6afce2c2d833c706cd677a8e" src="��?o=3&g=&s=&z=����������~?"></script>
Now, we don't know that this is Adpeak, because the URLs are mangled, but the "o=3", "g", "s", and "z" parameters are four smoking guns. The host is missing here, so it will resolve against our server, so these UxFFFDs will get sent up as UTF-8 hex-encoded "%EF%BF%BD" sequences, which are identical to what people have been seeing above.
If you're curious about how common this is, for a particular customer with high traffic and a wide demographic, we see Adpeak URLs injected into about 1.09% of their web pages, both well-formed Adpeak URLs as well as URLs with UxFFFD's. If you just look for Adpeak URLs with UxFFFD sequences, those appear in 0.053% of all web pages. And if you just look for Adpeak URLs that cause DOM corruptions (e.g., the valid URLs that contain "o=2" or no "o" parameter), that covers 0.20% of all web pages.
Probably your site's character-set is not initialized to UTF-8, but when you request a page in the site it thinks that the character are encoded with utf-8. When it "understands" that the characters are not encoded in UTF-8 format, it replaces any character that it doesn't know with the bytes sequence EF BF BD ("character place keeper").
Make sure you use UTF-8 in everyplace in your site by using <meta charset="UTF-8"> in every page.
Another example for this in a different situation: Whats going on with this byte array?
You have to use Regular Expression Functions, Search for it in php official site or google it...
The url's which are in other languages rather than english are causing this problem,
Meta charset utf 8 will not affect the url,so it wont help..meta charset only helps you to display other languages text on your web page ,not your URL..
using php Regex you can shown even chinese text in url..
Hope it will work ..
just un-check the EnableBrowserLink option in visual studio. Every Thing will work out of box.

How to get user agent from a thymeleaf html page

In an application am using spring + thymeleaf. I want to get the user agent for including cetain files.
<%
String browser = request.getHeader("User-Agent")
%>
i need to get this done in a thymeleaf page.How can i do that. Any help will be appreciated
SHANib's answer didn't work for me. getRequest on ServletRequest has no parameters, at least on my version. However,
<span th:text="${#httpServletRequest.getHeader('User-Agent')}">Agent</span>
worked just fine.
you can access the HttpServletRequest object with #httpServletRequest
So, for example, you can print the user agent like this
<span th:text="${#httpServletRequest.getRequest('User-Agent')}">Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/5.0)</span

Resources