Parser returns wrong url - parsing

I'm parsing dialect words from http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna.
from urllib import request
from bs4 import BeautifulSoup
from nltk import corpus, word_tokenize, FreqDist, ConditionalFreqDist
url = 'http://www.dialettando.com/dizionario/hitlist_regioni_new.lasso?regione=Sardegna'
dialettando_tokens = []
while url:
html = request.urlopen(url).read().decode('utf8')
page = BeautifulSoup(html, 'html.parser')
a_list = page.find_all('a')
for a in a_list:
try:
a_str = str(a.contents[0])
if a_str[:3] == '<b>' and a.contents[0].string:
dialettando_tokens.append(a.contents[0].string.strip())
except:
pass
if a.string == 'Simonelli Editore Srl':
break
elif a.string == 'PROSSIMI':
link = a['href']
url = 'http://www.dialettando.com/dizionario/' + link
break
else:
url = ''
In the end of each iteration I need to parse url to the next page.
HTML:
PROSSIMI
And I need to get this link:
'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialetto&regione=Sardegna'
BUT the parser returns:
'hitlist_regioni_new.lasso?saltarec=20&ordina=parola_dialettoRione=Sardegna'
This link doesn't work correctly and I can't understand what's wrong.

An href needs to have the ampersand character escaped, see this question. It is possible the site you visited is not escaping the ampersand inside the href correctly, and hoping they never accidentally reference an HTML entity, except in your case they did. It seems like you have to parse buggy HTML, plus a parser that didn't notice the semicolon was missing and did the HTML entity conversion anyway.

Related

Wierd URL Encoding/Decoding for non English Characters

How and why a non-English word is converted to weird characters like پاکستان to پاکستان, is there any way back to get پاکستان from پاکستان. It happens in browser shown code and received requests at server
Background:
I get lot of requests at my Non-English content (urdu) website with urls like
پاکستان
I tried to know what that means but search engines don't help. I tried things like
Decode this 'mystring'
What ecoding is this 'mystring'
I thought it might be corrupted/spam url, from this link
Weird characters in URL
Problem explanation/example
But when I viewed one my js file in browser (while having look on working js file). It is showing me same wired characters in browser, even at localhost
'pakistan': {'eng': 'Pakistan', 'ur': 'پاکستان'},
//But actually source code for above line is following
'pakistan': {'eng': 'Pakistan', 'ur': 'پاکستان'},
But in browser its showing me following for same line,
My knowledge
I only know about Encoding/Decoding, which seems unrelated here with best of my knowledge as?
encodeURI and decodeURI in JS or quote and unquote in python and same for other languages. But what they do for me is only
`پاکستان` to `%D9%BE%D8%A7%DA%A9%D8%B3%D8%AA%D8%A7%D9%86` and vise versa
Why needed?
I don't want to miss the requests received with those malformed urls, there must be some things to undo as all browsers chrome/firefox/edge showing those characters same, If their translation/conversion method and result is same then there should be some technique available to reverse it as well
Thanks to Giacomo Catenazzi and then I be greatful to the following answer
How to decode cp1252 string?
A very custom and still imperfect solution to my problem.
This algo needs to be improved Only by experiment I came to know, this algo works as its not working for me when string is long or including - (hyphens)
So I made changes according to my requirement and its working fair enough, so that I could guess what the actual string was.
import re, itertools
from lxml.builder import unicode
def specific_my_required_processing(received_string):
starting_characters_in_encoded_string_in_my_case = ['Ø', 'Ã', 'Ù', 'Ù', 'Ú']
arr = received_string.split('-')
res = []
missed = []
for string_item in arr:
decoded_string = guess_decode_string_without_hyphens(string_item)
if decoded_string and decoded_string[:1] not in starting_characters_in_encoded_string_in_my_case:
res.append(decoded_string)
else:
missed.append({string_item: decoded_string})
resulting_urdu_string = '-'.join(res)
print('\n\nResult', resulting_urdu_string)
print('\nCould not be decoded', missed)
def guess_decode_string_without_hyphens(s):
encodings = ['cp1251', 'cp1252', 'utf8']
for steps in range(2, 10, 2):
for encs in itertools.product(encodings, repeat=steps):
r = s
try:
for enc in encs:
r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
except (UnicodeEncodeError, UnicodeDecodeError) as e:
continue
if re.match(u'^[\w\sа-яА-Я]+$', r):
res = str(r)
print('Encoding => ', encs, ' Conversion = ' + s + ' => ' + res)
return res
sample_encoded_string = 'اسلام-آباد-Ûائیکورٹ-ای-ÙˆÛŒ-ایم-قانون-سازی-کالعدم-قرار-دینے-Ú©ÛŒ-درخواست-نامکمل-قرار'
specific_my_required_processing(sample_encoded_string)

Saving SEC 10-K annual report text to files (trouble with decoding)

I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)
Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)

Applying filter "Less-than or equal" to an URL

I bumped into an annoying problem today. I'm doing an app that queries an homemade API and parse the answer, a very classic one.
But due to the huge data that I'm receiving, I'd like to apply a filter "<=" and ">=" to my request, and it doesn't work. The URL object resulting is nil.
Here's the code :
print(request.url) // prints the expected URL
var uerel = URL(string: request.url)
print(uerel) // prints 'nil'
Output:
https://XXXXXXX.eu/YYYYYY?my_filter_id=5b057e27443318329d694d64&date>=2016-01-01T08:00:00.000Z&date<=2016-01-01T20:00:00.000Z
nil
The thing intriguing is if I remove the < and the >, it works like a charm.
I tried to search in the official doc for URL object but it doesn't seem to need a special encoding .. ?
I also took a look at the RFC 1808 as mentioned in the said Official Documentation and these special characters are marked as punctuation so I believe it is OK to put them in an URL.
Where does the problem come from ?
You need to encode that URL since the < and > signs are not valid in a URL.
let unencodedUrlString = "https://XXXXXXX.eu/YYYYYY?my_filter_id=5b057e27443318329d694d64&date>=2016-01-01T08:00:00.000Z&date<=2016-01-01T20:00:00.000Z"
guard let encodedUrlString = unencodedUrlString.addingPercentEncoding(withAllowedCharacters: .urlQueryAllowed), let url = URL(string: encodedUrlString) else { return }
The encoded URL will not contain the < and > symbols as you can see:
https://XXXXXXX.eu/YYYYYY?my_filter_id=5b057e27443318329d694d64&date%3E=2016-01-01T08:00:00.000Z&date%3C=2016-01-01T20:00:00.000Z

How to concatenate API request URL safely

Let's imagine I have the following parts of a URL:
val url_start = "http://example.com"
val url_part_1 = "&fields[...]&" //This part of url can be in the middle of url or in the end
val url_part_2 = "&include..."
And then I try to concatenate the resulting URL like this:
val complete_url = url_start + url_part_2 + url_part_1
In this case I'd get http://example.com&include...&fields[...]& (don't consider syntax here), which is one & symbol between URL parts which means that concatenation was successful, BUT if I use different concat sequence in a different request like this:
val complete_url = url_start + url_part_1 + url_part_2
I'd get http://example.com&fields[...]&&include..., to be specific && in this case. Is there a way to ensure that concatenation is safer?
To keep you code clean use an array or object to keep your params and doin't keep "?" or "&" as part of urlStart or params. Add these at the end. e.g.
var urlStart = "http://example.com"
var params=[]
params.push ('a=1')
params.push ('b=2')
params.push ('c=3', 'd=4')
url = urlStart + '?' + params.join('&')
console.log (url) // http://example.com?a=1&b=2&c=3&d=4
First, you should note that it is invalid to have query parameters just after domain name; it should be something like http://example.com/?include...&fields[...] (note the /? part, you can replace it with / to make it a path parameter, but it's not likely that the router of the website supports parameters like this). Refer, for example, to this article: https://www.talisman.org/~erlkonig/misc/lunatech%5Ewhat-every-webdev-must-know-about-url-encoding/ to know more about what URLs can be valid.
For the simple abstract approach, you can use Kotlin's joinToString():
val query_part = arrayOf(
"fields[...]",
"include..."
).joinToString("&")
val whole_url = "http://example.com/?" + query_part
print(whole_url) // http://example.com/?fields[...]&include...
This approach is abstract because you can use joinToString() not only for URLs, but for whatever strings you want. This also means that if there will be an & symbol in one of the input strings itself, it will become two parameters in the output string. This is not a problem when you, as a programmer, know what strings will be joined, but if these strings are provided by user, it can become a problem.
For URL-aware approach, you can use URIBuilder from Apache HttpComponents library, but you'll need to import this library first.

Robots.txt flexibility with top level domains

so the only problem I have left for this web crawler is making it to where when the top level domain changes, say from imdb to youtube, that it will then switch the robots.txt from following the disallow rules of imdb to youtube. I believe that it can all be fixed just with how the variables are declared in the beginning.
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re
re.IGNORECASE = True
#SourceUrl
url = "http://www.imdb.com"
urls = [url]
visited =[url]
robotsUrl = url +'/robots.txt'
while len(urls) < 250000:
try:
htmltext = urllib.request.urlopen(urls[0]).read()
robots = urllib.request.urlopen(robotsUrl).read()
disallowList = re.findall(b'Disallow\:\s*([a-zA-Z0-9\*\-\/\_\?\.\%\:\&]+)', robots)
except:
print (urls[0])
sourceCode = BeautifulSoup(htmltext, "html.parser")
urls.pop(0)
print(len(urls))
for link in sourceCode.findAll('a', href=True):
if "http://" not in link['href']:
link['href'] = urllib.parse.urljoin(url,link['href'])
in_disallow = False
for i in range(len(disallowList)):
if (disallowList[i]).upper().decode() in link['href'].upper():
in_disallow = True
break
if not in_disallow:
if link['href'] not in visited:
urls.append(link['href'])
visited.append(link['href'])
print (visited)
As long as the domain names used inside your robots.txt matches the one corresponding to the url to your robots.txt, it is all fine. In other words, you can replace yoursite.imdb to yoursite.youtube in all urls. That's fine.
Update
Say you have a sitemap declared in your robots.txt, then it should have the same tld.
http://www.yoursite.imbd/robots.txt
should contain:
sitemap: http://www.yoursite.imbd/sitemap1.xml (not .youtube)
Otherwise, for directives such as allow or disallow, there is not impact, since the TDL does not appear in the paths.

Resources