Wierd URL Encoding/Decoding for non English Characters - url

How and why a non-English word is converted to weird characters like پاکستان to پاکستان, is there any way back to get پاکستان from پاکستان. It happens in browser shown code and received requests at server
Background:
I get lot of requests at my Non-English content (urdu) website with urls like
پاکستان
I tried to know what that means but search engines don't help. I tried things like
Decode this 'mystring'
What ecoding is this 'mystring'
I thought it might be corrupted/spam url, from this link
Weird characters in URL
Problem explanation/example
But when I viewed one my js file in browser (while having look on working js file). It is showing me same wired characters in browser, even at localhost
'pakistan': {'eng': 'Pakistan', 'ur': 'پاکستان'},
//But actually source code for above line is following
'pakistan': {'eng': 'Pakistan', 'ur': 'پاکستان'},
But in browser its showing me following for same line,
My knowledge
I only know about Encoding/Decoding, which seems unrelated here with best of my knowledge as?
encodeURI and decodeURI in JS or quote and unquote in python and same for other languages. But what they do for me is only
`پاکستان` to `%D9%BE%D8%A7%DA%A9%D8%B3%D8%AA%D8%A7%D9%86` and vise versa
Why needed?
I don't want to miss the requests received with those malformed urls, there must be some things to undo as all browsers chrome/firefox/edge showing those characters same, If their translation/conversion method and result is same then there should be some technique available to reverse it as well

Thanks to Giacomo Catenazzi and then I be greatful to the following answer
How to decode cp1252 string?
A very custom and still imperfect solution to my problem.
This algo needs to be improved Only by experiment I came to know, this algo works as its not working for me when string is long or including - (hyphens)
So I made changes according to my requirement and its working fair enough, so that I could guess what the actual string was.
import re, itertools
from lxml.builder import unicode
def specific_my_required_processing(received_string):
starting_characters_in_encoded_string_in_my_case = ['Ø', 'Ã', 'Ù', 'Ù', 'Ú']
arr = received_string.split('-')
res = []
missed = []
for string_item in arr:
decoded_string = guess_decode_string_without_hyphens(string_item)
if decoded_string and decoded_string[:1] not in starting_characters_in_encoded_string_in_my_case:
res.append(decoded_string)
else:
missed.append({string_item: decoded_string})
resulting_urdu_string = '-'.join(res)
print('\n\nResult', resulting_urdu_string)
print('\nCould not be decoded', missed)
def guess_decode_string_without_hyphens(s):
encodings = ['cp1251', 'cp1252', 'utf8']
for steps in range(2, 10, 2):
for encs in itertools.product(encodings, repeat=steps):
r = s
try:
for enc in encs:
r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
except (UnicodeEncodeError, UnicodeDecodeError) as e:
continue
if re.match(u'^[\w\sа-яА-Я]+$', r):
res = str(r)
print('Encoding => ', encs, ' Conversion = ' + s + ' => ' + res)
return res
sample_encoded_string = 'اسلام-آباد-Ûائیکورٹ-ای-ÙˆÛŒ-ایم-قانون-سازی-کالعدم-قرار-دینے-Ú©ÛŒ-درخواست-نامکمل-قرار'
specific_my_required_processing(sample_encoded_string)

Related

Lua - gsub xml characters to make xml responses visible in iOS Safari Browser

For some reason the iOS Safari browser does not allow you to see xml content returned via a server.
So to try and get around this, I thought I’d try to take the distinctive xml characters ‘>’ and ‘<‘ with something else, which is unlikely to be challenged e.g ‘~’.
I’ve tried a number of different ways , and while I can use the following to find/replace letters, when I try it with special characters, i can’t seem to get it to work.
Can anyone help ?
local xmltest = "<XML Test>"
local t = {< = "~", > = "~"}
local result = string.gsub(xmltest, "<>", t)
print(result)
Many thanks
Here is the answer, thanks #lhf
local xmltest = "<XML Test>"
local t = {["<"] = "~", [">"] = "~"}
local result = string.gsub(xmltest, "[<>]", t)
print(result)

Bookmarks parsing issue

I have a LARGE number of bookmarks and wanted to export them and share them with a group I work with. The issue is that when I export them, there are ADD_DATE and LAST_MODIFIED fields added by the browser (Firefox). I was hoping to just use cut or awk to pull the fields I want but the lack of a space before the >(website_name) is making that difficult. And my regex skills are weak.
How do I add a single space before the second to last > at the end of the line so that I can use cut or awk to pull out the fields I want into a new file?
Ex: 123456">SecurityTrails would become 123456 >SecurityTrails
Please see below for examples of what I'm working with. Any help is greatly appreciated!
<DT>SecurityTrails
i use firefox myself. it frequently also embeds favicon into the exported bookmarks.html file via base64 encoding. so to account for the different scenarios (than just the one mentioned by OP), maybe something like
{mawk/mawk2/gawk} 'BEGIN { FS = "\042" } $1 = $1'
then do whatever cutting that you want. That's just assuming OP wanted to keep every bit of it, and simply remove the quotations.
Now, if the objective is just to take out URL+Name of it,
{mawk/mawk2/gawk} 'BEGIN { DBLQT="\042"; FS = "(<A HREF=" DBLQT "|>)" } /<A HREF=/ {
url = substr($2, 1, index($2, DBLQT) - 1);
sitename = $(NF-1);
sub(/<\/A$/, "", sitename) ;
print url " > " sitename ; }' # or whatever way you want the output to be
I just typed it in extra verbosity to show what \042 meant - the ascii octal for double quote.

Saving SEC 10-K annual report text to files (trouble with decoding)

I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)
Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)

URLConnection with arabic parameters

i'm trying to develop an android application which contains arabic data , so i've got a problem ;
URL twitter = new URL("http://10.0.2.2/WS/identi_el.php?id1="+nomm+"&id2="+pren+"&id3="+pa);
these parameters (nomm , pren and pa ) are in arabic language so it doesn't return any result , however , when i put them in french it returns results so can anyone helps me how to make URLConnection supports arabic letters please ?
Non alphanumeric characters except -, _ and . are know to cause issues in URLs, I bet you'll run into the same problem if you use a french word with an accent.
So stay on the safe side and encode all parameters before using them as part query string parameters.
I modified the URL from
URL twitter = new URL("http://10.0.2.2/WS/identi_el.php?id1="+nomm+"&id2="+pren+"&id3="+pa);
to
url = new URL("http://10.0.2.2/WS/identi_el.phpid1="+java.net.URLEncoder.encode(nomm,"utf8")+"&id2="+java.net.URLEncoder.encode(pren,"utf8")+"&id3="+java.net.URLEncoder.encode(pa,"utf-8"));
=> I just added the following java.net.URLEncoder.encode(...,"utf8") for each parameter and it's working :)

Int32.ParseInt throws FormatException after web post

Update
I've found the problem, the exception came from a 2nd field on the same form which indeed should have prompted it (because it was empty)... I was looking at an error which I thought came from trying to parse one string, when in fact it was from trying to parse another string... Sorry for wasting your time.
Original Question
I'm completely dumbfounded by this problem. I am basically running int.Parse("32") and it throws a FormatException. Here's the code in question:
private double BindGeo(string value)
{
Regex r = new Regex(#"\D*(?<deg>\d+)\D*(?<min>\d+)\D*(?<sec>\d+(\.\d*))");
Regex d = new Regex(#"(?<dir>[NSEW])");
var numbers = r.Match(value);
string degStr = numbers.Groups["deg"].ToString();
string minStr = numbers.Groups["min"].ToString();
string secStr = numbers.Groups["sec"].ToString();
Debug.Assert(degStr == "32");
var deg = int.Parse(degStr);
var min = int.Parse(minStr);
var sec = double.Parse(secStr);
var direction = d.Match(value).Groups["dir"].ToString();
var result = deg + (min / 60.0) + (sec / 3600.0);
if (direction == "S" || direction == "W") result = -result;
return result;
}
My input string is "32 19 17.25 N"
The above code runs on a .NET 4 web hosting service (aspspider) on an ASP.NET MVC 3 web application (with Razor as its view engine).
Note the assersion of degStr == "32" is valid! Also when I take the above code and run it in a console application it works just fine. I've scoured the web for an answer, nothing...
Any ideas?
UPDATE (stack trace)
[FormatException: Input string was not in a correct format.]
System.Number.StringToNumber(String str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal) +9586043
System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info) +119
System.Int32.Parse(String s) +23
ParkIt.GeoModelBinder.BindGeo(String value) in C:\MyProjects\ParkIt\ParkIt\GeoBinder.cs:42
Line 42 is var deg = int.Parse(degStr); and note that the exception is in System.Int32.Parse (not in System.Double as was suggested).
You are wrongly thinking that it is the following line that is throwing the exception:
int.Parse("32")
This line is unlikely to ever throw an exception.
In fact it is the following line:
var sec = double.Parse(secStr);
In this case secStr = "17.25";.
The reason for that is that your hosting provider uses a different culture in which the . is not a decimal separator.
You have the possibility to specify the culture in your web.config file:
<globalization culture="en-US" uiCulture="en-US" />
If you don't do that, then auto is used. This means that the culture could be set based on the client browser preferences (which are sent with each request using the Accept-Language HTTP header).
Another possibility is to specify the culture when parsing:
var sec = double.Parse(secStr, CultureInfo.InvariantCulture);
This way you know for sure that . is the decimal separator for the invariant culture.
Testing this (via PowerShell):
PS [64] E:\dev #43> '32 19 17.25 N' -match "\D*(?\d+)\D*(?\d+)\D*(?\d+(\.\d*))"
True
PS [64] E:\dev #44> $Matches
Name Value
---- -----
sec 17.25
deg 32
min 19
1 .25
0 32 19 17.25
So the regex is working with all three named captures getting a value, all of which will parse OK (ie. it isn't something like \d matching something like U+0660: ARABIC-INDIC DIGIT ZERO that Int32.Parse doesn't handle).
But you do not check that the regex actually makes a match.
Therefore I suspect that the value passed to the function is not the input you expect. Put a breakpoint (or logging) at the start of the function and get the actual value of value.
I think what is happening is:
Value isn't what you think it is.
The regex fails to match.
The captures are empty
Int32.Parse("") is throwing (just confirmed: it throws a FormatException "Input string was not in a correct format.")
Adendum: Just noted you comment on the assertion.
If things seem contradictory go back to basics: at least one of your assumptions is wrong eg. there could be an off by one in the exception's line number (an edit to the file before going to that line number: very easy to do).
Stepping through with a debugger in this case is by far the easiest approach. On every expression check everything.
If you cannot use a debugger then try and remove that restriction, if not how about IntelliTrace? Othewrwise use some kind of logging (if you app doesn't have it, add it as you'll need it in the future for things like this).
try remove non unicode ( if any - non-visible) chars from string :
string s = "søme string";
s = Regex.Replace(s, #"[^\u0000-\u007F]", string.Empty);
edit
also - try to see its hex values to see where it is doing exceptio n :
BitConverter.ToString(buffer);
this will show you the hex values so you can verify...
also paste its value so we can see it.
It turns out that this is a non-question. The problem was that the exception came from a 2nd field on the same form which indeed should have prompted it (because it was empty)... I was looking at an error which I thought came from trying to parse one string, when in fact it was from trying to parse another string...
Sorry for wasting your time.

Resources