Why does Network.URI (parseURI) not like the pipe character? - url

I'm using the parseURI function from the network-uri package to parse some urls. Some of these urls have a pipe character in and parsing fails for them. For example:
Network.URI> parseURI "http://something.com/foo|bar"
Nothing
However, these urls are obtained from a real website and they load correctly in a web browser, so there must be some sort of correct way of dealing with them.
Why does parsing fail on urls with a pipe character, and what can I do to make them correctly parse?

You need to use escapeURIString before parsing. isUnescapedInURI will tell you if the character is allowed unescaped in a URI component as mentioned in the documentation.
λ> isUnescapedInURI '|'
False
So, to properly encode and parse it:
λ> parseURI $ escapeURIString isUnescapedInURI "http://something.com/foo|bar"
Just http://something.com/foo%7Cbar
In fact this specific corner case, is well explained in the Hackage docs.

Related

Apostrophe (valid char) is percent-encoded - but only sometimes

Try to use Google to find Wikipedia article about De Morgan's laws.
Click the link, and see the URL. At least in Chrome, it will be
https://en.wikipedia.org/wiki/De_Morgan%27s_laws
' is percent-encoded as %27, despite it is a valid URL character (and even more, if you manually change it in address bar from %27 to ', it will work). Why?
While aposthrope may be valid char, URL-encoded version is also equally valid!
Not sure if there is a hard reason, so this is kinda "soft" answer: Aposthrope (and/or double quote) needs to be escaped somehow if URL is ever put into for example JSON or XML. URL encoding them as part of sanitizing URLs solves this one way, and protects against poor JSON/XML handling and programmer errors. It's just pragmatic.
Decoding these certain valid chars in HTTP responses' headers etc (so browser shows them "right") should be possible and maybe nice, but extra work and code. Note that there are also chars where decoding would not be ok, so this would have to be selective! So at least in this case it just wasn't done I guess. So if a char gets URL-encoded at any step of the whole page loading operation chain, they stay that way.

IBM Cast Iron studio unable to convert '&' to '&'

Hello I am constructing a URI from two different strings coming from a source.
String1 = 12345&67890
String2 = 78326832
URI = /api?invoice=String1&supplier=String2
After using concat function available in studio, this is the final URI.
/api?invoice=12345&67890&supplier=78326832
(Get request fails because 67890 is taken as query)
Expected output is
/api?invoice=12345&67890&supplier=78326832
how do I achieve this, Can i use xslt to convert symbols to its HTML entity characters
Your expected output /api?invoice=12345&67890&supplier=78326832 is rather bizarre: there's no context where it makes sense to escape some ampersands (at the XML/HTML level) and leave others unescaped.
I think that what you really want is to use URI escaping (not XML escaping) for the first ampersand, that is you want /api?invoice=12345%2667890&supplier=78326832. If you're building the URI using XSLT 2.0 you can achieve this by passing the strings through encode-for-uri() before you concatenate them into the URI.
But you've given so little information about the context of your processing that it's hard to be sure exactly what you want.

how can I use colon instead of question mark in url query?

for example this image:
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg
then I add a color symbol to send query string:
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg:large
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg:small
I googled that is twitter image
what coding language can achieve this?
php? ruby on rails?
or any htaccess rewrite rule?
Any.
It has nothing to do with programming languages, but with CGI: http://en.wikipedia.org/wiki/Common_Gateway_Interface
The colon is however not a valid part of the CGI spec, so the server receiving the request will probably parse it in code.
Note though that the CGI spec defines '&' as separator between different variable/value pairs, which results in incorrect (X)HTML when used in <a> tags. This is because it doesn't define a valid entity. To remedy this, at least in PHP, you can change this separator: http://www.php.net/manual/en/ini.core.php#ini.arg-separator.output

lua reading chinese character

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?
I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.
I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

regular expression for emails NOT ending with replace script

I'm currently modifying my regex for this:
Extracting email addresses in an html block in ruby/rails
basically, im making another obfuscator that uses ROT13 by parsing a block of text for all links that contain a mailto referrer(using hpricot). One use case this doesn't catch is that if the user just typed in an email address(without turning it into a link via tinymce)
So here's the basic flow of my method:
1. parse a block of text for all tags with href="mailto:..."
2. replace each tag with a javascript function that changes this into ROT13 (using this script: http://unixmonkey.net/?p=20)
3. once all links are obfuscated, pass the resulting block of text into another function that parses for all emails(this one has an email regex that reverses the email address and then adds a span to that email - to reverse it back)
step 3 is supposed to clean the block of text for remaining emails that AREN'T in a href tags(meaning it wasn't parsed by hpricot). Problem with this is that the emails that were converted to ROT13 are still found by my regex. What i want to catch are just emails that WEREN'T CONVERTED to ROT13.
How do i do this? well all emails the WERE CONVERTED have a trailing "'.replace" in them. meaning, i need to get all emails WITHOUT that string. so far i have this regex:
/\b([A-Z0-9._%+-]+#[A-Z0-9.-]+.[A-Z]{2,4}('.replace))\b/i
but this gets all the emails with the trailing '.replace i want to get the opposite and I'm currently stumped with this. any help from regex gurus out there?
MORE INFO:
Here's the regex + the block of text im parsing:
http://www.rubular.com/r/NqXIHrNqjI
as you can see, the first two 'email addresses' are already obfuscated using ROT13. I need a regex that gets the emails ohhellzyeah#ribute.com and kaboom#yahoo.com
On negative lookaheads
You can use a negative lookahead to assert that a pattern doesn't match.
For example, the following regex matches all strings that doesn't end with ".replace" string:
^(?!.*\.replace$).*$
As another example, this regex matches all a*b*, except aabb:
^(?!aabb$)a*b*$
Ideally,
See also
regular-expressions.info/Lookaheads and anchors
Flavor comparison - unfortunately, Ruby doesn't support lookbehinds
Specific solution
The following regex works in this scenario: (see on rubular.com):
/\b([A-Z0-9._%+-]+#(?![A-Z0-9.-]*'\.replace\b)[A-Z0-9.-]+\.[A-Z]{2,4})\b/i

Resources