regular expression for emails NOT ending with replace script - ruby-on-rails

I'm currently modifying my regex for this:
Extracting email addresses in an html block in ruby/rails
basically, im making another obfuscator that uses ROT13 by parsing a block of text for all links that contain a mailto referrer(using hpricot). One use case this doesn't catch is that if the user just typed in an email address(without turning it into a link via tinymce)
So here's the basic flow of my method:
1. parse a block of text for all tags with href="mailto:..."
2. replace each tag with a javascript function that changes this into ROT13 (using this script: http://unixmonkey.net/?p=20)
3. once all links are obfuscated, pass the resulting block of text into another function that parses for all emails(this one has an email regex that reverses the email address and then adds a span to that email - to reverse it back)
step 3 is supposed to clean the block of text for remaining emails that AREN'T in a href tags(meaning it wasn't parsed by hpricot). Problem with this is that the emails that were converted to ROT13 are still found by my regex. What i want to catch are just emails that WEREN'T CONVERTED to ROT13.
How do i do this? well all emails the WERE CONVERTED have a trailing "'.replace" in them. meaning, i need to get all emails WITHOUT that string. so far i have this regex:
/\b([A-Z0-9._%+-]+#[A-Z0-9.-]+.[A-Z]{2,4}('.replace))\b/i
but this gets all the emails with the trailing '.replace i want to get the opposite and I'm currently stumped with this. any help from regex gurus out there?
MORE INFO:
Here's the regex + the block of text im parsing:
http://www.rubular.com/r/NqXIHrNqjI
as you can see, the first two 'email addresses' are already obfuscated using ROT13. I need a regex that gets the emails ohhellzyeah#ribute.com and kaboom#yahoo.com

On negative lookaheads
You can use a negative lookahead to assert that a pattern doesn't match.
For example, the following regex matches all strings that doesn't end with ".replace" string:
^(?!.*\.replace$).*$
As another example, this regex matches all a*b*, except aabb:
^(?!aabb$)a*b*$
Ideally,
See also
regular-expressions.info/Lookaheads and anchors
Flavor comparison - unfortunately, Ruby doesn't support lookbehinds
Specific solution
The following regex works in this scenario: (see on rubular.com):
/\b([A-Z0-9._%+-]+#(?![A-Z0-9.-]*'\.replace\b)[A-Z0-9.-]+\.[A-Z]{2,4})\b/i

Related

Undo email wordwrap line breaks in Ruby

My Rails app processes incoming emails by splitting them into multiple lines. This is what I currently use on the plain text version of the body: lines = email.body.split("\n")
This works well unless the sentences are longer than ~74 characters as most email clients will automatically add a line break per RFC 2822.
Example email: https://gist.github.com/marckohlbrugge/39c17b928eb17d330d63
Looking at the plain text part there seems to be no way to discern between a line break added by the user versus the email client. You could ignore any line break happening at the 75th position, but I think there might be a chance of false positives. (I could be wrong.)
The HTML part has all the information we need, but I'm not sure about a universal way to process this. Is replacing every div and br with a newline and then stripping al other HTML elements enough? What about all the other block-element tags? What about inline elements styled as block-elements? What if an email doesn't have an HTML part?
I did find some interesting code examples in Convert HTML to plain text (with inclusion of s), but replacing a list of html tags with newlines doesn't seem like a complete (exhaustive) solution.
Is it worth looking at something like this mail library as they've probably already thought about the edge cases? ;)

how can I use colon instead of question mark in url query?

for example this image:
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg
then I add a color symbol to send query string:
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg:large
https://pbs.twimg.com/media/BFmDUA5CcAAmcBl.jpg:small
I googled that is twitter image
what coding language can achieve this?
php? ruby on rails?
or any htaccess rewrite rule?
Any.
It has nothing to do with programming languages, but with CGI: http://en.wikipedia.org/wiki/Common_Gateway_Interface
The colon is however not a valid part of the CGI spec, so the server receiving the request will probably parse it in code.
Note though that the CGI spec defines '&' as separator between different variable/value pairs, which results in incorrect (X)HTML when used in <a> tags. This is because it doesn't define a valid entity. To remedy this, at least in PHP, you can change this separator: http://www.php.net/manual/en/ini.core.php#ini.arg-separator.output

What is the proper way to sanitize user input when using a Ruby system call?

I have a Ruby on Rails Application that is using the X virtual framebuffer along with another program to grab images from the web. I have structured my command as shown below:
xvfb-run --server-args=-screen 0 1024x768x24 /my/c++/app #{user_provided_url}
What is the best way to make this call in rails with the maximum amount of safety from user input?
You probably don't need to sanitize this input in rails. If it's a URL and it's in a string format then it already has properly escaped characters to be passed as a URL to a Net::HTTP call. That said, you could write a regular expression to check that the URL looks valid. You could also do the following to make sure that the URL is parse-able:
uri = URI.parse(user_provided_url)
You can then query the object for it's relevant parts:
uri.path
uri.host
uri.port
Maybe I'm wrong, but why don't you just make sure that the string given is really an URL (URI::parse), surround it with single quotes and escape any single quote (') character that appears inside?

Extracting email addresses in an html block in ruby/rails

I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it)
I've tried regexes and so far this has been successful:
/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
problem is, i need to ignore all email addresses with mailto hrefs. for example:
test#mail.com
should only return the second email add.
To get a background of what im doing, im reversing the email addresses in a block so the above example would look like this:
moc.liam#tset
problem with my current regex is that it also replaces the one in href. Is there a way for me to do this with a single regex? Or do i have to check for one then the other? Is there a way for me to do this just by using gsub or do I have to use some nokogiri/hpricot magicks and whatnot to parse the mailtos? Thanks in advance!
Here were my references btw:
so.com/questions/504860/extract-email-addresses-from-a-block-of-text
so.com/questions/1376149/regexp-for-extracting-a-mailto-address
im also testing using this:
http://rubular.com/
edit
here's my current helper code:
def email_obfuscator(text)
text.gsub(/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i) { |m|
m = "<span class='anti-spam'>#{m.reverse}</span>"
}
end
which results in this:
<a target="_self" href="mailto:<span class='anti-spam'>moc.liamg#tset</span>"><span class="anti-spam">moc.liamg#tset</span></a>
Another option if lookbehind doesn't work:
/\b(mailto:)?([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})\b/i
This would match all emails, then you can manually check if first captured group is "mailto:" then skip this match.
Would this work?
/\b(?<!mailto:)[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
The (?<!mailto:) is a negative lookbehind, which will ignore any matches starting with mailto:
I don't have Ruby set up at work, unfortunately, but it worked with PHP when I tested it...
Why not just store all the matched emails in an array and remove any duplicates? You can do this easily with the ruby standard library and (I imagine) it's probably quicker/more maintainable than adding more complexity to your regex.
emails = ["email_one#example.com", "email_one#example.com", "email_two#example.com"]
emails.uniq # => ["email_one#example.com", "email_two#example.com"]

what if html_escape would stop escaping '&'?

is there any danger if the rails html_escape function would stop escaping '&'? I tested a few cases and it doesn't seem to create any problems. Can you give me a contrary an example? Thanks.
If you put an unescaped "&" into an HTML attribute, it would make your page invalid. For example:
Link
The page is now invalid as the & indicates an entity. This is true for any usage of an & on a page (for example, view source and hopefully you'll notice that Stack Overflow escapes the & signs in this post!)
The following would make the above example valid:
Link
Additional Note
& characters do need to be escaped in URLs if you want to validate your markup against the W3C validator. Example:
Line 9, Column 38: & did not start a character reference.
(& probably should have been escaped as &.)
Example
change an url with adding some argument

Resources