Extracting email addresses in an html block in ruby/rails - ruby-on-rails

I am creating a parser that wards off against spamming and harvesting of emails from a block of text that comes from tinyMCE (so it may or may not have html tags in it)
I've tried regexes and so far this has been successful:
/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
problem is, i need to ignore all email addresses with mailto hrefs. for example:
test#mail.com
should only return the second email add.
To get a background of what im doing, im reversing the email addresses in a block so the above example would look like this:
moc.liam#tset
problem with my current regex is that it also replaces the one in href. Is there a way for me to do this with a single regex? Or do i have to check for one then the other? Is there a way for me to do this just by using gsub or do I have to use some nokogiri/hpricot magicks and whatnot to parse the mailtos? Thanks in advance!
Here were my references btw:
so.com/questions/504860/extract-email-addresses-from-a-block-of-text
so.com/questions/1376149/regexp-for-extracting-a-mailto-address
im also testing using this:
http://rubular.com/
edit
here's my current helper code:
def email_obfuscator(text)
text.gsub(/\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i) { |m|
m = "<span class='anti-spam'>#{m.reverse}</span>"
}
end
which results in this:
<a target="_self" href="mailto:<span class='anti-spam'>moc.liamg#tset</span>"><span class="anti-spam">moc.liamg#tset</span></a>

Another option if lookbehind doesn't work:
/\b(mailto:)?([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})\b/i
This would match all emails, then you can manually check if first captured group is "mailto:" then skip this match.

Would this work?
/\b(?<!mailto:)[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b/i
The (?<!mailto:) is a negative lookbehind, which will ignore any matches starting with mailto:
I don't have Ruby set up at work, unfortunately, but it worked with PHP when I tested it...

Why not just store all the matched emails in an array and remove any duplicates? You can do this easily with the ruby standard library and (I imagine) it's probably quicker/more maintainable than adding more complexity to your regex.
emails = ["email_one#example.com", "email_one#example.com", "email_two#example.com"]
emails.uniq # => ["email_one#example.com", "email_two#example.com"]

Related

How can I show the name of the link without http://, https://, and everything that goes after .com and other similar domains?

In my view I'm displaying the link in a such way:
<%= #casino.play_now_link %>
So, #casino.play_now_link can be like this: https://www.spinstation.com/?page=blockedcountry&content=1 What I need, is to display only this part: www.spinstation.com. I tried gsub('http://', '').gsub('https://', ''), and it works, but how can I remove the part of url name after .com? Thanks in advance.
Don't use regexes at all for this sort of thing, use URI from the standard library:
URI.parse(#casino.play_now_link).hostname
or, for a more robust solution, use Addressable:
Addressable::URI.parse(#casino.play_now_link).hostname
Of course, this assumes that you've properly validated that your play_now_links are valid URIs. If you haven't then you can add validations that use URI or Addressable to do so and either clean up existing play_now_links that aren't valid URIs or wrap the parsing and hostname extraction in a method (which is a good idea anyway) with some error handling.
In a simple way one can use
.split('/')[2]
which is regex based and depends on the '/' in your url.
But as #mu is too short mentioned: URI is better for this.

Regex to normalize topic links in Discourse forum

I am using Discourse forum software. As in its current state, Discourse presents links to topic in two ways, with and without a post number at the end.
Example:
forum.domain.com/t/some-topic/23
forum.domain.com/t/some-topic/23/5
The first one is what I want and the second one I want to not be displayed in the forum at all.
I've written a post about it on Discourse forum but didn't receive an answer what Regex to put in the permalink normalization input field in the admin section.
I was told that there is an option to do it using permalink normalization like so (It's an example shown in the admin under the Regex input text, I didn't write it):
permalink normalizations
Apply the following regex before matching permalinks,
for example: /(topic.)\?./\1 will strip query strings from topic routes.
Format is regex+string use \1 etc. to access captures
I don't know what Regex I should use in order to remove the numerical value of the post number from links. I need it only for topic links.
This is the routes.rb routing library and this is the permalink.rb library (I think that the permalink library should help get a better clue how to achieve this). I have no idea how to approach this, because it seems that I need some knowledge of the Discourse routing to make it work. For example, I don't understand why (topic.) is part of the regex, what does it mean, so their example doesn't help me to find a solution.
In the admin I have an input field in which I nee to put the normalization regex code.
I need help with the Regex. I need the regex to work with all topics.
Things I've tried that didn't work out:
/(\/\d+)\/\d+$/\1
/(t/[^/]+/\d+).*/\1
/(\/\d+)\/[0-9]+$/\1
/(\/\d+)\/[0-9]+/\1
/(\/\d+)\/\d+$/\1/
/(forum.domain.com(\/\w+)*\/\d+)\/\d+(?=\s|$)/\1
Note: The Permalink Normalization input field treats the character | as a separator to separate between several Regex expressions.
I think this may be the expression you are looking for to put inside de settings field:
/(t\/.*\/\d+)(\/\d+)/\1
You can see it working on Rubular.
However, the code that generates the url is not using the normalization code, so the expression is being ignored.
You could try normalizing the permalink there:
def last_post_url
url = "#{Discourse.base_uri}/t/#{slug}/#{id}/#{posts_count}"
url = Permalink.normalize_url url
url
end
I didn't truly understand your question, but if I got it right, you are saying that you want links with /some-number at the end but don't what links with /some-number/some-number at the end. If that is the case, the regex is:
forum\.domain\.com\/t\/[^0-9\/]+\/\d{1,9}$
You can replace 'forum' with your forum name and 'domain' with your domain name.
This will remove trailing "/<digits>" after another "/<digits>":
/(forum.domain.com(\/\w+)*\/\d+)\/\d+(?=\s|$)/\1

regular expression for emails NOT ending with replace script

I'm currently modifying my regex for this:
Extracting email addresses in an html block in ruby/rails
basically, im making another obfuscator that uses ROT13 by parsing a block of text for all links that contain a mailto referrer(using hpricot). One use case this doesn't catch is that if the user just typed in an email address(without turning it into a link via tinymce)
So here's the basic flow of my method:
1. parse a block of text for all tags with href="mailto:..."
2. replace each tag with a javascript function that changes this into ROT13 (using this script: http://unixmonkey.net/?p=20)
3. once all links are obfuscated, pass the resulting block of text into another function that parses for all emails(this one has an email regex that reverses the email address and then adds a span to that email - to reverse it back)
step 3 is supposed to clean the block of text for remaining emails that AREN'T in a href tags(meaning it wasn't parsed by hpricot). Problem with this is that the emails that were converted to ROT13 are still found by my regex. What i want to catch are just emails that WEREN'T CONVERTED to ROT13.
How do i do this? well all emails the WERE CONVERTED have a trailing "'.replace" in them. meaning, i need to get all emails WITHOUT that string. so far i have this regex:
/\b([A-Z0-9._%+-]+#[A-Z0-9.-]+.[A-Z]{2,4}('.replace))\b/i
but this gets all the emails with the trailing '.replace i want to get the opposite and I'm currently stumped with this. any help from regex gurus out there?
MORE INFO:
Here's the regex + the block of text im parsing:
http://www.rubular.com/r/NqXIHrNqjI
as you can see, the first two 'email addresses' are already obfuscated using ROT13. I need a regex that gets the emails ohhellzyeah#ribute.com and kaboom#yahoo.com
On negative lookaheads
You can use a negative lookahead to assert that a pattern doesn't match.
For example, the following regex matches all strings that doesn't end with ".replace" string:
^(?!.*\.replace$).*$
As another example, this regex matches all a*b*, except aabb:
^(?!aabb$)a*b*$
Ideally,
See also
regular-expressions.info/Lookaheads and anchors
Flavor comparison - unfortunately, Ruby doesn't support lookbehinds
Specific solution
The following regex works in this scenario: (see on rubular.com):
/\b([A-Z0-9._%+-]+#(?![A-Z0-9.-]*'\.replace\b)[A-Z0-9.-]+\.[A-Z]{2,4})\b/i

I am creating a Twitter clone in Ruby On Rails, how do I code it so that the '#...''s in the 'tweets' turn into links?

I am somewhat of a Rails newbie so bear with me, I have most of the application figured out except for this one part.
def linkup_mentions_and_hashtags(text)
text.gsub!(/#([\w]+)(\W)?/, '#\1\2')
text.gsub!(/#([\w]+)(\W)?/, '#\1\2')
text
end
I found this example here: http://github.com/jnunemaker/twitter-app
The link to the helper method: http://github.com/jnunemaker/twitter-app/blob/master/app/helpers/statuses_helper.rb
Perhaps you could use Regular Expressions to look for "#..." and then replace the matches with the corresponding link?
You could use a regular expression to search for #sometext{whitespace_or_endofstring}
You can use regular expressions, i don't know ruby but the code should be almost exactly as my example:
Regex.Replace("this is an example #AlbertEin",
"(?<type>[##])(?<nick>\\w{1,}[^ ])",
"${type}${nick}");
This example would return
this is an example <a href="http://twitter.com/AlbertEin>#AlbertEin</a>
If you run it on .NET
The regex (?<type>[##])(?<nick>\\w{1,}[^ ]) means, capture and name it TYPE the text that starts with # or #, and then capture and name it NAME the text that follows that contains at least one text character until you fin a white space.
Perhaps you can use a regular expression to parse out the words starting with #, then update the string at that location with the proper link.
This regular expression will give you words starting with # symbols, but you might have to tweak it:
\#[\S]+\
You would use a regular expression to search for #username and then turn that to the corresponding link.
I use the following for the # in PHP:
$ret = preg_replace("#(^|[\n ])#([^ \"\t\n\r<]*)#ise",
"'\\1<a href=\"http://www.twitter.com/\\2\" >#\\2</a>'",
$ret);
I've also been working on this, I'm not sure that it's 100% perfect, but it seems to work:
def auto_link_twitter(txt, options = {:target => "_blank"})
txt.scan(/(^|\W|\s+)(#|#)(\w{1,25})/).each do |match|
if match[1] == "#"
txt.gsub!(/##{match.last}/, link_to("##{match.last}", "http://twitter.com/search/?q=##{match.last}", options))
elsif match[1] == "#"
txt.gsub!(/##{match.last}/, link_to("##{match.last}", "http://twitter.com/#{match.last}", options))
end
end
txt
end
I pieced it together with some google searching and some reading up on String.scan in the api docs.

How to use ruby to get string between HTML <cite> tags?

Greetings everyone:
I would love to get some information from a huge collection of Google Search Result pages.
The only thing I need is the URLs inside a bunch of <cite></cite> HTML tags.
I cannot get a solution in any other proper way to handle this problem so now I am moving to ruby.
This is so far what I have written:
require 'net/http'
require 'uri'
url=URI.parse('http://www.google.com.au')
res= Net::HTTP.start(url.host, url.port){|http|
http.get('/#hl=en&q=helloworld')}
puts res.body
Unfortunately I cannot use the recommended hpricot ruby gem (because it misses a make command or something?)
So I would like to stick with this approach.
Now that I can get the response body as a string, the only thing I need is to retrieve whatever is inside the ciite(remove an i to see the true name :)) HTML tags.
How should I do that? using regular expression? Can anyone give me an example?
Here's one way to do it using Nokogiri:
Nokogiri::HTML(res.body).css("cite").map {|cite| cite.content}
I think this will solve it:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten
# This one to ignore empty tags:
res.scan(/<cite>([^<>]*)<\/cite>/imu).flatten.select{|x| !x.empty?}
If you're having problems with hpricot, you could also try nokogiri which is very similar, and allows you to do the same things.
Split the string on the tag you want. Assuming only one instance of tag (or specify only one split) you'll have two pieces I'll call head and tail. Take tail and split it on the closing tag (once), so you'll now have two pieces in your new array. The new head is what was between your tags, and the new tail is the remainder of the string, which you may process again if the tag could appear more than once.
An example that may not be exactly correct but you get the idea:
head1, tail1 = str.split('<tag>', 1) # finds the opening tag
head2, tail2 = tail1.split('</tag>', 1) # finds the closing tag

Resources