Extracting URLs from a String that do not contain 'http'

Extracting URLs from a String that do not contain 'http' - ruby-on-rails

I have the following 3 strings...
a = "The URL is www.google.com"
b = "The URL is google.com"
c = "The URL is http://www.google.com"
Ruby's URI extract method only returns the URL in the third string, because it contains the http part.
URI.extract(a)
=> []
URI.extract(b)
=> []
URI.extract(c)
=> ["http://www.google.com"]
How can I create a method to detect and return the URL in all 3 instances?

Use regular expressions :
Here is a basic one that should work for most cases :
/(https?:\/\/)?\w*\.\w+(\.\w+)*(\/\w+)*(\.\w*)?/.match( a ).to_s
This will only fetch the first url in the string and return a string.

There's no perfect solution to this problem: it's fraught with edge cases. However, you might be able to get tolerably good results using something like the regular expressions used by Twitter to extract URLs from tweets (stripping off the extra leading spaces is left as an exercise!):
require './regex.rb'
def extract_url(s)
s[Twitter::Regex[:valid_url]]
end
a = "The URL is www.google.com"
b = "The URL is google.com"
c = "The URL is http://www.google.com"
extract_url(a)
# => " www.google.com"
extract_url(b)
# => " google.com"
extract_url(c)
# => " http://www.google.com"

You seem to be satisfied with Sucrenoir's answer. The essence of Sucrenoir's answer is to identity a URL by assuming that it includes at least one period. if that is the case, Sucrenoir's regex can be simplified (not equivalently, but for the most part) to this:
string[/\S+\.\S+/]

This is something I used a while ago, hopefully it helps
validates :url, :format =>
{ :with => URI::regexp(%w(http https)), :message => "Not Valid URL" }
Pass it through that validation (I assume your using a database)

Try with this method. Hope it will work for you
def get_url(str)
arr = str.split(' ')
url = nil
arr.map {|arr_str| url = arr_str if arr_str.include?('.com')}
url
end
This is your example
get_url("The URL is www.google.com") #=> www.google.com
get_url("The URL is google.com") #=> google.com
get_url("The URL is http://www.google.com") #=> http://www.google.com

Related

Ruby (Rails) gsub: pass the captured string into a method

I'm trying to match a string as such:
text = "This is a #hastag"
raw(
h(text).gsub(/(?:\B#)(\w*[A-Z]+\w*)/i, embed_hashtag('\1'))
)
def embed_hashtag('data')
#... some code to turn the captured hashtag string into a link
#... return the variable that includes the final string
end
My problem is that when I pass '\1' in my embed_hashtag method that I call with gsub, it simply passes "\1" literally, rather than the first captured group from my regex. Is there an alternative?
FYI:
I'm wrapping text in h to escape strings, but then I'm embedding code into user inputted text (i.e. hashtags) which needs to be passed raw (hence raw).
It's important to keep the "#" symbol apart from the text, which is why I believe I need the capture group.
If you have a better way of doing this, don't hesitate to let me know, but I'd still like an answer for the sake of answering the question in case someone else has this question.

Use the block form gsub(regex){ $1 } instead of gsub(regex, '\1')
You can simplify the regex to /\B#(\w+)/i as well
You can leave out the h() helper, Rails 4 will escape malicious input by default
Specify method arguments as embed_hashtag(data) instead of embed_hashtag('data')
You need to define embed_hashtag before doing the substitution
To build a link, you can use link_to(text, url)
This should do the trick:
def embed_hashtag(tag)
url = 'http://example.com'
link_to tag, url
end
raw(
text.gsub(/\B#(\w+)/i){ embed_hashtag($1) }
)

The correct way would be the use of a block here.
Example:
def embed_hashtag(data)
puts "#{data}"
end
text = 'This is a #hashtag'
raw(
h(text).gsub(/\B#(\S+)/) { embed_hashtag($1) }
)

Try last match regexp shortcut:
=> 'zzzdzz'.gsub(/d/) { puts $~[0] }
=> 'd'
=> "zzzzz"

Extracting sublink in between two characters in Ruby

How would I extract a sub-link between two characters in a string?
For example, I'd like to extract the Video ID in a youtube URL:
http://www.youtube.com/watch?v=UkzbRkPv4T4&feature=g-all-u
I'd like the text between the "=" and the first "&" sign, which would be "UkzbRkPv4T4".

If you don't want to deal with regular expressions, you could rely on functionality from Ruby's Standard Library for parsing URLs:
url = "http://www.youtube.com/watch?v=UkzbRkPv4T4&feature=g-all-u"
video_id = CGI.parse(URI.parse(url).query)['v'][0]

You just need a regular expression:
uri = 'http://www.youtube.com/watch?v=UkzbRkPv4T4&feature=g-all-u'
m = uri.match /v=(?<id>\w+)&/
if m
puts m[:id]
end

Just to expand upon apneadiving's comment.
>> url = "http://www.youtube.com/watch?v=UkzbRkPv4T4&feature=g-all-u"
=> "http://www.youtube.com/watch?v=UkzbRkPv4T4&feature=g-all-u"
>> md = url.match(/v=(.*)&/)
=> #<MatchData "v=UkzbRkPv4T4&" 1:"UkzbRkPv4T4">
>> md[1]
=> "UkzbRkPv4T4"

require 'uri'
uri = URI("http://www.youtube.com/watch?v=UkzbRkPv4T4&feature=g-all-u")
uri.query
# => "v=UkzbRkPv4T4&feature=g-all-u"
URI.decode_www_form(uri.query)
# => [["v", "UkzbRkPv4T4"], ["feature", "g-all-u"]]
URI.decode_www_form(uri.query).map(&:last)
# => ["UkzbRkPv4T4", "g-all-u"]
URI.decode_www_form(uri.query).assoc("v").last
# => "UkzbRkPv4T4"

restclient with ruby

Here i am trying to pass one ID with the url, but that ID didn't append with URL...
def retrieve
url = "http://localhost:3000/branches/"
resource = RestClient::Resource.new url+$param["id"]
puts resource
end
giving ID via commend line that is
ruby newrest.rb id="22"
I have got the error like this
`+': can't convert nil into String (TypeError)
But all this working with mozilla rest client. How to rectify this problem?

Like this:
RestClient.get 'http://localhost:3000/branches', {:params => {:id => 50, 'name' => 'value'}}

You can find the command line parameters in the global ARGV array.
If ruby newrest.rb 22 will do then just
id = ARGV[0]
response = RestClient.get "http://localhost:3000/branches/#{id}"
puts response.body

Here are some examples from the documentation:
private_resource = RestClient::Resource.new 'https://example.com/private/resource', 'user', 'pass'
RestClient.post 'http://example.com/resource', :param1 => 'one', :nested => { :param2 => 'two' }
Just experiment with comma-separated parameters or with hashes so see what your URL gives you.

From my point of view line puts resource seems strange,
but when we leave it as it is
I'd suggest
def retrieve
url = "http://localhost:3000/branches/"
resource = RestClient::Resource.new url
res_with_param = resource[$param["id"]]
puts res_with_param
end
I haven't tried so there may be a syntax mistakes.
I'm really newcomer in ruby.
But idea is good I hope.
Greetings,
KAcper

Finding exact words in a string

I have a list of links to clothing websites that I am categorising by gender using keywords. Depending on what website they are for, they all have different URL structures, for example...
www.website1.com/shop/womens/tops/tshirt
www.website2.com/products/womens-tshirt
I cannot use the .include? method because regardless of whether it is .include?("mens") or .include?("womens"), it will return true. How can I have a method that will only return true for "womens" (and vice versa). I suspect it may have to be some sort of regex, but I am relatively inexperienced with these, and the different URL structures make it all the more tricky. Any help is much appreciated, thanks!

The canonical regex way of doing this is to search on word boundaries:
pry(main)> "foo/womens/bar".match(/\bwomens\b/)
=> #<MatchData "womens">
pry(main)> "foo/womens/bar".match(/\bmens\b/)
=> nil
pry(main)> "foo/mens/bar".match(/\bmens\b/)
=> #<MatchData "mens">
pry(main)> "foo/mens/bar".match(/\bwomens\b/)
=> nil
That said, either splitting, or searching with the leading "/", may be adequate.

If you first check for women it should work:
# assumes str is not nil
def gender(str)
if str.include?("women")
"F"
elsif str.include?("men")
"M"
else
nil
end
end
If this is not what you are looking for, please explain your problem in more detail.

You could split with / and check for string equality on the component(s) you want -- no need for a regex there

keyword = "women"
url = "www.website1.com/shop/womens/tops/tshirt"
/\/#{keyword}/ =~ url
=> 21
keyword = "men"
url = "www.website1.com/shop/womens/tops/tshirt"
/\/#{keyword}/ =~ url
=> nil
keyword = "women"
url = www.website2.com/products/womens-tshirt
/\/#{keyword}/ =~ url
=> 25
keyword = "men"
url = www.website2.com/products/womens-tshirt
/\/#{keyword}/ =~ url
=> nil
Then just do a !! on it:
=> !!nil => false
=> !!25 => true

Regular expression to explode the URLs

When I am trying to explode the url from one string, its not returning the actual URL. Please find the def I have used
def self.getUrlsFromString(str="")
url_regexp = /(?:http|https):\/\/[a-z0-9]+(?:[\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(?:(?::[0-9]{1,5})?\/[^\s]*)?/ix
url = str.split.grep(url_regexp)
return url
rescue Exception => e
DooDooLogger.log(e.message,e)
return ""
end
when I do self.getUrlsFromString(" check this site...http://lnkd.in/HjUVii") it's returning
site...http://lnkd.in/HjUVii
Instead of
http://lnkd.in/HjUVii

It's because grep in Array class returns an array of every element for element === pattern, so
str.split.grep(/http/ix)
will return ["site...http://lnkd.in/HjUVii"] too.
You can try instead of
str.split.grep(url_regexp)
something like this:
url_regexp.match(str).to_s

Should not you use something much simpler as regex like:
/((http|https):[^\s]+)/

If you want to find all occurences in a string, you could use String#scan:
str = "check these...http://lnkd.in/HjUVii http://www.google.com/"
str.scan(url_regexp)
=> ["http://lnkd.in/HjUVii", "http://www.google.com/"]

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Extracting URLs from a String that do not contain 'http' - ruby-on-rails

Use regular expressions : Here is a basic one that should work for most cases : /(https?:\/\/)?\w\.\w+(\.\w+)(\/\w+)(\.\w)?/.match( a ).to_s This will only fetch the first url in the string and return a string.

You seem to be satisfied with Sucrenoir's answer. The essence of Sucrenoir's answer is to identity a URL by assuming that it includes at least one period. if that is the case, Sucrenoir's regex can be simplified (not equivalently, but for the most part) to this: string[/\S+\.\S+/]

This is something I used a while ago, hopefully it helps validates :url, :format => { :with => URI::regexp(%w(http https)), :message => "Not Valid URL" } Pass it through that validation (I assume your using a database)

Related

Ruby (Rails) gsub: pass the captured string into a method

Extracting sublink in between two characters in Ruby

restclient with ruby

Finding exact words in a string

Regular expression to explode the URLs

Categories

Resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Extracting URLs from a String that do not contain 'http' - ruby-on-rails

Use regular expressions : Here is a basic one that should work for most cases : /(https?:\/\/)?\w*\.\w+(\.\w+)*(\/\w+)*(\.\w*)?/.match( a ).to_s This will only fetch the first url in the string and return a string.

You seem to be satisfied with Sucrenoir's answer. The essence of Sucrenoir's answer is to identity a URL by assuming that it includes at least one period. if that is the case, Sucrenoir's regex can be simplified (not equivalently, but for the most part) to this: string[/\S+\.\S+/]

This is something I used a while ago, hopefully it helps validates :url, :format => { :with => URI::regexp(%w(http https)), :message => "Not Valid URL" } Pass it through that validation (I assume your using a database)

Related

Ruby (Rails) gsub: pass the captured string into a method

Extracting sublink in between two characters in Ruby

restclient with ruby

Finding exact words in a string

Regular expression to explode the URLs

Categories

Resources

Use regular expressions : Here is a basic one that should work for most cases : /(https?:\/\/)?\w\.\w+(\.\w+)(\/\w+)(\.\w)?/.match( a ).to_s This will only fetch the first url in the string and return a string.