Pulling URL with =VLOOKUP from the table - google-sheets

Trying to get url from this link https://www.atanet.org/onlinedirectories/tsd_view.php?id=3856
I use the following formula: =VLOOKUP("Website",ImportXML(A1, "(//table[#id='tableTSDContent']//tr)"),2,0)
But unfortunately, it does not pull out the url. I would really appreciate it if you could help me extract the url in question.

I tried using the APIPheny add on to import the data. After the <h2>Online Directories Listing</h2>, I saw a cell that said "Google bot blocked" or something to that effect.
I then went to the site's robots.txt file (https://www.atanet.org/robots.txt), which says:
User-agent: *
Disallow: /onlinedirectories/tsd_view.php*
Disallow: /onlinedirectories/tsd_search.php*
Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
Disallow: /onlinedirectories/tsd_corp_listings/*
Disallow: /bin
Disallow: /division_calendar
User-agent: Googlebot
Disallow: /onlinedirectories/tsd_view.php*
Disallow: /onlinedirectories/tsd_search.php*
Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
Disallow: /*division_calendar*
Disallow: /*bin*
Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
User-agent: ITABot
Disallow: /onlinedirectories
I also think this means that the Google Sheets user agent is the same as the same as the Search Engine (Googlebot). If this is the case, then with Google Sheets, you're out of luck here because the tsd_view.php you want is disallowed. Likely, this was put there because they didn't want Google (or other search engines, for that matter) to index people's contact information. Of course, if you're a malicious webcrawler, you could ignore the robots.txt, but Googlebot is a nice bot.

Related

How to block specific urls by using robots.txt?

How can I block specific URLs by using a robots.txt? We do not want Google to crawl our site. How can I define a disallow tag for those URLs in a robots.txt file?
To exclude all robots from part of the server you can use this if you want to exclude, for example, these 3 folders:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

IBM Watson NLU - Understanding is being blocked by robots.txt

I'm using the IBM Watson Natural Language API to scan specific webpage to determine keywords and categories.
But I've run into a issue with some sites which have their robots.txt set to block website scanner.
I'm working directly with these sites and they added the Watson agent string of "watson-url-fetcher" to their robots.txt file.
The result is that this works only some of the time.
This simplified robots.txt file works:
User-agent: *
Disallow: /
User-agent: watson-url-fetcher
Disallow: /manager/
But if the order is changed, the Watson no longer works:
reordered robots.txt fails :
User-agent: watson-url-fetcher
Disallow: /manager/
User-agent: *
Disallow: /
Watson then return the error code:
{
"error": "request to fetch blocked: fetch_failed",
"code": 400
}
Is this an error with Watson, or do I need to instruct the websites to always put the User-agent: * at the top of the robots.txt file?

Should I include mobile site URLs in robots.txt?

My boss is having me look at various ways to improve our site's SEO and I've been doing some research on it. I'm aware that search engines like mobile-friendly sites and I used Google's Webmaster Tools, finding that it considers our site to be mobile-friendly. However, we lack an adequate robots.txt file.
What we want to do is avoid getting the same page indexed twice (as desktop and mobile versions), and he recommended that I include our site's mobile URLs in the robots.txt file. However, will doing this damage our site's ranking? I get that files listed under robots.txt shouldn't be indexed, which raises concerns about whether or not people will be able to see results for our site when they search for it on their phones.
Although I would not recommend having two different files or URLs for mobile/regular sites,as the official google blog recommends:
Sites that use responsive web design, i.e. sites that serve all
devices on the same set of URLs, with each URL serving the same HTML
to all devices and using just CSS to change how the page is rendered
on the device. This is Google’s recommended configuration.
http://googlewebmastercentral.blogspot.ca/2012/06/recommendations-for-building-smartphone.html
Having said that, since you already have mobile versions and would like to block google bot from indexing multiple versions of the same URL:
Blocking Googlebot-Mobile from desktop site
Desktop site: http://www.domain.com/robots.txt
User-agent: Googlebot
User-agent: Slurp
User-agent: bingbot
Allow: /
User-agent: Googlebot-Mobile
User-Agent: YahooSeeker/M1A1-R2D2
User-Agent: MSNBOT_Mobile
Disallow: /
Mobile site: http://m.domain.com/robots.txt
User-agent: Googlebot
User-agent: Slurp
User-agent: bingbot
Disallow: /
User-agent: Googlebot-Mobile
User-Agent: YahooSeeker/M1A1-R2D2
User-Agent: MSNBOT_Mobile
Allow: /
http://searchengineland.com/5-tips-for-optimal-mobile-site-indexing-107088
Robots.txt disallows crawling, not indexing.
So if you would block your mobile URLs, bots would never be able to even see that you have a mobile site, which is probably not what you want.
Alternative
Tell bots what the links are about. Based on this declaration, bots can decide what they want to do with these URLs.
You can do this by providing the link types alternate and canonical:
alternate (defined in the HTML5 spec), to denote that it’s an "alternate representation of the current document".
canonical (defined in RFC 6596), to denote that the pages are the same, or that they only have trivial differences (e.g., different HTML structure, table sorted differently etc.), or that one is the superset of the other.
So if you want to use the URLs from the desktop site as canonical, you would use "alternate canonical" to link from mobile to desktop, and "alternate" to link from desktop to mobile. You can see an example in my answer to the Webmasters question Linking desktop and mobile pages.

Google indexed my domain anyway?

I have a robots.txt like below but Google has still indexed my domain. Basically they've indexed mydomain.com but not mydomain.com/any_page
UserAgent: *
Disallow: /
I mean how can I go back further than / which I thought was the root of domain?
Note this domain is a work in progess, hence I don't want Google or any other search engines seeing it for a minute.
If you don't have one already, get a Google Webmaster Tools account. It includes a URL removal tool that may work for you.
This doesn't address the problem of search engines possibly ignoring or misinterpreting your robots.txt file, of course.
If you REALLY want your site to be off the air until it's launched, your best bet is to actually take it off the air. Make the site inaccessible except by password. If you put HTTP Basic authentication on your documentroot, then no search engine will be able to index anything, but you'll have full access with a password.

URL Shorten: how is this achieved? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How do short URLs services work?
I often see shortened urls from bitly.com such as http://bit.ly/abcd. How is this "bit.ly" realized at server side? Is it some DNS trick inside?
Yes.. actually if you go to https://bitly.com/ you will notice that it provides this URL shortening service.
Going to http://bit.ly/abcd just redirects it to a URL of your choice. You can figure it by looking at the HTTP request and response headers
Request URL:http://bit.ly/abcd
Request Method:GET
Status Code:301 Moved
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Host:bit.ly
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1
Response Headersview source
Cache-control:private; max-age=90
Connection:keep-alive
Content-Length:145
Content-Type:text/html; charset=utf-8
Date:Thu, 16 Jun 2011 21:14:04 GMT
Location:http://macthemes2.net/forum/viewtopic.php?id=16786044
MIME-Version:1.0
Server:nginx
Set-Cookie:_bit=4dfa721c-001f7-011f8-c8ac8fa8;domain=.bit.ly;expires=Tue Dec 13 16:14:04 2011;path=/; HttpOnly
http://www.w3.org/Protocols/HTTP/HTRESP.html talks about status codes and 301 is what you should be looking for
No, it's just an HTTP server that looks up abcd in a database, finds http://example.com/long/url, and sends an HTTP redirect answer, like
HTTP/1.1 301 Moved Permanently
Location: http://example.com/long/url
Have you gone to http://bit.ly/? The url shortener stores the long url in a database, then when the short url is used, the url shortener service performs an http redirect to the long url.
LY is the top-level domain for Libya, which is distinct from bitly.com.
bit.ly is just a domain like any other (ie: .com, .net. .fr)
In this case .ly belongs to Libya.
It looks like they use A-Za-z0-9 for generating their URLs, and if my calculations are right, this means at any one time they can probably store a database of 61,474,519 of those codes mapped onto the long URLs. Assuming certain links can expire, or people can delete links they have made, it's safe to assume they won't run out of possibilities soon...and hey if they do, just make the links up to 8 characters- then you get 3,381,098,545 possibilities =P

Resources