IBM Watson NLU - Understanding is being blocked by robots.txt

IBM Watson NLU - Understanding is being blocked by robots.txt - watson

I'm using the IBM Watson Natural Language API to scan specific webpage to determine keywords and categories.
But I've run into a issue with some sites which have their robots.txt set to block website scanner.
I'm working directly with these sites and they added the Watson agent string of "watson-url-fetcher" to their robots.txt file.
The result is that this works only some of the time.
This simplified robots.txt file works:
User-agent: *
Disallow: /
User-agent: watson-url-fetcher
Disallow: /manager/
But if the order is changed, the Watson no longer works:
reordered robots.txt fails :
User-agent: watson-url-fetcher
Disallow: /manager/
User-agent: *
Disallow: /
Watson then return the error code:
{
"error": "request to fetch blocked: fetch_failed",
"code": 400
}
Is this an error with Watson, or do I need to instruct the websites to always put the User-agent: * at the top of the robots.txt file?

Related

POST receives HTTP 404 on Server 2012 R2 Only

I'm trying to debug an issue calling the Twitter API (it works on my localhost, but doesn't work from a Server 2012 R2 build). But i don't think it's an issue on Twitter's side.
Anyway, to strip it down to a basic example, if i use Fiddler on my local Windows 10 desktop to POST to this endpoint: https://api.twitter.com/oauth/request_token
With this header data:
User-Agent: Fiddler
Authorization: OAuth oauth_consumer_key="xx",oauth_signature_method="HMAC-SHA1",oauth_timestamp="1654005734",oauth_nonce="yy",oauth_version="1.0",oauth_signature="zz"
Accept: */*
Host: api.twitter.com
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Content-Length: 0
I 'successfully' receive a HTTP 401, which is expected (because the info in my request is obviously inaccurate for this forum). Great.
But if i POST the exact same data using Fiddler on my Server 2012 R2 build, I get an HTTP 404?
Can anybody explain why this might be? I don't see any errors in Wireshark related to certificates or ciphers? I'm stumped....
UPDATE
I can reproduce the same issue with PowerShell like so:
Invoke-WebRequest -Headers #{"Authorization" = 'OAuth oauth_consumer_key="xx",oauth_signature_method="HMAC-SHA1",oauth_timestamp="1654005734",oauth_nonce="yy",oauth_version="1.0",oauth_signature="zz"' } `
-Method POST `
-Uri https://api.twitter.com/oauth/request_token

I've finally got it working after days of debugging. It has something to do with Schannel/TLS/Cipher suites, but I'm not exactly sure what the specific fix was. I hope this helps somebody and saves a few days of pain.
https://www.alkanesolutions.co.uk/2022/06/07/twitter-api-giving-http-404-not-found-when-requesting-a-token/

Pulling URL with =VLOOKUP from the table

Trying to get url from this link https://www.atanet.org/onlinedirectories/tsd_view.php?id=3856
I use the following formula: =VLOOKUP("Website",ImportXML(A1, "(//table[#id='tableTSDContent']//tr)"),2,0)
But unfortunately, it does not pull out the url. I would really appreciate it if you could help me extract the url in question.

I tried using the APIPheny add on to import the data. After the <h2>Online Directories Listing</h2>, I saw a cell that said "Google bot blocked" or something to that effect.
I then went to the site's robots.txt file (https://www.atanet.org/robots.txt), which says:
User-agent: *
Disallow: /onlinedirectories/tsd_view.php*
Disallow: /onlinedirectories/tsd_search.php*
Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
Disallow: /onlinedirectories/tsd_corp_listings/*
Disallow: /bin
Disallow: /division_calendar
User-agent: Googlebot
Disallow: /onlinedirectories/tsd_view.php*
Disallow: /onlinedirectories/tsd_search.php*
Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
Disallow: /*division_calendar*
Disallow: /*bin*
Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
User-agent: ITABot
Disallow: /onlinedirectories
I also think this means that the Google Sheets user agent is the same as the same as the Search Engine (Googlebot). If this is the case, then with Google Sheets, you're out of luck here because the tsd_view.php you want is disallowed. Likely, this was put there because they didn't want Google (or other search engines, for that matter) to index people's contact information. Of course, if you're a malicious webcrawler, you could ignore the robots.txt, but Googlebot is a nice bot.

How to block specific urls by using robots.txt?

How can I block specific URLs by using a robots.txt? We do not want Google to crawl our site. How can I define a disallow tag for those URLs in a robots.txt file?

To exclude all robots from part of the server you can use this if you want to exclude, for example, these 3 folders:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/

POST request does not work on external servers

I am trying to send data from an arduino to a web server (LAMP) using the ESP8266 module, when I do a POST to a local network server the server receives the data and returns 200, however, when I post to an external server
(Hosting or google cloud) it registers error 400 in the Apache log and returns nothing, but when I do the same type of request by Postman everything is fine, because of this I do not know if it is my fault when mounting or executing the request or if Is some block on the external servers because the http server in my network works.
I'm using this lib to work with ESP: https://github.com/itead/ITEADLIB_Arduino_WeeESP8266
This is the request string:
POST /data/sensor_test.php HTTP/1.1
Host: xxxxxxxxx.com
Accept: */*
Content-Length: 188
Content-Type: application/x-www-form-urlencoded
Cache-Control: no-cache
temperatureAir1=19.70&humidityAir1=82.30&temperatureAir2=19.40&humidityAir2=78.60&externalTemperature=19.31&illumination05=898&illumination10=408&humiditySoilXD28=6&humiditySoilYL69=5

I found the problem, when I concatenated the strings that make up the request I was doing line breaks with \n, I switched to \r\n and it worked!
The amount of bytes really is with error, I am seeing to correct, but the good thing is that now the request is right.

Should I include mobile site URLs in robots.txt?

My boss is having me look at various ways to improve our site's SEO and I've been doing some research on it. I'm aware that search engines like mobile-friendly sites and I used Google's Webmaster Tools, finding that it considers our site to be mobile-friendly. However, we lack an adequate robots.txt file.
What we want to do is avoid getting the same page indexed twice (as desktop and mobile versions), and he recommended that I include our site's mobile URLs in the robots.txt file. However, will doing this damage our site's ranking? I get that files listed under robots.txt shouldn't be indexed, which raises concerns about whether or not people will be able to see results for our site when they search for it on their phones.

Although I would not recommend having two different files or URLs for mobile/regular sites,as the official google blog recommends:
Sites that use responsive web design, i.e. sites that serve all
devices on the same set of URLs, with each URL serving the same HTML
to all devices and using just CSS to change how the page is rendered
on the device. This is Google’s recommended configuration.
http://googlewebmastercentral.blogspot.ca/2012/06/recommendations-for-building-smartphone.html
Having said that, since you already have mobile versions and would like to block google bot from indexing multiple versions of the same URL:
Blocking Googlebot-Mobile from desktop site
Desktop site: http://www.domain.com/robots.txt
User-agent: Googlebot
User-agent: Slurp
User-agent: bingbot
Allow: /
User-agent: Googlebot-Mobile
User-Agent: YahooSeeker/M1A1-R2D2
User-Agent: MSNBOT_Mobile
Disallow: /
Mobile site: http://m.domain.com/robots.txt
User-agent: Googlebot
User-agent: Slurp
User-agent: bingbot
Disallow: /
User-agent: Googlebot-Mobile
User-Agent: YahooSeeker/M1A1-R2D2
User-Agent: MSNBOT_Mobile
Allow: /
http://searchengineland.com/5-tips-for-optimal-mobile-site-indexing-107088

Robots.txt disallows crawling, not indexing.
So if you would block your mobile URLs, bots would never be able to even see that you have a mobile site, which is probably not what you want.
Alternative
Tell bots what the links are about. Based on this declaration, bots can decide what they want to do with these URLs.
You can do this by providing the link types alternate and canonical:
alternate (defined in the HTML5 spec), to denote that it’s an "alternate representation of the current document".
canonical (defined in RFC 6596), to denote that the pages are the same, or that they only have trivial differences (e.g., different HTML structure, table sorted differently etc.), or that one is the superset of the other.
So if you want to use the URLs from the desktop site as canonical, you would use "alternate canonical" to link from mobile to desktop, and "alternate" to link from desktop to mobile. You can see an example in my answer to the Webmasters question Linking desktop and mobile pages.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

IBM Watson NLU - Understanding is being blocked by robots.txt - watson

Related

POST receives HTTP 404 on Server 2012 R2 Only

Pulling URL with =VLOOKUP from the table

How to block specific urls by using robots.txt?

POST request does not work on external servers

Should I include mobile site URLs in robots.txt?

Categories

Resources