Ruby code to check if website has a sitemap or not - ruby-on-rails

I am developing an application in rails which requires checking whether a sitemap of the entered website's URL exists or not? For Eg if a user enters http://google.com then it should return "Sitemap present".I have seen for solutions that usually websites have either /sitemap.xml or /sitemap at the end of their URL.So i tried putting a check for this using typhoeus gem, checking response.code for the URL(like www.google.com/sitemap.xml OR www.apple.com/sitemap) that if it returns with a 200 or 301, then sitemap exists, else not.But i have found that some sites return a 301 even if they dont have a sitemap, they redirect it to their main page(For Eg http://yournextleap.com/sitemap.xml), hence i don't get a conclusive result.Any help would be really great.
Here is my sample code to check for sitemap using typhoeus :
# the request object
request = Typhoeus::Request.new("http://apple.com/sitemap")
# Run the request via Hydra.
hydra = Typhoeus::Hydra.new
request.on_complete do |response|
if response.code == 301
p "success 301" # hell yeah
elsif response.code == 200
p "Success 200"
elsif response.code == 404
. puts "Could not get a sitemap, something's wrong."
else
p "check your input!!!!"
end

The HTTP response status code 301 Moved Permanently is used for
permanent redirection. This status code should be used with the
location header. RFC 2616 states that:
If a client has link-editing capabilities, it should update all references to the Request URI.
The response is cachable.
Unless the request method was HEAD, the entity should contain a small hypertext note with a hyperlink to the new URI(s).
If the 301 status code is received in response to a request of any type other than GET or HEAD, the client must ask the user before redirecting.
I don't think its fair for you to assume that a 301 Response indicates that there was ever a sitemap. If you're checking the existence of a sitemap.xml or a sitemap directory then the correct response to expect is a 2XX.
If you're insistent on assuming that a 3XX request indicates a redirect to a sitemap, then follow the redirect and add logic to check the url of the page (if its the homepage) or the content of the page to see if it has XML structure.

Sitemap may also be compressed to sitemap.xml.gz -- so you may have to check for that filename too. Also, it may have an index file that points to many other sub sitemaps which also may be named differently.
For examples in my project I have:
sitemap_index.xml.gz
-> sitemap_en1.xml.gz (english version of links)
-> sitemap_pl1.xml.gz (polish version of links)
-> images_sitemap1.xml.gz (only images sitemap)
Websites ping search engines with those filenames, but sometimes they also may include them in the /robots.txt file, so you may try hunting for them in there. For example http://google.com has this at the end of their file:
(See how weird sitemaps' names can be!)
Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml
Sitemap: http://www.google.com/ventures/sitemap_ventures.xml
Sitemap: http://www.google.com/sitemaps_webmasters.xml
Sitemap: http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml
Sitemap: http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml
About 301: you may try spoofing as a Google Bot or other crawler. Maybe they redirect everyone except robots. But if they redirect everyone, there's nothing you can really do about it.

Related

FastAPI RedirectResponse gets {"message": "Forbidden"} when redirecting to a different route

Please bare with me for a question for which it's nearly impossible to create a reproducible example.
I have an API setup with FastAPI using Docker, Serverless and deployed on AWS API Gateway. All routes discussed are protected with an api-key that is passed into the header (x-api-key).
I'm trying to accomplish a simple redirect from one route to another using fastapi.responses.RedirectResponse. The redirect works perfectly fine locally (though, this is without api-key), and both routes work perfectly fine when deployed on AWS and connected to directly, but something is blocking the redirect from route one (abc/item) to route two (xyz/item) when I deploy to AWS. I'm not sure what could be the issue, because the logs in CloudWatch aren't giving me much to work with.
To illustrate my issue let's say we have route abc/item that looks like this:
#router.get("/abc/item")
async def get_item(item_id: int, request: Request, db: Session = Depends(get_db)):
if False:
redirect_url = f"/xyz/item?item_id={item_id}"
logging.info(f"Redirecting to {redirect_url}")
return RedirectResponse(redirect_url, headers=request.headers)
else:
execution = db.execute(text(items_query))
return convert_to_json(execution)
So, we check if some value is True/False and if it's False we redirect from abc/item to xyz/item using RedirectResponse(). We pass the redirect_url, which is just the xyz/item route including query parameters and we pass request.headers (as suggested here and here), because I figured we need to pass along the x-api-key to the new route. In the second route we again try a query in a different table (other_items) and return some value.
I have also tried passing status_code=status.HTTP_303_SEE_OTHER and status_code=status.HTTP_307_TEMPORARY_REDIRECT to RedirectResponse() as suggested by some tangentially related questions I found on StackOverflow and the FastAPI discussions, but that didn't help either.
#router.get("/xyz/item")
async def get_item(item_id: int, db: Session = Depends(get_db)):
execution = db.execute(text(other_items_query))
return convert_to_json(execution)
Like I said, when deployed I can successfully connect directly to both abc/item and get a return value if True and I can also connect to xyz/item directly and get a correct value from that, but when I pass a value to abc/item that is False (and thus it should redirect) I get {"message": "Forbidden"}.
In case it can be of any help, I try debugging this using a "curl" tool, and the headers I get returned give the following info:
Content-Type: application/json
Content-Length: 23
Connection: keep-alive
Date: Wed, 27 Jul 2022 08:43:06 GMT
x-amzn-RequestId: XXXXXXXXXXXXXXXXXXXX
x-amzn-ErrorType: ForbiddenException
x-amz-apigw-id: XXXXXXXXXXXXXXXX
X-Cache: Error from cloudfront
Via: 1.1 XXXXXXXXXXXXXXXXXXXXXXXXX.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: XXXXX
X-Amz-Cf-Id: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
So, this is hinting at a CloudFront error. Unfortunately I don't see anything slightly hinting at this API when I look into my CloudFront dashboard on AWS, there literally is nothing there (I do have permissions to view the contents though...)
The API logs in CloudWatch look like this:
2022-07-27T03:43:06.495-05:00 Redirecting to /xyz/item?item_id=1234...
2022-07-27T03:43:06.495-05:00 [INFO] 2022-07-27T08:43:06.495Z Redirecting to /xyz/item?item_id=1234...
2022-07-27T03:43:06.496-05:00 2022-07-27 08:43:06,496 INFO sqlalchemy.engine.Engine ROLLBACK
2022-07-27T03:43:06.496-05:00 [INFO] 2022-07-27T08:43:06.496Z ROLLBACK
2022-07-27T03:43:06.499-05:00 END RequestId: 6f449762-6a60189e4314
2022-07-27T03:43:06.499-05:00 REPORT RequestId: 6f449762-6a60189e4314 Duration: 85.62 ms Billed Duration: 86 ms Memory Size: 256 MB Max Memory Used: 204 MB
I have been wondering if my issue could be related to something I need to add to somewhere in my serverless.yml, perhaps in the functions: part. That currently looks like this for these two routes:
events:
- http:
path: abc/item
method: get
cors: true
private: true
request:
parameters:
querystrings:
item_id: true
- http:
path: xyz/item
method: get
cors: true
private: true
request:
parameters:
querystrings:
item_id: true
Finally, it's probably good to note that I have added custom middleware to FastAPI to handle the two different database connections I need for connecting to other_items and items tables, though I'm not sure how relevant this is, considering this functions fine when redirecting locally. For this I implemented the solution found here. This custom middleware is the reason for the redirect in the first place (we change connection URI based on route with that middleware), so I figured it's good to share this bit of info as well.
Thanks!
As noted here and here, it is mpossible to redirect to a page with custom headers set. A redirection in the HTTP protocol doesn't support adding any headers to the target location. It is basically just a header in itself and only allows for a URL (a redirect response though could also include body content, if needed—see this answer). When you add the authorization header to the RedirectResponse, you only send that header back to the client.
A suggested here, you could use the set-cookie HTTP response header:
The Set-Cookie HTTP response header is used to send a cookie from the
server to the user agent (client), so that the user agent can send it back to
the server later.
In FastAPI—documentation can be found here and here—this can be done as follows:
from fastapi import Request
from fastapi.responses import RedirectResponse
#app.get("/abc/item")
def get_item(request: Request):
redirect_url = request.url_for('your_endpoints_function_name') #e.g., 'get_item'
response = RedirectResponse(redirect_url)
response.set_cookie(key="fakesession", value="fake-cookie-session-value", httponly=True)
return response
Inside the other endpoint, where you are redirecting the user to, you can extract that cookie to authenticate the user. The cookie can be found in request.cookies—which should return, for example, {'fakesession': 'fake-cookie-session-value-MANUAL'}—and you retrieve it using request.cookies.get('fakesession').
On a different note, request.url_for() function accepts only path parameters, not query parameters (such as item_id in your /abc/item and /xyz/item endpoints). Thus, you can either create the URL in the way you already do, or use the CustomURLProcessor suggested here, here and here, which allows you to pass both path and query parameters.
If the redirection takes place from one domain to another (e.g., from abc.com to xyz.com), please have a look at this answer.

does docpad secondary url redirect feature work?

Based on the documentation of docpad primary url, all requests to a document secondary url should be redirected to the primary url. But actually it respond the expected page directly when requesting any secondary urls without any redirection.
For example, you have a docpad document /src/documents/secondary-url.html.md like:
---
urls:
- '/my-secondary-urls1'
- '/my-secondary-urls2'
---
# primary url should be `secondary-url.html`
Then run command $ docpad run
It will responds status 200 when hitting either http://localhost:9778/my-secondary-urls1 or http://localhost:9778/my-secondary-urls2. While expected result is a redirect with status code 301 to http://localhost:9778/secondary-url.html
It seems an expected feature if checking this line of docpad code.
I'm curious if this is a defect or a deprecated feature?
BTW: I have a simple fix here which won't become a pull request until I read the contribution guide: https://github.com/shawnzhu/docpad/commit/731cdec43f9d9d155c8a8310494575d9746a065c
This was addressed in issue 850 of project docpad, and fixed in pull request 905, so further version than v6.70.1 of docpad will contain this fix.

Large number of likes but now realise it is to an invalid url

My site at www.kruaklaibaan.com (yes I know it's hideous) currently has 3.7 million likes but while working to build a proper site that doesn't use some flowery phpBB monstrosity I noticed that all those likes are registered against an invalid URL that doesn't actually link back to my site's URL at all. Instead the likes have all been registered against a URL-encoded version:
www.kruaklaibaan.com%2Fviewtopic.php%3Ff%3D42%26t%3D370
This is obviously incorrect. Since I already have so many likes I was hoping to either get those likes updated to the correct URL or get them to just point to the base url of www.kruaklaibaan.com
The correct url they SHOULD have been registered against is (not url-encoded):
www.kruaklaibaan.com/viewtopic.php?f=42&t=370
Is there someone at Facebook I can discuss this with? 3.7m likes is a little too many to start over with without a lot of heartache. It took 2 years to build those up.
Short of getting someone at Facebook to update the URL, the only option within your control that I could think of that would work is to create a custom 404 error page. I have tested such a page with your URL and the following works.
First you need to set the Apache directive for ErrorDocument (or equivalent in another server).
ErrorDocument 404 /path/to/404.php
This will cause any 404 pages to hit the script, which in turn will do the necessary check and redirect if appropriate.
I tested the following script and it works perfectly.
<?php
if ( $_SERVER['REQUEST_URI'] == '/%2Fviewtopic.php%3Ff%3D42%26t%3D370' ) {
Header("HTTP/1.1 301 Moved Permanently");
Header("Location: /viewtopic.php?f=42&t=370");
exit();
} else {
header('HTTP/1.0 404 Not Found');
}
?><html><body>
<h1>HTTP 404 Not Found</h1>
<?php echo $_SERVER['REQUEST_URI']; ?>
</body></html>
This is a semi-dirty way of achieving this, however I tried several variations in Apache2.2 using mod_alias's Redirect and mod_rewrite's RewriteRule, neither of which I have been able to get working with a URL containing percent encoded chars. I suspect that with nginx you may have better success at a more graceful way to handle this in the server.

Google docs API: can't download a file, downloading documents works

I'm trying out http requests to download a pdf file from google docs using google document list API and OAuth 1.0. I'm not using any external api for oauth or google docs.
Following the documentation, I obtained download URL for the pdf which works fine when placed in a browser.
According to documentation I should send a request that looks like this:
GET https://doc-04-20-docs.googleusercontent.com/docs/secure/m7an0emtau/WJm12345/YzI2Y2ExYWVm?h=16655626&e=download&gd=true
However, the download URL has something funny going on with the paremeters, it looks like this:
https://doc-00-00-docs.googleusercontent.com/docs/securesc/5ud8e...tMzQ?h=15287211447292764666&amp\;e=download&amp\;gd=true
(in the url '&amp\;' is actually without '\' but I put it here in the post to avoid escaping it as '&').
So what is the case here; do I have 3 parameters h,e,gd or do I have one parameter h with value 15287211447292764666&ae=download&gd=true, or maybe I have the following 3 param-value pairs: h = 15287211447292764666, amp;e = download, amp;gd = true (which I think is the case and it seems like a bug)?
In order to form a proper http request I need to know exectly what are the parameters names and values, however the download URL I have is confusing. Moreover, if the params names are h,amp;e and amp;gd, is the request containing those params valid for obtaining file content (if not it seems like a bug).
I didn't have problems downloading and uploading documents (msword docs) and my scope for downloading a file is correct.
I experimented with different requests a lot. When I treat the 3 parameters (h,e,gd) separetaly I get Unauthorized 401. If I assume that I have only one parameter - h with value 15287211447292764666&ae=download&gd=true I get 500 Internal Server Error (google api states: 'An unexpected error has occurred in the API.','If the problem persists, please post in the forum.').
If I don't put any paremeters at all or I put 3 parameters -h,amp;e,amp;gd, I get 302 Found. I tried following the redirections sending more requests but I still couldn't get the actual pdf content. I also experimented in OAuth Playground and it seems it's not working as it's supposed to neither. Sending get request in OAuth with the download URL responds with 302 Found instead of responding with the PDF content.
What is going on here? How can I obtain the pdf content in a response? Please help.
I have experimented same issue with oAuth2 (error 401).
Solved by inserting the oAuth2 token in request header and not in URL.
I have replaced &access_token=<token> in the URL by setRequestHeader("Authorization", "Bearer <token>" )

How to get 302 redirect location in Rails? (have tried HTTParty, Net/Http and RedirectFollower)

Hello
i am trying to get Facebook user's album's cover picture.
as it's said in the API page, it returns "An HTTP 302 with the URL of the album's cover picture" when getting:
http s://graph.facebook.com/[album_id]}/picture?access_token=blahblahblah...
documents here: http://developers.facebook.com/docs/reference/api/album
i've tried HTTParty, Net:HTTP and also the RedirectFollower class
HTTParty returns the picture image itself, and no "location" (URL) information anywhere
NET:HTTP and RedirectFollower are a bit tricky...
if i don't use URI.encode when passing the URL into the get method, it causes "bad uri" error
but if i use URI.encode to pass the encoded URI, it causes EOFError (end of file reached)
what's amazing is that i can see the location URL when using apigee's FB API
here is the redirect method which is recommended on the Net:HTTP documents:
anything should be modified? or is there any easier way to do this?
thank you!!
def self.fetch(uri_str, limit = 10)
response = Net::HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch(response['location'], limit - 1)
else
response.error!
end
end
If you don't mind using a gem, curb will do this for you. It's all about using the follow_location parameter:
gem 'curb'
require 'curb'
# http://www.stackoverflow.com/ redirects to http://stackoverflow.com/
result = Curl::Easy.http_get("http://www.stackoverflow.com/") do |curl|
curl.follow_location = true
end
puts result.body_str
This is not the only library with this feature, though.
As a note, many times you will get an invalid location in the header and it will have to be interpreted by the user agent to render it into something useful. A header like Location: / will need to be re-written before it can be fetched. Other times you will get a header like Location: uri=... and you'll have to pull out the location from there. It's really best to leave it to your library than re-write that yourself.
here is what i end up with after some trial and error:
uri_str = URI.encode(https://graph.facebook.com/[album_id]}/picture?access_token=blahblahblah...)
result = Curl::Easy.http_get(uri_str) do |curl|
curl.follow_location = false
end
puts result.header_str.split('Location: ')[1].split(' ')[0]
the returned header_str looks like
"HTTP blah blah blah Location: http://xxxxxxx/xxxx.jpg blah blah blah"
so i managed to get the URL by using 2 split()
the final result is a clean URL
also the curl.follow_location should be false so it won't return the body of that page

Resources