Figure out if two URLs navigate to the same page - url

This is a hypothetical question. No codes whatsoever.
Say that you should figure out if two URL's navigate to the same page. You have to do this programmatically. Just scanning through the URL is not enough, maybe one of them has an anchor link which navigates to the same page but different section. Or maybe there could be a ref=... on the url that is just used to monitor the referrers by the backend.
One possible solution could be requesting the contents of both URL's and comparing the HTML outcomes, but if you have to do this many times to many different URL's it is a costly operation. Could there be a better solution for this ?
Thanks

Sure, by reading the HTTP response headers, which are sent by the server after each request.
When you're redirected, the server responds with the status code 302. 200 is the code when everything is OK and the most popular is the 404, not found.
Check this Wikipedia page for a little more about the 302 status code and this one for the complete list.

Related

Remove multiple indexed URLs (duplicates) with redirect

I am managing a website that has only about 20-50 pages (articles, links and etc.). Somehow, Google indexed over 1000 links (duplicates, same page with different string in the URL). I found that those links contain ?date= in url. I already blocked by writing Disallow: *date* in robots.txt, made an XML map (which I did not had before) placed it into root folder and imported to Google Webmaster Tools. But the problem still stays: links are (and probably will be) in search results. I would easily remove URLs in GWT, but they can only remove one link at the time, and removing >1000 one by one is not an option.
The question: Is it possible to make dynamic 301 redirects from every page that contains $date= in url to the original one, and how? I am thinking that Google will re-index those pages, redirect to original ones, and delete those numerous pages from search results.
Example:
bad page: www.website.com/article?date=1961-11-1 and n same pages with different "date"
good page: www.website.com/article
automatically redirect all bad pages to good ones.
I have spent whole work day trying to solve this problem, would be nice to get some support. Thank you!
P.S. As far as I think this coding question is the right one to ask in stackoverflow, but if I am wrong (forgive me) redirect me to right place where I can ask this one.
You're looking for the canonical link element, that's the way Google suggests to solve this problem (here's the Webmasters help page about it), and it's used by most if not all search engines. When you place an element like
<link rel='canonical' href='http://www.website.com/article'>
in the header of the page, the URI in the href attribute will be considered the 'canonical' version of the page, the one to be indexed and so on.
For the record: if the duplicate content is not a html page (say, it's a dynamically generated image), and supposing you're using Apache, you can use .htaccess to redirect to the canonical version. Unfortunately the Redirect and RedirectMatch directives don't handle query strings (they're strictly for URIs), but you could use mod_rewrite to strip parts of the query string. See, for example, this answer for a way to do it.

Forcing a page to POST

This may be a very unusual question, but basically there's a page on another domain (that I can view, but can't edit/change) that has a button. When that button is clicked it generates some unique keys.
I need to pull those unique keys with my web service (using ASP .NET MVC3) I can get the initial HTML of the page, but how can I force the page to "click" the button so that I can get the values after the POST?
Normally, I'd reuse the code to generate keys myself, but I don't have access to the logic.
I hope this makes sense.
Use e.g. firebug to see what POST parameters are sent with form and then make the same POST from your code.
For this you can use WebRequest or WebClient.
See this SO questions that will help you how to do it:
HTTP request with post
Send POST request in C# like a web page does?
How to simulate browser HTTP POST request and capture result in C#
Then just parse the response with technology of your choice (I would use regular expressions - Regex, or LinqToXml if the response is well formed XML).
Note: Keep in mind that your code will be dependent on some service you are not maintaining. So you can get in problems when the service is unavailable, discontinued or if the format of POSTed form or response will be changed.
This really depends on the technology on the targeted site.
If the page is a simple HTML form then you can easily send a POST. You will need to send the expected data to the POST. Then you can parse the data.
If its not so straight forward you will need to look into ways to automate the click. Check Selenium. Also you might need to employ scrapping if the results page is a mess.

Google Analytics tracking some pageviews as just a forward slash

Really strange. One of my posts is being tracked half the time in Google Analytics as its correct permalink, while the other half of the pageviews are coming from a single forward slash that is attributed with the same Page Title.
Example:
Title of Page: Official iPhone Unlock
Correct URL of page: /official-iphone-unlock
Two URL's being tracked with that page title:/official-iphone-unlock/
So, needless to say, this is throwing off my numbers as I'm getting pageviews for this page under both URLs, and really hard to figure out what the issue is. I'm using ECWID shopping cart, and I'm suspicious that it's their way of tracking things, but I can't prove it. But the issue started around the time I enabled their tracking code.
Have you tried segmenting the traffic for these pages by browser?
First find the page:
Behavior > Site Content > All Pages (then search for your pages)
...then cross-drill by browser segment:
Secondary Dimension > Visitors > Browsers
One possibility that comes to mind is that some browsers may auto-append a slash to the end of URLs without a file extension, while others may not. For example, Chrome forwards a /foo URL to /foo/ for me. It may only be specific versions of a browser that exhibits this behavior -- like IE9 for example.
You can implement the filter to remove the trailing slashes - check this https://www.petramanos.com/ecommerce-google-analytics/remove-slashes-end-urls-google-analytics

URL's that exist, but do not verify

I have a page that lists various urls and I am using verify_exists in my model. This worked splendidly except for one address. I can visit the link in question at the url given to me but each time I enter it into the admin it tells me the url does not exist. I had a hundred or so links that all worked except for this one.
Does anyone know why this could be happening? Could it be that the actual url is forwarded or something to that affect?
I am not sure about django specifically, but many link checkers do not set a user-agent when they run a link check and many websites block requests with no user agent.
Wikipedia blocks requests with no user-agent, so you can check if that is the problem by trying to create a link to Wikipedia and checking if you get an error.

URL Redirection for Coming Soon Page?

I have a site with over 100 pages. We need to go live with products that are soon available, however, many site pages will not be prepared at the time of release.
In order to move forward, I would like to reference a "coming soon" page with links to pages that are current and available.
Is there an easy way to forward a URL to a Coming Soon page?
Is this valid, or is there a better way?
Found this at:
http://www.web-source.net/html_redirect.htm
"Place the following HTML redirect code between the and tags of your HTML code.
meta HTTP-EQUIV="REFRESH" content="0; url=http://www.yourdomain.com/index.html"
Does this negatively affect you if the search engines crawl through your site?
Thank you!
The code you listed will work. However, I would never do this:
You could just show the page you wanted to show immediately, without a redirect. This will be faster for the visitor, as they don't need to load two pages.
If you must use a redirect, why not create it programmatically, for example by instructing your web server (e.g. Apache) to redirect certain pages?
I would not link to pages that don't exist yet. Most visitors will dislike that - clicking on something to find out "come back later" is a disappointment. We've all seen those coming soon pages, with the content never arriving, or only after months or even years. Either leave out those links (or perhaps put a "work in progress sign" without a link), or add the items only after they've been finished.
Search engines should work well with redirect pages, although it is unlikely your "coming soon" page will show up anywhere in the top the rankings anyway.
Perhaps a better or "more correct way" would be to do the redirection at the header level. Using PHP, you would call
<?php
header("Location: http://www.yourdomain.com/index.html");
There's also ways to do this in Apache (assuming you are using it) and .htaccess-files. See http://www.webweaver.nu/html-tips/web-redirection.shtml for more info about that.

Resources