Will using multiple url redirects hurt my URL? - url

Currently I'm using this link forwarding structure:
bit.ly/{some_hash} > example.com/s/{ID} > example.com/blog/full-seo-optimized-url/
Because the id of the blog never changes but the url might change (e.g. spelling mistake), I'm forwarding my bit.ly short urls to a special subpage (/s/{UD}) that will eventually get the full url from the database and forward the visitor to the blog entry.
Pros:
If the URL changes, the bit.ly short link will still work and forward to the updated url
(Possible) Cons:
Might be seen as spammy method (hiding target link)?
Might be violating any terms of service?
... ?
Therefore I would know, if this is a good and proper way or if I should remove the step in the middle?

Those redirects will cause a slower user experience and when used will cause a loss of PageRank being sent to the destination.
I'd avoid doing it where possible.
There are URL shorteners out there that let you directly edit the destination which would avoid your need for the middle redirect.
You also want to avoid changing the destinations URL as other people will not use your fancy redirects and you will lose PageRank every time they change.

Related

Is it possible to detect uniquely-named files if you have the URL path?

Let's say someone has posted a resource at https://this-site.com/files/pdfs/some_file_name.pdf
Another resource is then posted at that URL, which we don't know the name of. However, the pathname is the same: https://this-site.com/files/pdfs/another_unique_resource98237219.pdf
Is it possible to detect when a new PDF is posted to this location? Or would we have to know more about the backend infrastructure? Keeping in mind that:
None of the other pieces of the URL are valid paths, in other words https://this-site.com/files/pdfs and https://this-site.com/files both return 404 errors.
The names of the files are unique and do not follow a specific pattern.
If this is not possible, what are other ways you might inspect the request/response infrastructure to look for resources posted to that URL?
my first suggestion is to look if there is another page that displays a list of the resource available on the website, of course assuming the website owner actually provides such page
the other method would be effectively brute forcing all the URLs under that path. you will need to collect some SOCKS to use with your crawlers to distribute your requests among multiple IP addresses otherwise the server will probably block your IP address. should you be able to distinguish the minimum and maximum number of characters in the file names (not pattern, just length) this operation can be drastically optimized.

Avoid robots from going into a www.domain.com/thishash when link posted to twitter, facebook

I'm building a service where people gets notified (mails) when they follow a link with the format www.domain.com/this_is_a_hash. The people that use this server can share this link on different places like, twitter, tumblr, facebook and more...
The main problem I'm having is that as soon as the link is shared on any of this platforms a lot of request to the www.domain.com/this_is_a_hash are coming to my server. The problem with this is that each time one of this requests hits my server a notification is sent to the owner of the this_is_a_hash, and of course this is not what I want. I just want to get notifications when real people is going into this resource.
I found a very interesting article here that talks about the huge amount of request a server receives when posting to twitter...
So what I need is to avoid search engines to hit the "resource" url... the www.mydomain.com/this_is_a_hash
Any idea? I'm using rails 3.
Thanks!
If you don’t want these pages to be indexed by search engines, you could use a robots.txt to block these URLs.
User-agent: *
Disallow: /
(That would block all URLs for all user-agents. You may want to add a folder to block only those URLs inside of it. Or you could add the forbidden URLs dynamically as they get created, however, some bots might cache the robots.txt for some time so they might not recognize that a new URL should be blocked, too.)
It would, of course, only hold back those bots that are polite enough to follow the rules of your robots.txt.
If your users would copy&paste HTML, you could make use of the nofollow link relationship type:
cute cat
However, this would not be very effective, as even some of those search engines that support this link type still visit the pages.
Alternatively, you could require JavaScript to be able to click the link, but that’s not very elegant, of course.
But I assume they only copy&paste the plain URL, so this wouldn’t work anyway.
So the only chance you have is to decide if it’s a bot or a human after the link got clicked.
You could check for user-agents. You could analyze the behaviour on the page (e.g. how long it takes for the first click). Or, if it’s really important to you, you could force the users to enter a CAPTCHA to be able to see the page content at all. Of course you can never catch all bots with such methods.
You could use analytics on the pages, like Piwik. They try to differentiate users from bots, so that only users show up in the statistics. I’m sure most analytics tools provide an API that would allow sending out mails for each registered visit.

Url with pseudo anchors and duplicate content / SEO

I have a product page with options in select list (ex : color of the product etc...).
You accede to my product with different urls :
www.mysite.com/product_1.html
www.mysite.com/product_1.html#/color-green
If you accede with the url www.mysite.com/product_1.html#/color-green, the option green of the select list is automatically selected (with javascript).
If i link my product page with those urls, is there a risk of duplicate content ? Is it good for my seo ?
thx
You need to use canonical urls in order to let the search engines know that you are aware that the content seems duplicated.
Basically using a canonical url on your page www.mysite.com/product_1.html#/color-green to go to www.mysite.com/product_1.html tells the search engine that whenever they see www.mysite.com/product_1.html#/color-green they should not scan this page but rather scan the page www.mysite.com/product_1.html
This is the suggested method to overcome duplicate content of this type.
See these pages:
SEO advice: url canonicalization
A rel=canonical corner case
At one time I saw Google indexing the odd #ed URL and showing them in results, but it didn't last long. I think it also required that there was an on page link to the anchor.
Google does support the concept of the hashbang (#!) as a specific way to do indexable anchors and support AJAX, which implies an anchor without the bang (!) will no longer be considered for indexing.
Either way, Google is not stupid. The the basic use of the anchor is to move to a place on a page, i.e. it is the same page (duplicate content) but a different spot. So Google will expect a #ed URL to contain the same content. Why would they punish you for doing what the # is for?
And what is "the risk of duplicate content". Generally, the only onsite risk from duplicate content is Google may waste it's time crawling duplicate pages instead of focusing on other valuable pages. As Google will assume # is the same page it is more likely to not event try the #ed URL.
If you're worried, implement the canonical tag, but do it right. I've seen more issues from implementing it badly than the supposed issues they are there to solve.
Both answers above are correct. Google has said they ignore hashtags unless you use hash-bang format (#!) -- and that really only addresses a certain use case, so don't add it just because you think it will help.
Using the canonical link tag is the right thing to do.
One additional point about dupe content: it's less about the risk than about a missed opportunity. In cases where there are dupes, Google chooses one. If 10 sites link to your site using www.example.com and 10 more link using just example.com you'll get the :link goodness" benefit of only 10 links. The complete solution to this involves ensuring that when users and Google arrive at the "wrong" on, the server responds with an HTTP 301 status and redirects the user to the "right" one. This is known as domain canonicalization and is a good thing for many, many reasons. Use this in addition to the "canonical" link tag and other techniques.

Hide website filenames in URL

I would like to hide the webpage name in the url and only display either the domain name or parts of it.
For example:
I have a website called "MyWebSite". The url is: localhost:8080/mywebsite/welcome.xhtml. I would like to display only the "localhost:8080/mywebsite/".
However if the page is at, for example, localhost:8080/mywebsite/restricted/restricted.xhtml then I would like to display localhost:8080/mywebsite/restricted/.
I believe this can be done in the web.xml file.
I believe that you want URL rewriting. Check out this link: http://en.wikipedia.org/wiki/Rewrite_engine - there are many approaches to URL rewriting, you need to decide what is appropriate for you. Some of the approaches do make use of the web.config file.
You can do this in several ways. The one I see most is to have a "front door" called a rewrite engine that parses the URL dynamically to internally redirect the request, without exposing details about how that might happen as you would see if you used simple query strings, etc. This allows the URL you specify to be digested into a request for a master page with specific content, instead of just looking up a physical page at that location to serve.
The StackExchange sites do this so that you can link to a question in a semi-permanent fashion (and thus can use search engines with crawlers that log these URLs) without them having to have a real page in the file system for every question that's ever been asked (we're up to 9,387,788 questions as of this one).

Why would I put ?src= in a link?

I feel dumb for not knowing this, but I see a lot of links in web pages and instead of this:
<a href="http://foo.com/">
...they use this:
<a href="http://foo.com/?src=bar.com">
Now I understand that the ?src= is telling something that this referral is coming from bar.com, but I don't understand why this needs to be called out explicitly. Can anyone shed some light on it for me? Is this something I need to include in my program generated links?
EDIT: Ok, sorry, I'm not being clear enough. I understand the GET syntax with a question mark and parameters separated by ampersands. I'm wondering what's this special src parameter? Why would one site link to another and tack an src parameter on the end even though there's no indication that the destination site uses this normally.
For example, on this page hover your mouse over the screenshot. The link URL is http://moms4mom.com/?src=stackexchangesites
But moms4mom.com is our site. Passing the src parameter does nothing, so why include it?
There are a few reasons that the src is being used explicitly. But in general, it is easier and more reliable to trust a query string to determine referer[sic] than it is to trust the referer, since the latter is often broken, deliberately or not. On the other hand, browsers almost never break the query string in a url, since this, unlike referers, is pretty important for pages to function. Besides, a referer is often done without any deliberate action on the part of the site doing the refering, which some users dislike.
The reason (I do it) is that popular analytics tools sometimes make it easier to filter on query strings than referrers.
There is no standard to the src parameter. Each site has its own and it's usually up to the site that gets the link to define how it wants to read it (as usually it's that site that's going to pay for the click).
The second is a dynamic link, it's a URL that another language(like ASP and PHP) interpret as something to do, like in those Google URLs, but i never used this site(foo.com), then i don't much things about this parameter.
Depending on how the site processes its URL, you may or may not need to include the ?... information.
This is passed to the website, and the server can process it just like form input. Some sites require this - and build their navigation off a single page, using nothing but the "extra" stuff passed afterwards. If you're generating a link to a site like that, it will be required.
In other cases, this is just used to pass extra, unrequired info (such as advertising, tracking info, etc)... In those cases, you can leave it off.
Unfortunately, there's no way to know without trying whether you can remove the "extra" bits from the URL.
After reading some of your comments - I'll also say:
There is nothing special about the "src" field in a query string. The server is free to use it any way it wishes. Unless you know specific info about the server, you cannot assume it can be left out.
The part after the ? is the query string. Different sites use it for different things, and it is usually used for passing information to the server side code for that URL, but can also be used in javascript.
For more info see Query String

Resources