What is the best URL strategy to handle multiple search parameters and operators? - url

Searching with mutltiple Parameters
In my app I would like to allow the user to do complex searches based on several parameters, using a simple syntax similar to the GMail functionality when a user can search for "in:inbox is:unread" etc.
However, GMail does a POST with this information and I would like the form to be a GET so that the information is in the URL of the search results page.
Therefore I need the parameters to be formatted in the URL.
Requirements:
Keep the URL as clean as possible
Avoid the use of invalid URL chars such as square brackets
Allow lots of search functionality
Have the ability to add more functions later.
I know StackOverflow allows the user to search by multiple tags in this way:
https://stackoverflow.com/questions/tagged/c+sql
However, I'd like to also allow users to search with multiple additional parameters.
Initial Design
My design is currently to do use URLs such as these:
http://example.com/search/tagged/c+sql/searchterm/transactions
http://example.com/search/searchterm/transactions
http://example.com/search/tagged/c+sql
http://example.com/search/tagged/c+sql/not-tagged/java
http://example.com/search/tagged/c+sql/created/yesterday
http://example.com/search/created_by/user1234
I intend to parse the URL after the search parameter, then decide how to construct my search query.
Has anyone seen URL parameters like this implemented well on a website?
If so, which do it best?

What you have here isn't a bad start.
Some things to keep in mind is that there is a length restriction on urls ~2000 characters in IE. Keep this in mind in the battle between SEO and readability vs brevity.
I'm not aware of any standards in this arena outside of common sense which it appears you've captured.
Another thing to keep in mind is that most search engines use standard url params e.g. ?http://www.google.com/search?hl=en&source=hp&q=donkeys+for+sale&aq=f&aqi=g10&aql=&oq=&gs_rfai=
There is good reason for this namely to do with url encoding and allowing for not traditional characters in the search bar.
So while pretty urls are nice they fail here for a variety of reasons

Related

Canonical url and localization

In my application I have localized urls that look something like this:
http://examle.com/en/animals/elephant
http://examle.com/nl/dieren/olifant
http://examle.com/de/tiere/elefant
This question is mainly for Facebook Likes, but I guess I will hit similar problems when I start thinking about search engine crawlers.
What kind of url would you expect as canonical url? I don't want to use the exact english url, because I want that people clicking the link will be forwarded to their own language (browser setting/dependent on IP).
The IP lookup is not something that I want to do on every page hit. Besides that I would need to incorporate more 'state' in my application, because I have to check wether a user has already been forwarded to his own locale, or is browsing the english version on purpose.
I guess it will going to be something like:
http://example.com/something/animals/elephant
or maybe without any language identifier at all:
http://example.com/animals/elephant
but that is a bit harder to implement, bigger chance on url clashes in the future (in the rare case I would get a category called en or de).
Summary
What kind of url would you expect as canonical url? Is there already a standard set for this?
I know this question is a bit old, but I was facing the same issue.
I found this:
Different language versions of a single page are considered duplicates only if the main content is in the same language (that is, if only the header, footer, and other non-critical text is translated, but the body remains the same, then the pages are considered to be duplicates).
That can be found here: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
From this I can conclude that we should add locales to canonicals.
I did find one resource that recommends not using the canonical tag with localized addresses. However, Google's documentation does not specify and only mentions subdomains in another context.
There is more that that language that you need to think of.
It's typical a tuple of 3 {region, language, property}
If you only have one website then you have {region, language} only.
Every piece of content can either be different in this 3 dimensional space, or at least presented differently. But this is the same piece of content so you'd like to centralize managing of editorial signals, promotions, tracking etc etc. Think about search systems - you'd like page rank to be merged across all instances of the article, not spread thinly out.
I think there is a standard solution: Canonical URL
Put language/region into the domain name
example.com
uk.example.com
fr.example.com
Now you have a choice how you attach a cookie for subdomain (for language/region) or for domain (for user tracking)!
On every html page add a link to canonical URL
<link rel="canonical" href="http://example.com/awesome-article.html" />
Now you are done.
There certainly is no "Standard" beyond it has to be an URL. What you certainly do see on many comercial websites is exactly what you describe:
<protocol>://<server>/<language>/<more-path>
For the "language-tag" you may follow RFCs as well. I guess your 2-letter-abbrev is quite fine.
I only disagree on the <more-path> of the URL. If I understand you right you are thinking about transforming each page into a local-language URL? I would not do that. Maybe I am not the standard user, but I personally like to manually monkey around in URLs, i.e. if the URL shown is http://examle.com/de/tiere/elefant, but I don't trust the content to be translated well I would manually try http://examle.com/en/tiere/elefant -- and that would not bring me to the expected page. And since I also dislike those URLs http://ex.com/with-the-whole-title-in-the-url-so-the-page-will-be-keyworded-by-search-engines my favorite would be to just exchange the <language> part and use generic english (or any other language) for <more-path>. Eg:
http://examle.com/en/animals/elephant
http://examle.com/nl/animals/elephant
http://examle.com/de/animals/elephant
If your site is something like Wikipedia, then I would agree to your scheme of translating the <more-part> as well.
Maybe this Google's guidelines can help with your issue: https://support.google.com/webmasters/answer/189077?hl=en
It says that many websites serve users (across the world) with content targeted to users in a certain region. It is advised to use the rel="alternate" hreflang="x" attributes to serve the correct language or regional URL in Search results.

Search engine's viewpoint of a URL

Given below two links which point to the same page with same content. I'm just trying to give this page the right URL.
I can give it one of the following URLs.
http://www.mywebsite.com/help/topic/2001
or
http://www.mywebsite.com/help?topic=2001
Now, when a search engine sees this, what's the effect on the page's caching by the engine.
Do both link have the same effect or one of them can improvise the caching better.
That depends on the search engine. For example, Google will automatically try to guess if a URL parameter must be treated as a unique webpage, so in your scenario google will guess that the value of url parameter "topic" defines a page. Other search engines may fail on guessing this.
I think its better to use a url with no URL parameters, so its absolutely clear that a url is pointing to a unique page, instead of relying on the guessing ability of a search engine.
Specifically, google gives the ability to webmasters to manually set how a parameter will be treated. https://support.google.com/webmasters/answer/1235687?hl=en

Url with pseudo anchors and duplicate content / SEO

I have a product page with options in select list (ex : color of the product etc...).
You accede to my product with different urls :
www.mysite.com/product_1.html
www.mysite.com/product_1.html#/color-green
If you accede with the url www.mysite.com/product_1.html#/color-green, the option green of the select list is automatically selected (with javascript).
If i link my product page with those urls, is there a risk of duplicate content ? Is it good for my seo ?
thx
You need to use canonical urls in order to let the search engines know that you are aware that the content seems duplicated.
Basically using a canonical url on your page www.mysite.com/product_1.html#/color-green to go to www.mysite.com/product_1.html tells the search engine that whenever they see www.mysite.com/product_1.html#/color-green they should not scan this page but rather scan the page www.mysite.com/product_1.html
This is the suggested method to overcome duplicate content of this type.
See these pages:
SEO advice: url canonicalization
A rel=canonical corner case
At one time I saw Google indexing the odd #ed URL and showing them in results, but it didn't last long. I think it also required that there was an on page link to the anchor.
Google does support the concept of the hashbang (#!) as a specific way to do indexable anchors and support AJAX, which implies an anchor without the bang (!) will no longer be considered for indexing.
Either way, Google is not stupid. The the basic use of the anchor is to move to a place on a page, i.e. it is the same page (duplicate content) but a different spot. So Google will expect a #ed URL to contain the same content. Why would they punish you for doing what the # is for?
And what is "the risk of duplicate content". Generally, the only onsite risk from duplicate content is Google may waste it's time crawling duplicate pages instead of focusing on other valuable pages. As Google will assume # is the same page it is more likely to not event try the #ed URL.
If you're worried, implement the canonical tag, but do it right. I've seen more issues from implementing it badly than the supposed issues they are there to solve.
Both answers above are correct. Google has said they ignore hashtags unless you use hash-bang format (#!) -- and that really only addresses a certain use case, so don't add it just because you think it will help.
Using the canonical link tag is the right thing to do.
One additional point about dupe content: it's less about the risk than about a missed opportunity. In cases where there are dupes, Google chooses one. If 10 sites link to your site using www.example.com and 10 more link using just example.com you'll get the :link goodness" benefit of only 10 links. The complete solution to this involves ensuring that when users and Google arrive at the "wrong" on, the server responds with an HTTP 301 status and redirects the user to the "right" one. This is known as domain canonicalization and is a good thing for many, many reasons. Use this in addition to the "canonical" link tag and other techniques.

Why would I put ?src= in a link?

I feel dumb for not knowing this, but I see a lot of links in web pages and instead of this:
<a href="http://foo.com/">
...they use this:
<a href="http://foo.com/?src=bar.com">
Now I understand that the ?src= is telling something that this referral is coming from bar.com, but I don't understand why this needs to be called out explicitly. Can anyone shed some light on it for me? Is this something I need to include in my program generated links?
EDIT: Ok, sorry, I'm not being clear enough. I understand the GET syntax with a question mark and parameters separated by ampersands. I'm wondering what's this special src parameter? Why would one site link to another and tack an src parameter on the end even though there's no indication that the destination site uses this normally.
For example, on this page hover your mouse over the screenshot. The link URL is http://moms4mom.com/?src=stackexchangesites
But moms4mom.com is our site. Passing the src parameter does nothing, so why include it?
There are a few reasons that the src is being used explicitly. But in general, it is easier and more reliable to trust a query string to determine referer[sic] than it is to trust the referer, since the latter is often broken, deliberately or not. On the other hand, browsers almost never break the query string in a url, since this, unlike referers, is pretty important for pages to function. Besides, a referer is often done without any deliberate action on the part of the site doing the refering, which some users dislike.
The reason (I do it) is that popular analytics tools sometimes make it easier to filter on query strings than referrers.
There is no standard to the src parameter. Each site has its own and it's usually up to the site that gets the link to define how it wants to read it (as usually it's that site that's going to pay for the click).
The second is a dynamic link, it's a URL that another language(like ASP and PHP) interpret as something to do, like in those Google URLs, but i never used this site(foo.com), then i don't much things about this parameter.
Depending on how the site processes its URL, you may or may not need to include the ?... information.
This is passed to the website, and the server can process it just like form input. Some sites require this - and build their navigation off a single page, using nothing but the "extra" stuff passed afterwards. If you're generating a link to a site like that, it will be required.
In other cases, this is just used to pass extra, unrequired info (such as advertising, tracking info, etc)... In those cases, you can leave it off.
Unfortunately, there's no way to know without trying whether you can remove the "extra" bits from the URL.
After reading some of your comments - I'll also say:
There is nothing special about the "src" field in a query string. The server is free to use it any way it wishes. Unless you know specific info about the server, you cannot assume it can be left out.
The part after the ? is the query string. Different sites use it for different things, and it is usually used for passing information to the server side code for that URL, but can also be used in javascript.
For more info see Query String

Can an "SEO Friendly" url contain a unique ID?

I'd like to start using "SEO Friendly Urls" but the notion of generating and looking up large, unique text "ids" seems to be a significant performance challenge relative to simply looking up by an integer. Now, I know this isn't as "human friendly", but if I switched from
http://mysite.com/products/details?id=1000
to
http://mysite.com/products/spacelysprokets/sproket/id
I could still use the ID alone to quickly lookup the details, but the URL itself contains keywords that will display in that detail. Is that friendly enough for Google? I hope so as it seems a much easier process than generating something at the end that is both unique and meaningful.
Thanks!
James
Be careful with allowing a page to render using the same method as Stack overflow.
http://stackoverflow.com/questions/820493/random-text-can-cause-problems
Black hats can this to cause duplicate content penalty for long tail competitors (trust me).
Here are two things you can do to protect yourself from this.
HTTP 301 redirect any inbound display url that matches your ID but doesn't match the text to the correct text.
Example:
http://stackoverflow.com/questions/820493/random-text-can-cause-problems
301 ->
http://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id
Use canonical URLs.
<link rel="canonical"
href="http://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id"
/>
https://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id
I'd say you're fine.
Have a look at the URLs that StackOverflow uses. They have a unique id, then they have the SEO-friendly stuff. You can omit the SEO-friendly stuff and the URL still works.
You are making a devils bargan here, you are trading away business goals for technology goals.
If you were to ask "From a purely business and SEO prospective, is it better to include unique IDs in the URL or not?"; the answer would clearly be to not use them.
The question then becomes, if you do use them, how much does it hurt you in the search engines? The answer is that it definately has some negative impact. How much is yet to be determined.
In terms of "user friendly", no, they are definitely not user friendly.
In terms of Google, they state "Whenever possible, shorten URLs by trimming unnecessary parameters." See their URL structure document.
I'm not aware of any problems caused by adding an ID to a URL. In fact it can be extremely useful, as it allows the human/search engine friendly part of the URL to be changed without causing a broken link to a page that a search engine has already indexed. Using SO as an example, here's a link to your question:
https://stackoverflow.com/questions/820493/you-can-put-any-text-you-want-here
Nothing wrong with that. An increasing number of services have started to use a hybrid solution as Paul Tomblin already pointed out. In addition to SO, Tumblr uses this pattern too (maybe it was the first).
Furthermore, in certain services—like Google News—the URL must contain a unique numeric ID.
Getting rid of the parameterized URL will definitely help. From my experience, including the ID does not hurt or help, as long as there are no '?key=value' pairs in the url.
I have two seemingly contradictory points to make here:-
Nobody looks at URLs! Experience has "trained" browser users to render the "Address" box contents as invisable, they know the contents will be any two of 'ureadable', 'meaningless' and 'confusing', hence they just ignore it completely.
Using a String which can be easily converted to an integer may offer a slight performance advantage over using a longer string which is slightly harder (hash() vs. to_int() ) to convert into an integer. However in the context of the average web application any performance difference would would be negligable.
My advice would be to stick with what your comfortable with.
Use something like modrewrite to parse URLs before they reach your server. So you could convert a slug like http://oorl.com/99942/My-Friendly-Text-For-Search-Engines/ into http://oorl.com/lookup.php?id=99942. This will also let you change slug and keywords used to optimize certain links without damaging functionality.
Duplicate refer cause more negative impact compare to friendly URL, be careful about using fake text with id, your competitors could miss use this.
Yes, and in fact it's more SEO friendly to include a number in your url as it implies to google that you are consistently updating your content.
I am fairly sure that it makes it much more difficult to get indexed in Google News if you don't have an incrementing number attached in some way to your URLs.

Resources