How to format URL for better indexing from search engines

How to format URL for better indexing from search engines - url

Let say I want to refer to a restaurant page, I could use one of those 2 URLs for example:
/restaurants/123
/restaurants/Pizzeria-Mamma
URL 1 has the advantage to be a quick match because of the ID but it is not as descriptive as URL 2.
Does URL matter to search engines? I read somewhere that it is good to put the keywords in the URL too so URL 2 would be the way to go. Can someone confirm or deny?

Ultimately, search engine algorithms are designed to reward good usability (although obviously in practice that is not always the case). As a user, it would semantically make more sense to have the pizzeria name in the url, and you have the added advantage of it being easier to remember. Whilst kbrimington's comment is correct that page content it probably more important, SEOMoz, a search engine algorithm authority, puts keyword in the url at somewhere between the 9th and 11th most important ranking factor depending on where it appears in the url
http://www.seomoz.org/article/search-ranking-factors#ranking-factors
At number 5 in the ranking factors, however, is anchor text; it's only an opinion, but I'm inclined to say that having the word "pizzeria" in the url is more likely to encourage people to put "pizzeria" in the anchor text when they link to your site.

Related

What does mean ?page=1

Suppose I have an URL like this https://example.com?page=1 or have https://example.com?text=1. Here what does mean by ?page=1 or ?text=1. Some website like youtube there I can see that they use like https://youtube.com?watch=zcDchec . What does mean it?
Please explain anyone. I need to know this.

That’s the query component (indicated by the first ?).
It’s a common convention to use key=value pairs, separated by &, in the query component. But it’s up to the author what to use. And it’s up to the author to decide what it should mean.
In practice, the query component ?page=1 will likely point to page 1 of something, while ?page=2 will point to page 2, etc. But there is nothing special about this. The author could as well have used the path component, e.g. /page/1 and /page/2.

Will Service Worker Query Cache Algorithm Allow Expression Matching URL paths?

I discovered a use case for matching Request urls using an expression that ignores part of a url path (path not ignoreSearch).
The use case is for an image processing service used in a responsive design where the dimensions of the image are encoded in the url path. This is sort of common among these kinds of services (Cloudinary, Firesize, even Lorempixel).
I noticed every once in a while, one of the dimension components will request will be off by one pixel. The required dimensions are calculated from the client - source of the error is rounding here - But the service worker cache could be an elegant solution for this variation.
However, this rounding problem results in a cache miss because I can't specify that part of the url path can be ignored.
Will url expression matching ever become part of the spec?
In general, is it ok that the 'fetch with url A, cache put/match with url B' pattern grow?
I get that the work around for this is the same as the current work around for ignoreSearch (until its implementation), wherein you fetch with one url but cache with another. I'm just wondering if url path expression matching will ever become part of the spec, or if an url expression matching use case has been considered. I don't see any evidence of this in the authoritative spec.
Thanks in advance for any words of insight.

It might be considered at some point in the future if it becomes a dominant pattern in many applications. Usually if something is fairly common it'll eventually be included in the standard so it can be made faster and more reliable. I wouldn't count on it happening anytime soon though and without many libraries having support for it.

Profanity filter import

I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.

As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.

Here's one you could use: Offensive/Profane Word List from CMU site

Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.

I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb

what's the max size of a url word in a URL?

i was wondering, in order to get a good SEO you have to use natural language in your URLs.
Do you know the max size for a word or phrase in characters?
ex:
www.me.com/this-is-a-really-long-url.htm
i ask this, because once i've checked out that google was banning some of my URLs because they were too long.
thanks a lot :D

Could you link to that information because I guess the SEO limit is more on the total length than the individual parts.
When generating friendly URLs, try to keep below 15 chars for sections, 40 chars for article and a couple more for id.
Also, try to use subdomains if available, you can link them in your Google panel anyway
That is:
http://forums.en.mycompany.com/general/annoucements/5142-today-is-a-big-day.html

There is no limitation on the url length as of now. you can keep as long as you wish, but that doesnot bring you any positive affect or SEO friendly.
So it is advisable not to exceed more than 90 characters

Human-readable URLs: preferably hierarchical too?

In a now migrated question about human-readable URLs I allowed myself to elaborate a little hobby-horse of mine:
When I encounter URLs like http://www.example.com/product/123/subpage/456.html I always think that this is an attempt on creating meaningful hierarchical URLs which, however, is not entirely hierarchical. What I mean is, you should be able to slice off one level at a time. In the above, the URL has two violations on this principle:
/product/123 is one piece of information represented as two levels. It would be more correctly represented as /product:123 (or whatever delimiter you like)
/subpage is very likely not an entity in itself (i.e., you cannot go up one level from 456.html as http://www.example.com/product/123/subpage is "nothing").
Therefore, I find the following more correct:
http://www.example.com/product:123/456.html
Here, you can always navigate up one level at a time:
http://www.example.com/product:123/456.html
— The subpage
http://www.example.com/product:123 — The product page
http://www.example.com/ — The root
Following the same philosophy, the following would make sense [and provide an additional link to the products listing]:
http://www.example.com/products/123/456.html
Where:
http://www.example.com/products/123/456.html — The subpage
http://www.example.com/products/123 — The product page
http://www.example.com/products — The list of products
http://www.example.com/ — The root
My primary motivation for this approach is that if every "path element" (delimited by /) is selfcontained1, you will always be able to navigate to the "parent" by simply removing the last element of the URL. This is what I (sometimes) do in my file explorer when I want to go to the parent directory. Following the same line of logic the user (or a search engine / crawler) can do the same. Pretty smart, I think.
On the other hand (and this is the important bit of the question): While I can never prevent that a user tries to access a URL he himself has amputated, am I wrongfully asserting (and honouring) that a search engine might do the same? I.e., is it reasonable to expect that no search engine (or really: Google) would try to access http://www.example.com/product/123/subpage (point 2, above)? (Or am I really only taking the human factor into account here?)
This is not a question about personal preference. It's techical question about what I can expect of an crawler / indexer and to what extend I should take non-human URL manipulation into account when designing URLs.
Also, the structural "depth" of http://www.example.com/product/123/subpage/456.html is 4, where http://www.example.com/products/123/456.html is only 3. Rumour has it that this depth influences search engine ranking. At least, so I was told. (It is now evident that SEO is not what I know most about.) Is this (still?) true: does the hierarchical depth (number of directories) influence search ranking?
So, is my "hunch" technically sound or should I spend my time on something else?
Example: Doing it (almost) right
Good ol' SO gets this almost right. Case in point: profiles, e.g., http://stackoverflow.com/users/52162:
http://stackoverflow.com/users/52162 — Single profile
http://stackoverflow.com/users — List of users
http://stackoverflow.com/ — Root
However, the canonical URL for the profile is actually http://stackoverflow.com/users/52162/jensgram which seems redundant (the same end-point represented on two hierarchical levels). Alternative: http://stackoverflow.com/users/52162-jensgram (or any other delimiter consistently used).
1) Carries a complete piece of information not dependent on "deeper" elements.

Hierarchical urls of this kind "http://www.example.com/product:123/456.html" are as useless as "http://www.example.com/product/123/subpage", because when users see your urls, they don't care about identifiers from your database, they want meaningful paths. This is why StackOverflow puts question titles into urls: "http://stackoverflow.com/questions/4017365/human-readable-urls-preferably-hierarchical-too".
Google advices against practice of replacing usual queries like "http://www.example.com/?product=123&page=456", because when every site develops it's own scheme, crawler doesn't know what each part means, if it's important or not. Google has invented sophisticated mechanisms to find important arguments and ignore unimportant, which means you'll get more pages into index and there will be less duplicates. But these algorithms often fail when web developers invent their own scheme.
If you care about both users and crawlers you should use urls like this instead:
http://www.example.com/products/greatest-keyboard/benefits — the subpage
http://www.example.com/products/greatest-keyboard — the product page
http://www.example.com/products — the list of products
http://www.example.com/ — the root
Also, search engines give higher rating to pages with keywords in the url.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart