Descriptive URLs vs. Basic URLs - url

I have a website and I'm employing Clean URLs to all of the links. I'm wondering what the opinion is about short, basic URLs versus longer, descriptive URLs.
For instance, if my website was about Georgia Bulldog football news, which would be better for SEO purposes?
http://www.example.com/news
or
http://www.example.com/georgia-bulldog-football-news
I've read quite a bit, but I'm torn on the simple vs. descriptive factor. Can anyone give opinions based on SEO experience?

The descriptive format, as the search engine can pick up keywords inside the URL. Apart from that, I don't think there's much difference. I personally prefer the simple format, but I'm obsessed with URLs!

Think of it in terms of the end user.
I don't know how much Google really uses URLs in its rankings. It's something that can be so obviously spoofed (like keywords) that I suspect it's low in their algorithm. The heart of what they do is to count incoming links, and trying to discern meaning of the actual page contents.
But users appreciate readable URLs. It gives them a hint of what they will be getting. I know that a readable URL greatly increases the likelihood that I will click on something (in an email, say).

See here and here for lots of detail on this topic.

No one but Google knows for sure exactly how much this factors into rankings, but it helps and Google recommends that you use hyphens (just as you demonstrated). This also tends increase clickthroughs from search result pages. I found this article very useful:
http://searchengineland.com/supercharge-your-urls-for-maximum-seo-impact-14006

Readability is nice, and may help your rankings. In your example the exact domain is important e.g.
georgiabulldogs.com/news
and
southerncollegesports.com/news
will leave a user with very different expectations.
In some cases typability is also important, and long hyphenated or ID ridden URLs are terrible anytime you may expect people to type in a URL.

I love the second one.
No.1 In the perspective of SEO, It will be better if you add keywords in your URL.(it will be better to add keywords than null,not sure exactly how much this count in ranking though) and take SRACKOVERFLOW.COM is doing the same.
No.2 In the standpoint of visitors, the second one is readable. It good for end users, there is no reason for google to not count this element!

Related

URL my-web-url.com vs myweburl.com in SEO

Can anyone suggest about the different between two domain in Search engine and it's effect. although there are two different words in the domain most prefer domain without "-" but in my knowledge "-" means space in the URL and "_" means same words but this two symbols are least use in domain name. Can anyone provide the different on these two.
One should first give priority to the domain name without '-' because it is hard to pronounce when telling someone your domain name, as well as chances are high that people will often forget '-' in your domain name when they are typing, at least the first few times. Of course this will impact your business negatively.
Also, the domain with hyphen doesn't produces very good feeling in the customer as well. Agree with what #chimpsarehungry said in the earlier answer.
Other than that, I guess it doesn't matters much in the SEO though. May be even produces good effect in some cases as in long URLs. For eg. WordPress posts. URL's with '-' are search engine friendly.
Take a look for yourself, based on 2011 data gathered by SEOmoz:
http://www.seomoz.org/article/search-ranking-factors#metrics
Not looking so good for dashes. Some of that is from correlation of spammers using such domains, but definitely not all of it. I apologize I don't have a reference to back this up, but there was a Matt Cutts QA where he said multiple dashes is indicative of spam and does indeed get a negative hit in overall rank score. I believe it was part of a big keynote speech so it'd be hard to find. You'll just have to take my word for it.
I don't think this will matter at all. But as a search engine user the sites with dashes in between them look spam-like to me. Name one popular website with a dash.

metaphone versus soundex versus NYSIIS

I'm trying to come up with an implicit spell checker that will use the mappings of input words to some kind of more general phonetic representation to account for typos that might occur, basically for a search bar that will automatically correct your spelling to a degree. Two things that I've been looking into are metaphone, nysiis and soundex, but I don't really know which would be better for this application.
I would like there to be preferentially more matches than less matches, and I would like the matching to be a bit more general and so for that reason I was thinking of going with soundex which seems to be a more approximate mapping than the original metaphone, but I don't really know how large the difference in vagueness is. I know that nysiis is pretty similar to soundex, but I don't have a good idea of how similar they are, or how nysiis compares to metaphone.
I am also looking for the solution that is quickest to execute. I know that these phonetic mappers are usually pretty quick, but I'm not sure which would be fastest, considering I would like to be able to check spelling without an increase in search time, speed is a consideration. Thoughts?
I managed to find a wonderful article on this over here:
http://www.informit.com/articles/article.aspx?p=1848528
Not quite everything I was looking for, but a pretty large amount of it.

Intelligently extracting tags from blogs and other web pages

I'm not talking about HTML tags, but tags used to describe blog posts, or youtube videos or questions on this site.
If I was crawling just a single website, I'd just use an xpath to extract the tag out, or even a regex if it's simple. But I'd like to be able to throw any web page at my extract_tags() function and get the tags listed.
I can imagine using some simple heuristics, like finding all HTML elements with id or class of 'tag', etc. However, this is pretty brittle and will probably fail for a huge number of web pages. What approach do you guys recommend for this problem?
Also, I'm aware of Zemanta and Open Calais, which both have ways to guess the tags for a piece of text, but that's not really the same as extracting tags real humans have already chosen. But I would still love to hear about any other services/APIs to guess the tags in a document.
EDIT: Just to be clear, a solution that already works for this would be great. But I'm guessing there's no open-source software that already does this, so I really just want to hear from people about possible approaches that could work for most cases. It need not be perfect.
EDIT2: For people suggesting a general solution that usually works is impossible, and that I must write custom scrapers for each website/engine, consider the arc90 readability tool. This tool is able to extract the article text for any given article on the web with surprising accuracy, using some sort of heuristic algorithm I believe. I have yet to dig into their approach, but it fits into a bookmarklet and does not seem too involved. I understand that extracting an article is probably simpler than extracting tags, but it should serve as an example of what's possible.
Systems like the arc90 example you give work by looking at things like the tag/text ratios and other heuristics. There is sufficent difference between the text content of the pages and the surrounding ads/menus etc. Other examples include tools that scrape emails or addresses. Here there are patterns that can be detected, locations that can be recognized. In the case of tags though you don't have much to help you uniqely distinguish a tag from normal text, its just a word or phrase like any other piece of text. A list of tags in a sidebar is very hard to distinguish from a navigation menu.
Some blogs like tumblr do have tags whose urls have the word "tagged" in them that you could use. Wordpress similarly has ".../tag/..." type urls for tags. Solutions like this would work for a large number of blogs independent of their individual page layout but they won't work everywhere.
If the sources expose their data as a feed (RSS/Atom) then you may be able to get the tags (or labels/categories/topics etc.) from this structured data.
Another option is to parse each web page and look for for tags formatted according to the rel=tag microformat.
Damn, was just going to suggest Open Calais. There's going to be no "great" way to do this. If you have some target platforms in mind, you could sniff for Wordpress, then see their link structure, and again for Flickr...
I think your only option is to write custom scripts for each site. To make things easier though you could look at AlchemyApi. They have simlar entity extraction capabilities as OpenCalais but they also have a "Structured Content Scraping" product which makes it a lot easier than writing xpaths by using simple visual constraints to identify pieces of a web page.
This is impossible because there isn't a well know, followed specification. Even different versions of the same engine could create different outputs - hey, using Wordpress a user can create his own markup.
If you're really interested in doing something like this, you should know it's going to be a real time consuming and ongoing project: you're going to create a lib that detects which "engine" is being used in a page, and parse it. If you can't detect a page for some reason, you create new rules to parse and move on.
I know this isn't the answer you're looking for, but I really can't see another option. I'm into Python, so I would use Scrapy for this since it's a complete framework for scraping: it's complete, well documented and really extensible.
Try making a Yahoo Pipe and running the source pages through the Term Extractor module. It may or may not give great results, but it's worth a try. Note - enable the V2 engine.
Looking at arc90 it seems they are also asking publishers to use semantically meaningful mark-up [see https://www.readability.com/publishers/guidelines/#view-exampleGuidelines] so they can parse it rather easily, but presumably they must either have developed a generic rules such as #dunelmtech suggested tag/text ratios, which can work with article detection, or they might be using with a combination of some text-segmentation algorithms (from Natural Language Processing field) such as TextTiler and C99 which could be quite usefull for article detection - see http://morphadorner.northwestern.edu/morphadorner/textsegmenter/ and google for more info on both [published in academic literature - google scholar].
It seems that, however, to detect "tags" as you required is a difficult problem (for already mentioned reasons in comments above). One approach I would try out would be to use one of the text-segmentation (C99 or TextTiler) algorithms to detect article start/end and then look for DIV's / SPAN's / ULs with CLASS & ID attributes containing ..tag.. in them, since in terms of page-layout's tags tend to be generally underneath the article and just above the comment feed this might work surprisingly well.
Anyway, would be interesting to see whether you got somewhere with the tag detection.
Martin
EDIT: I just found something that might really be helpfull. The algorithm is called VIPS [see: http://www.zjucadcg.cn/dengcai/VIPS/VIPS.html] and stands for Vision Based Page Segmentation. It is based on the idea that page content can be visually split into sections. Compared with DOM based methods, the segments obtained by VIPS are much more semantically aggregated. Noisy information, such as navigation, advertisement, and decoration can be easily removed because they are often placed in certain positions of a page. This could help you detect the tag block quite accurately!
there is a term extractor module in Drupal. (http://drupal.org/project/extractor) but it's only for Drupal 6.

Tool to parse text for possible Wikipedia links

Does a tool exist that can parse text and output that text, hyper-linked to Wikipedia entries for words of interest?
For example, I'd like a tool that could turn something like:
The most popular search algorithm on a
sorted list is the binary search.
Into:
The most popular search algorithm on a
sorted list is the binary search.
It would be wonderful if Wikipedia had an API which would do this since they would be best equipped to determine what "words of interests" are.
In my example I simply linked all combinations which linked directly to an entry except for The and most.
There is a tool that does exactly what you're asking for.
http: //wikify.appointment.at/
It's not perfect, but it works.
You have two separate problems to solve here:
Deciding which words should be linked
Determining if there's a suitable entry to link these words to
Now, (2) is simpler, though it's also somewhat problematic. Wikipedia seems to have an API that allows you to gather data efficiently, and they also allow "screen scraping". But there's a problem with disambiguation - sometimes you might hit not the entry you wanted. For example, python links to a disambiguation page, as it can be a programming language, a snake and a couple of other things.
(1) Is much harder, though. You can take the "simple approach" and attempt to find links for all non-trivial nouns (or even noun/adjective pairs). Non-trivial here means omitting words like "fiend, word, computer" etc.
But This would result in a plethora of links, which isn't convenient to read. It's really up to you to decide what's interesting in the text, and this depends a lot on the text itself. In an article for professional programmers, do you really want to link to "search algorithm" every time? But for beginners, perhaps you do.
To conclude, I strongly doubt there's a single general-purpose tool that will do the trick for you. But you surely have all the options at your hand, and something need-specific can be coded without too much effort.
Silviu Cucerzan of Microsoft Research tackled this problem. Well, not the problem of inserting the links, but the general issue of determining what entities are being mentioned in a some piece of text. Fortunately for you, he used Wikipedia articles as his set of entities. His paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", is available on his website. Direct link: pdf.

Pros and cons of using DB id in the URL?

For example: http://stackoverflow.com/questions/396164/exposing-database-ids-security-risk and http://stackoverflow.com/questions/396164/blah-blah loads the same question.
(I guess this is DB id of Questions table? Is this standard in ASP.NET?)
What are the pros and cons of using this type of scheme in your web app?
Well, for one, simple id's are usually sequential, so it's quite easy to guess at and retrieve other data from your application.
Load JSON at runtime rather than dynamically via AJAX
https://stackoverflow.com/questions/395858/doesnt-matter-what-I-type-here
Now, having said that, that might also be seen as a bonus, because nobody in their right mind would make their whole security hinge on the fact that you have to clink on a link to get to your secure data, and thus easy discoverability of the data might be good.
However, one point is that you're at some point going to reindex your database, having something that makes the old url's invalid would be bad, if for no other reason that search engines would still have old links.
Also, here on SO it's quite normal to use links like this to other questions, so if they at some point want to reindex and thus renumber things (or move to guid's), they will still have to keep the old structure and id's.
Now, is this likely to ever happen or be needed? Probably no.
I wouldn't worry too much about it, just build your security as though every entrypoint to your application is known and there should be no problems.
The database ID is used to lookup the question in the database. It's numerical which means: fast. If you would leave it out you had to lookup the title which is a lot slower.
The question itself is part of the url to make it "search engine friendly". It'll be higher ranked by g**gle etc.
Pro:
Super easy to retrieve the page information. Take the ID, call the database, viola. Your table will (should) be indexed to make this lookup super fast.
Guaranteed unique URL.
Con:
IDs in your system are being publicly displayed. Not a problem in a publicly available system like SO. However, proper security measures on the back end can make this not a problem even on sensitive systems.
Ugly URLs. 6+ digit numbers are just hard to remember, and makes it more difficult to distinguish pages, if the number is all that identifies it. This can also has SEO consequences, as URLs with more relevant and well structured information are generally ranked better. SO compensates by providing the post name in the URL as well. While I still can't rattle off a particular post to my buddy at lunch, I can still find it easier in the browser history.
Slower lookups. Doing text searches on a database is generally slower.
But remember in a community like this there is a higher (although still minimal) chance of the same question name being posted at the same time, which would break things, thus some kind of unique identification need be applied, ID's are probably quite logical in the context that this particular web application was developed in.
I dont think it's bad practice, and fairly common, to do it in ASP.NET and other frameworks. As #lassevk said, if your security depends on it, then you need some more checks in there (can user X get to record Y), but it more comes down to the SEO-friendlyness of the URLs for public sites.
For example, SO's URLs are fairly friendly:
Pros and cons of using DB id in the URL?
google rates information at the START of the URL higher than at the end, so having it look like:
https://stackoverflow.com/pros-and-cons-of-using-db-id-in-the-url/q/407120
should get a higher ranking for "pros and cons of using db id in the url". It's not the only factor, but it is quite a major one - look at Amazon's format, they do it for a very good reason:
http://www.amazon.com/Maverick-Ricardo-Semler/dp/0712678867
http://server/book-name/dp/book-id
Wordpress does it like this:
http://server/yyyy/mm/dd/name-of-the-post
however, if you post two posts on the same day called "foo", you get:
http://server/yyyy/mm/dd/foo
http://server/yyyy/mm/dd/foo2
the slug (foo/foo2) isn't a PK, but it IS maintained as unique over the posts table.
I think putting the ID in the URL isn't a problem, unless your URL is a GUID! Way too long, and hard to type. If it's an int, or some kind of short guid (eg 6-8 chars), then it shouldn't be a problem.

Resources