Algorithm to figure out which links to follow in wikipedia - machine-learning

Starting at a given wiki page, I wish to find the pages whose concepts comprise the original. For example, from "computer", I would like to travel to links "Hard drive" or "CPU". I have a list of all links. Some relevant others not. How would I assign a probability of "goodness" to each link?

Related

Pros and Cons of using hierarchical URLs versus flat?

I'm building a large news site and we'll have several thousand articles. So far we have over 20,000. We plan on having a main menu which contains links which will display articles based on those criteria. Therefore, clicking "baking" will show all articles related to "baking", and "baking/cakes" will show everything related to cakes.
Right now, we're weighing whether or not to use hierarchical URLs for each article. If I'm on the "baking/cakes" page, and I click an article that says "Chocolate Raspberry Cake", would it be best to put that article at a specific, hierarchical URL like this:
website.com/baking/cakes/chocolate-raspberry-cake
or a generic, flat one like this:
website.com/articles/chocolate-raspberry-cake
What are the pros and cons of doing each? I can think of cases for each approach, but I'm wondering what you think.
Thanks!
It really depends on the structure of your site. There's no one correct answer for every site.
That being said, here's my recommendation for a news site: instead of embedding the category in the URL, embed the date. For example: website.com/article/2016/11/18/chocolate-raspberry-cake or even website.com/2016/11/18/chocolate-raspberry-cake. This allows you to write about Chocolate Raspberry Cake more than once, as long as you don't do it on the same day. When I'm browsing news I find it helpful to identify the date an article was written as quickly as possible; embedding it in the URL is very helpful.
Hierarchical URLs based on categories lock you into a single category for each article, which may be too limiting. There may be articles which fit multiple categories. If you've set up your site to require each article to have a single primary category, then this may not be an issue for you.
Hierarchical URLs based on categories can also be problematic if any of the categories ever change. For example, in the case of typos, changes to pluralization, a new term coming into vogue and replacing an existing term, or even just a change in wording (e.g. "baking" could become "baked goods"). The terms as they existed when you created the article will be forever immortalized in your URL structure, unless you retroactively change them all (invalidating old links, so make sure to use Drupal's Redirect module).
If embedding the date in the URL is not an option, then my second choice would be the flat URL structure because it will give you URLs which are shorter and easier to remember. I would recommend using "article" instead of "articles" in the URL because it saves you a character.

URL keyword vs URL readibility

this question is about SEO in URL naming, I just want to know is SEO really weight much more than user experience? What you guys will see as limit to how far SEO should go as ruining people's experience. Just like for this example, I have a page that contain information about art contest that is running or have run in my website.
Which URL is better?
example.com/contest/{contest-id}/{name-of-contest}
or
example.com/online-graphic-design-contest/{contest-id}/{name-of-contest}
Is keyword stuffing in url for keyword such as 'online', 'graphic', 'design' and 'contest' so much more important in SEO, than having a short more readable URL such as the first one?
The best way to think about SEO these days is through the perspective of the user, firstly, and then through the search engine perspective. I would argue that your second URL is much better for both cases. It's more descriptive to the user (we have an "online graphic design contest") and also to search engines.
Google has made it apparent that their focus is on providing content that is relevant to the user, and the best way to be relevant is with content that is descriptive and fits with what your users are searching for. I don't think you're keyword stuffing if you're using a single natural language phrase in the URL to describe the content of the page. That portion of the URL should also match your page title, and header tags on the page, etc., etc.
Here are some useful resources:
http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
http://linchpinseo.com/user-focused-seo-redefining-what-search-engine-optimization-is

Find out various layouts used in website

Is it possible to find out the total no of layouts (templates) used within a website.
For example:-
Suppose i want to know how many types of layouts www.flipkart.com uses.
Answer will be like:-
Landing page or Home page
Category Page e.g http://www.flipkart.com/mobiles?_l=GIuT6NCRsZbfL9ID9ZKHNQ--&_r=hCno5y6eFUI8C0iWzaQbAg--&ref=cef19a11-4ebc-4f8e-a0dc-401c2d55de3e&_pop=brdcrumb
This is a category page. All such pages will have same layout only the inner content will be different.
Product Pages like http://www.flipkart.com/htc-sensation-mobile-phone/p/itmczbrsnwphgbnw?pid=MOBCYW9HXBUDYJPH&_l=sXQjsX87GxqrvKzhjuOrkw--&_r=n_2yuAC4xgh0SZTuulvAtw--&ref=9305103f-6fc1-497c-807a-8f30ee30c13c is a product page.
All the product pages will have same layout like they have buy now option. Multiple images will be there. So Is there any existing tool to find out this.
I hope i am clear in my question. I just want to classify the site pages into some buckets.
Well I don't think there exists some kind of tool or algorithm now upto my knowledge but yes you can write some. Try to find out some attributes of these pages and set them as benchmarks. Now whenever you encounter a url and you want to identify its category just find out the attributes again and compare against the benchmarks set.
Its not generic though but will work for specific websites :)

Url with pseudo anchors and duplicate content / SEO

I have a product page with options in select list (ex : color of the product etc...).
You accede to my product with different urls :
www.mysite.com/product_1.html
www.mysite.com/product_1.html#/color-green
If you accede with the url www.mysite.com/product_1.html#/color-green, the option green of the select list is automatically selected (with javascript).
If i link my product page with those urls, is there a risk of duplicate content ? Is it good for my seo ?
thx
You need to use canonical urls in order to let the search engines know that you are aware that the content seems duplicated.
Basically using a canonical url on your page www.mysite.com/product_1.html#/color-green to go to www.mysite.com/product_1.html tells the search engine that whenever they see www.mysite.com/product_1.html#/color-green they should not scan this page but rather scan the page www.mysite.com/product_1.html
This is the suggested method to overcome duplicate content of this type.
See these pages:
SEO advice: url canonicalization
A rel=canonical corner case
At one time I saw Google indexing the odd #ed URL and showing them in results, but it didn't last long. I think it also required that there was an on page link to the anchor.
Google does support the concept of the hashbang (#!) as a specific way to do indexable anchors and support AJAX, which implies an anchor without the bang (!) will no longer be considered for indexing.
Either way, Google is not stupid. The the basic use of the anchor is to move to a place on a page, i.e. it is the same page (duplicate content) but a different spot. So Google will expect a #ed URL to contain the same content. Why would they punish you for doing what the # is for?
And what is "the risk of duplicate content". Generally, the only onsite risk from duplicate content is Google may waste it's time crawling duplicate pages instead of focusing on other valuable pages. As Google will assume # is the same page it is more likely to not event try the #ed URL.
If you're worried, implement the canonical tag, but do it right. I've seen more issues from implementing it badly than the supposed issues they are there to solve.
Both answers above are correct. Google has said they ignore hashtags unless you use hash-bang format (#!) -- and that really only addresses a certain use case, so don't add it just because you think it will help.
Using the canonical link tag is the right thing to do.
One additional point about dupe content: it's less about the risk than about a missed opportunity. In cases where there are dupes, Google chooses one. If 10 sites link to your site using www.example.com and 10 more link using just example.com you'll get the :link goodness" benefit of only 10 links. The complete solution to this involves ensuring that when users and Google arrive at the "wrong" on, the server responds with an HTTP 301 status and redirects the user to the "right" one. This is known as domain canonicalization and is a good thing for many, many reasons. Use this in addition to the "canonical" link tag and other techniques.

Rails - extract seo keywords from block of text

I need to generate seo meta keyword tags based upon user generated wiki content.
Say I have an article and a predefined list of keywords/phrases, is there some good method to grab matched article keywords? Keywords may not be of one word length and will be given a predefined weight as to which keywords are used first. Some implementation of Nokogiri seems the obvious choice but I wondered if there were something more complete for this exact scenario.
You could process your text thanks to a semantic API, it will give you a list of potential keywords + the score associated.
I've begun to develop this gem: https://github.com/apneadiving/SemExtractor
It still needs some improvements for error handling but it's fully operational to query the following engines:
Zemanta
Semantic Hacker from Textwise
Yahoo Boss
OpenCalais
If you're only wanting to grab keywords for the meta keyword tag, that's not really worth your time. Google doesn't pay attention to those anymore.
Here's a good post about it, with a video of Matt Cutts from Google explaining that the meta keyword tag doesn't play a part in search engine rankings.
http://www.stepforth.com/blog/2010/meta-keyword-tag-dead-seo/
What is worth your time? Good title tags.

Resources