Rails - extract seo keywords from block of text

Rails - extract seo keywords from block of text - ruby-on-rails

I need to generate seo meta keyword tags based upon user generated wiki content.
Say I have an article and a predefined list of keywords/phrases, is there some good method to grab matched article keywords? Keywords may not be of one word length and will be given a predefined weight as to which keywords are used first. Some implementation of Nokogiri seems the obvious choice but I wondered if there were something more complete for this exact scenario.

You could process your text thanks to a semantic API, it will give you a list of potential keywords + the score associated.
I've begun to develop this gem: https://github.com/apneadiving/SemExtractor
It still needs some improvements for error handling but it's fully operational to query the following engines:
Zemanta
Semantic Hacker from Textwise
Yahoo Boss
OpenCalais

If you're only wanting to grab keywords for the meta keyword tag, that's not really worth your time. Google doesn't pay attention to those anymore.
Here's a good post about it, with a video of Matt Cutts from Google explaining that the meta keyword tag doesn't play a part in search engine rankings.
http://www.stepforth.com/blog/2010/meta-keyword-tag-dead-seo/
What is worth your time? Good title tags.

Related

Pros and Cons of using hierarchical URLs versus flat?

I'm building a large news site and we'll have several thousand articles. So far we have over 20,000. We plan on having a main menu which contains links which will display articles based on those criteria. Therefore, clicking "baking" will show all articles related to "baking", and "baking/cakes" will show everything related to cakes.
Right now, we're weighing whether or not to use hierarchical URLs for each article. If I'm on the "baking/cakes" page, and I click an article that says "Chocolate Raspberry Cake", would it be best to put that article at a specific, hierarchical URL like this:
website.com/baking/cakes/chocolate-raspberry-cake
or a generic, flat one like this:
website.com/articles/chocolate-raspberry-cake
What are the pros and cons of doing each? I can think of cases for each approach, but I'm wondering what you think.
Thanks!

It really depends on the structure of your site. There's no one correct answer for every site.
That being said, here's my recommendation for a news site: instead of embedding the category in the URL, embed the date. For example: website.com/article/2016/11/18/chocolate-raspberry-cake or even website.com/2016/11/18/chocolate-raspberry-cake. This allows you to write about Chocolate Raspberry Cake more than once, as long as you don't do it on the same day. When I'm browsing news I find it helpful to identify the date an article was written as quickly as possible; embedding it in the URL is very helpful.
Hierarchical URLs based on categories lock you into a single category for each article, which may be too limiting. There may be articles which fit multiple categories. If you've set up your site to require each article to have a single primary category, then this may not be an issue for you.
Hierarchical URLs based on categories can also be problematic if any of the categories ever change. For example, in the case of typos, changes to pluralization, a new term coming into vogue and replacing an existing term, or even just a change in wording (e.g. "baking" could become "baked goods"). The terms as they existed when you created the article will be forever immortalized in your URL structure, unless you retroactively change them all (invalidating old links, so make sure to use Drupal's Redirect module).
If embedding the date in the URL is not an option, then my second choice would be the flat URL structure because it will give you URLs which are shorter and easier to remember. I would recommend using "article" instead of "articles" in the URL because it saves you a character.

URL keyword vs URL readibility

this question is about SEO in URL naming, I just want to know is SEO really weight much more than user experience? What you guys will see as limit to how far SEO should go as ruining people's experience. Just like for this example, I have a page that contain information about art contest that is running or have run in my website.
Which URL is better?
example.com/contest/{contest-id}/{name-of-contest}
or
example.com/online-graphic-design-contest/{contest-id}/{name-of-contest}
Is keyword stuffing in url for keyword such as 'online', 'graphic', 'design' and 'contest' so much more important in SEO, than having a short more readable URL such as the first one?

The best way to think about SEO these days is through the perspective of the user, firstly, and then through the search engine perspective. I would argue that your second URL is much better for both cases. It's more descriptive to the user (we have an "online graphic design contest") and also to search engines.
Google has made it apparent that their focus is on providing content that is relevant to the user, and the best way to be relevant is with content that is descriptive and fits with what your users are searching for. I don't think you're keyword stuffing if you're using a single natural language phrase in the URL to describe the content of the page. That portion of the URL should also match your page title, and header tags on the page, etc., etc.
Here are some useful resources:
http://static.googleusercontent.com/media/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
http://linchpinseo.com/user-focused-seo-redefining-what-search-engine-optimization-is

best way to extract info from the web delphi

I want to know if there is a better way of extracting info from a web page than parsing the HTML for what i'm searching. ie: Extracting movie rating from 'imdb.com'
I'm currently using the IndyHttp components to get the page and i'm using strUtils to parse the text but the content is limited.

I found plain simple regex-es to be highly intuitive and simple when dealing with good web-sites, and IMDB is a good web site.
For example the movie rating on the IMDB's movie HTML page is in a <DIV> with class="star-box-giga-star". That's VERY easy to extract using a regular expression. The following regular expression will extract the movie rating from the raw HTML into capture group 1:
star-box-giga-star[^>]*>([^<]*)<
It's not pretty, but it does the job. The regex looks for the "star-box-giga-star" class id, then it looks for the > that terminates the DIV, and then captures everything until the following <. To create a new regex like this you should use a web browser that allows inspecting elements (for example Crome or Opera). With Chrome you can simply look at the web-page, right-click on the element you want to capture and do Inspect element, then look around for easily identifiable elements that can be used to create a good regex. In this case the "star-box-giga-star" class is obviously easily identifiable! You'll usually have no problem finding such identifiable elements on good web sites because good web sites use CSS and CSS requires ID's or class'es to be able to style the elements properly.

Processing RSS feed is more comfortable.
As of the time of posting, the only RSS feeds available on the site are:
Born on this Date
Died on this Date
Daily Poll
Yet, you may make a call for adding a new one by getting in touch with the help desk.
Resources on RSS feed processing:
Relevant post here on SO.
Super Object
Wikipedia.

When scraping websites, you cannot rely on the availability of the information. IMDB may detect your scraping and attempt to block you, or they may frequently change the format to make it more difficult.
Therefore, you should always try to use a supported API Or RSS feed, or at least get permission from the web site to aggregate their data, and ensure that you're abiding by their terms. Often, you will have to pay for this type of access. Scraping a website without permission may open you up to liability on a couple legal fronts (Denial of Service and Intellectual Property).
Here's IMDB's statement:
You may not use data mining, robots, screen scraping, or similar
online data gathering and extraction tools on our website.
To answer your question, the better way is to use the method provided by the website. For non-commercial use, and if you abide by their terms, you can download the IMDB database directly and use the data from there instead of scraping their site. Simply update your database frequently, and it's a better solution than scraping the site. You could even wrap your own web API around it. Ratings are available as a standalone table.

Use HTML Tidy to convert any HTML to valid XML and then use an XML parser, maybe using XPATH or developing your own code (which is what I do).

All the answers posted cover well your generic question. I usually follow an strategy similar to the one detailed by Cosmin. I use wininet and regex for most of my web extraction needs.
But let me add my two cents at the specific subquestion on extracting imdb qualification. IMDBAPI.COM provides a query interface returning json code, which is very handy for this type of searches.
So a very simple command line program for getting a imdb rating would be...
program imdbrating;
{$apptype console}
uses htmlutils;
function ExtractJsonParm(parm,h:string):string;
var r:integer;
begin
r:=pos('"'+Parm+'":',h);
if r<>0 then
result:=copy(h,r+length(Parm)+4,pos(',',copy(h,r+length(Parm)+4,length(h)))-2)
else
result:='N/A';
end;
var h:string;
begin
h:=HttpGet('http://www.imdbapi.com/?t=' + UrlEncode(ParamStr(1)));
writeln(ExtractJsonParm('Rating',h));
end.

If the page you are crawling is valid XML, i use SimpleXML to extract infos. Works pretty well.
Resource:
Download link.

Rails + MediaWiki API for Wikipedia data extraction

I am trying to use Rails to extract data from Wikipedia, based on a search term.
For example,
1) if I have the String "American Idol", I want to pass that to Wikipedia and get a list of the articles that relate to that. My goal will be to take the first 3 hyperlinks and display them on the website.
2) one step further would involve me extracting small pieces of data from Wikipedia - say the infobox, or the first few words of the wikipedia article.
Any tips?
Thanks!

You don't need to resort to screen-scraping, MediaWiki has a very comprehensive API for precisely this kind of thing. See https://github.com/jpatokal/mediawiki-gateway for a handy Ruby wrapper around it.
Alternatively, if you're only interested in data like infoboxes, see DBpedia for the database version of Wikipedia.

There is another gem that you can use: https://github.com/kenpratt/wikipedia-client
This gem seems to get just the first result of your search, but you can consult the documentation to be sure.
Regarding the content, once you get the page, the gem allows you to access the different content of the article, links, images and so on.

Use mechanize and nokogiri to do that. This is a great cheat sheet for that:
http://www.e-tobi.net/blog/files/ruby-mechanize-cheat-sheet.pdf
Mechanize is a toolbox to simulate website calls and nokogiri is an html/xml parser. It should be simple to realize that.

Can an "SEO Friendly" url contain a unique ID?

I'd like to start using "SEO Friendly Urls" but the notion of generating and looking up large, unique text "ids" seems to be a significant performance challenge relative to simply looking up by an integer. Now, I know this isn't as "human friendly", but if I switched from
http://mysite.com/products/details?id=1000
to
http://mysite.com/products/spacelysprokets/sproket/id
I could still use the ID alone to quickly lookup the details, but the URL itself contains keywords that will display in that detail. Is that friendly enough for Google? I hope so as it seems a much easier process than generating something at the end that is both unique and meaningful.
Thanks!
James

Be careful with allowing a page to render using the same method as Stack overflow.
http://stackoverflow.com/questions/820493/random-text-can-cause-problems
Black hats can this to cause duplicate content penalty for long tail competitors (trust me).
Here are two things you can do to protect yourself from this.
HTTP 301 redirect any inbound display url that matches your ID but doesn't match the text to the correct text.
Example:
http://stackoverflow.com/questions/820493/random-text-can-cause-problems
301 ->
http://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id
Use canonical URLs.
<link rel="canonical"
href="http://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id"
/>

https://stackoverflow.com/questions/820493/can-an-seo-friendly-url-contain-a-unique-id
I'd say you're fine.

Have a look at the URLs that StackOverflow uses. They have a unique id, then they have the SEO-friendly stuff. You can omit the SEO-friendly stuff and the URL still works.

You are making a devils bargan here, you are trading away business goals for technology goals.
If you were to ask "From a purely business and SEO prospective, is it better to include unique IDs in the URL or not?"; the answer would clearly be to not use them.
The question then becomes, if you do use them, how much does it hurt you in the search engines? The answer is that it definately has some negative impact. How much is yet to be determined.
In terms of "user friendly", no, they are definitely not user friendly.
In terms of Google, they state "Whenever possible, shorten URLs by trimming unnecessary parameters." See their URL structure document.

I'm not aware of any problems caused by adding an ID to a URL. In fact it can be extremely useful, as it allows the human/search engine friendly part of the URL to be changed without causing a broken link to a page that a search engine has already indexed. Using SO as an example, here's a link to your question:
https://stackoverflow.com/questions/820493/you-can-put-any-text-you-want-here

Nothing wrong with that. An increasing number of services have started to use a hybrid solution as Paul Tomblin already pointed out. In addition to SO, Tumblr uses this pattern too (maybe it was the first).
Furthermore, in certain services—like Google News—the URL must contain a unique numeric ID.

Getting rid of the parameterized URL will definitely help. From my experience, including the ID does not hurt or help, as long as there are no '?key=value' pairs in the url.

I have two seemingly contradictory points to make here:-
Nobody looks at URLs! Experience has "trained" browser users to render the "Address" box contents as invisable, they know the contents will be any two of 'ureadable', 'meaningless' and 'confusing', hence they just ignore it completely.
Using a String which can be easily converted to an integer may offer a slight performance advantage over using a longer string which is slightly harder (hash() vs. to_int() ) to convert into an integer. However in the context of the average web application any performance difference would would be negligable.
My advice would be to stick with what your comfortable with.

Use something like modrewrite to parse URLs before they reach your server. So you could convert a slug like http://oorl.com/99942/My-Friendly-Text-For-Search-Engines/ into http://oorl.com/lookup.php?id=99942. This will also let you change slug and keywords used to optimize certain links without damaging functionality.

Duplicate refer cause more negative impact compare to friendly URL, be careful about using fake text with id, your competitors could miss use this.

Yes, and in fact it's more SEO friendly to include a number in your url as it implies to google that you are consistently updating your content.
I am fairly sure that it makes it much more difficult to get indexed in Google News if you don't have an incrementing number attached in some way to your URLs.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart