Wikipedia API request sometimes not returning results - ruby-on-rails

I want to make a request to the Wikipedia API to see if a given name has a Wikipedia page.
For example, let's say I make an API request to get the page for Justin Bieber:
source = "https://en.wikipedia.org/w/api.php?action=query&titles=justin%20bieber&prop=revisions&rvprop=content&format=json"
data = open(source).read
json = JSON.parse(data)
Then I get back a JSON response with this info. But why is it not returning any result for some less well known name (even though they have wiki pages?) For example, this brent bolthouse page: https://en.wikipedia.org/wiki/Brent_Bolthouse. If I check the json, there's no indication that it's an actual page..
I basically just want to implement a simple check to see if there's a wiki page that matches the exact name.

Try capitalizing all parts of the name, e.g.:
"brent bolthouse".titleize
=> "Brent Bolthouse"
I suggest this because the titles of Wikipedia's pages on persons always have that format. While your URL with the lowercase name as the query doesn't work, the URL with the capitalized name does.

Ah, I found out MediaWiki is case sensitive for page titles.

Related

How can you get the canonical URL for a web page (Rails)?

I need to store a distinct URL for an external webpage
I need to put the URL into the database. I don't want to store the same page twice so
I need to strip all fluff off the URL.
# if I have
url_1 = "http://scientificamerican.com/royal-baby/?utm_campaign=promo"
# and
url_2 = "http://scientificamerican.com/royal-baby/?utm_source=email"
# then they should map to:
url_canonical = "http://scientificamerican.com/royal-baby/"
...it's not as simple as just stripping query parameters though
In order to get a single canonical URL regardless of what was on it I tried stripping the query string. The problem is that there are still CMSs which use the query string.
e.g.
url_1 = "https://www.scientificamerican.com/article.cfm?id=obama-budget"
# strip the query string and it becomes
url_1 = "https://www.scientificamerican.com/article.cfm"
# which is obviously the same for all articles :(
Is there any Rails tool for getting a page's canonical URL?
This is obviously a problem that a number of people have had to solve, not least the search engines. How do you reduce the URL down such that all that remains is the data for the page?
You can't. There is no way to know what query parameters are necessary to distinguish the URL. There are obviously many parameters you can knowingly remove (ie. utm_campaign, etc.) but not all.
You're best bet would be to load the HTML for the page and look for the canonical link element . If that exists, then you've got your canonical URL.
http://en.wikipedia.org/wiki/Canonical_link_element

Google reader public RSS get more than 9 items

We need to parse the data from a google reader public rss feed, the problem is that the url parameter n=numerofitemstoretrieve only works up to n=9
For example in our test url:
http://www.google.com/reader/shared/user%2F15926769355350523044%2Flabel%2FPublicas%20RSS?n=2
Retrieves 2 news items
http://www.google.com/reader/shared/user%2F15926769355350523044%2Flabel%2FPublicas%20RSS?n=20
Retrieves only 9 news items
How can we overcome this limitation? Is there another parameter for this case? Or another method?
We found that using this alternative url the n parameter works fine:
https://www.google.com/reader/api/0/stream/contents/feed/http://www.google.com/reader/public/atom/user%2F15926769355350523044%2Flabel%2FPublicas%20RSS?n=20
The only problem is the output format its different this way, so if someone finds a better solution we will grant the response to him/her
It seems the results are cropped only when the url is viewed in the browser...if you get the web contents from code it returns the correct item count...(in contrast using the alternative url the returned contents are right both ways: getting them from code as well as viewing it in the browser)
In Atom format (link in the top right in the two urls in the OP) :
http://www.google.com/reader/public/atom/user%2F15926769355350523044%2Flabel%2FPublicas%20RSS?n=20
The content with /api/ in the URL in the second post is in JSON format, slightly harder to parse than the Atom XML.
https://webapps.stackexchange.com/questions/26567/how-to-raise-google-reader-rss-feed-entry-limit

facebook graph api returns error 2500 when there are commas in the id url

I'm attempting to retrieve the "shares" graph data for a number of pages in JSON format. I suspect that the errors I am encountering stem from the fact that some of the URLs have commas in them, and are being parsed as an attempt to pass multiple ids.
Returns graph data.
https://graph.facebook.com/?ids=http://celebritybabies.people.com/2012/08/23/backstreet-boys-howie-dorough-expecting-second-son/
Returns error 2500 "Cannot specify an empty identifier"
https://graph.facebook.com/?ids=http://www.people.com/people/article/0,,20624518,00.html
Encode the commas, still returns 2500
https://graph.facebook.com/?ids=http://www.people.com.people.article/0%2C%2C20624518%2C00.html
There doesn't seem to a way around it other than to use the normal inspection
http://graph.facebook.com/http://www.people.com/people/article/0,,20624518,00.html
You may have to file a bug at http://developers.facebook.com/bugs though I feel as the answer would most likely be "Status by design".
You could try using FQL instead, querying the link_stat table:
SELECT url, normalized_url, share_count, comments_fbid FROM link_stat
WHERE url = 'http://www.people.com/people/article/0,,20624518,00.html'
(See result in Graph API Explorer.) You can also use WHERE url IN ("…", "…", …) to check multiple URLs at once.
This also returns a comments_fbid of 10151022112466453, and that one you can look up via the API, https://graph.facebook.com/10151022112466453
Maybe this can work as a workaround, until Facebook fixes this problem.

Getting the original link on Wikipedia?

I have some links to wikipedia articles, for example: https://en.wikipedia.org/wiki/Steve_jobs when you visit that link, you will see right under the article's title: (Redirected from Steve jobs) If you follow that link you will eventually reach a page with the same URL except that Steve_jobs has a capital J for jobs. So it would look like this: https://en.wikipedia.org/wiki/Steve_Jobs
Is there a way I can retrieve the latter link using the first one?
You can find out where does a certain title redirect to by API query like:
http://en.wikipedia.org/w/api.php?action=query&titles=Steve%20jobs&redirects
If you want the result in XML, add &format=xml to the URL, or &format=json for JSON.
Just use the query api. With http://en.wikipedia.org/w/api.php?action=query&titles=Steve+jobs|steve+jobs|steve+Jobs|Steve+Jobs&redirects you will get your titles normalized and redirected. See the api documentation for further options, especially the formats.

dynamic seo title for news articles

I have a news section where the pages resolve to urls like
newsArticle.php?id=210
What I would like to do is use the title from the database to create seo friendly titles like
newsArticle/joe-goes-to-town
Any ideas how I can achieve this?
Thanks,
R.
I suggest you actually include the ID in the URL, before the title part, and ignore the title itself when routing. So your URL might become
/news/210/joe-goes-to-town
That's exactly what Stack Overflow does, and it works well. It means that the title can change without links breaking.
Obviously the exact details will depend on what platform you're using - you haven't specified - but the basic steps will be:
When generating a link, take the article title and convert it into something URL-friendly; you probably want to remove all punctuation, and you should consider accented characters etc. Bear in mind that the title won't need to be unique, because you've got the ID as well
When handling a request to anything starting with /news, take the next part of the path, parse it as an integer and load the appropriate article.
Assuming you are using PHP and can alter your source code (this is quite mandatory to get the article's title), I'd do the following:
First, you'll need to have a function (or maybe a method in an object-oriented architecture) to generate the URLs for you in your code. You'd supply the function with the article object or the article ID and it returns the friendly URL with the ID and the friendly title.
Basically function url(Article $article) => URL.
You will also need some URL rewriting rules to remove the PHP script from the URL. For Apache, refer to the mod_rewrite documentation for details (RewriteEngine, RewriteRule, RewriteCond).

Resources