Ruby on Rails 3 search external website source based on top google result - ruby-on-rails

I'm having a hard time finding out where to start with this one. I pull information from an external website and put some of the content on my page. I think I need two things done. 1. A google search that takes the url of the top search given a name of my current object. 2. A way to examine the source of the result and output the information of a tag with a specific class.
To better explain this, I'll create a hypothetical situation: Say I have a website that lists mattresses and gives reviews. Say I want to add other websites reviews and in this website there's a tag like 3.5/5. Then I want to display this review along with a link to the external page. Is there a way to search the site like "site:http://mattressreviewsite/ #matress.name", pull that top url, and then search the source for the string "class='rating'" and display this in my view?
Thanks for any help or guidance. I'm using Rails 3.

You need an HTTP client (httparty, net/http-default) for that and do some parsing to get the required results.
Go study the url patterns of google (as far as I remember it was google.com?q=search_string) and use the http client for requests (get/post). Parse the result (there are many HTML parser gems available too) to get what you need and for any subsequent HTTP requests. And don't forget the 'I am feeling lucky' feature of google which returns only one result.
All the best!

Related

Get all urls indexed to google from a website

I want a program that does that, from a website, get all the urls indexed to it with a good output, like, all the urls line by line, and get the urls not used in the website (because a spider can already do that).
I have been searching and finding sloopy options, what I want is accurate and simple: INPUT: URL OUTPUT: ALL THE URLS.
I don't know such applications for now, but I'll try to simplify your task by dividing it:
Yon need a list of your website's internal links. Any webcrawler tool can do that.
You need a list of your website's pages indexed by Google. There are a lot of SE index checkers, you can google it.
Compare 2nd list to the 1st one, and find all the links presents in Google's index but missing on your website.

Labview to google spreadsheet information transfer

I have been using LabVIEW to collect measurement data, and I would like to know if it is possible for LabVIEW to communicate the results to a Google Spreadsheet. If so, where could I find resources to learn how to make LabVIEW transmit information to the Google Spreadsheet ?
Thanks!
EDIT AND FOLLOW-UP- I used Jonathan's suggestion below and experimented with the LabVIEW http Post.vi. It's very simple, all you need to do is enter the URL of the Google form (replacing the final "viewform" with "formResponse") and a string with the data you want to enter (with rough syntax = ). A big thanks for that answer, it was really helpful !
However, when I try to use this method for a Google form with more than one page, the data isn't read properly... The form is still sent but every field not present on the first page of the form remains blank on the Spreadsheet. I feel that this is somehow linked to the fact that in the Google form, the URL of all the pages after page 1 are the URL of page 1 with the final "viewform" replaced with "formResponse". Is this what is causing the error or is it something else altogether, and how can I fix it ?
I can think of two ways to do this:
You can create a form in google spreadsheets. The form appears as an html document with standard tags. From here, I would use labview's http functionality to submit data to that form using a POST request. This would be the easiest way to get data in there.
Using the Google Apps API, you can manipulate google spreadsheets and dump data in there directly. This is going to be more complicated in terms of development time, but more configurable in the long run. https://developers.google.com/google-apps/spreadsheets/#what_can_this_api_do There are .net and java code examples throughout the documentation, so it would take some work to port this to LabVIEW, but it could be done.

From a development perspective, how does does the indeed.com URL structure and site work?

On the webmaster's Q and A site, I asked the following:
https://webmasters.stackexchange.com/questions/42730/how-does-indeed-com-make-it-to-the-top-of-every-single-search-for-every-single-c
But, I would like a little more information about this from a development perspective.
If you search Google for anything job related, for example, Gastonia Jobs (City + jobs), then, in addition to their search results dominating the first page of Google, you get a URL structure back that looks like this:
indeed.com/l-Gastonia,-NC-jobs.html
I am assumming that the L stands for location in the URL structure. If you do a search for an industry related job, or a job with a specific company name, you will get back something like the following (Microsoft jobs):
indeed.com/q-Microsoft-jobs.html
With just over 40,000 cities in the USA I thought, ok, maybe it's possible they looped through them and created a page for every single one. That would not be hard for a computer. But then obviously the site is dynamic as each of those pages has 10000s of results and paginated by 10. The q above obviously stands for query. The locations I can understand, but they cannot possibly have created a web page for every single query combination, could they?
Ok, it gets a tad weirder. I wanted to see if they had a sitemap, so I typed into Google "indeed.com sitemap.xml" I got the response:
indeed.com/q-Sitemap-xml-jobs.html
.. again, I searched for "indeed.com url structure" and, as I mentioned in the other post on webmasters, I got back:
indeed.com/q-change-url-structure-l-Arkansas.html
Is indeed.com somehow using programming to create a webpage on the fly based on my search input into google? If they are not, how are they able to have a static page for millions and millions and millions possible query combinations, have them dynamically paginate, and then have all of those dominate google's first page of results (albeit that very last question may be best for the webmasters QA)?
Does the javascript in the page somehow interact with the URL
It's most likely not a bunch of pages. The "actual" page might be http://indeed.com/?referrer=google&searchterm=jobs%20in%20washington. The site then cleverly produces a human readable URL using URL rewrite, fetches jobs in the database that matches the query, and voĆ­la...
I could be dead wrong of course. Truth be told, the technical aspect of it can probably be solved in a multitude of ways. Every time a job is added to the site, all pages that need to be done to match that job, might be created, thus producing an enormous amount of pages for Google to crawl.
This is a great question however remains unanswered on the ground that a basic Google search using,
ste:indeed.com
returns over 120MM results and secondly a query such as, "product manager new york" ranks #1 in results. These pages are obviously pre-generated which is confirmed by the fact the page is cached by the search engine (sometimes several days before) has different results from a live query on the site.
Easy when Googles search bot crawls the pages on indeed or any other job search site those page are dynamically created. Here is another site: http://jobuzu.co.uk i run this which is similar to how indeed works.
PHP is your friend in this and Indeed don't just use standard databases look into Sphinx and Solr as they offer Full text search for better performance then MySql etc.
They also make clever use of rel="canonical" and thorough internal linking:
http://www.indeed.com/find-jobs.jsp
Notice that all the pages that actually rank can be found from that direct internal link structure.

Making a website without hyperlinks

I am making a simple CMS to use in my own blog and I have been using the following code to display articles.
xmlhttp.onreadystatechange=function(){
if (xmlhttp.readyState==4 && xmlhttp.status==200){
document.getElementById("maincontent").innerHTML=xmlhttp.responseText;
}
}
What it does is it sends a request to the database and gets the content associated with the article that was clicked on and writes it to the main viewing area with ".innerHTML".
Thus I don't have actual links to other articles. I know that I can use PHP to output HTML so that it forms a link like :
<a href=getcontent.php?q=article+title>Article Title</a>
But being slightly OCD I wanted my output to be as neat as possible. Although search engine visibility is not a concern for my personal blog I intend to adapt this to a few other sites which have search engine optimization as a priority.
From what I understand, basically search engine robots follow links to index the web sites.
My question is:
Does this practice have any negative implications for search engine visibility? Also; are there other reasons for preferring one approach over the other as I see that almost every site uses the 'link' method.
The link you've written will cause a page reload. In order to leverage the standard AJAX stuff you've got at the top, you need to write the links as something along the lines of
Article Title
This assumes you have a javascript function called ajaxGet that takes an argument of the identifier for the article you're searching for.
If you were to write your entire site that way, search engines wouldn't be able to crawl you at all since they don't execute javascript. Therefore they can't get to anything off the front page. Also, even if they could follow the links, they'd have no way of referencing the page they got to since it doesn't have a unique URL. This is also annoying for users, since they can't get a link to an exact story to bookmark, send to a friend etc.

How would I find all the short urls that link to a particular long url?

Basically I want to know how many people have tweeted a link to a url, but since there are dozens of link shortener out there I don't see any way to do this without having access to all of their url maps. I found a previous question here but it was over a year old and didn't have any new answers.
So #1, does anyone know of a service/API for doing this?
And #2, can anyone think of a way to accomplish this task other than submitting the long url in question to all the popular link shortening sites?
ps- I'm also open to comments about why this is impossible or impractical.
You could perform a Google search (or the equivalent via API) for any pages that link to your page. This is done with the link: keyword. So if you're trying to figure out how many people link to www.example.com (regardless of whether it's through a link shortner URL), then you would just do a Google search for link:www.example.com.
e.g.: http://www.google.com/search?q=link:www.example.com
Note that this will only find pages that have been indexed, so pages that haven't been crawled, or pages that get crawled infrequently, will not show up in the results until a later date (if at all).
Since all sites have different algorithms for shortening the URLs, and these are different sites that most likely do not share their data with each other, how can you hope to find all of them in a single or small number of queries?
All you can do is brute-force it, and even then this might not be any good if a site is content to create a new value for the same long-form URL (especially if you send a different long-form URL that maps to the same place, like http://www.stackoverflow.com/ rather than http://stackoverflow.com/).
In order to really get this to work, there would have to be a site that ALREADY automatically collects all of this information from every site, which the URL shortening sites voluntarily call. And even if you wrote such a site, that doesn't account for the URL-shortening sites already out there who already have data!
In short, I do not see how this is remotely possible, unless I'm wrong about there being such a database somewhere out there.
So months after asking this question I came across a solution to a similar question, that is how to tell how many times a link has been shared on facebook. The solution, via a simple new API call:
http://graph.facebook.com/http://stackoverflow.com
returns the following json data:
{
"id": "http://stackoverflow.com",
"shares": 1627
}

Resources