my problem is that, I'm trying to check a text area content against the web looking for palgiarized content. So my high level solution for this is to get the content of the text area, enclose them in "double quotes" and do a search in google. And I want to return the top 5 sites that are returned by google.
In order to apply that pseudo code on my app, I installed gem google-search, but when I tried to run my string, the sites returned by the gem have missing items take for example the search "EVE Search - Heavy dict defence?", if you run it on google, it'll return 1 site. But on my app, it doesn't return anything.
Anybody have any ideas?
I believe accessing search from google's API is different than accessing it from your browser. Google returns its search results dynamically based on things likes browser history, your local cache and cookies, etc. Even if you searched for "EVE Search - Heavy dict defence?" from two different browsers, your results could vary.
Also, check out this answer: https://stackoverflow.com/a/654558/1481644
You can go with ruby web-search gem..........it will take a query string to search and return the sites and for more info check here https://github.com/mattetti/ruby-web-search
Related
I'm trying to do an advanced search to get tweets that MUST include one word from the following three word groups: html or css, developer or engineer, home or remote.
I've read the Twitter's documentation and the query I should be using is:
html OR css developer OR engineer home OR remote
And I also tried:
html OR css AND developer OR engineer AND home OR remote
https://twitter.com/search?f=tweets&vertical=default&q=html%20OR%20css%20developer%20OR%20engineer%20home%20OR%20remote&src=typd
I'm getting inaccurate results, it's showing tweets that don't have at least one word from each word group:
Where is the issue? I've contacted Twitter's support but they don't respond to individual reports :/
ATTENTION: I don't want results from the Top tab. The Top tab only shows popular tweets. The Live tab shows all the tweets ands that's what I want. https://support.twitter.com/articles/131209
Just put quotes around the words.
"html" OR "css" AND "developer" OR "engineer" AND "home" OR "remote" seems to do exactly what you want.
You need to put brackets around the terms you wish to group. For example:
html OR css AND developer OR engineer AND home OR remote
becomes
(html OR css) AND (developer OR engineer) AND (home OR remote)
Example
You need to put brackets around each group. Also "AND" should be replaced with a space. Try this:
(html OR css) (developer OR engineer) (home OR remote)
According to Twitter's Search API:
Before getting involved, it’s important to know that the Search API is
focused on relevance and not completeness. This means that some Tweets
and users may be missing from search results. If you want to match for completeness you should consider using a Streaming API instead.
So that's why the search results are "inaccurate". I ended up creating a Node.js script that uses the Streaming API and filters the tweets I want.
I want a program that does that, from a website, get all the urls indexed to it with a good output, like, all the urls line by line, and get the urls not used in the website (because a spider can already do that).
I have been searching and finding sloopy options, what I want is accurate and simple: INPUT: URL OUTPUT: ALL THE URLS.
I don't know such applications for now, but I'll try to simplify your task by dividing it:
Yon need a list of your website's internal links. Any webcrawler tool can do that.
You need a list of your website's pages indexed by Google. There are a lot of SE index checkers, you can google it.
Compare 2nd list to the 1st one, and find all the links presents in Google's index but missing on your website.
On the webmaster's Q and A site, I asked the following:
https://webmasters.stackexchange.com/questions/42730/how-does-indeed-com-make-it-to-the-top-of-every-single-search-for-every-single-c
But, I would like a little more information about this from a development perspective.
If you search Google for anything job related, for example, Gastonia Jobs (City + jobs), then, in addition to their search results dominating the first page of Google, you get a URL structure back that looks like this:
indeed.com/l-Gastonia,-NC-jobs.html
I am assumming that the L stands for location in the URL structure. If you do a search for an industry related job, or a job with a specific company name, you will get back something like the following (Microsoft jobs):
indeed.com/q-Microsoft-jobs.html
With just over 40,000 cities in the USA I thought, ok, maybe it's possible they looped through them and created a page for every single one. That would not be hard for a computer. But then obviously the site is dynamic as each of those pages has 10000s of results and paginated by 10. The q above obviously stands for query. The locations I can understand, but they cannot possibly have created a web page for every single query combination, could they?
Ok, it gets a tad weirder. I wanted to see if they had a sitemap, so I typed into Google "indeed.com sitemap.xml" I got the response:
indeed.com/q-Sitemap-xml-jobs.html
.. again, I searched for "indeed.com url structure" and, as I mentioned in the other post on webmasters, I got back:
indeed.com/q-change-url-structure-l-Arkansas.html
Is indeed.com somehow using programming to create a webpage on the fly based on my search input into google? If they are not, how are they able to have a static page for millions and millions and millions possible query combinations, have them dynamically paginate, and then have all of those dominate google's first page of results (albeit that very last question may be best for the webmasters QA)?
Does the javascript in the page somehow interact with the URL
It's most likely not a bunch of pages. The "actual" page might be http://indeed.com/?referrer=google&searchterm=jobs%20in%20washington. The site then cleverly produces a human readable URL using URL rewrite, fetches jobs in the database that matches the query, and voíla...
I could be dead wrong of course. Truth be told, the technical aspect of it can probably be solved in a multitude of ways. Every time a job is added to the site, all pages that need to be done to match that job, might be created, thus producing an enormous amount of pages for Google to crawl.
This is a great question however remains unanswered on the ground that a basic Google search using,
ste:indeed.com
returns over 120MM results and secondly a query such as, "product manager new york" ranks #1 in results. These pages are obviously pre-generated which is confirmed by the fact the page is cached by the search engine (sometimes several days before) has different results from a live query on the site.
Easy when Googles search bot crawls the pages on indeed or any other job search site those page are dynamically created. Here is another site: http://jobuzu.co.uk i run this which is similar to how indeed works.
PHP is your friend in this and Indeed don't just use standard databases look into Sphinx and Solr as they offer Full text search for better performance then MySql etc.
They also make clever use of rel="canonical" and thorough internal linking:
http://www.indeed.com/find-jobs.jsp
Notice that all the pages that actually rank can be found from that direct internal link structure.
I am trying to add to my website a search bar like the one on Facebook. I want my users to be able to search through my products, my other users ... But I also want the result to be displayed in real time without pressing a button. I am currently looking at several options (thinking-sphinx, ferret, ...) but I am not sure which one to use and that's why I would like to get advices from pros ;)
So my requirement are :
Result to be displayed in real-time in a box on the current page.
Css Customizable
Be able to search through severals table in my PostGreSQL DB.
I am currently using Heroku for production.
I want to choose the best one for my needs and that's why I am asking your opinion.
Thanks in advance !
Make sure to separate how you want data displayed (your first two requirements) from how you want it indexed (third) from how you want to deploy.
Let's start backwards. Heroku provides limited support for machine configuration; both options you mention require installation of a service that reads and writes files. Heroku has such an option in two ways: 1) PostgreSQL has a built-in full-text search capability, and 2) Heroku has made Flying Sphinx an option, as well as these other options documented on the Heroku site. The first two options may provide the easiest linkage to your database, but I haven't tried other options, so it's possible they do too. So now you have a search index, deployed.
Real-time "incremental" search is purely a matter of presentation ... and maybe performance. Start typing and you start getting results is nothing more than sending requests via AJAX to the search server, typically after a short delay in typing (maybe 50ms), and handling the display of results. There are a couple of ways to make that simple written up here in this SO answer.
I ended up keeping the PostgreSQL engine for now but used select2 as the jquery for the presentation and was able to get a facebook like search box :
Select2 with Rails and JSON
Thanks for you help !
I've tried thinking sphinx after being pointed in that direction and simple filtering seems impossible. I've googled and asked questions for 2 days now and it seems it can't be done which is shocking because it's something commonly done when searching on websites.
All I would like to do add filtering options to my search form such as filtering by one or a combination of:
When user hits browse page all the sites users are returned but showing 20 results per page
Filtering options
in: location
who are: sexual preference
between the ages: age range
and located in: country
My search page works fine because all I require is 1 textfield a user uses for finding users by email, username or full name. My browse page is a different story because I'm using 1 form with multiple text fields and one or two select fields.
Example
Is there a gem that does this easily and performs well at the same time?
or would doing this manually via find methods be the only way?
Kind regards
Apart from using Sphinx and Thinking Sphinx, you can think of those gems: meta_where and meta_search
However after reading your description I think Sphinx is the best choice here indeed.
You wrote that it seems impossible to apply simple filtering using Thinking Sphinx. Let me explain a bit of Thinking Sphinx within the post you mentioned under the link: Example
You can go for Elasticsearch. Ruby has the 'Tire' gem, which is a client for ElasticSearch http://www.elasticsearch.org/