Single search engine for multiple sites (MediaWiki, JIRA etc.) - search-engine

I have a problem I would like to discuss. My company is working on a project and we have wiki (MediaWiki), a bug database (JIRA) & several other project related sites. All the sites are hosted on the same web-server and the problem is that it's very uncomfortable to search a piece of information (that may be found in any of these websites). All the sites are running from apache in linux if it matters.
I would like to know if there's an option to integrate google internally (or any other quality search engine) so it would index all of the sites into one search engine so all the employees will be able to search one search engine and find the correct reference (as we all work when we're using the internet).
Thanks in advance

Check out Solr
http://lucene.apache.org/solr/

Related

Integrating GCS on a staging Jekyll website

We currently have a staging website, which has an IP address like xx.xx.xxx.xxx and we would like to have integrated and tested GCS on it before pushing it live. Is it possible?
Otherwise, is there any alternative to GCS to add a search bar in a Jekyll blog without using plugin?
Thanks!
PART I: Google
Google custom search cannot index an application that isn't available to the internet.
No, that's not entirely true. You can arrange something with Google (in theory, never done it) but it doesn't look easy. Or cheap.
You could set up a custom search for an unrelated site and embed those results in your local page, if you want to test out CSS prior to launch.
Remember, Google Custom Search also comes with ads, unless you pay. And the results tend to look like they came from Google.
PART II: Alternatives
I've looked into this extensively and I haven't come up with a good answer. Here are some not-so-good answers:
1) Tapir Search. This actually worked pretty well, but appears to have died. They do have recent twitter activity, however, so maybe worth checking back in a bit. twitter. It's basically a (free) front end for an elasticsearch server. I think. Neat service, obviously not super-dependable.
2) Go javascript. Lunr for example. There are many, many similar solutions available. Sadly, they are client-side and doing a full-text search on even a smallish blog type site can be very slow. Works okay if you limit the search to titles, but then...you're only searching titles.
3) Build a search engine server. Maybe some breed of Lucene. Upside: very robust search while keeping the snappy response of a flat HTML site. Downside: building and maintaining a search engine server is difficult, expensive and probably overkill.
4) Hosted search engine. Algolia for example. They're basically doing 3) for you. Relatively expensive (~$50/month) but well worth the cost, because, seriously, search engine servers are finicky and prone to explosions. I've never gone this route with Jekyll because I've never had a Jekyll project I was quite that serious about, but I did consider it.
If anyone has anything to add, I'd love to hear it. This question has been irritating me for a while.

client side search engine

I would like to get some suggestions on my current headache. I have been researching on search engine for client side browser. I am building custom glossary project. The idea of the search engine will be used for searching terms, keywords, or definitions.
Here are my requirements for this project
no server side support. Total client side
only for intranet
build for browser that is not HTML5
thousands of terms
Any suggestions or ideas on how to build the client-side only search engine?
Thanks in advance
Barring a full desktop app, I don't see how you can do this without a server. Distribute the HTML files as a .ZIP file and browse them locally? If yes, you probably need to make a honking big page with the whole database inside it (as Javascript data structures) and search that way. Shouldn't be too hard with regexes etcetera, but I doubt whether you'll end up with a fun user experience...
Really curious why a simple server-side solution won't work :-)

How to extend an existing Ruby on Rails CMS to host multiple sites?

I am trying to build a CMS I can use to host multiple sites. I know I'm going to end up reinventing the wheel a million times with this project, so I'm thinking about extending an existing open source Ruby on Rails CMS to meet my needs.
One of those needs is to be able to run multiple sites, while using only one code-base. That way, when there's an update I want to make, I can update it in one place, and the change is reflected on all of the sites. I think that this will be able to scale by running multiple instances of the application.
I think that I can use the domain/subdomain to determine which data to display. For example, someone goes to subdomain1.mysite.com and the application looks in the database for the content for subdomain1.
The problem I see is with most pre-built CMS solutions, they are only designed to host one site, including the one I want to use. So the database is structured to work with one site. However, I had the idea that I could overcome this by "creating a new database" for each site, then specifying which database to connect to based on the domain/subdomain as I mentioned above.
I'm thinking of hosting this on Heroku, so I'm wondering what my options for this might be. I'm not very familiar with Amazon S3, or Amazon SimpleDB, but I feel like there's some sort of "cloud database" that would make this solution a lot more realistic, than creating a new MySQL database for each site.
What do you think? Am I thinking about this the wrong way? What advice do you have to offer in this area?
I've worked on a Rails app like this, and the way it was done there was named-based virtual hosts, with db entries for each site running. Each record was scoped to a site if necessary (blog posts, etc.) while users would have access to all sites running out of that db. Administrator permissions could be global or scoped to one or more sites.
You're absolutely correct when you say you'll reinvent the wheel a million times during the project. Plugins will likely require hacking on top of the CMS itself.
In my situation, it ended up being a waste of almost a million dollars of company money to build that codebase to run multiple sites while still being able to cater to the whims of each client site. It worked, but was not very maintainable due to the number of site-specific hacks that subsequently entered the codebase. You may be able to make it work if you don't have to worry about catering to specific client sites running on your platform.
In the end, you're going to need a layer of indirection to handle the different sites regardless of methodology. We ended up putting it in the database itself. If you go with the different-db-for-each-site method you mentioned, you'll put that layer in your code instead. I'm not sure which one is the better method.
I hope you're able to pull this off. I failed.
Also, as I learned today, Heroku offers postgres instead of mysql for rails apps.
There's James Stewart's Theme Support Plugin for Rails 2.3, and lucasefe's themes_for_rails gem for Rails 3+.
I just started using the 2.3 version and it's working well so far.

Best way to add full web search to my site?

I need to add full web search to my site. I need something like Google Custom Search but with no ads and it has to be free. Any recommendation of a web service or open source project that can index my site and allow me to search it will be helpful.
My site is made in ruby on rails, if that helps.
I'll make this question community-wiki so you can edit my bad English. I think many people can benefit from this question.
Check out Lucene. It's an open source search engine that will certainly be a fun learning experience to implement on your own site. It was originally designed by the Excite folks, I do believe.
Ferret is the Ruby port of Lucene. Check out the acts_as_ferret plugin.
Depends what you mean by full web search really. If you want to search the whole web then the answers above wont help you much as they are really for indexing and searching the content of your site. I would suggest using the Google ajax search (just a 'powered by google' needed, no ads) or Boss from yahoo (might require ads not sure).
http://code.google.com/apis/ajaxsearch/
http://developer.yahoo.com/search/boss/
People are going to acts_as_solr and thinking sphinx in the blogs i read:
http://acts-as-solr.rubyforge.org/
http://ts.freelancing-gods.com/
I've aslo been looking at tsearch in postgres, it looks very capable:
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
What do you mean by "full web search"?
The are good answers available for full-text search where a search engine indexes and queries the model objects stored in your database.
If you mean something that indexes and queries your rendered HTML, Nutch is a popular option with a web-crawler, parser, indexer, and query interface.
I recommend acts_as_xapian. It's very easy to implement, it's fast enough, and it's the got the features you'll normally need.

Where do search engines start crawling?

What do search engine bots use as a starting point? Is it DNS look-up or do they start with some fixed list of well-know sites? Any guesses or suggestions?
Your question can be interpreted in two ways:
Are you asking where search engines start their crawl from in general, or where they start to crawl a particular site?
I don't know how the big players work; but if you were to make your own search engine you'd probably seed it with popular portal sites. DMOZ.org seems to be a popular starting point. Since the big players have so much more data than we do they probably start their crawls from a variety of places.
If you're asking where a SE starts to crawl your particular site, it probably has a lot to do with which of your pages are the most popular. I imagine that if you have one super popular page that lots of other sites link to, then that would be the page that SEs starts will enter from because there are so many more entry points from other sites.
Note that I am not in SEO or anything; I just studied bot and SE traffic for a while for a project I was working on.
You can submit your site to search engines using their site submission forms - this will get you into their system. When you actually get crawled after that is impossible to say - from experience it's usually about a week or so for an initial crawl (homepage, couple of other pages 1-link deep from there). You can increase how many of your pages get crawled and indexed using clear semantic link structure and submitting a sitemap - these allow you to list all of your pages, and weight them relative to one another, which helps the search engines understand how important you view each part of site relative to the others.
If your site is linked from other crawled websites, then your site will also be crawled, starting with the page linked, and eventually spreading to the rest of your site. This can take a long time, and depends on the crawl frequency of the linking sites, so the url submission is the quickest way to let google know about you!
One tool I can't recommend highly enough is the Google Webmaster Tool. It allows you to see how often you've been crawled, any errors the googlebot has stumbled across (broken links, etc) and has a host of other useful tools in there.
In principle they start with nothing. Only when somebody explicitly tells them to include their website they can start crawling this site and use the links on that site to search more.
However, in practice the creator(s) of a search engine will put in some arbitrary sites they can think of. For example, their own blogs or the sites they have in their bookmarks.
In theory one could also just pick some random adresses and see if there is a website there. I doubt anyone does this though; the above method will work just fine and does not require extra coding just to bootstrap the search engine.

Resources