Lucidworks Fusion - lots of apps OK? - search-engine

For a search engine that will be used on lots of little sites, is there any downside to having a lot of apps, and linking to generic pipeline objects? They want to have site-specific synonyms and spelling suggestions.

If by "lots" you mean a few dozen then you are likely fine. Once a cluster gets around 1000 collections, Zookeeper can't keep up with needed coordination tasks and each app will likely have several collections.

Related

How many docker containers do i need, and how do i sperate code between them?

I'm in the process of designing a web-service hosted with Google App Engine comprised of three parts, a client website (or more), a simple CMS I designed to edit and view the content of that website, and lastly a server component to communicate between these two services and the database. I am new to docker and currently doing research to figure out how exactly to set up my containers along with the structure of my project.
I would like each of these to be a separate service, and therefor put them in different containers. From my research it seems perfectly possible to put them in separate containers and still have them communicate, but is this the optimal solution? Also given that in the future I might want to scale up so that my backed can supply multiple different frontends all managed from the same CMS.
tldr:
How should I best structure my web-service with docker, as well as assuming my back-end supplies more than one front end managed from a CMS.
Any suggestion for tools, or design patterns that make my life easier are welcome!
Personally, I don't like to think of designing whatever in terms of containers. Containers should be good for deployment process, for their main goal.
If you keep your logic in separate components/services you'll be able to combine them within containers in many different ways.
Once you have criteria what suits your product requirements (performance, price, security etc) you'll configure your docker images in the way you prefer.
So my advise is focus on design of your application first. Start from the number of solutions you have, provide a dockerfile for each one and then see what you will have to change.

Designing a modern platform in Rails 3

I'm in the early stages of prototyping a Rails 3 application that will expose a public API. The site has three separate concerns which I am planning to split across three subdomains.
api.mysite.com
The publicly exposed API.
admin.mysite.com
The admin portal for creating blogs (using the public API).
x.mysite.com
The public blog site created at admin.mysite.com where x is the name of the blog. This too will make use of the public API.
All three will share domain objects. For example, you should be able to login to admin.mysite.com using an account you created on api.mysite.com or x.mysite.com.
Questions
Should I attempt to build one rails application to handle all three concerns or should I split this in multiple applications each handling a specific concern?
What are the Pros & Cons of each?
Does anyone have any insight into how some of the larger sites (basecamp, github, shopify) are organized?
Your question is fairly general so I'll try and answer in general terms. And the fact that you mention "larger sites" leads me to the conclusion that you're concerned about scaling.
In the beginning it is definitely going to be easier to build one application - especially since the domain is shared. You can do separate controllers for the various interfaces (api, html, etc) but with a shared back-end. This will reduce code duplication and the complexity of keeping 3 apps in sync. Also remember that you might change your mind about features based on user feedback and you want to be nimble enough to respond quickly.
The main benefit I can see of separating out three different deployables is that you can have an independent deploy schedule for each. For example, a bug fix in the api won't have to wait for admin to be ready to deploy. Or that you can have separate teams working in parallel.
If you're careful about what you keep in your session you'll be able to deploy multiple instances of your application on multiple servers, pointing at the same database (a.k.a. horizontal scaling). Each of these instances are identical to the others and a load balancer (either dedicated hardware or virtual) directs traffic between them. Eventually this approach runs out of steam when your database can't handle the load. At that point you can look at more caching, sharding, no-sql and all sorts of clever scaling techniques.
Most (but not all) larger sites end up doing some sort of horizontal scaling with some sharding of data.
All told, focus on getting a useful application to your users. If things take off you can worry about scaling. More applications fail because the user experience is awful rather than not being able to scale.

Am I the only one that queries more than one database?

After much reading on ruby on rails and multiple database connections, it seems that I have found something that not that many folks do, at least not with ror. I am used to querying many different databases and schemas and pulling back the information either for a report or for one seamless page. So, a user doesn't have to log on to several different systems. I can create a page that has all the systems on one or two web pages.
Is that not a normal occurrence in the web and database driven design?
EDIT: Is this because most all my original code is in classic asp?
I really honestly think that most ORM designers don't seem to take the thought that users may want to access more than one database into account. This seems to be a pretty common limitation in the ORM universe.
Our client website runs across 3 databases, so I do this to. Actually, I'm condensing everything into views off of one central database which then connects to the others.
I never considered this to be "normal" behavior though. I would guess that most of the time you would be designing for one system and working against that.
EDIT: Just to elaborate, we use Linq to SQL for our data layer and we define the objects against the database views. This way we keep reports and application code working off the same data model. There is some extra work setting up the Linq entities, because you have to manually define primary keys and set up associations... however so far it has definitely proven worthwhile. We tried to do so with Entity Framework, but had a lot of trouble getting the relationships set up appropriately and had to give up. The funny thing is I had thought Entity Framework was supposed to be designed for more advanced scenarios like ours...
It is not uncommon to hit multiple databases during a single part of an application's workflow. However, in every instance that I have done it, this has been performed through several web service calls, which among other things wrap the databases in question.
I have not, to my knowledge, ever had a need to hit multiple databases directly at once and merge results into a single report.
I've seen this kind of architecture in corporate Portals- where lots of data is pulled in via different data sources. The whole point of a portal is to bring silo'd systems together- users might not want to be using lots of systems in isolation (especially if they have to sign into each one). In that sort of scenario it is normal, particularly if it is a large company that has expanded rapidly and has a large number of heterogenous systems.
In your case whether this is the right thing to do depends on why you have these seperate DBs.
With ORM's it may be a little difficult. However, it can be done. Pull the objects as needed from the various databases, then use them as a composite to create a new object that is the actual one that is desired. If you can skip the ORM part of the process, then you can directly query the databases and build your object directly.
Pulling data from two databases and compiling a report is not uncommon, but because cross-database queries cannot be optimized by the query engine of either database, OLTP systems typically use a single database, to keep the application performant.
If you build the system from the ground up, it is not advisable to do it this way. If you are working with a system you didn't design, there is no much choice and it is not uncommon (that is the difference between "organic" and "planned" grow).
Not counting master and various test instances, I hit nine databases on a regular basis. Yes, I inherited it, and yes, "Classic" ASP figures prominently. Of course, all the "brillant" designers of this mess are long gone. We're replacing it with things more sane as quickly as we safely can.
I would think that if you're building a new system, and keep adding databases and get to the point of two or three databases, it's probably time to re-think your design. OTOH, if you're aggregating data from multiple, disparate systems, then, no, it's not that strange. Depending on the timliness you need, and your budget for throwing hardware at the problem, and if your data is mostly static, this would be a good scenario for a "reporting server" that pulls the data down from the Live server periodically.

A/B testing and stats solutions

I've been looking for a good testing framework for months, not finding anything, so I've just been building my own.
This is what I want to do:
- track arbitrary behaviors (e.g. # of photos viewed, # of comments posted)
- track correlation between arbitrary variables and those behaviors
(e.g, how do different versions of this prompt affect average # of photos viewed?)
This kind of thing should be a core part of agile development. What's out there? I know Google Website Optimizer is one of the answers, but you can only track behaviors that end in a single "success" page.
It'd be great to have a plugin that can work within your code (Rails in my case) and feed into a nice hosted service with pretty graphs...
You may want to partition your problem into analytics (impressions, actions, and possible in-page events, reporting of your tests), and a framework for serving up your variations (how do you manage x variations both in practical terms of preparing them, do you need to store variations for future reference, turning on and off tests, optimize the effectiveness of your test etc). There is clearly an overlap, say, Google Website Optimizer can turn off a bad variation as soon as it has data to support it, but by thinking about this as different problems you may be able to reuse perhaps the Google Analytics component.
Yep, here:
http://github.com/paulmars/seven_minute_abs/tree/master

Ruby On Rails/Merb as a frontend for a billions of records app

I am looking for a backend solution for an application written in Ruby on Rails or Merb to handle data with several billions of records. I have a feeling that I'm supposed to go with a distributed model and at the moment I looked at
HBase with Hadoop
Couchdb
Problems with HBase solution as I see it -- ruby support is not very strong, and Couchdb did not reach 1.0 version yet.
Do you have suggestion what would you use for such a big amount of data?
Data will require rather fast imports sometimes of 30-40Mb at once, but imports will come in chunks. So ~95% of the time data will be read only.
Depending on your actual data usage, MySQL or Postgres should be able to handle a couple of billion records on the right hardware. If you have a particular high volume of requests, both of these databases can be replicated across multiple servers (and read replication is quite easy to setup (compared to multiple master/write replication).
The big advantage of using a RDBMS with Rails or Merb is you gain access to all of the excellent tool support for accessing these types of databases.
My advice is to actually profile your data in a couple of these systems and take it from there.
There's a number of different solutions people have used. In my experience it really depends more on your usage patterns related to that data and not the sheer number of rows per table.
For example, "How many inserts/updates per second are occurring." Questions like these will play into your decision of what back-end database solution you'll choose.
Take Google for example: There didn't really exist a storage/search solution that satisfied their needs, so they created their own based on a Map/Reduce model.
A word of warning about HBase and other projects of that nature (don't know anything about CouchDB -- I think it's not really a db at all, just a key-value store):
Hbase is not tuned for speed; it's tuned for scalability. If response speed is at all an issue, run some proofs of concept before you commit to this path.
Hbase does not support joins. If you are using ActiveRecord and have more than one relation.. well you can see where this is going.
The Hive project, also built on top of Hadoop, does support joins; so does Pig (but it's not really sql). Point 1 applies to both. They are meant for heavy data processing tasks, not the type of processing you are likely to be doing with Rails.
If you want scalability for a web app, basically the only strategy that works is partitioning your data and doing as much as possible to ensure the partitions are isolated (don't need to talk to each other). This is a little tricky with Rails, as it assumes by default that there is one central database. There may have been improvements on that front since I looked at the issue about a year and a half ago. If you can partition your data, you can scale horizontally fairly wide. A single MySQL machine can deal with a few million rows (PostgreSQL can probably scale to a larger number of rows but might work a little slower).
Another strategy that works is having a master-slave set up, where all writes are done by the master, and reads are shared among the slaves (and possibly the master). Obviously this has to be done fairly carefully! Assuming a high read/write ratio, this can scale pretty well.
If your organization has deep pockets, check out what Vertica, AsterData, and Greenplum have to offer.
The backend will depend on the data and how the data will be accessed.
But for the ORM, I'd most likely use DataMapper and write a custom DataObjects adapter to get to whatever backend you choose.
I'm not sure what CouchDB not being at 1.0 has to do with it. I'd recommend doing some testing with it (just generate a billion random documents) and see if it'll hold up. I'd say it will, despite not having a specific version number.
CouchDB will help you a lot when it comes to partitioning/sharding your data and like, seems like it might fit with your project -- especially if your data format might change in the future (adding or removing fields) since CouchDB databases have no schema.
There are plenty of optimizations in CouchDB for read-heavy apps as well and, based on my experience with it, is where it really shines.

Resources