Autocomplete from 50k records - which DB solution to use? - ruby-on-rails

I want to create an autocomplete to Rails 5 app with PostgreSQL and I'd have a database with something around 50,000 records available to the user find out (neighborhoods from one entire country). I did some research on web and there are many tutorials outdated and some of them was using redis as the best option for this case. So, are there something new that I should follow in these days? Thank you.

To summarize the comments under the question (writing it down, because users looking for answers usually gloss over comments... I know I do)
50k records data set is not that big
nature of the data set is that it will be rarely updated
therefore there will be a lot of reads, and almost no writes
So PostgreSQL database should be more than enough, and Martin suggested great read about trigram indexes perfect for the task.
If the day comes that there are a ton of users trying to use this autocomplete at once - you should be fine with simple replication.

Related

Strategy for regularly updating datasets from excel files

I have ~10 excel files which are produced by a third party and updated each night and are available as a download. They contain ~ 10 fields (all short text / dates) and between ~10,000 and ~1m rows in each.
I'm planning to create a simple web application to enable people to search the data. I'll host it on AWS or similar. Search load will be light maybe ~1000 searches / day.
I have to assume that all the records are unique each night and need completely replace the online dataset.
It's relatively simple for me to convert the data from the excel files into a database such as Postgres and create a simple search on top of it.
My question is how do I deal with the time it takes to do the database update each night? Should I create two databases and have my application alternate between them every other night?
What is a typical strategy for dealing with a situation like this?
My current skill set is Ruby/Rails/Postgres building and simple(ish) web apps. I've been intentionally vague about technology because I'm open minded about what to use. And I'm quite happy to learn something new to solve the problem.
If you do all updates in one stransaction, you don't need too dbs - all the time you updtae tables people see "old" version, just a moment after COMMIT they will see all "new"...

Re-indexing my Rails DB

I understand the basics of how indexing helps boost performance, and how to index my DB, but Im confused about how often to re-index. Also, when I re-index my database, do I need ot first removed the initial index, or can I just re-index as if I am indexing for the first time.
This is not a rails question its a DBMS question. What, where, when and how you re-index depends on your database DBMS but as a general rule re indexing is rarely needed unless there is database corruption of some description or you have a massive amount of changes to data that is included in the indexes. By changes I mean updates and deletes.
For example if you use Postgres then this link might help http://www.postgresql.org/docs/9.1/static/routine-reindex.html.
Also have a look through stackexchange. Questions and answers like this https://dba.stackexchange.com/questions/1937/is-reindex-dangerous may enlighten you.
If you use MySQL then this is a very good explanation http://dev.mysql.com/doc/refman/5.0/en/rebuilding-tables.html
Lookup whatever DBMS you are using and check the official documentation on how and when to re-index. The requirement to re-index is also likely to be different depending on the table types used like InnoDB and MyISAM for MySQL may well have different requirements and csv may well not have any indexing at all

Suitability of Amazon SimpleDB for large temporal data sets eminating from thousands of separate devices

I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.
Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/
I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.
It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.
I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here

Free data warehousing systems--specifically, for data storage

I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.
Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.
Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.
http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.
Hypertable is based on google's bigTable project.
see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks
you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).
hope that helps
I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases.
You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development.
Although this may not be what you were expecting, it may be worth considering.
Pentaho Mondrian
Open source
Uses standard relational database
MDX (think pivot table)
ETL ( via Kettle )
I use this.
In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:
http://hadoop.apache.org/
They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.
It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.
Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.
Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.
About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.
I see a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.
Hope this helps a bit.
A lot of people just use Mysql or Postgres :)

MVC Implementation where a Search Engine is the Model

Maybe I am mistating the problems and conflating the answer with the questions, but please here me out. I would like to think (communally, with you) about a site that is based on any any of the MVC frameworks(something PHP or ASP.NET MVC, whtever) that would use a search engine (lucene/solr, FAST ESP, whatever) as the back end of the Model. That is to say, there is no database per se in the project. Just a giant index of docuements that are semistructured content.
I am looking to understand - and keep in mind the site is primarily read-only - where I am likely to run into trouble. What are the things that make you think this is a bad idea from the get go. Also, please assume that there will be a robust infrastructure with caching surrounding the search engine - so while perf comments are welcomed, we feel they are not the major problem.
Thanks!
In general, I'd use a tool like Lucene for searching content, and a database for retrieving it. That doesn't mean that it won't work. It's more a question of why you don't want to use a database. Yes, it can work, and it probably will work (depending on the functional requirements of the site, read on), but that still doesn't make a tool like Lucene the right tool for the job per se.
That being said, it also it does depend on the kind of site however. Is it really a site with just a whole bunch of searchable data and nothing else, or is it something much more than that? If the answer is the first, then good! If it is the latter, there are some issues I can think of:
Updates to the data can be troublesome. "Instant updates" are usually a no-go, as Lucene would have to rebuild its index, which is time-consuming. If there aren't many updates to the data that's fine. You can just recreate the index a couple of times per day, or nightly, if that works.
Trying to stuff any data in an index which is not really suited to be indexed is usually not a good idea. If the site lets users register on your site, then that user data should really go in a database. It's not impossible to store it in a lucene index, it's just not the right tool for the job. Use the index as a bunch of indexed documents, but don't use it as a database as well.

Resources