Large spreadsheet of static data: relational db or flat file? - ruby-on-rails

I have a spreadsheet, approximately 1500 rows x 1500 columns. The labels along the top and side are the same, and the data in the cells is a quantified similarity score for the two inputs. I'd like to make a Rails app allowing a user to enter the row and column values and retrieve the similarity score. The similarity scores were derived empirically, and can't be mathematically produced by the controller.
Some considerations: with every cell full, over half of the data is redundant; e.g., (row 34, column 985) holds the same value as (row 985, column 34). And row x will always be perfectly similar to column x. The data is static, and won't change for years.
Can this be done with one db table? Is there a better way? Can I skip the relational db entirely and somehow query the file directly?
All assistance and advice is much appreciated!

Database is always a safe place to store it. Relational Database is straightforward and a good idea. However there are alternatives to consider. How often will this data be accessed? Is it accessed rarely or very frequently? If it's accessed very rarely, just put it in the database and let your code take care of searching and presenting. You'll optimize it by database indexes etc.
Flat-File is a good idea but reading and searching it at run time for every request is going to be too slow.
You could read all the data (from db/file) at server startup, and keep it in memory and ensure that your servers dont restart too often. It means each one of your servers is going to sit with the entire grid in memory but computation is going to be really fast. If you use REE and calibrate the Garbage Collection settings, you can also minimize the startup time of the server to a large extent.
Here's my final suggestion. Just build your app in the simplest way you know. Once you know how often and how much your app is going to be used, you start optimizing. You are fundamentally working with 1125000 cells. This is not unreasonably large dataset for a database to process. But since your dataset will not change, you can go far by conventional caching techniques.

Related

Which approach promises better performance - a megaquery or several targeted queries?

I am creating an SSRS report that returns data for several "Units", which are all to be displayed on a row, with Unit 1 first, to its right Unit 2 data, etc.
I can either get all this data using a Stored Proc that queries the database using an "IN" clause, or with multiple targeted ("Unit = Bla") queries.
So I'm thinking I can either filter each "Unit" segment with something like "=UNIT:[Unit1]" OR I can assign a different Dataset to each segment (witht the targeted data).
Which way would be more "performant" - getting a big chunk of data, and then filtering the same thing in various locations, or getting several instances/datasets of targeted data?
My guess is the latter, but I don't know if maybe SSRS is smart enough to make the former approach work just as well or better by doing some optimizing "behind the scenes"
I think it really depends on how big the big chunk of data is. My experience has been that SSRS can process quite a large amount of data after it comes back from the database, and it does it quickly. If the report is going to aggregate the data in the end, I try to do as much of that as I can on the database end. The reason, usually the database server has more resources to do all that work. But, if the detail is needed, and you can aggregate on the report server end easily enough, pull 10K records and do it to it.
I lean toward hitting the database as few times as possible, but sometimes it just makes sense to get the data I need with individual queries. I have built reports with over 20 datasets, each for very specific measures that just didn’t union up really well. Breaking it up like this took the report run time from 3 minutes, to 20 seconds.
Not a great answer if you were looking for which exact solution to go with. It depends on the situation. Often, trial and error gets you to the answer for the report in question.
SSRS is not going to do any "optimizing" and the rendering requirements sound trivial, so you should probably consider this as SQL query issue, not really SSRS.
I would expect the single SELECT with an IN clause will be faster, as it will require fewer I/Os on the database files. An SP is not required, you can just write a SELECT statement.
A further benefit is that you will be left with N-times less code to maintain (where N = the number of Units), and can guarantee the consistency of the code/logic across Units.

How do I get the size of a ruby object in mb in Rails?

I want to query an ActiveRecord model, modify it, and calculate the size of the new object in mb. How do I do this?
The size of data rows in a database as well as the object size of ruby objects in memory are both not readily available unfortunately.
While it is a bit easier to get a feeling for the object size in memory, you would still have to find all objects which take part of your active record object and thus should be counted (which is not obvious). Even then, you would have to deal with non-obvious things like shared/cached data, class overhead, which might be required to count, but doesn't have to.
On the database side, it heavily depends on the storage engine used. From the documentation of your database, you can normally deduct the storage requirements for each of the columns you defined in your table (which might vary in case of VARCHAR, TEXT, or BLOB columns. On top of this come shared resources like indexes, general table overhead, ... To get an estimate, the documented size requirements for the various columns in your table should be sufficient though
Generally, it is really hard to get a correct size for complex things like database rows or in-memory objects. The systems are not build to collect or provide this information.
Unless you absolutely positively need to get an exact data, you should err on the side of too much space. Generally, for databases it doesn't hurt to have too much disk space (in which case, the database will generally run a little faster) or too much memory (which will reduce memory pressure for Ruby which will again make it faster).
Often, the memory usage of Ruby processes will be not obvious. Thus, the best course of action is almost always to write your program and then test it with the desired amount of real data and check its performance and memory requirements. That way, you get the actual information you need, namely: how much memory does my program need when handling my required dataset.
The size of the record will be totally dependent on your database, which is independent of your Ruby on Rails application. It's going to be a challenge to figure out how to get the size, as you need to ask the DATABASE how big it is, and Rails (by design) shields you very much from the actual implementation details of your DB.
If you need to know the storage to estimate how big of a hard disk to buy, I'd do some basic math like estimate size in memory, then multiply by 1.5 to give yourself some room.
If you REALLY need to know how much room it is, try recording how much free space you have on disk, write a few thousand records, then measure again, then do the math.

Core Data Model Design - Attributes vs Entities

I've been developing a very basic core data application for over a year now (Toy Collector, http://bit.ly/tocapp), and I'm looking at doing a redesign so that I can build in iCloud support. I figured while I'm doing that, I might as well update my core data model (if needed), and I'm having a heck of a time tracking down "best practices" for the following:
Currently, I have 2 entities:
Toy, Keywords
Toy has all the information about the object: Name, Year, Set, imageName, Owned, Wanted, Manufacturer, etc, (18 attributes in all)
Keywords has the normalized words to help speed up the search
My question is whether or not there is any advantage to breaking out some of the Toy attributes into their own entities. For example, I could have a manufacturer entity that stores the dozen or so manufacturers, instead of keeping that information in the Toy object. My gut tells me this could reduce the memory footprint (instead of 50,000 objects storing a manufacturer string, there would simple be 12 manufacturer strings in an entity with a relationship to the main Toy entity). Does that kind of organization really matter? Am I trying to overcomplicate things? I just feel like my entity has a lot of attributes, and I'm not sure if taking the time to break it apart into multiple entities would make a difference.
Any advice or pointers would be appreciated!
Zack
Your question is pretty broad, since it addresses the topic of database design. Let me say upfront that it is almost impossible to give you any sensible suggestions, since I would need to know a lot more about your app, use cases, etc. than it is possible through a S.O. question.
Coming to your concrete questions, I would say that you correctly identify one of advantages of splitting a table into multiple ones; actually, the advantage of doing that is not just reducing the database footprint, rather keep data redundancy to a minimum. Redundancy not only affects memory footprint but also manageability and modifiability of your data, and lack of redundancy could even cause anomalies or corruption. There is even a whole database theory topic which is known as database normalisation that addresses this king of concerns.
On the other hand, as it is always the case, redundancy can help performance, and this is actually the case when you can fetch your data through a simple query instead of multiple queries or table joins. There is a technique to improving a database performance which is known as database denormalization and is the exact opposite to normalisation. Your current scheme is fully denormalized.
Using Core Data, which is a relational object graph manager running often on top of SQLite, which is a relational database manager, you have also to take into account the fact that Core Data will automatically build your object graph and fetch into memory the data when you need it. This means that if you can take a smaller memory footprint on disk for granted, this might not be the case when it comes to RAM footprint of your query results (Core Data will "explode", so to say, at some moment your data from multiple tables into one object plus its attributes).
In your specific case, you should also possibly take into account the cost of migrating your existing user base (if the database is not read-only).
All in all, I would say that if your app does not have any database footprint issues at the moment; if you do not feel that creating new tables might be useful, e.g., in order to add new functionality, such as listing all manufacturers; and, finally, if you do not foresee tasks like renaming a manufacturer or such at some point, then maybe refactoring your database will not add much benefit. But, as I say, without knowing your app in detail and your roadmap for it, it is difficult to say anything really on spot. In any case, I hope this general considerations will help you take a decision.
EDIT:
If you want to investigate your core data performance and try to understand where the bottlenecks are, give a try to Instruments/Core Data tool (Product/Profile menu). There are a lot of things that can go bad.
On the other hand, it is really hard to help you further without having more details about the type of searches your app allows to do. One thing that is not clear to me is if your searches are slow only when they return a lot of results or they are slow even when returning a few results.
Normalizing might help performance if you only use (say, after doing a search) just one normalized entity (e.g., to display the toy name in a table). In this case all of the attributes referring to other entities would be faults (hence would not occupy memory nor take) and this might speed up things. But, if you do a search and then display the information from the other tables as well, then there might not be any advantage, quite the opposite, since the faults would have to be resolved immediately and this would produce more accesses to the database.
Also it is true that depending on how you use it, core data could not be the best way to handle your data. Have a look at this Brent Simmons' post relating his experience.

iOs App 1000+ Data Entries CoreData or simple Array?

I have about 1000+ Data Entries (Number, Name, Age, Color). On the first view the user can imput a numer which will output the corresponding entries on the second view.
Do I have to work with CoreData or is it a good solution to use a simple Array to store the data?
Does the Array use to much memory? Is the Array to slow?
With 1000+ strings along with other data you are likely to run into memory problems rather soon. Also, if you are not using an incremental store such as Core Data provides, changing the smallest unit of your data would require rewriting the entire list.
Core Data would be my choice, but there is a somewhat steep learning curve. It is not really that difficult, but it takes maybe some time to get used to. The NSFetchedResultsController will ensure good performance and low memory footprint.
If, however, persistence is not really an issue and you are actually calculating the data (e.g. based on the initial number), you might want to use a different scheme. For example, in a standard UITableView you could calculate the data to be displayed ad hoc in the datasource methods based on the index path - you would then have to calculate only the data for the visible cells. Depending on the demands of the data model, you might have some performance issues when scrolling very fast - this is difficult to predict.
There are different considerations why you should use CoreData instead of a simple NSArray.
Ask yourself questions like:
Do you need to persist the data?
Do you need all the data at once for displaying?
Do you perform search queries on the data?
Depending on your answers you can choose CoreData or a NSArray. Keeping all the data in memory is nonetheless expensive and you should avoid it anyway. In my opinion I would try to go with CoreData although it produces some overhead in the first line.

Achieving better DB performance

I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...).
The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search parameters it sometimes times out (despite the indexing that is in place).
The DB schema is pretty well normalized and I believe I can achieve better performance by moving toward something like a data warehouse. DW projects aren't trivial and then there's the issue of keeping the data in sync so I was wondering if anyone knows of a shortcut. Perhaps an out-of the box solution that will create the DW schema and keep the data in sync (via triggers perhaps). I've heard of Lucene but it seems geared more toward text searches and document management. Does anyone have other suggestions?
How big is your database?
There's not really any shortcuts, but dimensional modelling is really NOT that hard. You first determine a grain and then need to identify your facts and the dimensions associated with the facts. Then you divide the dimensions into tables which allow you to have the dimensions only grow slowly over time. The choice of dimensions is completely practical and based on the data behavior.
I recommend you have a look at Kimball's books.
For a database of a few GB, it's certainly possible to update a reporting database from scratch several times a day (no history, just repopulating from a 3NF for a different model of the same data). There are certain realtime data warehousing techniques which just apply changes continuously throughout the day.
So while DW projects might not be trivial, the denormalization techniques are very approachable and usable without necessarily building a complete time-invariant data warehouse.
Materialized Views are what you might use in Oracle. They give you the "keeping the data in sync" feature you are looking for combined with fast access of aggregate data. Since you didn't mention any specifics (platform, server specs, number of rows, number of hits/second, etc) of your platform, I can't really help much more than that.
Of course, we are assuming you've already checked that all your SQL is written properly and optimally, that your indexing is correct, that you are properly using caching in all levels of your app, that your DB server has enough RAM, fast hard drives, etc.
Also, have you considered denormalizing your schema, just enough to serve up your most common queries faster? that's better than implementing an entire data warehouse, which might not even be what you want anyway. Usually a data warehouse is for reporting purposes, not for serving interactive apps.

Resources