Lazily downloading fields from DB with Mongoid, or handling large documents - ruby-on-rails

We've designed our Mongo database to be highly denormalized, resulting in many documents in our collections containing very large arrays as some of the fields. Naturally, this can result in downloads from our DB longer than necessary because the documents are just so large.
Whenever we need to grab some records from the DB, I have mitigated the performance implications of this by using .only to choose just the fields I want, but this requires me to download that extra data before I may need it, and in general it's a lot more involved for me to keep track of what fields end up being needed when I am querying for the document(s).
Does Mongoid have a way that I can simply define particular fields in my model as ones that should be lazily loaded so that I grab them from the server just when they're first accessed? I searched through Mongoid's documentation to see if it had anything built in, but I'm not seeing any such thing. Perhaps there's a third party gem that adds this functionality to Mongoid?

Mongoid doesn't have support for lazy loading data from the server, not aware of any plugins to do it either.
While technically you could add this to Mongoid, you would still be better off manually specifying only so you load what you need once. If you lazy loaded the fields based on usage, you would have to pull the data from MongoDB every time a field is accessed while nil.
Meaning if you accessed 5 different fields on top of the original document load, you're sending 6 queries to MongoDB which involves the general roundtrip/processing as compared to just specifying it in only in the first place.

Related

Is it better to do direct table loads in a high performance application?

I'm using PostgreSQL in a Rails 3.2 application that receives updates from a third party all day long. Sometimes this third party will throw over 2,000 requests a minute at my application, each update consisting of a large XML file.
Right now I am storing basic information from each XML file into a table. Then, a background process picks up big chunks of data in that table and copies the data into a table using PostgreSQL's COPY feature.
Am I doing the right thing or the wrong thing here? This table that is the load target is also the major CRUD target of the UI. Does the COPY feature lock the entire table when the load happens, and should I be doing a bunch of inserts instead? I originally thought the inserts would be too expensive, but if the direct load locks the whole table then that's going to be a problem.
COPY is the lowest level way to mass-insert records into PostgreSQL. I like your solution to post-process the records in a background job.
Alternatively, if you need to have performance and maintain some Rails/Ruby functionality, consider the
activerecord-import gem. The gem will perform mass-insertions and allow ActiveRecord callbacks and validations to be used as needed. Even if you use this for post-processing of the bulk COPYed records, it may gain you a significant performance increase.
Here is a good article for using activerecord-import:
http://ruby-journal.com/how-to-import-millions-records-via-activerecord-within-minutes-not-hours/
This is what the Postgres team recommends for optimal import performance: http://www.postgresql.org/docs/current/interactive/populate.html

nosql dynamic fields in rails

I'm writing a rails application that has a user document which has about 20 different attributes. Each time an attribute is updated, I need to store it in a transactions document which will have who changed, which attribute was changed and the old value and new value of the attribute.
Does it make sense to have a separate document to store transactions? or should I use a noSQL DB like CouchDB which supports versioning by default and then I don't have to worry about creating a transactions document.
If I do decide to create a transaction document, then the key of the document will be dynamic.
When I need to pull history, I can pull out all versions of a document and dynamically figure it out?
I would not store all transactions for a given user in a single document. This document will become very large and it may begin to take a up a lot of memory when you have to bring it into memory. It might also be difficult to query on various transactions (i.e. find all transactions for a given user that modified the name attribute).
The Versioning in CouchDB and similar NoSQL databases is a little bit misleading, and I tapped into the same mistake as you just did before. Versioning simply means - optimistic concurrency. This means that if you want to update a document, you need to provide the old version number with it, to be sure that nothing has been overwritten. So when you get a document and in the meanwhile someone else changes it, your version number is out of date (or out of sync) and you need to read it again and apply the required changes before submitting it to the database. Also some NoSQL stores allow you to ignore this versioning, while others (like CouchDB) enforce it.
Back to topic - Versioning won’t do what you want, you are rather looking for a log store with write often, read seldom (as I assume, you won’t read the history that often). In that case Cassandra is perfect for this, if you require a high throughput, but any other NoSQL or SQL DB might do the job as well, depending on your performance requirements.

Entity, dealing with large number of records (> 35 mlns)

We have a rather large set of related tables with over 35 million related records each. I need to create a couple of WCF methods that would query the database with some parameters (data ranges, type codes, etc.) and return related results sets (from 10 to 10,000 records).
The company is standardized on EF 4.0 but is open to 4.X. I might be able to make argument to migrate to 5.0 but it's less likely.
What’s the best approach to deal with such a large number of records using Entity? Should I create a set of stored procs and call them from Entity or there is something I can do within Entity?
I do not have any control over the databases so I cannot split the tables or create some materialized views or partitioned tables.
Any input/idea/suggestion is greatly appreciated.
At my work I faced a similar situation. We had a database with many tables and most of them contained around 7- 10 million records each. We used Entity framework to display the data but the page seemed to display very slow (like 90 to 100 seconds). Even the sorting on the grid took time. I was given the task to see if it could be optimized or not. and well after profiling it (ANTS profiler) I was able to optimize it (under 7 secs).
so the answer is Yes, Entity framework can handle loads of records (in millions) but some care must be taken
Understand that call to database made only when the actual records are required. all the operations are just used to make the query (SQL) so try to fetch only a piece of data rather then requesting a large number of records. Trim the fetch size as much as possible
Yes, not you should, you must use stored procedures and import them into your model and have function imports for them. You can also call them directly ExecuteStoreCommand(), ExecuteStoreQuery<>(). Sames goes for functions and views but EF has a really odd way of calling functions "SELECT dbo.blah(#id)".
EF performs slower when it has to populate an Entity with deep hierarchy. be extremely careful with entities with deep hierarchy .
Sometimes when you are requesting records and you are not required to modify them you should tell EF not to watch the property changes (AutoDetectChanges). that way record retrieval is much faster
Indexing of database is good but in case of EF it becomes very important. The columns you use for retrieval and sorting should be properly indexed.
When you model is large, VS2010/VS2012 Model designer gets real crazy. so break your model into medium sized models. There is a limitation that the Entities from different models cannot be shared even though they may be pointing to the same table in the database.
When you have to make changes in the same entity at different places, try to use the same entity by passing it and send the changes only once rather than each one fetching a fresh piece, makes changes and stores it (Real performance gain tip).
When you need the info in only one or two columns try not to fetch the full entity. you can either execute your sql directly or have a mini entity something. You may need to cache some frequently used data in your application also.
Transactions are slow. be careful with them.
if you keep these things in mind EF should give almost similar performance as plain ADO.NET if not the same.
My experience with EF4.1, code first: if you only need to read the records (i.e. you won't write them back) you will gain a performance boost by turning of change tracking for your context:
yourDbContext.Configuration.AutoDetectChangesEnabled = false;
Do this before loading any entities. If you need to update the loaded records you can allways call
yourDbContext.ChangeTracker.DetectChanges();
before calling SaveChanges().
The moment I hear statements like: "The company is standardized on EF4 or EF5, or whatever" This sends cold shivers down my spine.
It is the equivalent of a car rental saying "We have standardized on a single car model for our entire fleet".
Or a carpenter saying "I have standardized on chisels as my entire toolkit. I will not have saws, drills etc..."
There is something called the right tool for the right job
This statement only highlights that the person in charge of making key software architecture decisions has no clue about software architecture.
If you are dealing with over 100K records and the datamodels are complex (i.e. non trivial), Maybe EF6 is not the best option.
EF6 is based on the concepts of dynamic reflection and has similar design patterns to Castle Project Active Record
Do you need to load all of the 100K records into memory and perform operations on these ? If yes ask yourself do you really need to do that and why wouldn't executing a stored procedure across the 100K records achieve the same thing. Do some analysis and see what is the actual data usage pattern. Maybe the user performs a search which returns 100K records but they only navigate through the first 200. Example google search, Hardly anyone goes past page 3 of the millions of search results.
If the answer is still yes you need to load all of the 100K records into memory and perform operations. Then maybe you need to consider something else like a custom built write through cache with light weight objects. Maybe lazy load dynamic object pointers for nested objects. etc... One instance where I use something like this is large product catalogs for eCommerce sites where very large numbers of searches get executed against the catalog. Why is in order to provide custom behavior such as early exit search, and regex wildcard search using pre-compiled regex, or custom Hashtable indexes into the product catalog.
There is no one size fits all answer to this question. It all depends the data usage scenarios and how the application works with the data. Consider Gorilla Vs Shark who would win? It all depends on the environment and the context.
Maybe EF6 is perfect for one piece that would benefit from dynamic reflection, While NetTiers is better for another that needs static reflection and an extensible ORM. While low level ADO is perhaps best for extreme high performance pieces.

Rails CMS: static files or database records?

I'm trying to figure out the cut-off with respect to when a "text entry" should be stored in the database vs. as a static file. Are there any rules of thumb here? The text entries will be at the most several paragraphs and have links to images and tables (and hyperlinks to other text entries). Some criteria for the text entry:
I'm thinking of using DITA as the content format
The text should be searchable
If the text is revised, a new version will be created
thanks in advance, Chuck
The "rails way" would be using a database.
The solution will be more scalable, therefore faster and probably easier to develop with (using migration and so on). Using the file system, you will have to build lots of functions on your own, that are already implemented for database usage.
You could create a Model (e.g.) Document and easily use existing versioning systems, like paper_trail. When using an indexed search, you can just have an has_many relation enabling you to realise the depencies between the models (destroy a model means to destroy the search index).
Rather than a cut-off, you could look at what databases provide and ask yourself if those features would be useful. Take Isolation (the I in ACID): if you have any worries that multiple people could be trying to edit an entry at the same time, a database would handle that well while you'd have to handle the locks yourself working with files. Or Atomicity: you might want to update two things at once (e.g. an index page and an entry page) and know they will either both succeed or both fail.
Databases do a number of things beyond ACID, such as taking advantage of multiple datatypes, making querying easier, and allowing for scaling. It's a question worth asking since most databases end up having data stored in a bunch of files on disk. Would you end up writing a mini-database if you used files yourself?
Besides, if you're using rails you mind as well take advantage of its ActiveRecord functionality, and make it possible to use the many plugins that expect a database.
I'd use a database for even a small, single user rails app.

Should i keep a file as text or import to a database?

I am constructing an anagram generator that was a coding exercise, and uses a word list thats about 633,000 lines long (one word per line). I wrote the program just in Ruby originally, and I would like to modify this to deploy it online.
My hosting service supports Ruby on Rails as about the only Ruby-based solution. I thought of hosting on my own machine, and using a smaller framework, but I don't want to deal with the security issues at this moment.
I have only used RoR for database-driven (CRUD) apps. However, I have never populated a sqlite database this way, so this is a two-part question:
1) Should I import this to a database? If so, what's the best method to do so? I would like to stick with sqlite to keep things simple if that's the case.
2) Is a 'flat file' better? I wont be doing any creating or updating, just checking against the list of words.
Thank you.
How about keeping it in memory? Storing that many words would take just a few megabytes of RAM, and otherwise you'd be accessing the file frequently so it'd probably be cached anyway. The advantage of keeping the word list in memory is that you can organize it in whatever data structure suits your needs best (I'm thinking a trie). If you can't spare that much memory, it might be to your advantage to use a database so you can efficiently load only the parts of the word list you need for any given query - of course, in that case you'd want to create some index columns (well at least one) so you can take advantage of the indexing capabilities of SQL.
Assuming that what you're doing is looking up whether a word exists in your list, I would say that SQLite with an indexed column will likely be faster than scanning through the word list linearly. Now, if your current approach is fast enough for your purposes, then I see no reason to bother porting it over to a database; it's just an added headache for no gain as far as you're concerned. If you're seeing the search times become a burden, then dumping it into an indexed database would be a good idea.
You can create the table with the following schema:
CREATE TABLE words (
word text primary key
);
CREATE INDEX word_idx ON words(word);
And import your data with:
sqlite words.db < schema.sql
while read word
do
sqlite3 words.db "INSERT INTO words values('$word');"
done < words.txt
I would skip the database for reasons listed above. A simple hash in memory will perform about as fast a lookup in the database.
Even if the database was a bit faster for the lookup, you're still wasting time with the DB having to parse the query and create a plan for the lookup, then assemble the results and send them back to your program. Plus you can save yourself a dependency.
If you plan on moving other parts of your program to a persistent store, then go for it. But a hashmap should be sufficient for your use.

Resources