Dump and restore selected models/object graph in Rails - ruby-on-rails

Given a MySQL database and a set of corresponding active record models similar to:
Test -< Categories -< Questions
I need a way to quickly dump the contents of Test #1 to a file, and then restore on a separate machine. When Test #1 is reinstantiated in the database, all of the relational data should be intact (all foreign keys are maintained, the Categories, Questions for the test are all restored). What's the best way to do this?

Try the Jailer subsetting tool. It's for dumping subsets of relational data, keeping referential integrity.

I'd try using yaml: http://www.yaml.org/
It's an easy way to save and load heirarchical data (in a human readable format), and there are a number of implementations for Ruby. They extend your classes, adding methods to save and load objects to and from yaml files.
I typically use it when I need to save and reload a "deep copy" of a large multi-level hash of objects.

There are options out there, replicate is outdates and known to have issues with Rails 4 and Ruby 2, activerecord-import looks good, but doesn't have a -dump couterpart.

Related

Strategies to speed up access to databases when working with columns containing massive amounts of data (spatial columns, etc)

First things first, I am an amateur, self-taught ruby programmer who came of age as a novice engineer in the age of super-fast computers where program efficiency was not an issue in the early stages of my primary GIS software development project. This technical debt is starting to tax my project and I want to speed up access to this lumbering GIS database.
Its a postgresql database with a postgis extension, controlled inside of rails, which immediately creates efficiency issues via the object-ification of database columns when accessing and/or manipulating database records with one or many columns containing text or spatial data easily in excess of 1 megabyte per column.
Its extremely slow now, and it didn't used to be like this.
One strategy: I'm considering building child tables of my large spatial data tables (state, county, census tract, etc) so that when I access the tables I don't have to load the massive spatial columns every time I access the objects. But then doing spatial queries might be difficult on a parent table's children. Not sure exactly how I would do that but I think its possible.
Maybe I have too many indexes. I have a lot of spatial indexes. Do additional spatial indexes from tables I'm not currently using slow down my queries? How about having too many for one table?
These tables have a massive amount of columns. Maybe I should remove some columns, or create parent tables for the columns with massive serialized hashes?
There are A LOT of tables I don't use anymore. Is there a reason other than tidiness to remove these unused tables? Are they slowing down my queries? Simply doing a #count method on some of these tables takes TIME.
PS:
- Looking back at this 8 hours later, I think what I'm equally trying to understand is how many of the above techniques are completely USELESS when it comes to optimizing (rails) database performance?
You don't have to read all of the columns of the table. Just read the ones you need.
You can:
MyObject.select(:id, :col1, :col2).where(...)
... and the omitted columns are not read.
If you try to use a method that needs one of the columns you've omitted then you'll get an ActiveModel::MissingAttributeError (Rails 4), but you presumably know when you're going to need them or not.
The inclusion of large data sets in the table is going to be a noticeable problem from the database side if you have full table scans, and then you might consider moving these data to other tables.
If you only use Rails to read and write the large data columns, and don't use PostgreSQL functions on them, you might be able to compress the data on write and decompress on read. Override the getter and setter methods by using write_attribute and read_attribute, compressing and decompressing (respectively of course) the data.
Indexing. If you are using postgres to store such large chucks of data in single fields consider storing it as Array, JSON or Hstore fields. If you index it using the gin index types so you can search effectively within a given field.

Activerecord-import records with relationships to each other

My app needs to import hundreds to thousands of records at a time. The records are each nodes in a tree structure. I'm using activerecord-import to significantly speed up the import, and haven't yet settled on which of ancestry, closure_tree, acts_as_list or a custom solution to use for setting out the hierarchy.
The problem I'm grappling with is how to import all the data and relationships in one or just a few passes. My draft naive solution is:
creating each object in memory, and manually giving each object an id;
using those ids to manually giving each object the foreign_key that it needs (eg parent_id); and then
mass-importing the resulting array of objects using activerecord-import
This feels like a hack with obvious problems. For example, if the ids that I've chosen for my objects get used by the database while I'm still instantiating my objects, then the relationships I've manually set become useless/wrong.
Another major problem is that as I look into more advanced solutions for the tree data structure (eg closure_tree and ancestry), manually setting the fields required by those gems feels more and more like a hack.
So I guess my question is, is there a clean way to set up a tree structure of N nodes in a rails activerecord database while touching the database less than N times?
The master branch has a commit with this functionality. It will work only with Rails 4 and Postgres.
If you happen to have another configuration, you will need to:
Create a new array of Models/hashes
Model.import it
Retrieve the IDs of the rows you just inserted
Goto 1, setting in the new array the IDs (parent_id) you just retrieved

How to execute SQL statements on a dataset which didn't come from a database?

Suppose I have an application which fetches a custom XML packet from the server which represents a dataset. Then, suppose I wish to execute a SQL statement on that data via a dataset. What can I use to do this? I don't need to know the code necessarily, but just what to use to make this possible and a general explanation of how.
For example, I may fetch a list of customers in XML format from the server. Then, I can use any third-party parser to dump that XML data into some client dataset. Then, execute a query on that dataset, for example select * from customers where ZipCode = '12345' without fetching this data from the server again.
XML is not the only limitation, that's just an example. I might want to do the same to some application settings loaded from an INI file. Either way, the concept is that the original source of the data is unknown.
Whether the dataset stores its temporary data in the memory or on the disk doesn't matter, but it would be excellent if it could keep it in the disk.
TXQuery (http://code.google.com/p/txquery/) is a component that provides a local SQL engine for executing SQL queries against one or more TDataSets. The only issues I have had with it is updating data via a TDBGrid of a query joining multiple tables (TDataSets) - specifically which table is being updated.
AnyDac v6 (now FireDac) also has a local SQL engine. http://www.da-soft.com/anydac/docu/frames.html?frmname=topic&frmfile=Local_SQL.html
Edit: For the example SQL in your question, because it only involves a single table, you do this with just a Filter on the datatset. For example
ADataSet.Filtered := False;
ADataSet.Filter := 'ZipCode=' + QuotedStr('12345');
ADataSet.Filtered := True;
Such a feature can be done using a local database. You just insert the TDataSet result into a local in-memory (or file-based) stand-alone database, then you can use regular SQL queries on it, including JOIN.
You can for instance use SQLite3, or the free edition of NexusDB.
NexusDB embedded has the benefit of being a native Delphi database, so stick to the DB.pas TDataSet paradigm.
Another option is to use the so-called Virtual Table mechanism of SQLite3, which allows to expose any data (even from TDataSet, XML, JSON or in-memory objects) to the SQLite3 engine, just as regular tables. Then you can run SQL statements on those "virtual" tables, including JOINs. With this approach, you do not require to INSERT the data into regular tables, but the data remain in their original form. Of course, you will miss some performance features like indexes, which should be handled on the virtual table provider side. We use this feature as the database core of our mORMot ORM/SOA framework, and this is pretty powerful.
The general process that you want to perform is complicated by the difference in data representation. SQL data is stored in tables made up of distinguishable records. XML is a structured representation of data, but in tree form rather than table/row form.
Each of these data forms may be qualified by a schema that provides a context for the data.
You have two general paths that you can follow:
Take the XML, and based on the schema insert it into a set of interlinked tables, then perform the SQL query. - if you have the schema, you can use code generators to make a parser, and then based ont the parse tree, you can insert into a local db with tables constructed on the fly. You can set up my SQL pretty easily from https://dev.mysql.com/doc/refman/5.7/en/installing.html and then in your version of delphi make a connection to the database, first fill it in, then query. This would satisfy your desire to have the data stored on the disk. unless you purge the tables when done, the data are still available in the local machine db.
This seems like more work than:
Use Xpath or Xquery and work directly on the XML. For this, a package like saxon in your favorite environment, or expat in python would work nicely.
Let me know if either of these paths seems as if it may be fruitful.

What data type can I use for very large text fields that is database agnostic?

In my Rails app, I want to store the geographical bounds of places column fields in a database. E.g., the boundary of New York is represented as a polygon: an array of arrays.
I have declared my model to serialize the polygons, but I am unsure whether I should even store them like this. The size of these serialized polygons easily exceed 100,000 characters, and MySQL only can store about 65000 characters in a standard TEXT field.
Now I know MySQL also has a LONGTEXT field. But I really want my app to be database-agnostic. How does Rails handle this by itself? Will it switch automatically to LONGTEXT fields? What about when I start using PostgreSQL?
At this point I suggest you ask yourself - does this data need to be stored, or should be store in a database in this format?
I propose 2 possible solutions:
Store your polygons in the filesystem, and reference them from the database. Such large data items are of little use in a database - it's practically pointless to query against them as text. The filesystem is good at storing files - use it.
If you do need these polygons in the database, store them as normalised data. Have a table called polygon, and another called point, deserialize the polygons and store it in a way that reflects the way that databases are intended to be used.
Hope this is of help.
Postgresql has a library called PostGIS that my company uses to handle geometric locations and calculations that may be very helpful in this situation. I believe postgresql also has two data types that allow arrays and hashes. Arrays are declared, as an example, like text[] where text could be replaced with another data type. Hashes can be defined using the hstore module.
This question answers part of my question: Rails sets a default byte limit of 65535, and you can change it manually.
All in all, whether you will run into trouble after that depends on the database you're using. For MySQL, Rails will automatically switch to the appropriate *TEXT field. MySQL can store up to 1GB of text.
But like benzado and thomasfedb say, it is probably better to store the information in a file so that the database doesn't allocate a lot of memory that might not even be used.
Even though you can store this kind of stuff in the database, you should consider storing it externally, and just put a URL or some other identifier in the database.
If it's in the database, you may end up loading 64K of data into memory when you aren't going to use it, just because you access something in that table. And it's easier to scale a collection of read-only files (using something like Amazon S3) than a database table.

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources