Using multiple key value stores - ruby-on-rails

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?

The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Related

Storing text in Neo4j

We're using Neo4J and liking it. We do all sorts of graphy things in it. However, some of what we do is not graphy. For example, we keep a log of all changes to a certain type of node:
(n)-[:CHANGE]->(c1)-[:CHANGE]->(c2) etc etc
This list of changes can get to be 20 or 30 c1 nodes long. While it looks weird, I don't have a real problem with it. (Of course, I am smarter now, and since each :CHANGE relationship has a date in it, I could porcupine all the c1 nodes right from n. But whatever.)
But what if I wanted to store large amounts of text, or images. Is there a problem with storing large amounts of data in a node? I could use a different database for these things, but this just increases the skill set required to run the business. And of course, joining data in two disparate databases is always a PITA.
So I need to worry about storing large amounts of text in a single property? Do I need to avoid creating logs in the way I did above?
Creating linked lists of events, whatever they may be, is fine. It'd even say it is graphy! We use this approach a lot, in the case of ChangeFeed for something similar to what you're doing.
In Neo4j, such a linked list is actually better than a "porcupine", because you have to traverse all the relationships anyway, if you're looking for all changes. But in the linked list case, you don't have to look at properties to order them. In fact, the best approach is a hybrid approach, like the TimeTree.
As for storing large amounts of text or images, Neo4j is not the best place for it. Another database would be best, especially if the volume of these is large ( > hundreds of thousands).
But if you do want to store them in Neo4j, one thing to keep in mind is that Neo4j will load all the properties of a node/relationship once a single property is accessed. So in order to achieve good performance even with text/images in Neo4j, I would store them in a node of their own. That way, you can only load them if you really need them, but not during regular traversals.
For example:
CREATE (b:BlogPost {title:'Neo4j Rocks', author:'Tony Ennis', date:".."})-[:HAS_BODY]->(:BlogPostBody {content:'..'})

Find changes quickly in larger SQL database?

There is a Java Swing application which uses an Informix database. I have user rights granted for the Swing application (i.e. no source code), and read only access to a mirror of the database.
Sometimes I need to find a database column, which is backing a GUI element (TextBox, TableField, Label...). What would be best approach to find out which database column and table is holding the data shown e.g. in a TextBox?
My general approach is to capture the state of the database. Commit a change using the GUI and then capture the state of the database again. Then I need to examine the difference. I've already tried:
Use the nrows field of systables: Didn't work, because the number in nrows does not seem to be a realtime representation of the row count.
Create a script with SELECT COUNT(*) ... for all tables: didn't work because too many tables (> 5000). Also tried to optimize by removing empty tables, but there are still too many left.
Is there a simple solution that I'm missing?
Please look at the Change Data Capture API and check if this suits your needs
There probably isn't a simple solution.
You probably need to build yourself a map of the database, or a data dictionary for it. It sounds as though you can eliminate many of the tables from consideration since they're empty — at least for a preliminary pass. If you're dealing with information in a text box, the chances are it is some sort of character data; you can analyze which (non-empty) tables which contain longer character strings, and they'd be the primary targets of your searches. If the schema is badly designed with lots of VARCHAR(255) columns even though the columns normally only hold short strings, life is more difficult. Over time, you can begin to classify tables and columns so that you end up knowing where to look for parts of the application.
One problem to beware of: the tabid in informix.systables isn't necessarily as stable as you'd like. Your data dictionary needs to record its own dd_tabid for the table it describes, and can store the last known tabid from informix.systables, but it needs to be ready to find a new tabid value on occasion. You should probably only mark data in your dictionary for logical deletion.
To some extent, this assumes you can create a database in which to record this information. If you can't create an Informix database, you may have to use something else (MySQL, or SQLite, perhaps) to store the data dictionary. Alternatively, go to your DBA team and ask them for the information. Unless you're trying something self-evidently untoward, they're likely to help (but politics can get in the way — I've no idea how collegial your teams are).

What is the right way to store random numbers in my database?

I am working on an application which will generate unique random numbers and then store them into a database. I will check if a number exists through a HTTP request. Initially, for getting started, I would use around 10,000 numbers.
Is this the right approach?
Generate a random number, and, one by one, store them into an array and continue checking for array uniqueness, and when the array is complete, store the whole array to the database after sorting it.
Use the database and check to see if a number exists or not.
Which database should I use, as the application can scale up to 1 million numbers.
It may be more efficient, particularly if you want to generate 1000000 numbers, to make them one at a time and use validations in the model/database prevent duplicates.
As regards choosing a database, it will depend a little on you intended application. There is some info here: Which is the Best database for Rails application?
I can't comment on using a database directly from ruby without rails because I have not done that. One of the big pluses for rails for me is how easy it makes creating apps that use a database.
A couple thoughts:
If you are storing 10 or 10,000 "random" numbers, what difference does it make whether they are random going into the database, or if the database randomly picks one number of a range of 10,000 sequential numbers? Do you need doubly-random number selections? MySQL, PostgreSQL and other DBMs can generate random numbers, and you can use their random number generator to retrieve a row, so you could either have it return a value directly from its generator, or grab a row. Either way, you don't need to worry about Ruby creating a random value -- unless you really want "triplely"-random numbers. I'd just stick the values of a (1..10_000) range into the database and call that part done and work on a query to grab records randomly.
If you want truly random numbers, you can't guarantee uniqueness. If you're happy with pseudo-random, you still have a problem because you could end up returning duplicates from inside the range unless you track which numbers you've used previously for a particular session. How you track uniqueness across a bunch of sessions is going to be an interesting problem if your site gets popular.
If I was doing this, I'd reverse some of the process. I wouldn't store the "random" values in the database, I'd use Ruby's built-in random number generator, and then probably check the database to see if I'd previously generated that number for that particular session. Overall, fewer values would be stored in the database so lookups to determine uniqueness would happen faster.
That would still be an awkward system to code and would grow inefficient over time as the "unique" records for sessions grew.
To do this without a database I'd create the random/unique range using something like: array = (1..10_000).to_a.shuffle, then each time I needed a value I'd use pop to pull the last value from the randomized array. I'd be tempted to pull from that pool of values for all sessions until it was exhausted, then regenerate it. There'd be a possibility of duplicate "unique" values at that point, but there should be a pretty small chance of the same number reappearing twice in a row.

Trimming BOLD_CLOCKLOG table

I am doing some maintenance on a database for an application that uses the Bold for Delphi object persistence framework. This database has been been in production for several years and several of the tables have grown quite large. One of them is the BOLD_CLOCKLOG which has something to do with Bold's transaction management.
I want to trim this table (it is up to 1.2GB, with entries from Jan 2006).
Can anyone confirm the system does not need this old information?
From the bolds documentation:
BOLD_CLOCKLOG
To be able to map the transaction numbers used in the TimeStamp columns to the corresponding physical time (such as 2001-01-01 12:34) the persistence mapper will store a log with timestamps and times. Normally, this log is written for each database operation, but if the traffic to the database is very intensive, it is possible to restrict how often this log is written by setting the property ClockLogGranularity. The event OnGetCurrentTime should also be implemented to ensure that all clients have the same time.The usage of this table can be controlled with the tagged value: Model.UseClockLog
So I believe this is used for versioning Boldobjects, see Object Versioning Extension in bolds documentation. If your application don't need this you can drop this in the database.
In our Bold application we don't use that feature. Why don't simply test to turn off Bold_ClockLog in the model, drop that big table and try to use your application. I'm pretty sure if something is wrong then it say so at once.
I can also mention that we have an own custom objecthistoy. It is simply big string (as TStringList.DelimetedText) in a ObjectHistory class that have Time, user and a note about action. This suits our need better that Bolds builtin objecthistory. The disadvantage is of course that we need to add calls in the code when logging to history is done.
Bold_ClockLog is an optional table, it's purpose is to store mapping between integer timestamps and corresponding DateTime values.
This allows you to find out datetime of the last modification to any object.
If you don't need this feature feel free to empty the table, it won't cause any problems.
In addition to Bold_ClockLog, the Bold_XFiles is another optional table that tends to grow large. But unlike the Bold_ClockLog the Bold_XFiles can not be emptied.
Both of these tables can be turned on/off in the model tag values.

Composite primary keys versus unique object ID field

I inherited a database built with the idea that composite keys are much more ideal than using a unique object ID field and that when building a database, a single unique ID should never be used as a primary key. Because I was building a Rails front-end for this database, I ran into difficulties getting it to conform to the Rails conventions (though it was possible using custom views and a few additional gems to handle composite keys).
The reasoning behind this specific schema design from the person who wrote it had to do with how the database handles ID fields in a non-efficient manner and when it's building indexes, tree sorts are flawed. This explanation lacked any depth and I'm still trying to wrap my head around the concept (I'm familiar with using composite keys, but not 100% of the time).
Can anyone offer opinions or add any greater depth to this topic?
Most of the commonly used engines (MS SQL Server, Oracle, DB2, MySQL, etc.) would not experience noticeable issues using a surrogate key system. Some may even experience a performance boost from the use of a surrogate, but performance issues are highly platform-specific.
In general terms, the natural key (and by extension, composite key) verses surrogate key debate has a long history with no likely “right answer” in sight.
The arguments for natural keys (singular or composite) usually include some the following:
1) They are already available in the data model. Most entities being modeled already include one or more attributes or combinations of attributes that meet the needs of a key for the purposes of creating relations. Adding an additional attribute to each table incorporates an unnecessary redundancy.
2) They eliminate the need for certain joins. For example, if you have customers with customer codes, and invoices with invoice numbers (both of which are "natural" keys), and you want to retrieve all the invoice numbers for a specific customer code, you can simply use "SELECT InvoiceNumber FROM Invoice WHERE CustomerCode = 'XYZ123'". In the classic surrogate key approach, the SQL would look something like this: "SELECT Invoice.InvoiceNumber FROM Invoice INNER JOIN Customer ON Invoice.CustomerID = Customer.CustomerID WHERE Customer.CustomerCode = 'XYZ123'".
3) They contribute to a more universally-applicable approach to data modeling. With natural keys, the same design can be used largely unchanged between different SQL engines. Many surrogate key approaches use specific SQL engine techniques for key generation, thus requiring more specialization of the data model to implement on different platforms.
Arguments for surrogate keys tend to revolve around issues that are SQL engine specific:
1) They enable easier changes to attributes when business requirements/rules change. This is because they allow the data attributes to be isolated to a single table. This is primarily an issue for SQL engines that do not efficiently implement standard SQL constructs such as DOMAINs. When an attribute is defined by a DOMAIN statement, changes to the attribute can be performed schema-wide using an ALTER DOMAIN statement. Different SQL engines have different performance characteristics for altering a domain, and some SQL engines do not implement DOMAINS at all, so data modelers compensate for these situations by adding surrogate keys to improve the ability to make changes to attributes.
2) They enable easier implementations of concurrency than natural keys. In the natural key case, if two users are concurrently working with the same information set, such as a customer row, and one of the users modifies the natural key value, then an update by the second user will fail because the customer code they are updating no longer exists in the database. In the surrogate key case, the update will process successfully because immutable ID values are used to identify the rows in the database, not mutable customer codes. However, it is not always desirable to allow the second update – if the customer code changed it is possible that the second user should not be allowed to proceed with their change because the actual “identity” of the row has changed – the second user may be updating the wrong row. Neither surrogate keys nor natural keys, by themselves, address this issue. Comprehensive concurrency solutions have to be addressed outside of the implementation of the key.
3) They perform better than natural keys. Performance is most directly affected by the SQL engine. The same database schema implemented on the same hardware using different SQL engines will often have dramatically different performance characteristics, due to the SQL engines data storage and retrieval mechanisms. Some SQL engines closely approximate flat-file systems, where data is actually stored redundantly when the same attribute, such as a Customer Code, appears in multiple places in the database schema. This redundant storage by the SQL engine can cause performance issues when changes need to be made to the data or schema. Other SQL engines provide a better separation between the data model and the storage/retrieval system, allowing for quicker changes of data and schema.
4) Surrogate keys function better with certain data access libraries and GUI frameworks. Due to the homogeneous nature of most surrogate key designs (example: all relational keys are integers), data access libraries, ORMs, and GUI frameworks can work with the information without needing special knowledge of the data. Natural keys, due to their heterogeneous nature (different data types, size etc.), do not work as well with automated or semi-automated toolkits and libraries. For specialized scenarios, such as embedded SQL databases, designing the database with a specific toolkit in mind may be acceptable. In other scenarios, databases are enterprise information resources, accessed concurrently by multiple platforms, applications, report systems, and devices, and therefore do not function as well when designed with a focus on any particular library or framework. In addition, databases designed to work with specific toolkits become a liability when the next great toolkit is introduced.
I tend to fall on the side of natural keys (obviously), but I am not fanatical about it. Due to the environment I work in, where any given database I help design may be used by a variety of applications, I use natural keys for the majority of the data modeling, and rarely introduce surrogates. However, I don’t go out of my way to try to re-implement existing databases that use surrogates. Surrogate-key systems work just fine – no need to change something that is already functioning well.
There are some excellent resources discussing the merits of each approach:
http://www.google.com/search?q=natural+key+surrogate+key
http://www.agiledata.org/essays/keys.html
http://www.informationweek.com/news/software/bi/201806814
I've been developing database applications for 15 years and I have yet to come across a case where a non-surrogate key was a better choice than a surrogate key.
I'm not saying that such a case does not exist, I'm just saying when you factor in the practical issues of actually developing an application that accesses the database, usually the benefits of a surrogate key start to overwhelm the theoretical purity of non-surrogate keys.
the primary key should be constant and meaningless; non-surrogate keys usually fail one or both requirements, eventually
if the key is not constant, you have a future update issue that can get quite complicated
if the key is not meaningless, then it is more likely to change, i.e. not be constant; see above
take a simple, common example: a table of Inventory items. It may be tempting to make the item number (sku number, barcode, part code, or whatever) the primary key, but then a year later all the item numbers change and you're left with a very messy update-the-whole-database problem...
EDIT: there's an additional issue that is more practical than philosophical. In many cases you're going to find a particular row somehow, then later update it or find it again (or both). With composite keys there is more data to keep track of and more contraints in the WHERE clause for the re-find or update (or delete). It is also possible that one of the key segments may have changed in the meantime!. With a surrogate key, there is always only one value to retain (the surrogate ID) and by definition it cannot change, which simplifies the situation significantly.
It sounds like the person who created the database is on the natural keys side of the great natural keys vs. surrogate keys debate.
I've never heard of any problems with btrees on ID fields, but I also haven't studied it in any great depth...
I fall on the surrogate key side: You have less repetition when using a surrogate key, because you're only repeating a single value in the other tables. Since humans rarely join tables by hand, we don't care if it's a number or not. Also, since there's only one fixed-size column to look up in the index, it's safe to assume surrogates have a faster lookup time by primary key as well.
Using 'unique (object) ID' fields simplifies joins, but you should aim to have the other (possibly composite) key still unique -- do NOT relax the not-null constraints and DO maintain the unique constraint.
If the DBMS can't handle unique integers effectively, it has big problems. However, using both a 'unique (object) ID' and the other key does use more space (for the indexes) than just the other key, and has two indexes to update on each insert operation. So it isn't a freebie -- but as long as you maintain the original key, too, then you'll be OK. If you eliminate the other key, you are breaking the design of your system; all hell will break loose eventually (and you might or might not spot that hell broke loose).
I basically am a member of the surrogate key team, and even if I appreciate and understand arguments such as the ones presented here by JeremyDWill, I am still looking for the case where "natural" key is better than surrogate ...
Other posts dealing with this issue usually refer to relational database theory and database performance. Another interesting argument, always forgotten in this case, is related to table normalisation and code productivity:
each time I create a table, shall I
lose time
identifying its primary key and its
physical characteristics (type,
size)
remembering these characteristics
each time I want to refer to it in
my code?
explaining my PK choice to other
developers in the team?
My answer is no to all of these questions:
I have no time to lose trying to
identify "the best Primary Key" when
dealing with a list of persons.
I do not want to remember that the
Primary Key of my "computer" table
is a 64 characters long string (does
Windows accepts that many characters
for a computer name?).
I don't want to explain my choice to
other developers, where one of them
will finally say "Yeah man, but
consider that you have to manage
computers over different domains?
Does this 64 characters string allow
you to store the domain name + the
computer name?".
So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, such as a 'ComputerUser' table, where the combination of 'id_Computer' and 'id_User' forms a very acceptable Primary Key, I prefer to create this 'id_ComputerUser' field being a uniqueIdentifier, just to stick to the rule.
The major advantage is that you don't have to care animore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.
I am not sure that my rule is the best one. But it is a very efficient one!
A practical approach to developing a new architecture is one that utilizes surrogate keys for tables that will contain thousands of multi-column highly unique records and composite keys for short descriptionary tables. I usually find that the colleges dictate the use of surrogate keys while the real world programmers prefer composite keys. You really need to apply the right type of primary key to the table - not just one way or the other.
using natural keys makes a nightmare using any automatic ORM as persistence layer. Also, foreign keys on multiple column tend to overlap one another and this will give further problem when navigating and updating the relationship in a OO way.
Still you could transform the natural key in an unique constrain and add an auto generated id; this doesn't remove the problem with the foreign keys, though, those will have to be changed by hand; hopefully multiple columns and overlapping constraints will be a minority of all the relationship, so you could concentrate on refactoring where it matter most.
natural pk have their motivation and usages scenario and are not a bad thing(tm), they just tend to not get along well with ORM.
my feeling is that as any other concepts, natural keys and table normalization should be used when sensible and not as blind design constraints
I'm going to be short and sweet here: Composite primary keys are not good these days. Add in surrogate arbitrary keys if you can and maintain the current key schemes via unique constraints. ORM is happy, you're happy, original programmer not-so-happy but unless he's your boss then he can just deal with it.
Composite keys can be good - they may affect performance - but they are not the only answer, in much the same way that a unique (surrogate) key isn't the only answer.
What concerns me is the vagueness in the reasoning for choosing composite keys. More often than not vagueness about anything technical indicates a lack of understanding - maybe following someone else's guidelines, in a book or article....
There is nothing wrong with a single unique ID - infact if you've got an application connected to a database server and you can choose which database you're using it will all be good, and you can pretty much do anything with your keys and not really suffer too badly.
There has been, and will be, a lot written about this, because there is no single answer. There are methods and approaches that need to be applied carefully in a skilled manner.
I've had lots of problems with ID's being provided automatically by the database - and I avoid them wherever possible, but still use them occasionally.
... how the database handles ID fields in a non-efficient manner and when it's building indexes, tree sorts are flawed ...
This was almost certainly nonsense, but may have related to the issue of index block contention when assigning incrementing numbers to a PK at a high rate from different sessions. If so then the REVERSE KEY index is there to help, albeit at the expense of a larger index size due to a change in block-split algorithm. http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref998
Go synthetic, particularly if it aids more rapid development with your toolset.
I am not a experienced one but still i m in favor of Using primary key as id here is the explanation using an example..
The format of external data may change over time. For example, you might think that the ISBN of a book would make a good primary key in a table of books. After all, ISBNs are unique. But as this particular book is being written, the publishing industry in the United States is gearing up for a major change as additional digits are added to all ISBNs.
If we’d used the ISBN as the primary key in a table of books, we’d have to update each row to reflect this change. But then we’d have another problem. There’ll be other tables in the database that reference rows in the books table via the primary key. We can’t change the key in the books table unless we first go through and update all of these references. And that will involve dropping foreign key constraints, updating tables, updating the books table, and finally reestablishing the constraints. All in all, this is something of a pain.
The problems go away if we use our own internal value as a primary key. No third party can come along and arbitrarily tell us to change our schema—we control our own keyspace. And if something such as the ISBN does need to change, it can change without affecting any of the existing relationships in the database. In effect, we’ve decoupled the knitting together of rows from the external representation of data in those rows.
Although the explanation is quite bookish but i think it explains the things in a simpler way.
#JeremyDWill
Thank you for providing some much-needed balance to the debate. In particular, thanks for the info on DOMAINs.
I actually use surrogate keys system-wide for the sake of consistency, but there are tradeoffs involved. The most common cause for me to curse using surrogate keys is when I have a lookup table with a short list of canonical values—I'd use less space and all my queries would be shorter/easier/faster if I had just made the values the PK instead of having to join to the table.
You can do both - since any big company database is likely to be used by several applications, including human DBAs running one-off queries and data imports, designing it purely for the benefit of ORM systems is not always practical or desirable.
What I tend to do these days is to add a "RowID" property to each table - this field is a GUID, and so unique to each row. This is NOT the primary key - that is a natural key (if possible). However, any ORM layers working on top of this database can use the RowID to identify their derived objects.
Thus you might have:
CREATE TABLE dbo.Invoice (
CustomerId varchar(10),
CustomerOrderNo varchar(10),
InvoiceAmount money not null,
Comments nvarchar(4000),
RowId uniqueidentifier not null default(newid()),
primary key(CustomerId, CustomerOrderNo)
)
So your DBA is happy, your ORM architect is happy, and your database integrity is preserved!
I just wanted to add something here that I don't ever see covered when discussing auto-generated integer identity fields with relational databases (because I see them a lot), and that is, it's base type can an will overflow at some point.
Now I'm not trying to say this automatically makes composite ids the way to go, but it's just a matter of fact that even though more data could be logically added to a table (which is still unique), the single auto-generated integer identity could prevent this from happening.
Yes I realize that for most situations it's unlikely, and using a 64bit integer gives you lots of headroom, and realistically the database probably should have been designed differently if an overflow like this ever occurred.
But that doesn't prevent someone from doing it... a table using a single auto-generated 32bit integer as it's identity, which is expected to store all transactions at a global level for a particular fast-food company, is going fail as soon as it tries to insert it's 2,147,483,648th transaction (and that is a completely feasible scenario).
It's just something to note, that people tend to gloss over or just ignore entirely. If any table is going to be inserted into with regularity, considerations should be made to just how often and how much data will accumulate over time, and whether or not an integer based identifier should even be used.

Resources