Does setting column default values hurt in terms of DB space? - ruby-on-rails

This is a quick question. I was curious as to whether setting the default value for a column in Rails+PostgreSQL costs additional space (versus leaving the default as nil).
I am also curious if that naturally leads to concluding whether creating an empty record using rails (full of nils) takes the same space as a row full of data would.
I would suppose this highly depends on the DB implementation, but I am only interested in PostgreSQL databases (9.1+). But I got lost when I was searching for implementation details specific to that.

Setting the default is just an attribute in the the catalog table pg_attributes.
But actual values in table rows generally cost additional disk space as compared to NULL. NULL values are stored in a special structure called "NULL bitmap", and require a single bit per column. Therefore, a row with NULL values typically takes less space on disk than a row with (dummy) values.
It's actually more complicated. Find details and links in these related answers:
Do nullable columns occupy additional space in PostgreSQL?
Calculating and saving space in PostgreSQL

Related

How do I fix inconsistent types in InfluxDB?

In InfluxDB (1.5), I have a table where the fields have become inconsistently typed. Most rows in the table are Integer, however, some rows have become strings.
How is this possible? I thought, once a field's types were set (upon first insert), any insert into the table with incorrect typing would fail.
What do I do now? If I go back and attempt to overwrite the data in the inconsistent rows, I get errors saying the field is a string.
After some more research, here's what I've discovered:
Answer to Part 1:
InfluxDB uses a system they refer to as 'sharding' - while I don't know the specifics, I do know that data from the same measurement/table can be stored across multiple, different 'shards'.
According to the InfluxDB documentation, field types can differ between these shards, within the same field, on the same table.
Answer to Part 2:
In order to fix this, the currently-suggested answer is to make a new table, download all the data, and re-insert while ensuring the data that gets inserted is the proper types.
If you had a tag which changed type and became a field, this can be especially difficult to fix, the link above does not address that. To do selects only on tag or field, you can use tag_name::tag or field_name::field within a select statement.
The GROUP BY * clause suggested in the link is required in order to preserve tags, but seemed to cause issues when I used it.
My current solution is a PHP script that uses curl, downloads the points, chunks them, then re-inserts the points into the new table, ensuring each point that gets inserted is casted to the new, uniform type, and properly inserted.
The best way to stopping future issues, is simply not to have them. I went looking for how to lock field types in all cases, across all shards, for a particular measurement table.
Unfortunately, it seems impossible to guarantee 100% type consistency across all current and future shards. "Don't make mistakes because it's really difficult to clean up" seems to be InfluxDB's modus operandi.

Find max/min value of column in Mnesia in constant time

How can I in constant time (or closest possible) find the maximum or minimum value on an indexed column in an Mnesia table?
I would do it outside the Mnesia database. Keep an explicit-min and explicit-max by having a process which learns about these values whenever there is an insert into the table. This gives you awfully fast constant time lookup on the values.
If you can do with O(lg n) time then you can make the table an ordered_set. From there, first/1 and last/1 should give you what you want, given that the key contains the thing you are ordering by. But this also slows down other queries in general to O(lg n).
A third trick is to go by an approximate value. Once in a while you scan the table and note the max and min values. This then materializes into what you want, but the value might not be up to do date if it was a long time since you last scanned.
Good question but I don't think it is possible. A quick look through mnesia and qlc documentation did not give me any clues on the subject.
It seems for me that secondary keys facility in mnesia is incomplete and thus is very limited in features. Not to mention horrible mnesia startup times while loading indexed tables.
I think the most reliable solution in your case would be to do explicit indexing. E.g. creating and keeping in sync alongside table with ordering on primary keys which are in fact the values you have wanted to index by.

Assign Key Field Value Only If Corresponding Lookup Result value Exist

I have ten master tables and one Transaction table. In my transaction table (it is a memory table just like ClientDataSet) there are ten lookup fields pointing to my ten master tables.
Now i am trying to dynamically assigning key field values to all my lookup key field values (of the transaction table) from a different Server(data is coming as a soap xml). Before assigning these values i need to check whether the corresponding result value is valid in master tables or not. I am using a filter (eg status = 1 ) to check whether it is valid or not.
Currently how we are doing is, before assigning each key field value we are filtering the master tables using this filter and using the locate function to check whether it is there or not. and if located we will assign its key field value.
This will work fine if there is only few records in my master tables. Consider my master tables having fifty thousand records each (yeah, customer is having so much data), this will lead to big performance issue.
Could you please help me to handle this situation.
Thanks
Basil
The only way to know if it is slow, why, where, and what solution works best is to profile.
Don't make a priori assumptions.
That being said, minimizing round trips to the server and the amount of data transferred is often a good thing to try.
For instance, if your master tables are on the server (not 100% clear from your question), sending only 1 Query (or stored proc call) passing all the values to check at once as parameters and doing a bunch of "IF EXISTS..." and returning all the answers at once (either output params or a 1 record dataset) would be a good start.
And 50,000 records is not much, so, as I said initially, you may not even have a performance problem. Check it first!

Performance implications of a table with many fields

I have a table that is currently at 40 fields. A significant expansion of its capability now has it looking like something more like 100 fields.
What are the database and Rails performance implications of having a table with more fields? My understanding of relations is that they don't load the data until absolutely necessary, but would having so much more information slow down, say, a filtered index of these records (showing only the main 8-10 fields)?
The fields I'm specifically talking about adding are not relevant to any of my reports or most of my queries - they simply store data that is used on the back end.
Normalization is not a problem here (there are no fields like field1, field2, ..., for example). I know it's hard to answer these questions when posed in a qualitative manner, but is it likely better to build these 60 fields in this table, or should I create a separate 1-1 table for them?
Having a single table is not a big deal and make things easier when it comes to queries. So if it's relevant, no need to split.
Still, you should only query what you need in your views so use the ActiveRecord's select: doc here.
Yes, having a lot of fields will slow down access to the table, however, in general not significantly enough that it matters for average data sizes. Most SQL databases arrange tables row by row, so on the disk, first all 40 fields of row 1, then all 40 fields of row 2, and so on, are stored. This means, that if you are only interested in retrieving the first 2 fields, you will still read all other 38 fields and then jump to the next row that matches. This is not a big issue if you have only a few matching rows, but might be, if you would have many matches that are also consecutive.
That said, I would still heavily advice against a table with 40 fields, except when there is a very good reason to do so (which you might have, but you give to little details to answer this). In general, having that many fields indicates the use of some alternative design. Definitly, if what I wrote above starts becoming an issue, you should order the fields according to the access patterns (so if normally fields 1-10 and 20,24,25,30 are accessed together, put those groups into separate tables).

Can one rely on the auto-incrementing primary key in your database?

In my present Rails application, I am resolving scheduling conflicts by sorting the models by the "created_at" field. However, I realized that when inserting multiple models from a form that allows this, all of the created_at times are exactly the same!
This is more a question of best programming practices: Can your application rely on your ID column in your database to increment greater and greater with each INSERT to get their order of creation? To put it another way, can I sort a group of rows I pull out of my database by their ID column and be assured this is an accurate sort based on creation order? And is this a good practice in my application?
The generated identification numbers will be unique.
Regardless of whether you use Sequences, like in PostgreSQL and Oracle or if you use another mechanism like auto-increment of MySQL.
However, Sequences are most often acquired in bulks of, for example 20 numbers.
So with PostgreSQL you can not determine which field was inserted first. There might even be gaps in the id's of inserted records.
Therefore you shouldn't use a generated id field for a task like that in order to not rely on database implementation details.
Generating a created or updated field during command execution is much better for sorting by creation-, or update-time later on.
For example:
INSERT INTO A (data, created) VALUES (smething, DATE())
UPDATE A SET data=something, updated=DATE()
That depends on your database vendor.
MySQL I believe absolutely orders auto increment keys. SQL Server I don't know for sure that it does or not but I believe that it does.
Where you'll run into problems is with databases that don't support this functionality, most notably Oracle that uses sequences, which are roughly but not absolutely ordered.
An alternative might be to go for created time and then ID.
I believe the answer to your question is yes...if I read between the lines, I think you are concerned that the system may re-use ID's numbers that are 'missing' in the sequence, and therefore if you had used 1,2,3,5,6,7 as ID numbers, in all the implementations I know of, the next ID number will always be 8 (or possibly higher), but I don't know of any DB that would try and figure out that record Id #4 is missing, so attempt to re-use that ID number.
Though I am most familiar with SQL Server, I don't know why any vendor who try and fill the gaps in a sequence - think of the overhead of keeping that list of unused ID's, as opposed to just always keeping track of the last I number used, and adding 1.
I'd say you could safely rely on the next ID assigned number always being higher than the last - not just unique.
Yes the id will be unique and no, you can not and should not rely on it for sorting - it is there to guarantee row uniqueness only. The best approach is, as emktas indicated, to use a separate "updated" or "created" field for just this information.
For setting the creation time, you can just use a default value like this
CREATE TABLE foo (
id INTEGER UNSIGNED AUTO_INCREMENT NOT NULL;
created TIMESTAMP NOT NULL DEFAULT NOW();
updated TIMESTAMP;
PRIMARY KEY(id);
) engine=InnoDB; ## whatever :P
Now, that takes care of creation time. with update time I would suggest an AFTER UPDATE trigger like this one (of course you can do it in a separate query, but the trigger, in my opinion, is a better solution - more transparent):
DELIMITER $$
CREATE TRIGGER foo_a_upd AFTER UPDATE ON foo
FOR EACH ROW BEGIN
SET NEW.updated = NOW();
END;
$$
DELIMITER ;
And that should do it.
EDIT:
Woe is me. Foolishly I've not specified, that this is for mysql, there might be some differences in the function names (namely, 'NOW') and other subtle itty-bitty.
One caveat to EJB's answer:
SQL does not give any guarantee of ordering if you don't specify an order by column. E.g. if you delete some early rows, then insert 'em, the new ones may end up living in the same place in the db the old ones did (albeit with new IDs), and that's what it may use as its default sort.
FWIW, I typically use order by ID as an effective version of order by created_at. It's cheaper in that it doesn't require adding an index to a datetime field (which is bigger and therefore slower than a simple integer primary key index), guaranteed to be different, and I don't really care if a few rows that were added at about the same time sort in some slightly different order.
This is probably DB engine depended. I would check how your DB implements sequences and if there are no documented problems then I would decide to rely on ID.
E.g. Postgresql sequence is OK unless you play with the sequence cache parameters.
There is a possibility that other programmer will manually create or copy records from different DB with wrong ID column. However I would simplify the problem. Do not bother with low probability cases where someone will manually destroy data integrity. You cannot protect against everything.
My advice is to rely on sequence generated IDs and move your project forward.
In theory yes the highest id number is the last created. Remember though that databases do have the ability to temporaily turn off the insert of the autogenerated value , insert some records manaully and then turn it back on. These inserts are no typically used on a production system but can happen occasionally when moving a large chunk of data from another system.

Resources