Is the order of model ID's a reliable indication of the order the models were created in? - ruby-on-rails

The Scenario
Update: It was brought to my attention that ordering by created_at will actually compare a millisecond float that's of sufficient resolution (by far). However, while I feel a bit dumb now, my question still stands. My scenario is just irrelevant, so I removed it.
The Question
I know that the database knows precisely the order of creation by tracking a row's ID.
Are there any pitfalls in relying on latest ID to determine order?

A better solution is to replace the latest_post_at with something more precise than a second. Time.now.to_f instead of .to_i will give you sub-second precision (millisecond I think, the docs aren't clear). Should two posts happen to have the same millisecond timestamp you could use the id as a tie-breaker.

If you're using whatever is the "natural" way of generating autoincrementing surrogate primary keys for your database, the only pitfall that comes to mind is that the order in which the database sequencer generated the IDs might not be the order in which the transactions that create the Post records start or finish. (Or however you define the time when a post is "created".)
Considering the transaction should normally take a fraction of a second to complete this uncertainty might be irrelevant for your needs.

Related

update property in all edges, fast?

I want to update a property in every "edge" every n-cycles/seconds/minutes.
As you may suspect this is time consuming and probably wont work well.
One possible approach is to do it in chunks.
The question is what is the best way to do it.
Here is how a full sweep will look like :
match (n1)-[x:q]-(n2)
set x.decay = x.decay * exp(-rate)
So the idea is to decay the edges and remove them when they hit a specific value.
If I do it in chunks how do I keep track which ones I decayed already so that I skip them, faster and cheaper.
Sounds like you need a better approach.
For example, store the calculated expiration time (as a timestamp) in every relationship. And a query that wants to use such a relationship could test that it had not expired. This way, there is no need to update any relationship properties, and all queries will get the correct behavior (down to the millisecond).
Here is a sample snippet:
...
MATCH (foo)-[rel:REL]->(bar)
WHERE timestamp() < rel.expiration
You can also periodically remove expired relationships to clean up the DB and improve query performance.

Is it bad to change _id type in MongoDB to integer?

MongoDB uses ObjectId type for _id.
Will it be bad if I make _id an incrementing integer?
(With this gem, if you're interested)
No it isn't bad at all and in fact the built in ObjectId is quite sizeable within the index so if you believe you have something better then you are more than welcome to change the default value of the _id field to whatever.
But, and this is a big but, there are some considerations when deciding to move away from the default formulated ObjectId, especially when using the auto incrementing _ids as shown here: https://docs.mongodb.com/v3.0/tutorial/create-an-auto-incrementing-field
Multi threading isn't such a big problem because findAndModify and the atomic locks can actually take care of that, but then you just hit into your first problem. findAndModify is not the fastest function nor the lightest and there have been significant performance drops noticed when using it regularly.
You also have to consider the overhead of doing this yourself anyway, even without findAndModify. For every insert you will need an extra query. Imagine having a unique id that you have to query the uniqueness of every time you want to insert. Eventually your insert rate will drop to a crawl and your lock time will build up.
Of course the ObjectId is really good at being unique without having to check or formulate its own uniqueness by touching the database prior to insertion, hence it doesn't have this overhead.
If you still feel an integer _id suites your scenario, then go for it, but bare in mind the overhead described above.
You can do it, but you are responsible to make sure that the integers are unique.
MongoDB doesn't support auto-increment fields like most SQL databases. When you have a distributed or multithreaded application which has multiple processes and/or threads which create new database entries, you have to make sure that they use the same counter. Otherwise it could happen that two threads try to store a document with the same _id in the database.
When that happens, one of them will fail. That means you have to wait for the database to return a success or error (by calling GetLastError or by setting the write concerns to acknowledged), which takes longer than just sending data in a fire-and-forget manner.
I had a use case for this: replacing _id with a 64 bit integer that represented a simhash of a document index for searching.
Since I intended to "Get or create", providing the initial simhash, and creating a new record if one didn't exist was perfect. Also, for anyone Googling, MongoDB support explained to me that simhashes are absolutely perfect for sharding and scaling, and even better than the more generic ObjectId, because they will divide up the data across shards perfectly and intrinsically, and you get the key stored for negative space (a uint64 is much smaller than an objectId and would need to be stored anyway).
Also, for you Googlers, replacing a MongoDB _id with something other than an objectId is absolutely simple: Just create an object with the _id being defined; use an integer if you like. That's it: Mongo will simply use it. If you try to create a document with the same _id you'll get an error (E11000/Duplicate key). So like me, if you're using simhashing, this is ideal in all respects.

PostgreSQL always creates records with an even ID?

my PostgreSQL 9.2 database for some reason skips an ID with every record. Example:
User
1258930
1258932
1258934
1258936
What would cause this? Any pointer in the right direction to resolve this issue is appreciate. Thank you
The comments have adequately covered possible reasons:
multiple nextval calls, say one by a default and one explicit;
The same sequence used by more than one table
Transactions being rolled back
Any pointer in the right direction to resolve this issue is appreciated
Your key mistake is viewing this as an issue. It's entirely normal for gaps to appear in generated sequences. If your DB crashes and restarts, a gap will appear in a sequence. If a transaction rolls back after allocating IDs, a gap will appear in the sequence.
Your application must be able to deal with this. It shouldn't care what an ID is, only that it's unique.
For details and for hints if you need truly gapless sequences, see this answer.
Add to Craig's points the key culprit I think you will find is a DO ALSO rule inserting the new.id into another table. This will cause double incrementing as RULES rewrite queries so new.id ends up being "whatever we did to calculate new.id last time" (which means incrementing the sequence again).
if it is consistent, the most likely cause IMO is someone creating DO ALSO rules without fully understanding the gotchas.

Schema for storing "binary" values, such as Male/Female, in a database

Intro
I am trying to decide how best to set up my database schema for a (Rails) model. I have a model related to money which indicates whether the value is an income (positive cash value) or an expense (negative cash value).
I would like separate column(s) to indicate whether it is an income or an expense, rather than relying on whether the value stored is positive or negative.
Question:
How would you store these values, and why?
Have a single column, say Income,
and store 1 if it's an income, 0
if it's an expense, null if not
known.
Have two columns, Income and
Expense, setting their values to 1 or 0 as
appropriate.
Something else?
I figure the question is similar to storing a person's gender in a database (ignoring aliens/transgender/etc) hence my title.
My thoughts so far
Lookup might be easier with a single column, but there is a risk of mistaking 0 (false, expense) for null (unknown).
Having seperate columns might be more difficult to maintain (what happens if we end up with a 1 in both columns?
Maybe it's not that big a deal which way I go, but it would be great to have any concerns/thoughts raised before I get too far down the line and have to change my code-base because I missed something that should have been obvious!
Thanks,
Philip
How would you store these values, and why?
I would store them as a single column. Despite your desire to separate the data into multiple columns, anyone who understands accounting or bookkeeping will know that the dollar value of a transaction is one thing, not two separate things based on whether it's income or expense (or asset, liablity, equity and so forth).
As someone who's actually written fully balanced double-entry accounting applications and less formal budgeting applications, I suggest you rethink your decision. It will make future work on this endeavour a lot easier.
I'm sorry, that's probably not what you want to hear and may well result in ngative rep for me but I can't, in all honesty, let this go without telling you what a mistake it will be.
Your "thoughts so far" are an indication of the problems already appearing.
1/ "Having seperate columns might be more difficult to maintain (what happens if we end up with a 1 in both columns?" - well, this shouldn't happen. Data is supposed to be internally consistent to the data model. You would be best advised preventing it with an insert/update trigger or, say, a single column that didn't allow it to happen :-)
2/ "Lookup might be easier with a single column, but there is a risk of mistaking 0 (false, expense) for null (unknown)." - no mistake possible if the sign is stored with the magnitude of the value. And the whole idea of not knowing whether an item is expense or income is abhorrent to accountants. That knowledge exists when the transaction is created, it's not something that is nebulous until some point after a transaction happens.
Sometimes I use a character. For example, I have a column gender in my database that stores m or f.
And I usually choose to have just one column.
I would typically implement a flag as an nchar(1) and use some meaningful abbreviations. I think that's the easiest thing to work with. You could use 'I' for income and 'E' for expense, for example.
That said, I don't think that's a good way to do this system.
I would probably put incomes and expenses in separate tables, since they appear to be different sorts of things. The only advantages I can think of for putting them in the same table are lost once the meanings are differentiated by flags rather than postitive and negative values.

Can one rely on the auto-incrementing primary key in your database?

In my present Rails application, I am resolving scheduling conflicts by sorting the models by the "created_at" field. However, I realized that when inserting multiple models from a form that allows this, all of the created_at times are exactly the same!
This is more a question of best programming practices: Can your application rely on your ID column in your database to increment greater and greater with each INSERT to get their order of creation? To put it another way, can I sort a group of rows I pull out of my database by their ID column and be assured this is an accurate sort based on creation order? And is this a good practice in my application?
The generated identification numbers will be unique.
Regardless of whether you use Sequences, like in PostgreSQL and Oracle or if you use another mechanism like auto-increment of MySQL.
However, Sequences are most often acquired in bulks of, for example 20 numbers.
So with PostgreSQL you can not determine which field was inserted first. There might even be gaps in the id's of inserted records.
Therefore you shouldn't use a generated id field for a task like that in order to not rely on database implementation details.
Generating a created or updated field during command execution is much better for sorting by creation-, or update-time later on.
For example:
INSERT INTO A (data, created) VALUES (smething, DATE())
UPDATE A SET data=something, updated=DATE()
That depends on your database vendor.
MySQL I believe absolutely orders auto increment keys. SQL Server I don't know for sure that it does or not but I believe that it does.
Where you'll run into problems is with databases that don't support this functionality, most notably Oracle that uses sequences, which are roughly but not absolutely ordered.
An alternative might be to go for created time and then ID.
I believe the answer to your question is yes...if I read between the lines, I think you are concerned that the system may re-use ID's numbers that are 'missing' in the sequence, and therefore if you had used 1,2,3,5,6,7 as ID numbers, in all the implementations I know of, the next ID number will always be 8 (or possibly higher), but I don't know of any DB that would try and figure out that record Id #4 is missing, so attempt to re-use that ID number.
Though I am most familiar with SQL Server, I don't know why any vendor who try and fill the gaps in a sequence - think of the overhead of keeping that list of unused ID's, as opposed to just always keeping track of the last I number used, and adding 1.
I'd say you could safely rely on the next ID assigned number always being higher than the last - not just unique.
Yes the id will be unique and no, you can not and should not rely on it for sorting - it is there to guarantee row uniqueness only. The best approach is, as emktas indicated, to use a separate "updated" or "created" field for just this information.
For setting the creation time, you can just use a default value like this
CREATE TABLE foo (
id INTEGER UNSIGNED AUTO_INCREMENT NOT NULL;
created TIMESTAMP NOT NULL DEFAULT NOW();
updated TIMESTAMP;
PRIMARY KEY(id);
) engine=InnoDB; ## whatever :P
Now, that takes care of creation time. with update time I would suggest an AFTER UPDATE trigger like this one (of course you can do it in a separate query, but the trigger, in my opinion, is a better solution - more transparent):
DELIMITER $$
CREATE TRIGGER foo_a_upd AFTER UPDATE ON foo
FOR EACH ROW BEGIN
SET NEW.updated = NOW();
END;
$$
DELIMITER ;
And that should do it.
EDIT:
Woe is me. Foolishly I've not specified, that this is for mysql, there might be some differences in the function names (namely, 'NOW') and other subtle itty-bitty.
One caveat to EJB's answer:
SQL does not give any guarantee of ordering if you don't specify an order by column. E.g. if you delete some early rows, then insert 'em, the new ones may end up living in the same place in the db the old ones did (albeit with new IDs), and that's what it may use as its default sort.
FWIW, I typically use order by ID as an effective version of order by created_at. It's cheaper in that it doesn't require adding an index to a datetime field (which is bigger and therefore slower than a simple integer primary key index), guaranteed to be different, and I don't really care if a few rows that were added at about the same time sort in some slightly different order.
This is probably DB engine depended. I would check how your DB implements sequences and if there are no documented problems then I would decide to rely on ID.
E.g. Postgresql sequence is OK unless you play with the sequence cache parameters.
There is a possibility that other programmer will manually create or copy records from different DB with wrong ID column. However I would simplify the problem. Do not bother with low probability cases where someone will manually destroy data integrity. You cannot protect against everything.
My advice is to rely on sequence generated IDs and move your project forward.
In theory yes the highest id number is the last created. Remember though that databases do have the ability to temporaily turn off the insert of the autogenerated value , insert some records manaully and then turn it back on. These inserts are no typically used on a production system but can happen occasionally when moving a large chunk of data from another system.

Resources