Big Integers and Custom Validation - ruby-on-rails

I'm somewhat new to Rails and I'm trying to learn about custom validations.
One common requirement in Brazil are CPF/CNPJ/RG fields. They are a type of identification number and follow a specific format.
For example:
CPFs are 11 digit numbers. They follow this pattern: xxx.xxx.xxx-xx
I'm trying to store them in an Integer field but I'm getting (Using Postgres):
PG::Error: ERROR: value "xxxxxxxxxxx" is out of range for type
integer
What is the proper way to store this? Bigint (How?)? A string?
My second question is:
How can I specify a custom validation (a method) for this field that could be called somewhat like this:
class User < AR::Base
validates :cpf, presence: true, unique: true, cpf: true

Assuming performance is not critical, strings are fine. That way you can keep the dots and dashes. As mentioned by others in this thread, bigint or numeric may be far more performant if that's a concern.
If you keep the field a string, you can easily validate it with regex:
validates_format_of :cpf, with: /^[0-9]{3}\.[0-9]{3}\.[0-9]{3}\-[0-9]{2}$/

For small tables, just store as text to preserve the format.
For big tables, performance and storage size may be an issue. If your pattern is guaranteed, you may very well store the number as bigint and format it on retrieval with to_char():
Write:
SELECT translate('111.222.333-55', '.-', '')::bigint
This also serves as partial validation. Only digits, . and - are allowed in your string. The pattern might still be violated, you have to check explicitly with something like #Michael provided.
Read:
SELECT to_char(11122233355, 'FM000"."000"."000"-"00')
Returns:
111.222.333-55
Don't forget the leading FM in the pattern to remove the leading whitespace (where a negative sign might go for numbers).
A bigint occupies 8 bytes on disk and can easily store 11-digit numbers.
text (or varchar) need 1 byte plus the actual string, which amounts to 15 bytes in your case.
Plus, processing bigint is generally a bit faster than processing text of equal length..

Personally I would always store these values as bigint and apply formatting on input/output (as Erwin suggests) or in the application.
The main reasons are storage efficiency (as Erwin mentions) and efficiency of comparison. When you compare 11111111112 to 11111111113 as text, PostgreSQL will use language-specific collation rules that are correct for text, but may not be what you want for numbers. They're also slow; a recent question on SO reported a five-fold speed-up in text comparisons by using the COLLATE "C" option to force plain POSIX collations; numeric collations are faster again.
Most of these standard numbers have their own internal check-sums, often a variant of the Luhn algorithm. Validating these in a CHECK constraint is likely to be a good idea. You can implement a Luhn algorithm check against integers pretty easily in PL/PgSQL or plain SQL; I wrote some samples on the PostgreSQL wiki.
Whatever you do, make sure you have a CHECK constraint on the column that validates the number on storage, so you don't get invalid and nonsensical values stored.

Related

Is there ever reason to use string data type when using rails 6 with postgres database?

I see here that :text data type seems to perform exactly the same as :string
This comment particularly:
PostgreSQL implementation prefers text. The only difference for pg string/text is constraint on length for string. No performance differences.
Is there any good reason to ever use :string when making a rails app using a postgresql database?
No difference in performance, difference in semantics.
Several libraries (for example simpleform) look at the data type of the field in the database and perform differently depending on it. Simple form will add a number input if it's a number, a checkbox if it's a boolean and so on. For this case, it will add a single line text field for string and a multiline text box for text.
Since there is no difference in performance, you can use either, but it's still useful to denote semantics.

Rails ActiveRecord validation: maximum length of a string?

I have these string / text fields in my database migration file:
t.string :author
t.string :title
t.string :summary
t.text :content
t.string :link
And these are my questions:
Every string / text attribute should have a maximum length validation for both purposes, security (if you don't want to receive a few MB of text input) and database (if string = varchar, mysql has a 255 characters limit). Is that right or is there any reason not to have a maximum length validation for totally every string / text attribute in the database?
If I don't care about the exact length of author and title as long, as they are not too long to be stored as strings, should I set a maximum length to 255 for each of those?
If the maximum possible length of URL is about 2000 characters, is it safe to store links as strings, and not as texts? Should I be validating a maximum length of the link attribute if I am already validating its format using regexp?
Should a content (text) attribute have a maximum length just to protect the database from the input of an unlimited length? For example, is setting a maximum length of a text field to 100,000 characters reasonable, or is this totally pointless and inefficient?
I understand, that these questions might seem unimportant to some people, but still – that's a validation of input, which is required for any application, – and I think it's worth to be rather paranoid here.
The question is great, and perhaps people with more knowledge of rails/mysql internals will be able to expand more.
1) Having any validation in the model depends where you want the failure to happen in case it exceeds the limit. The model is the best option since most likely it will cover most objects using the model.
Other alternative is simply limiting form fields using maxlength attribute.
The first option does not work for optional fields.
2) I am not aware of any rule of thumb. Use whatever you know is the longest and make it a bit bigger.
3) My rule is that anything above 255 is text. You can find more info on this Here
4) If the column holds the same content - there might be value in that. Some use cases might have different maxlength depending on content type or user.
All of the above is also affected by how strict data validation requirements are in the project.

If you know the length of a string and apply a SHA1 hash to it, can you unhash it?

Just wondering if knowing the original string length means that you can better unlash a SHA1 encryption.
No, not in the general case: a hash function is not an encryption function and it is not designed to be reversible.
It is usually impossible to recover the original hash for certain. This is because the domain size of a hash function is larger than the range of the function. For SHA-1 the domain is unbounded but the range is 160bits.
That means that, by the Pigeonhole principle, multiple values in the domain map to the same value in the range. When such two values map to the same hash, it is called a hash collision.
However, for a specific limited set of inputs (where the domain of the inputs is much smaller than the range of the hash function), then if a hash collision is found, such as through an brute force search, it may be "acceptable" to assume that the input causing the hash was the original value. The above process is effectively a preimage attack. Note that this approach very quickly becomes infeasible, as demonstrated at the bottom. (There are likely some nice math formulas that can define "acceptable" in terms of chance of collision for a given domain size, but I am not this savvy.)
The only way to know that this was the only input that mapped to the hash, however, would be to perform an exhaustive search over all the values in the range -- such as all strings with the given length -- and ensure that it was the only such input that resulted in the given hash value.
Do note, however, that in no case is the hash process "reversed". Even without the Pigeon hole principle in effect, SHA-1 and other cryptographic hash functions are especially designed to be infeasible to reverse -- that is, they are "one way" hash functions. There are some advanced techniques which can be used to reduce the range of various hashes; these are best left to Ph.D's or people who specialize in cryptography analysis :-)
Happy coding.
For fun, try creating a brute-force preimage attack on a string of 3 characters. Assuming only English letters (A-Z, a-z) and numbers (0-9) are allowed, there are "only" 623 (238,328) combinations in this case. Then try on a string of 4 characters (624 = 14,776,336 combinations) ... 5 characters (625 = 916,132,832 combinations) ... 6 characters (626 = 56,800,235,584 combinations) ...
Note how much larger the domain is for each additional character: this approach quickly becomes impractical (or "infeasible") and the hash function wins :-)
One way password crackers speed up preimage attacks is to use rainbow tables (which may only cover a small set of all values in the domain they are designed to attack), which is why passwords that use hashing (SHA-1 or otherwise) should always have a large random salt as well.
Hash functions are one-way function. For a given size there are many strings that may have produced that hash.
Now, if you know that the input size is fixed an small enough, let's say 10 bytes, and you know that each byte can have only certain values (for example ASCII's A-Za-z0-9), then you can use that information to precompute all the possible hashes and find which plain text produces the hash you have. This technique is the basis for Rainbow tables.
If this was possible , SHA1 would not be that secure now. Is it ? So no you cannot unless you have considerable computing power [2^80 operations]. In which case you don't need to know the length either.
One of the basic property of a good Cryptographic hash function of which SHA1 happens to be one is
it is infeasible to generate a message that has a given hash
Theoretically, let's say the string was also known to be solely of ASCII characters, and it's of size n.
There are 95 characters in ASCII not including controls. We'll assume controls weren't used.
There are 95ⁿ possible such strings.
There are 1.461501×10⁴⁸ possible SHA-1 values (give or take) and a just n=25, there are 2.7739×10⁴⁹ possible ASCII-only strings without controls in them, which would mean guaranteed collisions (some such strings have the same SHA-1).
So, we only need to get to n=25 when this becomes impossible even with infinite resources and time.
And remember, up until now I've been making it deliberately easy with my ASCII-only rule. Real-world modern text doesn't follow that.
Of course, only a subset of such strings would be anything likely to be real (if one says "hello my name is Jon" and the other says "fsdfw09r12esaf" then it was probably the first). Stil though, up until now I was assuming infinite time and computing power. If we want to work it out sometime before the universe ends, we can't assume that.
Of course, the nature of the attack is also important. In some cases I want to find the original text, while in others I'll be happy with gibberish with the same hash (if I can input it into a system expecting a password).
Really though, the answer is no.
I posted this as an answer to another question, but I think it is applicable here:
SHA1 is a hashing algorithm. Hashing is one-way, which means that you can't recover the input from the output.
This picture demonstrates what hashing is, somewhat:
As you can see, both John Smith and Sandra Dee are mapped to 02. This means that you can't recover which name was hashed given only 02.
Hashing is used basically due to this principle:
If hash(A) == hash(B), then there's a really good chance that A == B. Hashing maps large data sets (like a whole database) to a tiny output, like a 10-character string. If you move the database and the hash of both the input and the output are the same, then you can be pretty sure that the database is intact. It's much faster than comparing both databases byte-by-byte.
That can be seen in the image. The long names are mapped to 2-digit numbers.
To adapt to your question, if you use bruteforce search, for a string of a given length (say length l) you will have to hash through (dictionary size)^l different hashes.
If the dictionary consists of only alphanumeric case-sensitive characters, then you have (10 + 26 + 26)^l = 62^l hashes to hash. I'm not sure how many FLOPS are required to produce one hash (as it is dependent on the hash's length). Let's be super-unrealistic and say it takes 10 FLOP to perform one hash.
For a 12-character password, that's 62^12 ~ 10^21. That's 10,000 seconds of computations on the fastest supercomputer to date.
Multiply that by a few thousand and you'll see that it is unfeasible if I increase my dictionary size a little bit or make my password longer.

floating point precision in ruby on rails model validations

I am trying to validate a dollar amount using a regex:
^[0-9]+\.[0-9]{2}$
This works fine, but whenever a user submits the form and the dollar amount ends in 0(zero), ruby(or rails?) chops the 0 off.
So 500.00 turns into 500.0 thus failing the regex validation.
Is there any way to make ruby/rails keep the format entered by the user, regardless of trailing zeros?
I presume your dollar amount is of decimal type. So, any value user enters in the field is being cast from string to appropriate type before saving to the database. Validation applies to the values already converted to numeric types, so regex is not really a suitable validation filter in your case.
You have couple of possibilities to solve this, though:
Use validates_numericality_of. That way you leave the conversion completely to Rails, and just check whether the amount is within a given range.
Use validate_each method and code your validation logic yourself (e.g. check whether the value has more than 2 decimal digits).
Validate the attribute before it's been typecasted:
This is especially useful in
validation situations where the user
might supply a string for an integer
field and you want to display the
original string back in an error
message. Accessing the attribute
normally would typecast the string to
0, which isn‘t what you want.
So, in your case, you should be able to use:
validates_format_of :amount_before_type_cast, :with => /^[0-9]+\.[0-9]{2}$/, :message => "must contain dollars and cents, seperated by a period"
Note, however, that users might find it tedious to follow your rigid entry rules (I would really prefer being able to type 500 instead 500.00, for example), and that in some locales period is not a decimal separator (if you ever plan to internationalize your app).
In general if you wish to “remember” the decimal precision of a floating point value, you should use a decimal type, not a binary float.
On the other hand, I'm not certain why you would wish to force the string representation in such a strict manner… How about accepting any number and formatting it with e.g. number_to_currency?
Usually with money it's best to store it as an integer in cents (500 cents is $5.00). I use the Money gem to handle this.

How should I present a cost field to the user, and store it in the database?

Right now I have two fields for cost. One for dollars and one for cents. This works, but it is a bit ugly. It also doesn't allow the user to enter the term "free" or "no cost" if they want. But if I only have one field, I might have to make my parser a bit smarter. What do you think?
On the server side, I combine dollars and cents to store them as decimals in my database. Mainly so that I can gather statistics (cost averages, etc.) quickly.
Do you think it is better to store the cost as a string? Then whenever I actually use the cost for stats or other purposes, I would convert it to a decimal at that point. Or am I on the right track?
There is a rule in database design that states that "atomic data" should not be split. By this rule a price, or cost is such an example of atomic data and therefore it should never be split among multiple columns just like you shouldn't split a phone number among multiple columns (unless you really have a very good reason for it - very rare)
Use a DECIMAL data type. Something like DECIMAL(8,3) should work and it's supported by all ANSI SQL compliant database products!
You can consult Joe Celko's "Thinking In Sets" book for a discussion of this topic. See section 1.6.2, pages 21-22.
EDIT -
It seems from your question that you are also concerned with how to accept user's input in a form that resembles the price (xxxx.xx) - hence the two input boxes, for the whole dollars, and the pennies.
I recommend using a single input box and then doing input validation using Regular Expressions to match your format (i.e. something like [0-9]+(.[0-9]{1,3})? would probably work but could be improved). You could then parse the validated string to a Decimal type in your language, or just pass it as a string into your database - SQL will know how to cast it to a DECIMAL type.
Keep the whole cost as decimal. If it's free, then keep the cost as 0. In presentation if cost is zero - write "free" instead of 0.
I generally store the cost as the lowest unit (pennies) and then convert it to whole dollars later.
So a cost of $4.50 gets stored as 450. Free items would be -1 pennies. You could store free things as 0 pennies as well, this gives you the flexibility to use 0 and -1 to mean two slightly different things (free vs no sale?).
It also makes it easier to support countries that don't use cents if you choose to go that route.
As for presenting the data entry field, I personally don't like it when I have to keep switching fields for tiny things (like when they break up phone numbers into 3 fields, or IP addresses into 4). I'd present one field, and let the users type the decimal point in themselves. That way, your users don't have to tab (or click, if they are unfamiliar with tab) to the next field.
Use cents, use 450 for $4.50 this will save you problems that are arising very often
from the fact that floating point operations are not safe. Just try the following expression in irb:
0.4 - 0.3 == 0.1 will return false. All because of floating point representation
innacuracies.
In my models I'm always using:
attr_accessor :price_with_cents
def price_with_cents
self.price/100.00
end
def price\_with\_cents==(num)
self.price = (num.to_f * 100.00).to_i
end
And the name of column is just price and integer type.
I don't have much experience with decimal columns and their representation in ruby (which can be float that is problematic as i've shown at the begining).
Don't allow garbage to make it to your database. If you're expecting a dollar amount on a field, than make sure it's valid before it gets in there. This will allow you to report better on the data and allow simpler formatting on output.
I suggest making this a single field with validation on update or insert.
if field != SpecialFreeTag then
try to convert to decimal
if fail then report to user
otherwise accept value
Use try parse or regular expressions to help with the validation.
I would store the cost as decimal with the scale being no less than 2 and maybe even 3-5. If something is bought in bulk the unit cost could easily include fractions of a cent. Free items have a cost of 0. If the cost is unknown then allow null values also.

Resources