How to search an encrypted field in attr_encrypted - ruby-on-rails

I have set up attr_encrypted in my app with individual 'iv's for every record.
# Fields
User.e_name
User.e_name_iv
I am trying to search the User table for a known name. I've tried:
User.find_by_name("Joe Bloggs") # undefined method "find_by_name" for Class
User.where("name = ?", "Joe Bloggs").first # column "name" does not exist
User.where(:e_name => User.encrypt_name("Joe Bloggs")) # must specify an iv
How can I find a record by its name?

While it is possible to do some searching, it's not very practical. you'll have to potentially iterate through every record trying each respective IV until you have an exact match, depending on the number of records you have this will not be very practical.
Have you read the readme? https://github.com/attr-encrypted/attr_encrypted#things-to-consider-before-using-attr_encrypted
Searching, joining, etc
While choosing to encrypt at the attribute level is the most secure
solution, it is not without drawbacks. Namely, you cannot search the
encrypted data, and because you can't search it, you can't index it
either. You also can't use joins on the encrypted data. Data that is
securely encrypted is effectively noise. So any operations that rely
on the data not being noise will not work. If you need to do any of
the aforementioned operations, please consider using database and file
system encryption along with transport encryption as it moves through
your stack.

Related

Creating a grouped text log of changed model attributes

I've recently started looking into Ruby on Rails, and I've set up a basic system to scan an parse and XML datasource, storing the elements in a MySQL database.
I'm intending to run the script as a rake task at set intervals, so want to track additions and updates, outputting the new, or changed, values to a text file.
I initially looked at using the before_save in order to write self.changes to a file, however the complexity arises as I'm retrieving data from two different pages and want to group the log output, e.g note each pricing row is a different record in the same table, ignore the variable names these are examples.
Item GUID
- Price US: #{old price} to #{new price}
- Price UK: #{old price} to #{new price}
The solution I'm currently looking to implement is appending a logged column to the table, if the data changes I can set this to changed, or new if the record has been added, and use this in a query to find records in which logged is not NULL, and group them by GUID. However as this will execute after the object has been saved I lose knowledge of the past values.
Is there a different approach I could take to achieve something like this?
Yes, there is a better way to do this. Take a look at these options you've got:
audited gem: https://github.com/collectiveidea/audited
paper_trail gem: https://github.com/airblade/paper_trail
espinita gem: https://github.com/continuum/espinita

Rails model.create set id

I wonder if it's possible to to run Model.create() such that instead of taking next free id integer it takes the lowest free integer.
For example, assume we have records for id=10..20 and we don't have records for id=0..9, I want create instance of Model with id starting from 0 (in normal Mode.create() in would create instance staring from 21)
Preferably I want to do it in automatic manner. I don't want to change id by explicitly defining it.
DB
You'll be best doing this at database-level (look at altering the auto-increment number)
Although I think you can do this in Rails, I would highly recommend using the DB functionality to make it happen. You can do something like this in PHPMYAdmin (for MYSQL):
If you set the Auto-Increment to the number you wish to start at, every time you save data into the DB, it will just save with that number. I think using any Rails-based method will just overcomplicate things unnecessarily.
I'd discourage it.
Those ids serve solely as unique identifiers for rows in a table, and it's the database's job to assign one. You can verify that the model doesn't require an id to be saved:
m = Model.new
# populate m with data
m.name = "Name"
# look at what m contains
m
# and save it
m.save
# now inspect it again and see it got its unique id
m
While it might be possible to modify ids, it's not a good practice to give more sense to ids — when each new record gets a unique id at any time it's easier to debug possible DB structure errors that might occur during development. Like, say, some associated objects suddenly show up in a new user's account. Weird enough, right? That can happen and, worst case, can show up in production resulting in a severe security breach.
Keeping ids unique at all times eliminates this bug's effect. That seems much more important if the associated objects store confidential information and you care about keeping them safe. Encryption concerns aside.
So, to be sure in every situation, developers have adopted a practice of not giving id any other role other than uniquely identifying a row in a table. If you want it to do something else, consider making another field for that purpose.

Rails 4 encrypt foreign key

I'm building an app that requires HIPAA compliance, which, to cut to the chase, means that I can't allow for certain connections to be freely viewable in the database (patients and recommendations for them).
These tables are connected through the patients_recommendations table in my app, which worked well until I added the encryption via attr_encrypted. In an effort to cut down on the amount of encrypting and decrypting (and associated overhead), I'd like to be able to simply encrypt the patient_id in the patients_recommendations table. However, upon changing the data type to string and the column name to encrypted_patient_id, the app breaks with the following error when I try to reseed my database:
can't write unknown attribute `patient_id'
I assume this is because the join is looking for the column directly and not by going through the model (makes sense, using the model is probably slower). Is there any way that I can make Rails go through the model (where attr_encrypted has added the necessary helper methods)?
Update:
In an effort to find a work-around, I've tried adding a before_save to the model like so:
before_save :encrypt_patient_id
...
private
def encrypt_patient_id
self.encrypted_patient_id = PatientRecommendation.encrypt(:patient_id, self.patient_id)
self.patient_id = nil
end
This doesn't work either, however, resulting in the same error of unknown attribute. Either solution would work for me (though the first would address the primary problem), any ideas why the before_save isn't being called when created through an association?
You should probably store the PII data and the PHI data in separate DBs. Encrypt the PII data (including any associations to a provider or provider location) and then hash out all of the PHI data to make it easier. As long as there are not direct associations between the two, it would be acceptable to not have the PHI data encrypted as it's anonymized.
Plan A
Don't set patient_id to nil in encrypt_patient_id since it does not exist and the problem could go away.
Also, ending a callback with a nil or false will halt the callback chain, put an explicite true at the end of method.
Plan B, rethink your options
There are more options - from database-level transparent encryption (which formally encrypts the data on disk), to encrypted filesystems for storing certain tablespaces, to flat out encryption of data in the columns.
Encrypting the join columns sounds like a road to unhappiness for a variety of reasons ranging from reporting issues to performance issues when joining is necessary which might be pretty severe,
the trouble you're currently experiencing with the seed looks like its the first bump caused by this on what promises to be a bad road (in this case activerecord seems to be confused how to handle your association, it tries to set patient_id on initialize and breaks).
The overhead of encrypting restricted data might not be as high as you think, not sure how things go for HIPAA but for PCI you're not exactly encouraged to render the protected data on screen so encryption incurs only a small overhead because it happens relatively rarely (business-need-to-know etc).
Also, memory is probably considered to be 'not at rest and not in transit', you could in theory cache some of the clear values for limited periods of time and thus save up on the decryption overhead.
Basically, encrypting data might not be that bad, and encrypting keys in database might be worse then you think
I suggest we talk directly, I'm doing PCI DSS compliance stuff and this topic interests me.
Option: 1-way hashes for primary/foreign keys
PatientRecommendation would have hash of patient_id - call it patient_hash and Patient would be capable of generating the same patient_hash from its id - but I'd suggest storing the patient_hash in both tables, for Patient it would be the primary key for join and for PatientRecommendation it would be the foreign key for join,
thus you define rails relation in these terms and rails will no longer be confused about your relation scheme
has_many :patient_recommendations, primary_key: :patient_hash, foreign_key: :patient_hash
and the result is cryptographically more robust and easy for the database to handle
IF you're adamant about not storing the patient_hash in Patient you could use a plain SQL statement to do the relation - less convenient but workable - something in the lines of this pseudosql:
JOIN ON generate_hash(patient.id) = patient_recommendations.patient_hash
Oracle, for example, has an option to make functional indexes (think create index generate_hash(patient.id)) so this approach could be pretty efficient depending on your choice of database.
However, playing with join keys will complicate your life a lot, even with these measures
I'll expand on this post later on with additional options

How to create a fact table using natural keys

We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...

How can I add an index on an attr_encrypted db field?

I have
attr_accessible :access_token
attr_encrypted :access_token, key: ENV['ENCRYPTION_KEY']
and I'm doing some User.find_by_access_token, so I'd like to index the field in the db.
However, no access_token exists, only encrypted_access_token.
Does indexing this do the same thing as indexing any other field?
The whole point of saving encrypted data is to prevent the cleartext from showing. Obviously you cannot search for it, or the whole concept would be flawed.
You can index the encrypted token with a plain index and search the table with the encrypted token - for which you obviously need the encryption key.
CREATE INDEX tbl_encrypted_access_token_idx ON tbl(encrypted_access_token);
SELECT *
FROM tbl
WHERE encrypted_access_token = <encrypted_token>;
If all your tokens can be decrypted with an IMMUTABLE Postgres function, you could use an index on an expression:
CREATE INDEX tbl_decrypted_token_idx
ON tbl(decrypt_func(encrypted_access_token));
SELECT *
FROM tbl
WHERE decrypt_func(encrypted_access_token) = <access_token>;
Note that the expression has to match the expression in the index for the index to be useful.
But that would pose a security hazard on multiple levels.
I Googled and found a reference to Fast Search on Encrypted Field. A comment mentioned "deidentifying the data". Seems to be an accepted method.
Here's how I envision it working. In this example I separate the patient name from the rest of the Patient record.
Patient Row: [id=1, name_and_link=9843565346598789, …]
Patient_name Row: [id=1, name=”John”, patient_link=786345786375657]
The name_and_link field is an encrypted copy of two fields: the name and a link to the Patient_name. Having the name in both tables is redundant. I suggest it to provide faster access (no need to read from the Patient_name table). Also allows recreating the Patient_name table if necessary (e.g. if the two tables become out of sync).
The Patient_name table contains the unencrypted copy of the name value. The name row can be indexed for fast access. To search by name, locate matching names in the Patient_name table and then use the encrypted links back to the Patient table.
Note: In the example above I show long numbers as sample encrypted data. It's actually worse in real life. Depending on the encryption method, the minimum length of an encrypted value is about 67 bytes using Postgres' pgp_sym_encrypt() function. That means encrypting the letter "x" turns into 67 bytes. And I'm proposing two encrypted fields for each de-id'd field. That's why I suggest encrypting the name and the link together (as JSON tuple?) in the Patient table. Cuts the space overhead in half versus encrypting two fields separately.
Note: This requires breaking some fields into pieces. (e.g. phone numbers, SSN, addresses). Each part would need to be stored in a separate table. Even the street portion of an address would have to be further subdivided. This is becoming complicated. I'd like to see Postgres automate this.
Note: Just controlling access to the password is a tough issue itself.

Resources