I have
attr_accessible :access_token
attr_encrypted :access_token, key: ENV['ENCRYPTION_KEY']
and I'm doing some User.find_by_access_token, so I'd like to index the field in the db.
However, no access_token exists, only encrypted_access_token.
Does indexing this do the same thing as indexing any other field?
The whole point of saving encrypted data is to prevent the cleartext from showing. Obviously you cannot search for it, or the whole concept would be flawed.
You can index the encrypted token with a plain index and search the table with the encrypted token - for which you obviously need the encryption key.
CREATE INDEX tbl_encrypted_access_token_idx ON tbl(encrypted_access_token);
SELECT *
FROM tbl
WHERE encrypted_access_token = <encrypted_token>;
If all your tokens can be decrypted with an IMMUTABLE Postgres function, you could use an index on an expression:
CREATE INDEX tbl_decrypted_token_idx
ON tbl(decrypt_func(encrypted_access_token));
SELECT *
FROM tbl
WHERE decrypt_func(encrypted_access_token) = <access_token>;
Note that the expression has to match the expression in the index for the index to be useful.
But that would pose a security hazard on multiple levels.
I Googled and found a reference to Fast Search on Encrypted Field. A comment mentioned "deidentifying the data". Seems to be an accepted method.
Here's how I envision it working. In this example I separate the patient name from the rest of the Patient record.
Patient Row: [id=1, name_and_link=9843565346598789, …]
Patient_name Row: [id=1, name=”John”, patient_link=786345786375657]
The name_and_link field is an encrypted copy of two fields: the name and a link to the Patient_name. Having the name in both tables is redundant. I suggest it to provide faster access (no need to read from the Patient_name table). Also allows recreating the Patient_name table if necessary (e.g. if the two tables become out of sync).
The Patient_name table contains the unencrypted copy of the name value. The name row can be indexed for fast access. To search by name, locate matching names in the Patient_name table and then use the encrypted links back to the Patient table.
Note: In the example above I show long numbers as sample encrypted data. It's actually worse in real life. Depending on the encryption method, the minimum length of an encrypted value is about 67 bytes using Postgres' pgp_sym_encrypt() function. That means encrypting the letter "x" turns into 67 bytes. And I'm proposing two encrypted fields for each de-id'd field. That's why I suggest encrypting the name and the link together (as JSON tuple?) in the Patient table. Cuts the space overhead in half versus encrypting two fields separately.
Note: This requires breaking some fields into pieces. (e.g. phone numbers, SSN, addresses). Each part would need to be stored in a separate table. Even the street portion of an address would have to be further subdivided. This is becoming complicated. I'd like to see Postgres automate this.
Note: Just controlling access to the password is a tough issue itself.
Related
I have set up attr_encrypted in my app with individual 'iv's for every record.
# Fields
User.e_name
User.e_name_iv
I am trying to search the User table for a known name. I've tried:
User.find_by_name("Joe Bloggs") # undefined method "find_by_name" for Class
User.where("name = ?", "Joe Bloggs").first # column "name" does not exist
User.where(:e_name => User.encrypt_name("Joe Bloggs")) # must specify an iv
How can I find a record by its name?
While it is possible to do some searching, it's not very practical. you'll have to potentially iterate through every record trying each respective IV until you have an exact match, depending on the number of records you have this will not be very practical.
Have you read the readme? https://github.com/attr-encrypted/attr_encrypted#things-to-consider-before-using-attr_encrypted
Searching, joining, etc
While choosing to encrypt at the attribute level is the most secure
solution, it is not without drawbacks. Namely, you cannot search the
encrypted data, and because you can't search it, you can't index it
either. You also can't use joins on the encrypted data. Data that is
securely encrypted is effectively noise. So any operations that rely
on the data not being noise will not work. If you need to do any of
the aforementioned operations, please consider using database and file
system encryption along with transport encryption as it moves through
your stack.
I am looking for a key value store database on iOS. It should be based on sqlite, so YapDatabase seems to be a good choice.
But I find YapDatabase only uses a single table, to quote "The main database table is named "database" ": CREATE TABLE "database2" ("rowid" INTEGER PRIMARY KEY, "collection" CHAR NOT NULL, "key" CHAR NOT NULL, "data" BLOB, "metadata" BLOB ). So I am concerned about storing different type objects into the same column.
For example, I plan to use YapDatabase for my chat app, storing each message into |collection|key|object|metadata|. Each message has a unique id, which will be used as the key, message content normally is nstring, which will be used as object, timestamp with some other data is used as metadata. Just like YapDatase author answered here.
Sometimes pictures will be sent. The images are small, normally around couple hundreds of kbs, I don't want to store them as files, I believe storing them as blob is appropriate.
But if I use YapDatabse they are stored in the same table as my normally text messages. Then how can I do some query like, said find all my text messages?
Is my concern valid (storing different type objects into the same column)? Do I need to store them in separated tables? If yes, how? If not, how do I find all my text message easily?
The whole point of a key/value store is that it uses only the keys to identify the values. So if you want to store messages and pictures in the same store, you must ensure that they have different keys.
If you want to store data in separate tables, use a database that actually has tables, like SQLite.
I'm building an app that requires HIPAA compliance, which, to cut to the chase, means that I can't allow for certain connections to be freely viewable in the database (patients and recommendations for them).
These tables are connected through the patients_recommendations table in my app, which worked well until I added the encryption via attr_encrypted. In an effort to cut down on the amount of encrypting and decrypting (and associated overhead), I'd like to be able to simply encrypt the patient_id in the patients_recommendations table. However, upon changing the data type to string and the column name to encrypted_patient_id, the app breaks with the following error when I try to reseed my database:
can't write unknown attribute `patient_id'
I assume this is because the join is looking for the column directly and not by going through the model (makes sense, using the model is probably slower). Is there any way that I can make Rails go through the model (where attr_encrypted has added the necessary helper methods)?
Update:
In an effort to find a work-around, I've tried adding a before_save to the model like so:
before_save :encrypt_patient_id
...
private
def encrypt_patient_id
self.encrypted_patient_id = PatientRecommendation.encrypt(:patient_id, self.patient_id)
self.patient_id = nil
end
This doesn't work either, however, resulting in the same error of unknown attribute. Either solution would work for me (though the first would address the primary problem), any ideas why the before_save isn't being called when created through an association?
You should probably store the PII data and the PHI data in separate DBs. Encrypt the PII data (including any associations to a provider or provider location) and then hash out all of the PHI data to make it easier. As long as there are not direct associations between the two, it would be acceptable to not have the PHI data encrypted as it's anonymized.
Plan A
Don't set patient_id to nil in encrypt_patient_id since it does not exist and the problem could go away.
Also, ending a callback with a nil or false will halt the callback chain, put an explicite true at the end of method.
Plan B, rethink your options
There are more options - from database-level transparent encryption (which formally encrypts the data on disk), to encrypted filesystems for storing certain tablespaces, to flat out encryption of data in the columns.
Encrypting the join columns sounds like a road to unhappiness for a variety of reasons ranging from reporting issues to performance issues when joining is necessary which might be pretty severe,
the trouble you're currently experiencing with the seed looks like its the first bump caused by this on what promises to be a bad road (in this case activerecord seems to be confused how to handle your association, it tries to set patient_id on initialize and breaks).
The overhead of encrypting restricted data might not be as high as you think, not sure how things go for HIPAA but for PCI you're not exactly encouraged to render the protected data on screen so encryption incurs only a small overhead because it happens relatively rarely (business-need-to-know etc).
Also, memory is probably considered to be 'not at rest and not in transit', you could in theory cache some of the clear values for limited periods of time and thus save up on the decryption overhead.
Basically, encrypting data might not be that bad, and encrypting keys in database might be worse then you think
I suggest we talk directly, I'm doing PCI DSS compliance stuff and this topic interests me.
Option: 1-way hashes for primary/foreign keys
PatientRecommendation would have hash of patient_id - call it patient_hash and Patient would be capable of generating the same patient_hash from its id - but I'd suggest storing the patient_hash in both tables, for Patient it would be the primary key for join and for PatientRecommendation it would be the foreign key for join,
thus you define rails relation in these terms and rails will no longer be confused about your relation scheme
has_many :patient_recommendations, primary_key: :patient_hash, foreign_key: :patient_hash
and the result is cryptographically more robust and easy for the database to handle
IF you're adamant about not storing the patient_hash in Patient you could use a plain SQL statement to do the relation - less convenient but workable - something in the lines of this pseudosql:
JOIN ON generate_hash(patient.id) = patient_recommendations.patient_hash
Oracle, for example, has an option to make functional indexes (think create index generate_hash(patient.id)) so this approach could be pretty efficient depending on your choice of database.
However, playing with join keys will complicate your life a lot, even with these measures
I'll expand on this post later on with additional options
We've got a data warehouse design with four dimension tables and one fact table:
dimUser id, email, firstName, lastName
dimAddress id, city
dimLanguage id, language
dimDate id, startDate, endDate
factStatistic id, dimUserId, dimAddressId, dimLanguageId, dimDate, loginCount, pageCalledCount
Our problem is: We want to build the fact table which includes calculating the statistics (depending on userId, date range) and filling the foreign keys.
But we don't know how, because we don't understand how to use natural keys (which seems to be the solution to our problem according to the literature we read).
I believe a natural key would be the userId, which is needed in all ETL jobs which calculate the dimension data.
But there are many difficulties:
in the ETL jobs load(), we do bulk inserts with INSERT IGNORE INTO to remove duplicates => we don't know the surrogate keys which were generated
if we create meta data (including a set of dimension_name, surrogate_key, natural_key) this will not work because of the duplicate elimination
The problem seems to be the duplicate elimination strategy. Is there a better approach?
We are using MySQL 5.1, if it makes any difference.
If your fact table is tracking logins and page calls per user, then you should have set of source tables which track these things, which is where you'll load your fact table data from. I would probably build the fact table at the grain of one row per user / login date - or even lower to persist atomic data if at all possible.
Here you would then have a fact table with two dimensions - User and Date. You can persist address and language as dimensions on the fact as well, but these are really just attributes of user.
Your dimensions should have surrogate keys, but also should have the source "business" or "natural" key available - either as an attribute on the dimension itself, or through a mapping table as your colleague suggested. It's not "wrong" to use a mapping table - it does make things easier when there are multiple sources.
If you store the business keys on a mapping table, or in the dimension as an attribue, then for each row to load in the fact, it's a simple lookup (usually via a join) against the dim or mapping table to get the surrogate key for the user (and then from the user to get the user's "current" address / language to persist on the fact). The date dimension usually hase a surrogate key stored in a YYYYMMDD or other "natural" format - you can just generate this from the date information on your source record that you're loading into the fact.
do not force for single query, try to load the data in separated queries and mix the data in some provider...
There might be a group of records in a table where only two fields vary record by record, and rest fields remains same. This calls for a normalization by splitting by a table through a foreign key association. But, in Ruby-on-Rails, it would mean the creation of a model. So, is it still possible to lessen use of disk space?
May be, it is, because it would be reasonable that storing multiple values of one column in a record would require the column to be an array of any type. But declaring the field to be :array type results in an error. So, is there a way to work around it?
After generating a model, open the model's file. Insert one line for each field.
serialize :field_name
But ensure that the fields for which you are serializing, should be of type
:text
or
:string
If they aren't of such primitive data types, i.e. of another type like
:datetime
then it would return an error.
This step is not complete as a whole. You need to do one complementing step: de-serialize, because in the model-level storage, it is stored as a string starting with "---\n-", which is not suitable for array-type operations.
While reading data from the model, you need to perform the following step:
YAML.load(field_name)
where field_name refers to the field that was serialized.
The above step would return an array, on which you can perform normal array operations.