Ruby/Rails: Hashing array of participant phone numbers to uniquely identify a group/MMS text conversation? - ruby-on-rails

While I'm using Ruby/Rails to solve this particular problem, the specific issue is not unique to Ruby.
I'm building an app that can send group/mms messages to multiple people, and then processes those texts when the others reply.
The app will have a different number for each record, and each record can be involved in multiple group conversations.
For example, record_1 can be involved in a conversation with user_1, user_2, but can also be involved in a separate conversation with user_2, user_3, and record_2 can have a separate conversation with user_1, user_2.
When I send a message the fields resemble:
{
from: "1234566789",
to: [
"1111111111",
"2222222222",
...
],
body: "..."
}
Where the from is my app number, and the to [] is an array of phone numbers for everyone else involved in the conversation.
When one of the other participants replies to the group message, I'll get a webhook from my text messaging provider that has the from as that person's phone number and the to [] would include my app number and everyone else's numbers.
The identifier for a conversation is the unique combination of the phone numbers involved.
However, having an array of ["1234567890", "1111111111", "2222222222"] is difficult to work with, and I would like a string representation that I can index in my database and quickly find.
If I have a to: ["1234567890", "1111111111", "2222222222] array of the phone numbers, I'm thinking about using Digest::MD5.hexdigest to.sort.to_s.
This would give me a unique identifier such as 49a5a960c5714c2e29dd1a7e7b950741, that I can index in my DB and use to uniquely reference conversations.
Are there any concerns with using the MD5 hash to solve my specific problem? Anytime I have the same numbers involved in a conversation, I want it to produce the same hash. Does MD5 guarantee the same result given the same ordered input?
Is there another approach to uniquely identify conversations by the participants?

Yes, MD5 does give you that guarantee, unless someone is trying to attack your system. It is possible to create colliding MD5 hashes but they will never happen by accident.
So if in your situation the hash will only ever be benign (i.e. created by your code, not created by someone trying to mount an attack of some kind), then using MD5 is fine.
Or you could switch to using SHA256 instead of MD5 which doesn't have this risk associated with it.

Related

What is exactly Teams CallRecord version field meaning

I've set up receiving of Teams CallRecords into Splunk and now stuck in a process of understanding them. I thought that one CallRecord represents one unique Teams call (like Mr. A dialed Mr. B, Mr. B answered, they talked and eventually hung up - that's a CallRecord) and documentation suggests that it is so: "callRecord resource type represents a single peer-to-peer call or a group call between multiple participants", "id - String - Unique identifier for the call record. Read-only."
But what I see is many CallRecords with same id but different versions ("version" field). These records might have different DateTimes of start and end, different lastModifiedDateTime, some versions have null values of organizer* and participant* fields. I saw quantity of versions from 1 to 66.
So here are my questions:
Does the one CallRecord represent one unique conversation? If yes, what would be it's unique identificator? id+version? Then why there are records with same id, different versions and same other data except lastModifiedFateTime (so these records are roughly the same and it will result in double accounting in final report)? And why there are records with null fields of organizer*?
Does the set of all CallRecords with same id and different versions represent one call? If I merge all such records into one I get multivalue fields of startDateTime, endDateTime and other DateTimes - which values of them I need to use for accounting - min(startDateTime) and max(endDateTime) or what?
Maybe there is some deep dive Microsoft documentation on this versioning? Frankly, I've completely lost here.

Creating Unique Access Codes Per Email Address in Rails

I'm looking to create a system for a classified ads-type site that allows users to create ad postings without going through any kind of account registration process. I want to have a unique access code associated with each email address that users use to make posts. This access code will later be used by users to gain access to the set of posts that they've made in the past.
So these access codes should be not only unique but also secure / unguessable. Any suggestions for what I can look into for implementing this with Ruby on Rails? I haven't been able to find much in researching the topic - most related discussion seems to be around encrypting passwords, hashing, etc, so any general direction is appreciated.
Thanks!
SecureRandom.hex(n=nil) click to toggle source ::hex generates a random hex string.
The argument n specifies the length of the random length. The length
of the result string is twice of n.
If n is not specified, 16 is assumed. It may be larger in future.
The result may contain 0-9 and a-f.
p SecureRandom.hex #=> "eb693ec8252cd630102fd0d0fb7c3485"
p SecureRandom.hex #=> "91dc3bfb4de5b11d029d376634589b61"
You can generate a hash with a salt to create an identifier for each email address. You should make sure two people cannot get the same hash.
It's worth mentioning that the length of this random and unique acces code will be much longer and harder to remember than a username and password.

Searching an array of email addresses for similarity between any two address

I'm looking for a way to search through a database and find close similarities between email addresses. The only solution I can thing of is O(N^2), and involves a nested loop. Basically grab an email address, and then check it against the rest of the addresses, over and over. This will be extremely consuming as I'm dealing with 100,000 email addresses in a database. If it makes a difference, this will be implemented as a background job for a Ruby on Rails app.
Is there any way to do this?
I'm really only looking for basic similarities. An example would be
docjohnson#gmail.com
docjohnson1#gmail.com
docjohnson333#gmail.com
docjohnson#hotmail.com
I would want those all marked similar to each other.
Thanks for the help!
EDIT: I'm using a Mongo database connected to ROR via Mongoid, if that changes the game at all.
Compute a "signature" for each email address; for instance, a signature might be the first five characters of the username part of the address. Sort all email addresses to bring together those with identical signatures; if your signature algorithm does a good job, each set of signatures should refer to the same person. You'll have to tune the signature algorithm based on your data and your definition of similarity.
I suggest that you start with "canonicalizing" the e-mails:
strip trailing digits from the username part, e.g., john123 -> john.
maybe drop some punctuation from the username, e.g., john.smith -> johnsmith.
drop the some hosts from the domain part, e.g., mail.foo.com -> foo.com; but not math.mit.edu -> mit.edu.
after you do 1 & 2, you should collect the original emails into a hash table mapping the canonical usernames to the original ones, so that when you are done, you only need to iterate over the canonical usernames.

Ruby on Rails: Is generating obfuscated and unique identifiers for users on a website using SecureRandom safe and reasonable?

Internally, my website stores Users in a database indexed by an integer primary key.
However, I'd like to associate Users with a number of unique, difficult-to-guess identifiers that will be each used in various circumstance. Examples:
One for a user profile URL: So a User can be found and displayed by a URL that does not include their actual primary key, preventing the profiles from being scraped.
One for a no-login email unsubscribe form: So a user can change their email preferences by clicking through a link in the email without having to login, preventing other people from being able to easily guess the URL and tamper with their email preferences.
As I see it, the key characteristics I'll need for these identifiers is that they are not easily guessed, that they are unique, and that knowing the key or identifier will not make it easy to find the other.
In light of that, I was thinking about using SecureRandom::urlsafe_base64 to generate multiple random identifiers whenever a new user is created, one for each purpose. As they are random, I would need to do database checks before insertion in order to guarantee uniqueness.
Could anyone provide a sanity check and confirm that this is a reasonable approach?
The method you are using is using a secure random generator, so guessing the next URL even knowing one of them will be hard. When generating random sequences, this is a key aspect to keep in mind: non-secure random generators can become predictable, and having one value can help predict what the next one would be. You are probably OK on this one.
Also, urlsafe_base64 says in its documentation that the default random length is 16 bytes. This gives you 816 different possible values (2.81474977 × 1014). This is not a huge number. For example, it means that a scraper doing 10.000 request a second will be able to try all possible identifiers in about 900 years. It seems acceptable for now, but computers are becoming faster and faster, and depending on the scale of your application this could be a problem in the future. Just making the first parameter bigger can solve this issue though.
Lastly, something that you should definitely consider: the possibility for your database to be leaked. Even if your identifiers are bullet proof, your database might not be and an attacker might be able to get a list of all identifiers. You should definitely hash the identifiers in the database with a secure hashing algorithm (with appropriate salts, the same you would do for a password). Just to give you an idea on how important this is, with a recent GPU, SHA-1 can be brute forced at a rate of 350.000.000 tries per second. A 16 bytes key (the default for the method you are using) hashed using SHA-1 would be guessed in about 9 days.
In summary: the algorithm is good enough, but increase the length of keys and hash them in the database.
Because the generated ids will not be related to any other data, they are going to be very hard (impossible) to guess. To quickly validate there uniqueness and find users, you'll have to index them in the DB.
You'll also need to write a function that returns a unique id checking the uniqueness, something like:
def generate_id(field_name)
found = false
while not found
rnd = SecureRandom.urlsafe_base64
found = User.exists?(field_name: rnd)
end
rnd
end
Last security check, try to check the correspondance between an identifier and the user information before doing any changes, at least the email.
That said, it seems a good approach to me.

RegExp as table entries

I'm building an application that takes inputs from SMS text thru Twilio. I'd like to build a table the matches the incoming SMS body with the appropriate response.
For example, imagine I'm building an NFL text message thing.
Someone texts in 'Redskins' and we text back, "The Redskins play at FedEx field"
Someone texts in 'Colts' and we text back, "The Colts are the pride of Indiana."
Here's the tricky part:
Of course, our Rails app is going to need to interpret the incoming team names through Regular Expressions, as many people will text in: Redskins or REDSKINS or REDSKIN or Redskin or REDskin.....
With one or two teams, one could just hardcode the RegExp and response into the controller...but with 30 teams, that seems wrong. (And with 120 entries -- say all pro sports-- even worse).
Does any one have any tips on getting the team names from the input stage, thru the DB table stage with a 'RegExp' conversion in the middle?
Thanks in advance.
for a modest number of keywords, I recommend a two table approach with Keywords and Aliases, always stores in lower case. Convert input to lower case. For each Keyword (say, redskins) you manually add 5-10 variations (including the correct one) in Aliases all of which have Alias.keyword_id = the id of the keyword. So you simply search Alias for the user input, and if you find a match you have the keyword_id of the keyword.
It has two advantages: fast and easy to extend... i fyou log the "no matches" you'll get a list of new aliases to add once to the dbase. MUCH easier and more reliable than trying to do via regex.
I don't think you want regexps here. What about spelling errors? For helpfulness (esp coming from a txt msg) I think you want to allow shortenings too.
Maybe a Soundex-based library or spelling correction thing would be best. You want a nearest match algorithm not a patterned match one.
If the text message is not too long, you should first chop that into words, and then take an intersection with the list of team names.
array_of_team_names = %w(Redskins Colts ... ) # keep it all capitalized
'cOLts blah blah'.scan(/\w+/).map{|word| word.capitalize} & array_of_team_names
# => ['Colts']
If you want to handle mistypes as suggested by drysdam, or if you want to handle larger text with more accuracy, you should use some library specific to that.
I think what you are asking is "how do I avoid hardcoding a regexp into my code, since I might have a lot of them, and they are really a data element"?
If you want to do the matching with regexp, you should note that you can create a regexp from a string, so you could easily have a table that contains column of regexp in string form. You can then dynamically create the array of regexp objects that you'd be using to search the incoming string with. The trick is what to do when you have a match. You'll need to develop a set of rules (yet another table) that basically says which response to pick based on incoming text. For example, if your rule is simply "match based on the team name and say where they play", that's pretty easy. Each regexp that you are searching for maps to exactly one action ("The Bears play in Chicago"). If your rules are more complicated (look for the Bears, and then look to see if the word "schedule" is in there too as well as "first game(s)", then you'd need another table that maps a collection of matches to a response.

Resources