MongoID Query with Regex and escaping - ruby-on-rails

I want to know if it is necessary to escape regex in query calls with rails/mongoID ?
This is my current query:
#model.where(nice_id_string: /#{params[:nice_id_string]}/i)
I am now unsure if it is not secure enough, because of the regex.
Should i use this code below or does MongoID escape automatically query calls?
#model.where(nice_id_string: /#{Regexp.escape(params[:nice_id_string])}/i)

Of course you should escape the input. Consider params[:nice_id_string] being .*, your current query would be:
#model.where(nice_id_string: /.*/i)
whereas your second would be:
#model.where(nice_id_string: /\.\*/i)
Those do very different things, one of which you probably don't want. Someone with a sufficiently bad attitude could probably slip some catastrophic backtracking through your current version and I'm not sure what MongoDB/V8's regex engine will do with that.

Related

Parsing a date token

I would like to parse various SQL literals in ANTLR. Examples of the literal would be:
DATE '2020-01-01'
DATE '1992-11-23'
DATE '2014-01-01'
Would it be better to do the 'bare minimum' at the parsing stage and just put in something like:
date_literal
: 'DATE' STRING
;
Or, should I be doing any validation within the parser as well, for example something like:
date_literal
: 'DATE' DIG DIG DIG DIG '-' DIG DIG '-' DIG DIG
If I do the latter I'll still need to validate...even if I do a longer regex, I'll need to check things like the number of days in the month, leap years, a valid date range, etc.
What is usually the preferable method to do this? As in, how much 'validation' do you want to do in your grammar and how much in the listeners where the actual programming would be done? Additionally, are there any performance differences between doing (small) validations 'within the grammar' vs doing it in listeners/followers?
These are actually two slightly different syntaxes (the second does not specify that the date should be surrounded by 's)
Based on your example, that may be an oversight, so I'll assume you mean both to require the 's, and that your STRINGs are ' delimited.
It's a design choice, but a couple of factors to consider.
If you use the more specific grammar, then, if the user input doesn't match, you'll get the default ANTLR error message (which will be "pretty good for a generated tool", but probably a bit obtuse to your user).
As you say, you'll still have to perform further edits.
I lean toward keeping the grammar as simple as possible and doing more validation in a listener (maybe a visitor). This allows you to be as clear with your error messages as possible.
The only reason I see to not use the 'DATE' STRING rule would be if there is some other string content that would NOT be a date_literal, but would be some other, valid syntax in your language. It might be an invalid date literal, in which case, I'd use your simple rules and do the edit.

idiomatic way to do regular expression searches in rails models?

in my rails controller, i would like to do a regular expression search of my model. my googling seemed to indicate that i would have to write something like:
Model.find( :all, :condition => ["field REGEXP '?' " , regex_str] )
which is rather nasty as it implies MySQL syntax (i'm using Postgres).
is there a cleaner way of forcing rails (4 in my case) to do a regexp search on a field?
i also much prefer using using where() as it allows me to map my strong parameters (hash) directly to a query. so what i would like is something like:
Model.where( params, :match_by => { 'field': '~' } )
which would loosely translate to something like (if params['field'] = 'regex_str')
select * from models where field ~ regex_str
Unfortunately, there is no idiomatic way to do this. There's no built-in support for regular expressions in ActiveRecord. It'd be impossible to do efficiently unless each database adapter had a database-specific implementation, and not all databases support regular expression matches. Those that do don't all support the same syntax (for example, Postgres doesn't have the same regexp syntax as Ruby's Regexp class).
You'll have to roll your own using SQL, as you've noted in your question. There are alternatives, however.
For a Postgres-specific solution, check out pg_search, which uses Postgres's full text search capabilities. This is very fast and supports fuzzy searching and some pattern matching.
elasticsearch requires more setup, but is incredibly fast, with some nice gems to make your life easier. Here's a RailsCasts episode introducing it. It requires running a separate server, but it's not too hard to get started, and it's powerful. Still no regular expressions, but it's worth looking at.
If you're just doing a one-off regexp search against a single field, SQL is probably the way to go.

Regex: Matching on an exclusive either/or

I want a regex that will match for strings of RACE or RACE_1, but not RACE_2 and RACE_3. I've been on Rubular for a while now trying to figure it out, but can't seem to get all the conditions I need met. Help is appreciated.
/^RACE(_1)?$/
Rubular example here
RACE(_1)?\b
\b means the end of a word, and that prevents matching RACE in RACE_2.
You can use:
(\bRACE(_[1])?\b)
It requires the one copy of RACE, and then 0 -> N occurrences of the _[1]. In the square brackets you can include any number you want. EXAMPLE:
(\bRACE(_[12345])?\b) will match up to RACE_5. You can then customize it to even skip numbers if you want [1245] for RACE_1, RACE_2, RACE_4, RACE_5 but not RACE_3.
/RACE(?!_)|RACE_1/
Its a bit of a hack but might fit your needs
EDIT:
Here might be a more specific one that works better
/RACE(?!_\d)|RACE_1/
In both cases, you use negative lookahead to enforce that RACE cannot be followed by _ and a number, but then specifically allow it with the or statement following.
Also, if you plan on only searching for instances of said matches that are whole words, prepend/append with \b to designate word boundaries.
/\bRACE(?!_\d)|RACE_1\b/

Is there a faster way to parse hashtags than using Regular Expressions?

I am curious, is there a faster/better way to parse hashtags in a string, other than using Regular Expressions (mainly in Ruby)?
Edit
For example I want to parse the string This is a #hashtag, and this is #another one! and get the words #hashtag and #another. I am using #\S+ for my regex.
You don't show any code (which you should have) so we're guessing how you are using your regex.
#\S+ is as good of a pattern as you'll need, but scan is probably the best way to retrieve all occurrences in the string.
'This is a #hashtag, and this is #another one!'.scan(/#\S+/)
=> ["#hashtag,", "#another"]
Its should be /\B#\w+/, if you don't want to parse commas
Yes, I agree. /\B#\w+/ makes more sense.
Maybe
Hmm, ideas....
You could try s.split('#'), and then perhaps apply a regex only to actual hashtags
s.split('#').drop(1).map { |x| x[/\w+/] } --- it may or may not be any faster but it clearly is uglier
You could write a C extension that extracts hashtags
You could profile your program and see if it really needs any optimization for this case.

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Resources