I'm looking for a simple way to generate 'Did you mean ...' style search tips when a search over the title of a record doesn't hit on a substring match because of slightly different punctuation or phrasing for a Rails 3 app.
Most commonly, I want to generate hits for 'Alpha: Beta' when the user searches for 'Alpha Beta', 'Alpha & Beta' for 'Alpha and Beta' and 'Alpha Beta' for 'The Alpha Beta' e.g.. The same goes for the opposite direction for the first two examples, because my current substring searching will catch the latter case already. I would prefer to do this without specific logic for each of the above examples though, as there may be other variants I can't think of right now.
I'd also prefer to shy away from a solution that requires me to popular a hidden field of the record with alternate spellings as records are generated, which is then searched over instead of the publicly displayed one.
I'm guessing that a proper full text search like Sphinx/Thinking Sphinx would accomplish this, but I want to check if there's an easier solution for my limited scope problem. Ideally something that automatically generated this hidden field by striping out common words like 'the', 'and' and punctuation like '&' and ':' from both the record title and search term and the title field and then does the search. The actual order of the remaining words needn't necessarily have to match when juggle around ('Alpha Beta Gamma' can match 'Alpha, Beta, Gamma' but not 'Alpha, Gamma, Beta').
This solution doesn't meet all of your requirements, but I believe it's close enough to be worth mentioning - the excellent "scoped_search" gem, available at https://github.com/wvanbergen/scoped_search
It implements a simple query language where a search for 'alpha beta' matches results containing all those words, rather than the exact phrase - see the wiki at https://github.com/wvanbergen/scoped_search/wiki/query-language for more information on what it supports.
It generates SQL queries behind the scenes, so doesn't require a separate search daemon like Sphinx.
However, I don't believe it does anything similar to stripping out common words. Perhaps you could get some mileage by manually stripping out your common words, and then getting scoped_search to search for your revised term?
Related
I am having a hard time identifying the rules around this syntax, I see it and kind of understand it, but would like to find the documentation. I'm not sure how to google it though.
match (g:Group)<-[*1..2]-(s)
^^^^
^^^^ I would like to know more about this rule that limits path length
I understand that the above says "I only want to find paths that are between 1 and 2 edge traversals long", and it it really is that simple that's great; but I as I understand it the use of a star makes it a wildcard on which edges it can follow, while [:EdgeTypeIWant1..2] doesn't appear to be correct syntax. I probably have other questions as well that proper documentation (if I could find it) would be helpful with.
They are called variable length patterns.
The star is only an indicator that you're specifying a pattern of variable length, it's not a wildcard.
Your syntax would be: [:EdgeTypeIWant*1..2]
I am trying to "highlight" references to law statutes in some text I'm displaying. These references are of the form <number>-<number>-<number>(char)(char), where:
"number" may be whole numbers 18 or decimal numbers 12.5;
the parenthetical terms are entirely optional: zero or one or more;
if a parenthetical term does exist, there may or may not be a space between the last number and the first parenthesis, as in 18-1.3-401(8)(g) or 18-3-402 (2).
I am using the regex
((\d+(\.\d+)*-){2}(\d+(\.\d+)*))( ?(\([0-9a-zA-Z]+\))*)
to find the ranges of these strings and then highlight them in my text. This expression works perfectly, 100% of the time, in all of the cases I've tried (dozens), in BBEdit, and on regex101.com and regexr.com.
However, when I use that exact same expression in my code, on iOS 12.2, it is extremely hit-or-miss as to whether a string matching the regex is actually found. So hit-or-miss, in fact, that a string of the exact same form of two other matches in a specific bit of text is NOT found. E.g., in this one paragraph I have, there are five instances of xxx-x-xxx; the first and the last are matched, but the middle three are not matched. This makes no sense to me.
I'm using the String method func range(of:options:range:locale:) with options of .regularExpression (and nil locale) to do the matching. I see that iOS uses ICU-compatible regexes, whereas these other tools use PCRE (I think). But, from what I can tell, my expression should be compatible and valid for my case with the ICU parsing. But, something is definitely different, and I cannot figure out what it is.
Anyone? (I'm going to give NSRegularExpression a go and see if it behaves differently, but I'd still like to figure out what's going on here.)
I'm not hoping for clues on picking key words, there are guides about that already.
I'm hoping to get a decisive idea, with a reference to documentation or statements from Apple of:
The correct keyword list syntax.
How they are employed for matching in Apple's back end.
For example:
Should they be comma delineated: "ham,chips,beans"?
Or space delineated: "ham chips beans"?
Or comma and space: "ham, chips, beans"?
If customers might search for me by a phrase, such as "hungry cat", should I include "hungry, cat, hungry cat, hungry-cat"? Or is "hungry cat" sufficient?
I believe it's not necessary to add plural forms: "cats" isn't needed, provided I have "cat". But what about other stemming? If people search for "eating cats", is "eat, cat" in my search terms enough?
Thanks.
There are two votes to close stating this question is "opinion based". I've adjusted the question to make it clear that I am not looking for opinion, but statements or documentation from Apple.
I'm making a simple search engine, and as I go through the documents that are going to be indexed, I want to automatically identify the words that should be ignored (such as "and" and "the").
The only simple method I can think of is just ignore words of up to a certain length (if they're not lengthy enough, then they're considered stop words). Any other method would probably have to require data mining (I'm open to suggestions).
I would prefer a method that I can use as i go through the documents, but I'm open to the other suggestions. I just need a simple method.
Short answer is: don't. As in don't bother, but instead strip them from the query and/or weigh them appropriately by TF-IDF.
Quoting the Xapian manual: http://xapian.org/docs/stemming.html
It has been traditional in setting up IR systems to discard the very commonest words of a language - the stopwords - during indexing. A more modern approach is to index everything, which greatly assists searching for phrases for example. Stopwords can then still be eliminated from the query as an optional style of retrieval. In either case, a list of stopwords for a language is useful.
Getting a list of stopwords can be done by sorting a vocabulary of a text corpus for a language by frequency, and going down the list picking off words to be discarded.
I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb