Rails/Sphinx: search excerpts are also showing search conditions - ruby-on-rails

This gives the results I was expecting:
result = Content.search("minerva", :conditions => {:publication_code => "12345678"})
result.first.element_type #=> "chapter"
result.first.excerpts.text #=> "outdated practice, The Owl of <span class=\"match\">Minerva</span> talks about the “unrealistic ‘Cartesian … major premise The Owl of <span class=\"match\">Minerva</span> details the innumerable combinations possible … concepts?” See The Owl of <span class=\"match\">Minerva</span>, p. 319. “Of course, ideally"
However: if I'm including search conditions that are literally present in the text, for instance the word "section" (which is a content element type) this is what I'm getting:
result = Content.search("minerva", :conditions => {:publication_code => "12345678", :element_type => "section"})
result.first.element_type #=> "section"
result.first.excerpts.text #=> "November 2001. The Owl of <span class=\"match\">Minerva</span>, p. 107. provides as follows: … foreign diplomatic or consular property, <span class=\"match\">section</span> 177 would place the United … source of leverage. In addition, <span class=\"match\">section</span> 177 could seriously affect our"
"Section", literally, is now also considered a match. I'm not getting what's the cause of this response.
Update to illustrate the problem some more:
Here's a query that finds a search term ("certification") near the term I'm using in the search conditions ("section", to limit my search to element_types that are sections).
result = Content.search("certification", :conditions => {:publication_code => "12345678", :element_type => "section"})
The text that gets returned is this (shortened to match following excerpts, and bold text mine):
result.first.text
[…] and operation of section 10 and the section 10 certification process. He noted […]
[…] object of the certification procedure introduced by section 10(1)(b) was not to […]
[…] domestic court. The certification procedure provided for by section 10 is similarly […]
Calling result.first.excerpts.text gives me the following. As you can see, everywhere in the text where either the term 'classification' or 'section' is found, it's set as a match.
" … and operation of <span class=\"match\">section</span> 10 and the <span class=\"match\">section</span> 10 <span class=\"match\">certification</span> process. He noted: … object of the <span class=\"match\">certification</span> procedure introduced by <span class=\"match\">section</span> 10(1)(b) was not to … domestic court. The <span class=\"match\">certification</span> procedure provided for by <span class=\"match\">section</span> 10 is similarly … "

The excerpts pane uses all query terms when generating output - which includes supplied conditions (as they end up being part of the Sphinx query - e.g. your second example, from Sphinx's perspective, is "minerva #publication_code 12345678 #element_type section").
An alternative is to have your own excerpter with just the query you want:
excerpter = ThinkingSphinx::Excerpter.new 'content_core', 'minerva', {}
excerpter.excerpt! results.first.text
The first argument when building the excerpter is the index name, the second is the search query to match against, and the third is options.

I think this is just a coincidence.
Try with a dataset that doesn't have section in the text to see if this also happens.

Related

Rails array INCLUDE with only distinct words

I'm building a profanity search function which needs to find instances of an array of profane words in a long string of text.
One could do a simple include like:
if profane_words.any? {|word| self.name.downcase.include? word}
...
end
This results in a positive match if ANY of the array of profane words are present anywhere in the text.
However, if a word like 'hell' is considered profane, this would produce a positive match against "Hell's Angels" or "Hell's Kitchen", which is undesirable.
How can the above search be modified to only produce positive results against distinct words or phrases? For example, "Hell Angels" returns positive but "Hell's Angels" returns negative.
To be clear, this means we're searching for any instance of a profane word that is immediately preceded or followed by another character or apostrophe.
What about using a regex ?
profane_words.any? { |word| self.name.downcase.match? /#{word}(?!')/ }
Examples:
"hell's angels".match?(/hell(?!')/) # => false
"hell angel".match?(/hell(?!')/) # => true
(?!') is a negative lookup meaning it won't match if the word has a ' right after it. If you'd like to exclude other characters you can add it to the list with pipes e.g. (?!'|") won't match ' and ".
See https://www.regular-expressions.info/lookaround.html for reference.
And you could make it more performant like this:
self.name.downcase.match? /#{profane_words.join('|')}(?!')/
if profane_words.any? {|word| self.name.downcase.split(' ').include? word} ... end
You should definitely use a Regex containing all your profane words followed by a space or period. Bellow yo
> "Hell's angels".match(/(hell|shit)[ .]/i)
=> nil
> "Hell angels".match(/(hell|shit)[ .]/i)
=> #<MatchData "Hell " 1:"Hell">
> "Hell's angels shit".match(/(hell|shit)[ .]/i)
=> nil

Postgres sort alphanumeric only

Trying to sort descending A->Z on some podcast titles, I only want A-Z and 0-9, everything else should come last:
.order('title ASC')
is giving me odd results at the start and end. The majority of the results in the middle are fine:
> ["\"Success Living\" - Dr. Leigh-Davis",
"\"The Real Deal\" with Dr. Leigh-Davis",
"#WeThePeople_Live",
"Alley Oop podcast",
"Always Listening: Podcast Reviews",
... ### everything here is fine ### ...
"Your Mom's House",
"Zen Dude Fitness",
"podCast411",
"talk2Cleo"]
(first three, last two are odd.)
Replace .order('title ASC') with this longer argument:
.order("
CASE WHEN lower(title) BETWEEN 'a' AND 'zzzzz'
OR title BETWEEN '0' AND '99999'
THEN lower(title)
ELSE concat('zzzzz', lower(title))
END")
This will sort case insensitive (lower); when values start with a digit or letter they are sorted normally, and all the other values will be sorted as if they were prefixed by 'zzzzz', forcing them to the end of the sort order.
Demo in SQL Fiddle
With Regular Expression
This solution combines the above idea with the idea of PJSCopeland (to use a regular expression). Again the strings starting with non-alphanumerical characters are sorted after those that start with alphanumerical characters:
.order("regexp_replace(lower(title), '([^[:alnum:] ])', 'zzz\1', 'gi')")
The \1 back-references the non-alphanumerical character that was matched, so all of them get prefixed with zzz.
Demo in SQL Fiddle
Disclaimer: I haven't tested this. It comes from the documentation for Postgres 9.1.
I have an inexact solution - the difference being that punctuation will simply be ignored, and your entries will end up like this:
.order("regexp_replace(title, '\W', '', 'gi')") # ASC is optional
=> ["Alley Oop podcast",
"Always Listening: Podcast Reviews",
"podCast411",
"\"Success Living\" - Dr. Leigh-Davis",
"talk2Cleo"
"\"The Real Deal\" with Dr. Leigh-Davis",
"Your Mom's House",
"#WeThePeople_Live",
"Zen Dude Fitness"]
regexp_replace is the Postgres equivalent of Ruby [g]sub.
\W means 'any non-word character' and will match anything other than A-Z, a-z, 0-9, and _.
If you want to ignore underscores as well, change \W to [\W_].
'' is what you replace the matches with: an empty string.
the g flag means 'all matches', and the i flag means 'case-insensitive' (dispensing with the need for lower()).

Porting POSIX regex to Lua pattern - unexpected results

I have hard time porting POSIX regex to Lua string patterns.
I'm dealing with html response from which I would like to filter checkboxes
that are checked. Particularly I'm interested in value and name fields of
each checked checkbox:
Here are examples of checkboxes I'm interested in:
<input class="rid-2 form-checkbox" id="edit-2-access-comments" name="2[access comments]" value="access comments" checked="checked" type="checkbox">
<input class="rid-3 form-checkbox real-checkbox" id="edit-3-administer-comments" name="3[administer comments]" value="administer comments" checked="checked" type="checkbox">
as opposed I'm not interested in this (unchecked checkbox):
<input class="rid-2 form-checkbox" id="edit-2-access-printer-friendly-version" name="2[access printer-friendly version]" value="access printer-friendly version" type="checkbox">
Using POSIX regex I've used following pattern in Python: pattern=r'name="(.*)" value="(.*)" checked="checked"' and it just worked.
My first approach in Lua was simply to use this: pattern ='name="(.-)"
value="(.-)" checked="checked"' but it gave strange results (first capture
was as expected but the second one returned lots of unneeded html).
I've also tried following pattern:
pattern = 'name="(%d?%[.-%])" value="(.-)"%s?(c?).-="?c.-"%s?type="checkbox"'
This time, in second capture content of value was returned but all
checkboxes where matched (not only those with checked="checked" field)
For completeness, here's the Lua code (snippet from my Nmap NSE script) that
attempts to do this pattern matching:
pattern = 'name="(.-)" value="(.-)" checked="checked"'
data = {}
for name, value in string.gmatch(res.body, pattern) do
stdnse.debug(1, string.format("%s %s", name, value))
end
I've used following pattern in Python: pattern=r'name="(.*)" value="(.*)" checked="checked"' and it just worked.
Python re is not POSIX compliant and . matches any char but a newline char there (in POSIX and Lua, . matches any char including a newline).
If you want to match a string that has 3 attributes above one after another, you should use something like
local pattern = 'name="([^"]*)"%s+value="([^"]*)"%s+checked="checked"'
Why not [^\r\n]-? Because in case there are two tags on one line with the first having the first and/or second attribute and the second having the second and third or just second (and even if there is a third tag with the third attribute while the first one contains the first two attributes), there will be match, as [^\r\n] matches < and > and can "overfire" across the tags.
Note that [^"]*, a negated bracket expression, will only match 0+ chars other than " thus restricting the matches within one tag.
See Lua demo:
local rx = 'name="([^"]*)"%s+value="([^"]*)"%s+checked="checked"'
local s = '<li name="n1"\nvalue="v1"><li name="n2"\nvalue="v1" checked="checked"><li name="n3"\nvalue="v3" checked="checked">'
for name, value in string.gmatch(s, rx) do
print(name, value)
end
Output:
n2 v1
n3 v3
(Updated based on comments) The pattern doesn't work when a line that doesn't have checked="checked" is before a line with checked="checked" in the input as .- expression captures unnecessary parts. There are several ways to avoid this; one suggested by #EgorSkriptunoff is to use ([^"]*) as the pattern; another is to exclude new lines ([^\r\n]-). The following example prints what you expect:
local s = [[
<input class="rid-2 form-checkbox" id="edit-2-access-comments" name="2[access comments]" value="access comments" checked="checked" type="checkbox">
<input class="rid-2 form-checkbox" id="edit-2-access-printer-friendly-version" name="2[access printer-friendly version]" value="access printer-friendly version" type="checkbox">
<input class="rid-3 form-checkbox real-checkbox" id="edit-3-administer-comments" name="3[administer comments]" value="administer comments" checked="checked" type="checkbox">
]]
local pattern = 'name="([^\r\n]-)" value="([^\r\n]-)" checked="checked"'
for name, value in string.gmatch(s, pattern) do
print(name, value)
end
The output:
2[access comments] access comments
3[administer comments] administer comments

Regular expression in Ruby - extracting from Gutenberg

I am fairly new to Ruby and I am struggling with a regular expression to seed a database from this text file: http://www.gutenberg.org/cache/epub/673/pg673.txt.
I want the <h1> tags as the words for the dictionary database, and the <def> tags as the definitions.
I could be quite off base here (I've only ever seeded a db with copy and past ;):
require 'open-uri'
Dictionary.delete_all
g_text = open('http://www.gutenberg.org/cache/epub/673/pg673.txt')
y = g_text.read(/<h1>(.*?)<\/h1>/)
a = g_text.read(/<def>(.*?)<\/def>/)
Dictionary.create!(:word => y, :definition => a)
As you can see, there are often more than one <def> for each <h1>, which is fine, as I can just add columns to my table for definition1, definition2, etc.
But what would this regular expression look like to be sure that each definition is in the same row as the immediately preceding <h1> tag?
Thanks for an help!
Edit:
Okay, so this is what i am trying now:
doc.scan(Regexp.union(/<h1>(.*?)<\/h1>/, /<def>(.*?)<\/def>/)).map do |m, n|
p [m,n]
end
How do I get rid of all of the nil entries?
It seems like regular expression is the only way of making it through the whole document without stopping part way through when an error is encountered...at least after a couple attempts at other parsers.
what I came to (with a local extract for sandbox use):
require 'pp' # For SO to pretty print the hash at end
h1regex="h1>(.+)<\/h1" # Define the hl regex (avoid empty tags)
defregex="def>(.+)<\/def" # define the def regex (avoid empty tags)
# Initialize vars
defhash={}
key=nil
last=nil
open("./gut.txt") do |f|
f.each_line do |l|
newkey=l[/#{h1regex}/i,1] # get the next key (or nothing)
if (newkey != last && newkey != nil) then # if we changed key, update the hash (some redundant hl entries with other defs)
key = last = newkey # update current key
defhash[key] = [] # init the new entry to empty array
end
if l[/#{defregex}/i] then
defhash[key] << l[/#{defregex}/i,1] # we did match a def, add it to the current key array
end
end
end
pp defhash # print the result
Which give this output:
{"A"=>
[" The first letter of the English and of many other alphabets. The capital A of the alphabets of Middle and Western Europe, as also the small letter (a), besides the forms in Italic, black letter, etc., are all descended from the old Latin A, which was borrowed from the Greek <spn>Alpha</spn>, of the same form; and this was made from the first letter (<i>Aleph</i>, and itself from the Egyptian origin. The <i>Aleph</i> was a consonant letter, with a guttural breath sound that was not an element of Greek articulation; and the Greeks took it to represent their vowel <i>Alpha</i> with the \\'84 sound, the Ph\\'d2nician alphabet having no vowel symbols.",
"The name of the sixth tone in the model major scale (that in C), or the first tone of the minor scale, which is named after it the scale in A minor. The second string of the violin is tuned to the A in the treble staff. -- A sharp (A#) is the name of a musical tone intermediate between A and B. -- A flat (A&flat;) is the name of a tone intermediate between A and G.",
"In each; to or for each; <as>as, \"twenty leagues <ex>a</ex> day\", \"a hundred pounds <ex>a</ex> year\", \"a dollar <ex>a</ex> yard\", etc.</as>",
"In; on; at; by.",
"In process of; in the act of; into; to; -- used with verbal substantives in <i>-ing</i> which begin with a consonant. This is a shortened form of the preposition <i>an</i> (which was used before the vowel sound); as in <i>a</i> hunting, <i>a</i> building, <i>a</i> begging. \"Jacob, when he was <i>a</i> dying\" <i>Heb. xi. 21</i>. \"We'll <i>a</i> birding together.\" \" It was <i>a</i> doing.\" <i>Shak.</i> \"He burst out <i>a</i> laughing.\" <i>Macaulay</i>. The hyphen may be used to connect <i>a</i> with the verbal substantive (as, <i>a</i>-hunting, <i>a</i>-building) or the words may be written separately. This form of expression is now for the most part obsolete, the <i>a</i> being omitted and the verbal substantive treated as a participle.",
"Of.",
" A barbarous corruption of <i>have</i>, of <i>he</i>, and sometimes of <i>it</i> and of <i>they</i>."],
"Abalone"=>
["A univalve mollusk of the genus <spn>Haliotis</spn>. The shell is lined with mother-of-pearl, and used for ornamental purposes; the sea-ear. Several large species are found on the coast of California, clinging closely to the rocks."],
"Aband"=>["To abandon.", "To banish; to expel."],
"Abandon"=>
["To cast or drive out; to banish; to expel; to reject.",
"To give up absolutely; to forsake entirely ; to renounce utterly; to relinquish all connection with or concern on; to desert, as a person to whom one owes allegiance or fidelity; to quit; to surrender.",
"Reflexively : To give (one's self) up without attempt at self-control ; to yield (one's self) unrestrainedly ; -- often in a bad sense.",
"To relinquish all claim to; -- used when an insured person gives up to underwriters all claim to the property covered by a policy, which may remain after loss or damage by a peril insured against."]}
Hope it can help.
Late edit: there's probably a better way, I'm not a ruby expert. I was just giving a usual advice while reviewing, but as it seems no one has answered this is how I would do it.

parsley.js telephone digits input validating with spaces

I have an input for telephone number.
I would like to write this format: 0175 6565 6262 (with spaces). But if write with " " spaces so get error and I write without spaces so get not error.
Here my HTML Input:
<input type="text" data-parsley-minlength="6" data-parsley-minlength-message="minlength six number" data-parsley-type="digits" data-parsley-type-message="only numbers" class="input_text" value="">
Hope someone can help me?
That's a great answer, but it's a bit too narrow for my needs. Input field should be tolerant of all potential inputs – periods, hyphens, parentheses, spaces in unexpected places, plus signs for international folk – and using this document from Microsoft detailing what numbers IE11 should accept, I've come up with this:
data-parsley-pattern="^[\d\+\-\.\(\)\/\s]*$"
Every number in that list passes the test with flying colours. Enjoy!
If you want your input to accept a string like "nnnn nnnn nnnn" you should use a regular expression.
For example, you can use the following HTML:
<input type="text" name="phone" value="" data-parsley-pattern="^\d{4} \d{4} \d{4}$" />
With this pattern the input will only be valid when you have fourdigits«space»fourdigits«space»fourdigits
You can test or tweak the regular expression and test it here: http://regexpal.com/
If you will use this pattern multiple times in your project I suggest you create a custom validator (see http://parsleyjs.org/doc/index.html#psly-validators-craft)

Resources