Sphinx (via SphinxQL) match without asterisk, but not with asterisk - search-engine

I have an index in Sphinx, one of the words in this index is an article number. In this case 04.007.00964.
When I query my index like this:
SELECT * FROM myIndex WHERE MATCH('04.007.00964')
I have one result, this is as expected.
However, when I query it like this:
SELECT * FROM myIndex WHERE MATCH('*04.007.00964*')
I have zero results.
My index configuration is:
index myIndex
{
source = myIndex
path = D:\Tools\Sphinx\data\myIndex
morphology = none
min_word_len = 3
min_prefix_len = 0
min_infix_len = 2
enable_star = 1
}
I'm using v2.0.4-release
What am I doing wrong, or what dont I understand?

Because of
min_word_len = 3
The first query will be effectivly:
SELECT * FROM myIndex WHERE MATCH('007 00964')
So short words are ignored. (indexing and querying)
Edit to add: And "." is not in the default charset_table, which is why its used as a seperator.
However "*04" is not stripped, because it 3 chars,
but then there is nothing to match, because "04" will not be in the index (its shorter than the min_word_len)
... so its a unfortunate combination of word and infix lengths. Can easily fix it by making min_word_len = 2
Edit to add: or adding '.' to charset tables, so that its no longer used to separate words, therefore the whole article number is used - and is longer than both min_word_len and min_infix_len)

Related

selecting range of values based upon first few characters in spss?

I know that through
select cases if char.substr(variable_name,1,3)="I22".
I can select values based on the first # of characters but this is not exactly my question. I need to select RANGE OF values that start with few characters, here is an example of what I want:
if I have the following cases:
I22A33
I22B33
I22C33
I22D33
So I want to select I22B33 and I22C33 out of the above 4 values, so it's like a range of cases between b and c.
One way to flag any cases that meet your criteria is using INDEX and a series of OR conditions. Not particularly modular, but if you just have a couple of conditions you're searching for it could get you on your way.
Edit: These searches are case-insensitive (due to UPCASE) and search for matches at the start of the string. To search for matches anywhere within the string set the condition to > 0 (instead of = 1).
COMPUTE f_I22 = (INDEX(UPCASE(var_name),'I22B33') = 1)
OR (INDEX(UPCASE(var_name),'I22C33') = 1) .
EXE .
Assuming in this range of values that you want to select, all the values will start with either "I22B" or "I22C", you can simply use:
select cases if char.substr(variable_name,1,4)="I22B" or
char.substr(variable_name,1,4)="I22C".

How to optimize pulling multiple results of text comparison across multiple tabs in a Google Sheet? (QUERY vs FILTER vs other functions)

new as heck here and I trying to find the best way to further optimize a set of sheets functions.
The starting function was essentially 26 stacked filter functions being used to reference individual cells beneath containing names(strings), find the IDs associated with the names in these cells, pull the IDs from the "ref" sheet, and create a url that contains these IDs. The function, stripped of confidential data:
=HYPERLINK(CONCATENATE("https://url.com/stuff?ids=",if(isna(filter(ref!$B:$B,ref!$A:$A=B3))=true,"",filter(ref!$B:$B,ref!$A:$A=B3)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B4))=true,"",filter(ref!$B:$B,ref!$A:$A=B4)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B5))=true,"",filter(ref!$B:$B,ref!$A:$A=B5)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B6))=true,"",filter(ref!$B:$B,ref!$A:$A=B6)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B7))=true,"",filter(ref!$B:$B,ref!$A:$A=B7)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B8))=true,"",filter(ref!$B:$B,ref!$A:$A=B8)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B9))=true,"",filter(ref!$B:$B,ref!$A:$A=B9)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B10))=true,"",filter(ref!$B:$B,ref!$A:$A=B10)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B11))=true,"",filter(ref!$B:$B,ref!$A:$A=B11)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B12))=true,"",filter(ref!$B:$B,ref!$A:$A=B12)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B13))=true,"",filter(ref!$B:$B,ref!$A:$A=B13)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B14))=true,"",filter(ref!$B:$B,ref!$A:$A=B14)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B15))=true,"",filter(ref!$B:$B,ref!$A:$A=B15)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B16))=true,"",filter(ref!$B:$B,ref!$A:$A=B16)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B17))=true,"",filter(ref!$B:$B,ref!$A:$A=B17)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B18))=true,"",filter(ref!$B:$B,ref!$A:$A=B18)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B19))=true,"",filter(ref!$B:$B,ref!$A:$A=B19)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B20))=true,"",filter(ref!$B:$B,ref!$A:$A=B20)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B21))=true,"",filter(ref!$B:$B,ref!$A:$A=B21)),",",if(isna(filter(ref!$B:$B,ref!$A:$A=B22))=true,"",filter(ref!$B:$B,ref!$A:$A=B22)),"&morestuff=true"),"Group A")
The best I could do to optimize this was to compile the IDs in one cell (B1) with this QUERY function within a TEXTJOIN:
=TEXTJOIN(",",TRUE, QUERY(ref!$A$2:$B, "select B where A = '"&B3&"' or A = '"&B4&"' or A = '"&B5&"' or A = '"&B6&"' or A = '"&B7&"' or A = '"&B8&"' or A = '"&B9&"' or A = '"&B10&"' or A = '"&B11&"' or A = '"&B12&"' or A = '"&B13&"' or A = '"&B14&"' or A = '"&B15&"' or A = '"&B16&"' or A = '"&B17&"' or A = '"&B18&"' or A = '"&B19&"' or A = '"&B20&"' or A = '"&B21&"' or A = '"&B22&"' or A = '"&B23&"' or A = '"&B24&"' or A = '"&B25&"' or A = '"&B26&"' or A = '"&B27&"' or A = '"&B28&"' or A = '"&B29&"' "))
Then do the url generation in another cell (B2), referencing B1 for the IDs:
=HYPERLINK(CONCATENATE("https://url.com/stuff?ids=", ",", B1, "&morestuff=true"), "Group A")
I then copied these across Groups B-Z
I tried multiple variations of FILTER stacked in ARRAYFORMULA but I couldn't get it to work. I'm unsure if there is a syntax I should be using to better handle matching text that I am not quite figuring out. Simply matching by first letter is not an option as there are names that do not qualify for the groupings contained within the "ref" sheet for use in other data sets.
So my question here is really: is there an easier way to compile this where I don't have to reference each cell individually for matches on the "ref" sheet?
For an example of the sheet I am working with, this link should work:
https://docs.google.com/spreadsheets/d/1ykSldWyQnPcar9G21ljLOzm3dWC1i82kMaQJuSKHajc/edit?usp=sharing
Just use another textjoin to construct your query string as follows:
="select B"&if(counta(B3:B)=0,""," where A='")&textjoin("' or A='",true,B3:B)
Not only is this much shorter than manually typing each cell but it will also expand infinitely downward as required.

Configure Sphinx to index dash and search it with and without it

I have a record
Item id: 1, name: "wd-40"
How do I configure Sphinx to match this record on the following queries:
Item.search("wd40")
Item.search("wd-40")
To answer your title question, charset_table is what you want.
http://sphinxsearch.com/docs/current.html#charsets
But that doesnt actully solve the query of matching those two queries, indexing - wouldn't work, just be the inverse of indexing it.
Instead, you probably want ignore_chars
http://sphinxsearch.com/docs/current.html#conf-ignore-chars
First indexing:
By default, only ascii characters are indexed by Sphinx; the others are considered word separators. To fix that, you need to use the charset_table parameter to map the dash to the dash character.
Second searching:
AFAIK, it is not possible to make Sphinx to consider both searches like you are asking for. However, you can just use something like:
# in Python, but I believe is understandable
query = word
if '-' in word:
query += " | " + word.replace('-','')
Item.search(query) # if word = 'wd-40', query = 'wd-40 | wd40'

Splitting an Array in Ruby

I am fetching some data from the database as below in my ruby file:
#main1= $connection.execute("SELECT * FROM builds
WHERE platform_type LIKE 'TOTAL';")
#main2= $connection.execute("SELECT * FROM builds
WHERE platform_type NOT LIKE 'TOTAL';")
After doing this I am performing hashing and a bunch of other stuff on these results. To be clear, this does not return an array as such, but it returns some mysql2 type object. So I just map it to 2 arrays to be safe:
#arr1 = Array.new
#arr1 = #main1.map
#arr2 = Array.new
#arr2 = #main2.map
Is there any way to avoid executing 2 different queries and getting all the results in 2 different arrays by executing just one query. I basically want to split the results into 2 arrays, the first one having platform_type = TOTAL and everything else in the other one.
Also without getting into why you're doing what you're doing, I would use Enumerable#partition as such:
rows = $connection.execute('SELECT * FROM builds')
like_total, not_like_total = rows.partition { |row|
row['platform_type'] =~ /TOTAL/
}
Note that, IIRC, SQL LIKE 'TOTAL' isn't the same as Ruby's "string" =~ /TOTAL/ (which is more like LIKE '%TOTAL%' in SQL—am not sure what you need).
To answer your question without getting into why you're doing it like that:
Return them all in one query, with the extra criteria, then you can group them however you want with group_by:
all_results = $connection.execute("SELECT *, platform_type LIKE 'TOTAL' as is_like_total FROM builds").
This will give each of your results an 'is_like_total' "column" that you can group_by on.
http://ruby-doc.org/core-2.0/Enumerable.html#method-i-group_by

Thinking Sphinx : Relevance - infix vs. complete word

Using Thinking Sphinx in my rails app, I set it up to allow partial match with infix (for example, searching for "tray" would match "ashtray").
However, I'd like complete word match to have more weight (relevance) than infix match.
So, if my search for 'tray' returns these 3 results : "Silver Tray", "Ashtray" and "Some other tray" - I want the "Ashtray" to be the last result when sorting by relevance.
Is there a way to configure Sphinx to do that ?
You need to define your own ranker. Here's how the default ones look like:
SPH_RANK_PROXIMITY_BM25 = sum(lcs*user_weight)*1000+bm25
SPH_RANK_BM25 = bm25
SPH_RANK_NONE = 1
SPH_RANK_WORDCOUNT = sum(hit_count*user_weight)
SPH_RANK_PROXIMITY = sum(lcs*user_weight)
SPH_RANK_MATCHANY = sum((word_count+(lcs-1)*max_lcs)*user_weight)
SPH_RANK_FIELDMASK = field_mask
SPH_RANK_SPH04 = sum((4*lcs+2*(min_hit_pos==1)+exact_hit)*user_weight)*1000+bm25
http://sphinxsearch.com/docs/2.0.6/weighting.html

Resources