Fuzzy search with Solr and sunspot - ruby-on-rails

I have installed Solr and the Sunspot gem for my Rails 3.0 app.
My goal is to do fuzzy search.
For example, I want the search term "Chatuea Marguxa" be found as "Château Margaux".
Actually, only the same exact words are found, so fuzzy didn't work at all.
My model:
searchable do
text :winery
end
My controller:
search = Wine.search do
fulltext 'Chatuea Marguxa'
end
The solr schemas I tried, with ngrams:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
I also tried with double metaphone:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
</analyzer>
In both cases, I got 0 response. (after reindexing of course).
What I did wrong ?

try to add character '~' behind all word in query. Like this: Chatuea~ Marguxa~. This is fuzzy operator implemented in lucene: http://lucene.apache.org/core/3_6_0/queryparsersyntax.html#Fuzzy%20Searches

some searching around revealed fuzzily gem:
Anecdotical benchmark: against our whole Geonames-derived table of
locations (3.2M records, about 1GB of data), on my development machine
(a 2011 MacBook Pro)
searching for the top 10 matching records takes 6ms ±1 preparing the
index for all records takes about 10min the DB query overhead when
changing a record is at 3ms ±2 the memory overhead (footprint of the
trigrams table index) is about 300MB

Related

How to check with Ant if files contain a string a OR a string b

I'm looking for a way to make this code snippet work but with one of the three contains being true. So OR instead of AND.
<condition property="warningFilesFound">
<resourcecount when="greater" count="0">
<fileset id="warnings-fileset-id" dir="${target.dir}/xref" includes="**/*.warnings">
<contains text="requires "CHARACTER","RAW" or "COLUMN" for double-byte or multi-byte(UTF8) languages. (3619)"/>
<contains text="requires "CHARACTER" or "RAW" for double-byte languages. (2363)"/>
<contains text="requires "CHARACTER", "RAW", "COLUMN" or "FIXED" for double-byte or multi-byte(UTF8) languages. (3623)"/>
</fileset>
</resourcecount>
</condition>
Only one of the three contains should be true. In this code snippet all three contains must be true. Regular expression?
You should be able to wrap an <or> selector round the contain elements, something like this:
<fileset id="warnings-fileset-id" dir="${target.dir}/xref" includes="**/*.warnings">
<or>
<contains text="requires "CHARACTER","RAW" or "COLUMN" for double-byte or multi-byte(UTF8) languages. (3619)"/>
<contains text="requires "CHARACTER" or "RAW" for double-byte languages. (2363)"/>
<contains text="requires "CHARACTER", "RAW", "COLUMN" or "FIXED" for double-byte or multi-byte(UTF8) languages. (3623)"/>
</or>
</fileset>

Fluentd regex filter removes other keys

I'm getting a message into fluentd with a few keys already populated from previous stages (fluent-bit on another host). I'm trying to parse the content of the log field as follows:
# Parse app_logs
<filter filter.app.backend.app_logs>
#type parser
key_name log
<parse>
#type regexp
expression /^(?<module>[^ ]*) *(?<time>[\d ,-:]*) (?<severity>[^ ]*) *(?<file>[\w\.]*):(?<function>[\w_]*) (?<message>.*)$/
time_format %Y-%m-%d %H:%M:%S,%L
</parse>
</filter>
It works (kind of), as it extracts the fields as expected. That said, it removes all the other fields that were there before.
Example message before the filter:
filter.app.backend.app_logs: {"docker.container_name":"intranet-worker","docker.container_id":"98b7784f27f93a056c05b4c5066c06cb5e23d7eeb436a6e4a66cdf8ff045d29f","time":"2022-06-10T17:00:00.248932151Z","log":"org-worker 2022-06-10 19:00:00,248 INFO briefings.py:check_expired_registrations Checking for expired registrations\n","docker.container_image":"registry.my-org.de/org-it-infrastructure/org-fastapi-backend/backend-worker:v0-7-11","stream":"stdout","docker.container_started":"2022-06-10T14:57:27.925959889Z"}
After the filter, the message looks like this (its a slightly different one, but same stream):
filter.app.backend.app_logs: {"module":"mksp-api","severity":"DEBUG","file":"authToken.py","function":"verify_token","message":"Token is valid, checking permission"}
So only the parsed fields are kept, the rest is removed. Can I somehow use that filter to add the fields to the message, instead of replacing it?
Actually, this scenario is described in the documentation, its not part of the regexp documentation but of the corresponding parser filter documentation:
reserve_data
Keeps the original key-value pair in the parsed result.
Therefore, the following configuration works:
<filter filter.app.backend.app_logs>
#type parser
key_name log
reserve_data true
<parse>
#type regexp
expression /^(?<module>[^ ]*) *(?<time>[\d ,-:]*) (?<severity>[^ ]*) *(?<file>[\w\.]*):(?<function>[\w_]*) (?<message>.*)$/
time_format %Y-%m-%d %H:%M:%S,%L
</parse>
</filter>

Stackdriver custom multiline logging, time format

I've been trying to set up a custom multiline log parser to get logs into Stackdriver with some readable fields. Currently it looks like this:
<source>
type tail
read_from_head true
path /root/ansible.log
pos_file /var/lib/google-fluentd/pos/ansible.pos
time_format "%a %b %e %T %Z %Y"
format multiline
format_firstline /Started ansible run at/
format1 /Started ansible run at (?<timestart>[^\n]+)\n(?<body>.*)/
format2 /PLAY RECAP.*/
format3 /ok=(?<ok>\d+)\s+changed=(?<changed>\d+)\s+unreachable=(?<unreachable>\d+)\s+failed=(?<failed>\d+).*/
format4 /Finished ansible run at (?<timeend>[^\n]+)/
tag ansible
</source>
It's done to the specifications at http://docs.fluentd.org/v0.12/articles/parser_multiline, and it works. But it works without a proper time stamp - timestart and timeend are just simple fields in the json. So in this current state, the time_format setting is useless, because I don't have a time variable among the regexes. This does aggregate all the variables I need, logs show up in Stackdriver when I run the fluend service, and all is almost happy.
However, when I change one of those time variables' name to time, trying to actually assign a Stackdriver timestamp to the entry, it doesn't work. The fluentd log on the machine says that the worker started and parsed everything, but logs don't show up in the Stackdriver console at all.
timestart and timeend look like Fri Jun 2 20:39:58 UTC 2017 or something along those lines. The time format specifications are at http://ruby-doc.org/stdlib-2.4.1/libdoc/time/rdoc/Time.html#method-c-strptime and I've checked and double checked them too many times and I can't figure out what I'm doing wrong.
EDIT: another detail: when I try to parse out the time variable, while the logs don't show up in the Stackdriver console, the appropriate tag (in this case ansible) shows up in the list of tags. It's just that the results are empty.
You're correct that the Stackdriver logging agent looks for the timestamp in the 'time' field, but it uses Ruby's Time.iso8601 to parse that value (falling back on Time.at on error). The string you quoted (Fri Jun 2 20:39:58 UTC 2017) is not in either of those formats, so it fails to parse it (you could probably see the error in /var/log/google-fluentd/google-fluentd.log).
You could add a record_transformer plugin to your config to change your parsed date to the right format (hint: enable_ruby is your friend). Something like:
<filter foo.bar>
#type record_transformer
enable_ruby
<record>
time ${Time.strptime(record['time'], '%a %b %d %T %Z %Y').iso8601}
</record>
</filter>
should work...

How to concat multiple files to a single file at the same time

I am trying to concat multiple files (say 15 txt files) to a single file at the same time by separate ant calls.
Say there are 15 concat() run at the same time.
However, the output file was not expected.
The data in the output file is corrupted.
Does anyone have idea to solve this problem?
Example:
Input 1:
a=1
b=2
c=3
Input 2:
d=4
e=5
f=6
Output:
a=1
b=2
d=4
e
c=3=5
f=6
You can do this with the concat task, which take a resource collection such as `filesets' as nested elements, allowing you to concatenate all the files in a single task call. Example:
<concat destfile="${build.dir}/output.txt">
<fileset file="${src.dir}/input1.txt" />
<fileset file="${src.dir}/input2.txt" />
</concat>

Solr with partial search not working when query is wrapped in quotes containing words separated by space

Here's my search query:
name_text_partial_all:"hello world"
The field has these words in the index for one document: hello world
Here's my schema definition for this type:
<fieldtype class="solr.TextField" name="text_partial_all" positionIncrementGap="100" omitNorms="false" stored="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="[^\d\sa-zA-Z]" replacement=""/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="[^\d\sa-zA-Z]" replacement=""/>
<filter class="solr.LengthFilterFactory" min="2" max="30" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldtype>
This is not finding the document. Any clue why?
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="30"/> will generate ngram tokens which would be at separate positions.
For e.g. Hello World when it goes through the NGramFilterFactory the tokens Hello and World would be at separate positions.
You can check on analysis for the Hello World, the tokens Hello is at position 10 and world is at position 20.
So a query looking for exact phrase name_text_partial_all:"hello world" would not work while name_text_partial_all:"hello world"~9 would work.
You need to either use slop or position filter to maintain the same positions.

Resources