Unexpected result with apoc.text.sorensenDiceSimilarity?

Unexpected result with apoc.text.sorensenDiceSimilarity? - neo4j

A bit confused in regards to string similarity using Sorensen-Dice.
Apparently it makes a difference in what order parameters are being passed.
WITH
apoc.text.sorensenDiceSimilarity("+46xxxxx2260", "+46xxxxx2226") as score1,
apoc.text.sorensenDiceSimilarity("+46xxxxx2226", "+46xxxxx2260") as score2
RETURN
score1, score2
One of these scores (i.e. similarity coefficients) will say 1.0, the other 0.909090...
Does not make sense to me, but perhaps there's something with the algorithm I'm not aware of?
Any insight is appreciated.
P.S. "Neo4j Kernel", "3.5.9", "community"

This is definitely a bug and a good catch!
As alternative you can do below query which uses apoc functions as toSet and intersection and text function, split. There is hack on the query that uses ROUND(10^4/10^4) to use 4-decimal places. If you like my answer, please vote and accept it. Thanks.
WITH apoc.coll.toSet(split("+46xxxxx2260","")) as set1, apoc.coll.toSet(split("+46xxxxx2226","")) as set2
WITH set1, set2, apoc.coll.intersection(set1, set2) as common
RETURN ROUND(2*size(common)*10^4/(size(set1)+size(set2)))/10^4 as sorensenDiceSimilarity
Result:
0.9091

Related

How to recover a valuation from a satifsiable formula, a question about model

I'm using Z3 with the ml interface. I had created a formula
f(x_i)
that is satisfiable, according to the solver
Solver.mk_simple_solver ctxr.
The problem is: I can get a model, but he find me values only for some variables of the formula, and not all (some of my Model.get_const_interp_er end with a type None)
How can it be possible that the model can give me only a part of the x_ir? In my understanding, if the model work for one of the values, it means that the formula was satisfiable (in my case, it is) and so all the values can be given...
I don't understand something..
Thanks for reading me!

You should always post full examples so people can help with actual coding issues; without seeing your actual code, it's impossible to know what might be the actual reason.
Having said that, this sounds very much like the following question: Why Z3Py does not provide all possible solutions So, perhaps the answer given there will help you.
Long story short: Z3 models will only contain values for variables that matter for the model. For anything that is not explicitly assigned, any value will do. There are ways to get "full" models as explained in that answer of course; which I'm sure is also possible from the ML interface.

Maxima gives crazy answer for integrate(exp(x^2))

I'm trying to learn how to use Maxima. Something goes wrong with integrate :
(%i) integrate(exp(x^3),x,1,2);
(%o) (gamma_incomplete(1/3,-8)-gamma_incomplete(1/3,-1))/3
(%i) float(%);
(%o) .3333333333333333 (- 715.7985328824436 %i - 413.26647564521807)
(%i) expand(%);
(%o) - 238.59951096081454 %i - 137.75549188173935
What do you think?

Comparing Maxima's result to Wolfram Alpha, looks like Maxima has assumed that -x/((-x^3)^(1/3)) = 1. After debugging this for a bit, I can't tell if that term was originally in the result and it got simplified away, or if it was never there. With that term in place, and using the principal branch for the cube root, I get 275.510983 + (epsilon)*%i which agrees with a numerical result, namely quad_qags(exp(x^3), x, 1, 2) => 275.510983.
For the record, this integral is handled as "Type 1a" in maxima/src/sin.lisp, in the function INTEGRATE-EXP-SPECIAL.

Mathematically, I don't think there's anything fundamentally wrong with a complex answer to an exponential integration. In general, If you integrate e^(x^n) you're going to run into strange functions like the incomplete gamma function etc, because the answer isn't expressible in conventional functions, so has no conventional real analytic solution.
However, I think that there's definitely some inaccuracy here. Mathematica gives a different answer, much closer to a real answer, and as I ask for more accuracy, the real part appears to tend to zero, as you would expect.
If you want to numerically integrate (and it sounds like you do), you'll could use a different function. integrate is for analytical integration, which is why it gave you a formula rather than a number. Look up quad_qags and its friends for some really clever numerical integration functions.

do record_info and tuple_to_list return the same key order in Erlang?

I.e, if I have a record
-record(one, {frag, left}).
Is record_info(fields, one) going to always return [frag,
left]?
Is tl(tuple_to_list(#one{frag = "Frag", left = "Left"}))
always gonna be ["Frag", "Left"]?
Is this an implementation detail?
Thanks a lot!

The short answer is: yes, as of this writing it will work. The better answer is: it may not work that way in the future, and the nature of the question concerns me.
It's safe to use record_info/2, although relying on the order may be risky and frankly I can't think of a situation where doing so makes sense which implies that you are solving a problem the wrong way. Can you share more details about what exactly you are trying to accomplish so we can help you choose a better method? It could be that simple pattern matching is all you need.
As for the example with tuple_to_list/1, I'll quote from "Erlang Programming" by Cesarini and Thompson:
"... whatever you do, never, ever use the tuple representations of records in your programs. If you do, the authors of this book will disown you and deny any involvement in helping you learn Erlang."
There are several good reasons why, including:
Your code will become brittle - if you later change the number of fields or their order, your code will break.
There is no guarantee that the internal representation of records will continue to work this way in future versions of erlang.

Yes, order is always the same because records represented by tuples for which order is an essential property. Look also on my other answer about records with examples: Syntax Error while accessing a field in a record

Yes, in both cases Erlang will retain the 'original' order. And yes it's implementation as it's not specifically addressed in the function spec or documentation, though it's a pretty safe bet it will stay like that.

Best practice for determining the probability that 2 strings match

I need to write code to determine if 2 strings match when one of the strings may contain a small deviation from the second string e.g. "South Africa" v "South-Africa" or "England" v "Enlgand". At the moment, I am considering the following approach
Determine the percentage of characters in string 1 that match those in string 2
Determine the true probability of the match by combining the result of 1 with a comparison of the length of the 2 strings e.g. although all the characters in "SA" are found in "South Africa" it is not a very likely match since "SA" could be found in a range of other country names as well.
I would appreciate to hear what current best practice is for performing such string matching.

You can look at Levenshtein distance. This is distance between two strings. The same strings have distance equal 0. Strings such as kitten and sitten have distance equal 1, and so on. Distance is measured by minimal number of simple operations that transform one string to another.
More information and algorithm in pseudo-code is given in link.
I also remember that this topic was mentioned in Game programming gems: volume 6: Article 1.6 Closest-String Matching Algorithm

To make fuzzy string matching ideal, it's important to know about the context of the strings. When it's just about small typos, Levenstein can be good enough. When it's about misheard sound, you can use a phonetic algorithm like soundex or metaphone.
Most times, you need a combination of the following algorithms, and some more specific manually written stuff.
Needleman-Wunsch
Soundex
Metaphone
Levenstein distance
Bitmap
Hamming distance
There is no best fuzzy string matching algorithm. It's all about the context it's used in, so you need to tell us about where you want to use the string matching for.

Don't reinvent the wheel. Wikipedia has the Levenshtein algorithm which has metrics for what you want to do.
http://en.wikipedia.org/wiki/Levenshtein_distance
There's also Soundex, but that might be too simplistic for your requirements.

Use of Soundex proved to work nicely for me:
With a small tweak or two to the implementation, Soundex matching can check cross-languages if two strings of different languages sound the same..
Objective-C Soundex implementation:
http://www.cocoadev.com/index.pl?NSStringSoundex

I've found an Objective-C implementation of the Levenshtein Distance Algorithm here. It works great for me.

infix to postfix conversion and evaluation

I have a complex problem, I am getting formulas form the database and I need to evaluate them. I choose to convert them to post fix...and evaluate them the problem is that..
my formulas are like
roundoff(vd,2);
udV=lookup(uv*dse,erd);
ude=if(er>es)?sr:ss;
Can anyone find a solution for these type of conversions and evaluations...

No, not without some more clarification from you. Perhaps you could tell us what sort of technology you are using and what some, at least, of your functions mean. As it stands I recommend that you use Mathematica because it's probably powerful enough to tackle this type of problem. If you don't have access to Mathematica, perhaps you could hook in to Wolfram Alpha for evaluations.

Categories

HOME

dart

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Unexpected result with apoc.text.sorensenDiceSimilarity? - neo4j

Related

How to recover a valuation from a satifsiable formula, a question about model

Maxima gives crazy answer for integrate(exp(x^2))

do record_info and tuple_to_list return the same key order in Erlang?

Best practice for determining the probability that 2 strings match

infix to postfix conversion and evaluation

Categories

Resources