N-gram split function for string similarity comparison

N-gram split function for string similarity comparison - f#

As part of excersise to better understand F# which I am currently learning , I wrote function to
split given string into n-grams.
1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?
2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example).
3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal?
//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
let ngram_count = s.Length - (s.Length % n)
let ngram_list = List.init ngram_count (fun i ->
if( i + n >= s.Length ) then
s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
(fun i -> "#")
else
s.Substring(i,n)
)
let ngram_array_unique = ngram_list
|> Seq.ofList
|> Seq.distinct
|> Array.ofSeq
//produce tuples of ngrams (ngram string,how much occurrences in original string)
Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
|> List.length)
)

I don't know much about evaluating similarity of strings, so I can't give you much feedback regarding points 2 and 3. However, here are a few suggestions that may help to make your implementation simpler.
Many of the operations that you need to do are already available in some F# library function for working with sequences (lists, arrays, etc.). Strings are also sequences (of characters), so you can write the following:
open System
let ngramSplit n (s:string) =
let ngrams = Seq.windowed n s
let grouped = Seq.groupBy id ngrams
Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences) grouped
The Seq.windowed function implements a sliding window, which is exactly what you need to extract the n-grams of your string. The Seq.groupBy function collects the elements of a sequence (n-grams) into a sequence of groups that contain values with the same key. We use id to calculate the key, which means that the n-gram is itself the key (and so we get groups, where each group contains the same n-grams). Then we just convert n-gram to string and count the number of elements in the group.
Alternatively, you can write the entire function as a single processing pipeline like this:
let ngramSplit n (s:string) =
s |> Seq.windowed n
|> Seq.groupBy id
|> Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences)

Your code looks OK to me. Since ngram extraction and similarity comparison are used very often. You should consider some efficiency issues here.
The MapReduce pattern is very suitable for your frequency counting problem:
get a string, emit (word, 1) pairs out
do a grouping of the words and adds all the counting together.
let wordCntReducer (wseq: seq<int*int>) =
wseq
|> Seq.groupBy (fun (id,cnt) -> id)
|> Seq.map (fun (id, idseq) ->
(id, idseq |> Seq.sumBy(fun (id,cnt) -> cnt)))
(* test: wordCntReducer [1,1; 2,1; 1,1; 2,1; 2,2;] *)
You also need to maintain a <word,int> map during your ngram building for a set of strings. As it is much more efficient to handle integers rather than strings during later processing.
(2) to compare the distance between two short strings. A common practice is to use Edit Distance using a simple dynamic programming. To compute the similarity between articles, a state-of-the-art method is to use TFIDF feature representation. Actuallym the code above is for term frequency counting, extracted from my data mining library.
(3) There are complex NLP methods, e.g. tree kernels based on the parse tree, to in-cooperate the context information in.

I think you have some good answers for question (1).
Question (2):
You probably want cosine similarity to compare two arbitrary collections of n-grams (the larger better). This gives you a range of 0.0 - 1.0 without any scaling needed. The Wikipedia page gives an equation, and the F# translation is pretty straightforward:
let cos a b =
let dot = Seq.sum (Seq.map2 ( * ) a b)
let magnitude v = Math.Sqrt (Seq.sum (Seq.map2 ( * ) v v))
dot / (magnitude a * magnitude b)
For input, you need to run something like Tomas' answer to get two maps, then remove keys that only exist in one:
let values map = map |> Map.toSeq |> Seq.map snd
let desparse map1 map2 = Map.filter (fun k _ -> Map.containsKey k map2) map1
let distance textA textB =
let a = ngramSplit 3 textA |> Map.ofSeq
let b = ngramSplit 3 textB |> Map.ofSeq
let aValues = desparse a b |> values
let bValues = desparse b a |> values
cos aValues bValues
With character-based n-grams, I don't know how good your results will be. It depends on what kind of features of the text you are interested in. I do natural language processing, so usually my first step is part-of-speech tagging. Then I compare over n-grams of the parts of speech. I use T'n'T for this, but it has bizarro licencing issues. Some of my colleagues use ACOPOST instead, a Free alternative (as in beer AND freedom). I don't know how good the accuracy is, but POS tagging is a well-understood problem these days, at least for English and related languages.
Question (3):
The best way to compare two strings that are nearly identical is Levenshtein distance. I don't know if that is your case here, although you can relax the assumptions in a number of ways, eg for comparing DNA strings.
The standard book on this subject is Sankoff and Kruskal's "Time Warps, String Edits, and Maromolecules". It's pretty old (1983), but gives good examples of how to adapt the basic algorithm to a number of applications.

Question 3:
My reference book is Computing Patterns in Strings by Bill Smyth

Related

How to combine two differently typed sequences into tuples in f#?

I'm a bit new to F#, I have mostly c# background.
I'm working with two lists/sequences that represent the same thing, but from different datasources (one is a local file, the other is a group of items in an online system.
I need to report the mismatches between the two datasets.
So far I have filtered down the two lists to contain only items that aren't have mismatches in the other dataset.
Now I want to pair them up into tuples (or anything else really) based on one of the properties so I can log the differences.
So far what I've tried is this:
let printDiff (offlinePackages: oP seq) (onLinePackages: onP seq) =
Seq.map2(fun x y -> (x, y)) offlinePackages onLinePackages
This does pair up my data, however I'd still need some logic to do the pairup based on their matching property value.
I'm looking for something like:
if offLinePackage.Label = OnlinePackage.Label then /do the matchup/
else /don't do anything/
I know I'm still stuck in my object oriented thinking, that's also why I'm asking.
Thanks in advance!

Matching sequence elements based on some equivalency function - that's called "join".
The most "straightforward" way to do a join is to get a Cartesian product and filter it down, like this:
let matchup seq1 seq2 =
Seq.allPairs seq1 seq2
|> Seq.filter (fun (x, y) -> x.someProp = y.someProp)
Or in a computation expression form:
let matchup seq1 seq2
seq {
for x in seq1 do
for y in seq2 do
if x.someProp = y.someProp then yield (x,y)
}
But this is a bit inefficient. The complexity would be O(n*m), because it iterates over all possible pairs. Fine if the lists are short, but will bite you the more you scale.
To do this more efficiently, you can use the query computation builder and its operation join, which does a hash-join (i.e. it builds a hashtable first and then matches the elements based on that):
let matchup seq1 seq2 =
query {
for x in seq1 do
join y in seq2 on (x.someProp = y.someProp)
select (x,y)
}

unexpected return type from list comprehension

I am teaching myself a bit of F# by doing a bit of simple matrix mathematics. I decided to write a set of simple functions for combining two matrices as I thought that this would be a good way of learning list comprehensions. However when I compile it my unit tests produce a type mismatch exception.
//return a column from the matrix as a list
let getColumn(matrix: list<list<double>>, column:int) =
[for row in matrix do yield row.Item(column)]
//return a row from the matrix as a list
let getRow(matrix: list<list<double>>, column:int) =
matrix.Item(column)
//find the minimum width of the matrices in order to avoid index out of range exceptions
let minWidth(matrix1: list<list<double>>,matrix2: list<list<double>>) =
let width1 = [for row in matrix1 do yield row.Length] |> List.min
let width2 = [for row in matrix2 do yield row.Length] |> List.min
if width1 > width2 then width2 else width1
//find the minimum height of the matrices in order to avoid index out of range exceptions
let minHeight(matrix1: list<list<double>>,matrix2: list<list<double>>) =
let height1 = matrix1.Length
let height2 = matrix2.Length
if height1 > height2 then height2 else height1
//combine the two matrices
let concat(matrix1: list<list<double>>,matrix2: list<list<double>>) =
let width = minWidth(matrix1, matrix2)
let height = minHeight(matrix1, matrix2)
[for y in 0 .. height do yield [for x in 0 .. width do yield (List.fold2 (fun acc a b -> acc + (a*b)), getRow(matrix1, y), getColumn(matrix2, x))]]
I was expecting the function to return a list of lists of type
double list list
However what it actually returns looks more like some kind of lambda expression
((int -> int list -> int list -> int) * double list * double list) list list
Can somebody tell me what is being returned, and how to force it to be evaluated into the list of lists that I originally expected?

There's a short answer and a long answer to your question.
The short answer
The short version is that F# functions (like List.fold2) take multiple parameters not with commas the way you think they do, but with spaces in between. I.e., you should NOT call List.fold2 like this:
List.fold2 (function, list1, list2)
but rather like this:
List.fold2 function list1 list2
Now, if you just remove the commas in your List.fold2 call, you'll see that the compiler complains about your getRow(matrix1, y) call, and tells you to put parentheses around them. (And the outer pair of parentheses around List.fold2 isn't actually needed). So this:
(List.fold2 (fun acc a b -> acc + (a*b)), getRow(matrix1, y), getColumn(matrix2, x))
Needs to turn into this:
List.fold2 (fun acc a b -> acc + (a*b)) (getRow(matrix1, y)) (getColumn(matrix2, x))
The long answer
The way F# functions take multiple parameters is actually very different from most other languages such as C#. In fact, all F# functions take exactly one parameter! "But wait," you're probably thinking right now, "you just now showed me the syntax for F# functions taking multiple parameters!" Yes, I did. What's going on under the hood is a combination of currying and partial application. I'd write a long explanation, but Scott Wlaschin has already written one, that's much better than I could have written, so I'll just point you to the https://fsharpforfunandprofit.com/series/thinking-functionally.html series to help you understand what's going on here. (The sections on currying and partial application are the ones you want, but I'd recommend reading the series in order because the later parts build on concepts introduced in earlier parts).
And yes, this "long" answer appears shorter than the "short" answer, but if you go read that series (and then the rest of Scott Wlaschin's excellent site), you'll find that it's much longer than the short answer. :-)
If you have more questions, I'll be happy to try to explain.

Apply several aggregate functions with one enumeration

Let's assume I have a series of functions that work on a sequence, and I want to use them together in the following fashion:
let meanAndStandardDeviation data =
let m = mean data
let sd = standardDeviation data
(m, sd)
The code above is going to enumerate the sequence twice. I am interested in a function that will give the same result but enumerate the sequence only once. This function will be something like this:
magicFunction (mean, standardDeviation) data
where the input is a tuple of functions and a sequence and the ouput is the same with the function above.
Is this possible if the functions mean and stadardDeviation are black boxes and I cannot change their implementation?
If I wrote mean and standardDeviation myself, is there a way to make them work together? Maybe somehow making them keep yielding the input to the next function and hand over the result when they are done?

The only way to do this using just a single iteration when the functions are black boxes is to use the Seq.cache function (which evaluates the sequence once and stores the results in memory) or to convert the sequence to other in-memory representation.
When a function takes seq<T> as an argument, you don't even have a guarantee that it will evaluate it just once - and usual implementations of standard deviation would first calculate the average and then iterate over the sequence again to calculate the squares of errors.
I'm not sure if you can calculate standard deviation with just a single pass. However, it is possible to do that if the functions are expressed using fold. For example, calculating maximum and average using two passes looks like this:
let maxv = Seq.fold max Int32.MinValue input
let minv = Seq.fold min Int32.MaxValue input
You can do that using a single pass like this:
Seq.fold (fun (s1, s2) v ->
(max s1 v, min s2 v)) (Int32.MinValue, Int32.MaxValue) input
The lambda function is a bit ugly, but you can define a combinator to compose two functions:
let par f g (i, j) v = (f i v, g j v)
Seq.fold (par max min) (Int32.MinValue, Int32.MaxValue) input
This approach works for functions that can be defined using fold, which means that they consist of some initial value (Int32.MinValue in the first example) and then some function that is used to update the initial (previous) state when it gets the next value (and then possibly some post-processing of the result). In general, it should be possible to rewrite single-pass functions in this style, but I'm not sure if this can be done for standard deviation. It can be definitely done for mean:
let (count, sum) = Seq.fold (fun (count, sum) v ->
(count + 1.0, sum + v)) (0.0, 0.0) input
let mean = sum / count

What we're talking about here is a function with the following signature:
(seq<'a> -> 'b) * (seq<'a> -> 'c) -> seq<'a> -> ('b * 'c)
There is no straightforward way that I can think of that will achieve the above using a single iteration of the sequence if that is the signature of the functions. Well, no way that is more efficient than:
let magicFunc (f1:seq<'a>->'b, f2:seq<'a>->'c) (s:seq<'a>) =
let cached = s |> Seq.cache
(f1 cached, f2 cached)
That ensures a single iteration of the sequence itself (perhaps there are side effects, or it's slow), but does so by essentially caching the results. The cache is still iterated another time. Is there anything wrong with that? What are you trying to achieve?

f# iterating over two arrays, using function from a c# library

I have a list of words and a list of associated part of speech tags. I want to iterate over both, simultaneously (matched index) using each indexed tuple as input to a .NET function. Is this the best way (it works, but doesn't feel natural to me):
let taggingModel = SeqLabeler.loadModel(lthPath +
"models\penn_00_18_split_dict.model");
let lemmatizer = new Lemmatizer(lthPath + "v_n_a.txt")
let input = "the rain in spain falls on the plain"
let words = Preprocessor.tokenizeSentence( input )
let tags = SeqLabeler.tagSentence( taggingModel, words )
let lemmas = Array.map2 (fun x y -> lemmatizer.lookup(x,y)) words tags

Your code looks quite good to me - most of it deals with some loading and initialization, so there isn't much you could do to simplify that part. Alternatively to Array.map2, you could use Seq.zip combined with Seq.map - the zip function combines two sequences into a single one that contains pairs of elements with matching indices:
let lemmas = Seq.zip words tags
|> Seq.map (fun (x, y) -> lemmatizer.lookup (x, y))
Since lookup function takes a tuple that you got as an argument, you could write:
// standard syntax using the pipelining operator
let lemmas = Seq.zip words tags |> Seq.map lemmatizer.lookup
// .. an alternative syntax doing exactly the same thing
let lemmas = (words, tags) ||> Seq.zip |> Seq.map lemmatizer.lookup
The ||> operator used in the second version takes a tuple containing two values and passes them to the function on the right side as two arguments, meaning that (a, b) ||> f means f a b. The |> operator takes only a single value on the left, so (a, b) |> f would mean f (a, b) (which would work if the function f expected tuple instead of two, space separated, parameters).
If you need lemmas to be an array at the end, you'll need to add Array.ofSeq to the end of the processing pipeline (all Seq functions work with sequences, which correspond to IEnumerable<T>)
One more alternative is to use sequence expressions (you can use [| .. |] to construct an array directly if that's what you need):
let lemmas = [| for wt in Seq.zip words tags do // wt is tuple (string * string)
yield lemmatizer.lookup wt |]
Whether to use sequence expressions or not - that's just a personal preference. The first option seems to be more succinct in this case, but sequence expressions may be more readable for people less familiar with things like partial function application (in the shorter version using Seq.map)

How To Apply a Function to an Array of float Arrays?

Let's suppose I have n arrays, where n is a variable (some number greater than 2, usually less than 10).
Each array has k elements.
I also have an array of length n that contains a set of weights that dictate how I would like to linearly combine all the arrays.
I am trying to create a high performance higher order function to combine these arrays in F#.
How can I do this, so that I get a function that takes an array of arrays (arrs is a sample), a weights array (weights), and then computed a weighted sum based on the weights?
let weights = [|.6;;.3;.1|]
let arrs = [| [|.0453;.065345;.07566;1.562;356.6|] ;
[|.0873;.075565;.07666;1.562222;3.66|] ;
[|.06753;.075675;.04566;1.452;3.4556|] |]
thanks for any ideas.

Here's one solution:
let combine weights arrs =
Array.map2 (fun w -> Array.map ((*) w)) weights arrs
|> Array.reduce (Array.map2 (+))
EDIT
Here's some (much needed) explanation of how this works. Logically, we want to do the following:
Apply each weight to its corresponding row.
Add together the weight-adjusted rows.
The two lines above do just that.
We use the Array.map2 function to combine corresponding weights and rows; the way that we combine them is to multiply each element in the row by the weight, which is accomplished via the inner Array.map.
Now we have an array of weighted rows and need to add them together. We can do this one step at a time by keeping a running sum, adding each array in turn. The way we sum two arrays pointwise is to use Array.map2 again, using (+) as the function for combining the elements from each. We wrap this in an Array.reduce to apply this addition function to each row in turn, starting with the first row.
Hopefully this is a reasonably elegant approach to the problem, though the point-free style admittedly makes it a bit tricky to follow. However, note that it's not especially performant; doing in-place updates rather than creating new arrays with each application of map, map2, and reduce would be more efficient. Unfortunately, the standard library doesn't contain nice analogues of these operations which work in-place. It would be relatively easy to create such analogues, though, and they could be used in almost exactly the same way as I've done here.

Something like this did it for me:
let weights = [|0.6;0.3;0.1|]
let arrs = [| [|0.0453;0.065345;0.07566;1.562;356.6|] ;
[|0.0873;0.075565;0.07666;1.562222;3.66|] ;
[|0.06753;0.075675;0.04566;1.452;3.4556|] |]
let applyWeight x y = x * y
let rotate (arr:'a[][]) =
Array.map (fun y -> (Array.map (fun x -> arr.[x].[y])) [|0..arr.Length - 1|]) [|0..arr.[0].Length - 1|]
let weightedarray = Array.map (fun x -> Array.map(applyWeight (fst x)) (snd x)) (Array.zip weights arrs)
let newarrs = Array.map Array.sum (rotate weightedarray)
printfn "%A" newarrs
By the way.. the 0 preceding a float value is necessary.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

N-gram split function for string similarity comparison - f#

Question 3: My reference book is Computing Patterns in Strings by Bill Smyth

Related

How to combine two differently typed sequences into tuples in f#?

unexpected return type from list comprehension

Apply several aggregate functions with one enumeration

f# iterating over two arrays, using function from a c# library

How To Apply a Function to an Array of float Arrays?

Categories

Resources