how to find the smallest n values in a distributed dask array

how to find the smallest n values in a distributed dask array - dask

I have a distributed dask array with shape (2400,2400) with chunksize (100,100). I thought I could use topk(-n) to find the smallest n values. However, it appears to return an array of shape (2400,n), so it looks like it finds the smallest n in each row.Is there a way to use topk to get the smallest n values across all rows (entire array)?
One idea is to call topk twice, once for each axis.
>>> dist
dask.array<pow, shape=(2400, 2400), dtype=float64, chunksize=(100, 100)>
>>> dist.topk(-5,axis=0).topk(-5,axis=1).compute()
array([[ 0. , 2620.09503644, 2842.15200157, 2955.08409356,
3163.49458669],
[3660.67698657, 3670.4457495 , 3700.09837707, 3717.09052889,
4002.86497399],
[4125.89820524, 4139.44658137, 4250.50420539, 4331.01304547,
4402.14606754],
[4328.22966119, 4378.25193428, 4507.94409903, 4522.4913488 ,
4555.06860541],
[4441.58755402, 4560.95625938, 4576.39333974, 4682.06215251,
4765.11531865]])

One idea is to call topk twice, once for each axis.
Sounds good to me!
You might consider flattening the array first, but I can't see an advantage to this to what you've already found.
x.flatten().topk(...)

Related

Understanding FastRP vs scaleProperties

I am trying to understand the difference or error I am receiving between these two steps. I followed this tutorial to practice KNN with my own data (https://towardsdatascience.com/create-a-similarity-graph-from-node-properties-with-neo4j-2d26bb9d829e)
During the process we project our graph of interest, which mine contains three properties: bd_load, weight, and length of organisms. In the example we use this code below to create scaledProperties embeddings between the 3 variables.
Project graph
//(5) project graph of interest
CALL gds.graph.project('bd_graph',
'node_sim',
'*',
{nodeProperties:['bd_load', 'weight', 'length']})
Scale variables of interest between 0-1 for future Euclidean distance calculation
//(6) add scalar 0-1
CALL gds.alpha.scaleProperties.mutate('bd_graph',
{nodeProperties:['bd_load', 'weight', 'length'],
scaler:'MinMax',
mutateProperty:'scaledProperties'})
YIELD nodePropertiesWritten
We then can run KNN based on euclidean distance
//(8) project relationship to graph
CALL gds.knn.mutate("bd_graph",
{nodeProperties: {scaledProperties: "EUCLIDEAN"},
topK: 15,
mutateRelationshipType: "IS_SIMILAR",
mutateProperty: "similarity",
similarityCutoff: 0.6409912109375,
sampleRate:1,
randomSeed:42,
concurrency:1}
)
However I continue the learning curve with Neo4j and FastRP I am trying to understand the difference between the scale property and FastRP. Today I tried to create graph embeddings for my 3 variables using FastRP with 8 dimensions on my projected graph with out running the scaled property embeddings. My thought was increasing the dimensions would be better for finding similarities between nodes. The code below runs fine and there is an embedding vector with 8 elements.
FastRP
CALL gds.fastRP.mutate(
'bd_graph',
{
embeddingDimension: 8,
mutateProperty: 'fastrp-embedding',
featureProperties: ['bd_load', 'weight', 'length']
}
)
YIELD nodePropertiesWritten
But when I run the below code
ALL gds.knn.stats("bd_graph",
{
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
I receive an error:
Invalid input '{': expected "+" or "-" (line 4, column 22 (offset: 97))
nodeProperties:{fastrp-embedding:"EUCLIDEAN"},
Does the embedding element length have to match the number of variables in the node? Am using FastRP correctly and my understanding of creating embeddings with in nodes to then calculate Euclidean distance for a similarity score?

I am glad you are finding the tutorial helpful and getting into GDS!
Map keys in Cypher must be strings. https://neo4j.com/docs/cypher-manual/current/syntax/maps/
The - in your property name fastrp-embedding is not recognized as a string character. If you enclose that property name with back ticks, GDS will know to treat the special character as part of the map key. This should work for you.
CALL gds.knn.stats("bd_graph",
{
nodeProperties:{`fastrp-embedding`:"EUCLIDEAN"},
topK:10,
sampleRate:1,
randomSeed:42,
concurrency:1
}
) YIELD similarityDistribution
RETURN similarityDistribution
The recommended format for Neo4j property names is camel case. If you name your property fastrpEmbedding instead of fastrp-embedding, you would not need to use the back ticks.

Using cv.matchTemplate to find multiple best matches

I am using the function cv.matchTemplate to try to find template matches.
result = cv.matchTemplate(img, templ, match_method)
After I run the function I have a bunch of answers in list result. I want to filter the list to find the best n matches. The data in result just a large array of numbers so I don't know what criteria to filter based on. Using extremes = cv.minMaxLoc(result, None) filters the result list in an undesired way before converting them to locations.
The match_method is cv.TM_SQDIFF. I want to:
filter the results down to the best matches
Use the results to obtain the locations
How can I acheive this?

You can treshold the result of matchTemplate to find locations with sufficient match. This tutorial should get you started. Read at the bottom of the page for finding multiple matches.
import numpy as np
threshold = 0.2
loc = np.where( result <= threshold) # filter the results
for pt in zip(*loc[::-1]): #pt marks the location of the match
cv2.rectangle(img_rgb, pt, (pt[0] + w, pt[1] + h), (0,0,255), 2)
Keep in mind depending on the function you use will determine how you filter. cv.TM_SQDIFF tends to zero as the match quality increases so setting the threshold closer to zero filters out worse. The opposite is true for cv.TM CCORR cv.TM_CCORR_NORMED cv.TM_COEFF and cv.TM_COEFF_NORMED matching methods (better tends to 1)

The above answer does not find the best N matches as the question asked. It filters out answers based on a threshold leaving open the (likely) possibility that you still have more than N results or zero results that beat the threshold.
To find the N 'best matches' we're looking for the N highest numbers in a 2d array and retrieving their indexes so we know the location. We can use nump.argpartition to find the highest N indexes in a 1d array and numpy.ndarray.flatten with numpy.unravel_index to go back and forth between a 2d and 1d array like so:
find_num = 5
result = cv.matchTemplate(img, templ, match_method)
idx_1d = np.argpartition(result.flatten(), -find_num)[-find_num:]
idx_2d = np.unravel_index(idx_1d, result.shape)
From here you have the x,y locations of the top 5 matches.

if (freq) x$counts else x$density length > 1 and only the first element will be used

for my thesis I have to calculate the number of workers at risk of substitution by machines. I have calculated the probability of substitution (X) and the number of employee at risk (Y) for each occupation category. I have a dataset like this:
X Y
1 0.1300 0
2 0.1000 0
3 0.0841 1513
4 0.0221 287
5 0.1175 3641
....
700 0.9875 4000
I tried to plot a histogram with this command:
hist(dataset1$X,dataset1$Y,xlim=c(0,1),ylim=c(0,30000),breaks=100,main="Distribution",xlab="Probability",ylab="Number of employee")
But I get this error:
In if (freq) x$counts else x$density
length > 1 and only the first element will be used
Can someone tell me what is the problem and write me the right command?
Thank you!

It is worth pointing out that the message displayed is a Warning message, and should not prevent the results being plotted. However, it does indicate there are some issues with the data.
Without the full dataset, it is not 100% obvious what may be the problem. I believe it is caused by the data not being in the correct format, with two potential issues. Firstly, some values have a value of 0, and these won't be plotted on the histogram. Secondly, the observations appear to be inconsistently spaced.
Histograms are best built from one of two datasets:
A dataframe which has been aggregated grouped into consistently sized bins.
A list of values X which in the data
I prefer the second technique. As originally shown here The expandRows() function in the package splitstackshape can be used to repeat the number of rows in the dataframe by the number of observations:
set.seed(123)
dataset1 <- data.frame(X = runif(900, 0, 1), Y = runif(900, 0, 1000))
library(splitstackshape)
dataset2 <- expandRows(dataset1, "Y")
hist(dataset2$X, xlim=c(0,1))
dataset1$bins <- cut(dataset1$X, breaks = seq(0,1,0.01), labels = FALSE)

Histogram calculation in julia-lang

refer to julia-lang documentations :
hist(v[, n]) → e, counts
Compute the histogram of v, optionally using approximately n bins. The return values are a range e, which correspond to the edges of the bins, and counts containing the number of elements of v in each bin. Note: Julia does not ignore NaN values in the computation.
I choose a sample range of data
testdata=0:1:10;
then use hist function to calculate histogram for 1 to 5 bins
hist(testdata,1) # => (-10.0:10.0:10.0,[1,10])
hist(testdata,2) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,3) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,4) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,5) # => (-2.0:2.0:10.0,[1,2,2,2,2,2])
as you see when I want 1 bin it calculates 2 bins, and when I want 2 bins it calculates 3.
why does this happen?

As the person who wrote the underlying function: the aim is to get bin widths that are "nice" in terms of a base-10 counting system (i.e. 10k, 2×10k, 5×10k). If you want more control you can also specify the exact bin edges.

The key word in the doc is approximate. You can check what hist is actually doing for yourself in Julia's base module here.
When you do hist(test,3), you're actually calling
hist(v::AbstractVector, n::Integer) = hist(v,histrange(v,n))
That is, in a first step the n argument is converted into a FloatRange by the histrange function, the code of which can be found here. As you can see, the calculation of these steps is not entirely straightforward, so you should play around with this function a bit to figure out how it is constructing the range that forms the basis of the histogram.

How calculate the mean of Mean Squared Errors?

I have an array A where each element is an Mean Squared Error. How can I calculate the mean of A?
If I do a simply mean (If I do so I should got a mean of means) of the elements of A, is it a correct operation? If not why? And what's a solution?
Note: The elements in A are real in range from 0 to 1.

If you're after the total mean squared error you'll need the number of values that contributed to each element, n[i][j]. You can then compute
total_err2 = (Σ (n[i][j] * err2[i][j])) / (Σ n[i][j])
where Σ is the sum over all of the elements.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

how to find the smallest n values in a distributed dask array - dask

One idea is to call topk twice, once for each axis. Sounds good to me! You might consider flattening the array first, but I can't see an advantage to this to what you've already found. x.flatten().topk(...)

Related

Understanding FastRP vs scaleProperties

Using cv.matchTemplate to find multiple best matches

if (freq) x$counts else x$density length > 1 and only the first element will be used

Histogram calculation in julia-lang

How calculate the mean of Mean Squared Errors?

Categories

Resources