I want to compute the Pearson correlation in InfluxDB. I found that there's a function for that in Flux.
I am using Influx Query Language (InfluxQL). It is possible to compute the Pearson correlation in InfluxQL?
Unfortunately, it seems there is no current plan to support this function natively. There is an opening issue for years and Flux has been give more priority than InfuxQL these days obviously, with constant iterations.
Related
can we predict growth percentage in sales of an item given the change in discount(positive or negative number) from the previous year as a predictor variable. There seems to be no correlation between these. How to solve this problem using machine learning?
You are on the wrong track to ask this question.
Correlation is on the knowledge side of Statistics, Please check Pearson’s correlation of coefficient / Spearman’s correlation of coefficient in order to find the correlation between the discount changes and the sales groth correlation.
In Machine Learning, we seldom compare two percentage data, instead, we compare the actual sales/discount value. A simple ML can be applied by Linear regression (most ML is used in multi-dimension, as your case is one-x one-y data (single column to single output). Please refer to related information online and solved with excel or python code.
K-means clustering in sklearn, number of clusters is known in advance (it is 2).
There are multiple features. Feature values are initially without any weight assigned, i.e. they are treated equally weighted. However, task is to assign custom weights to each feature, in order to get best possible clusters separation.
How to determine optimum sample weights (sample_weight) for each feature, in order to get best possible separation of the two clusters?
If this is not possible for k-means, or for sklearn, I am interested in any alternative clustering solution, the point is that I need method of automatic determination of appropriate weights for multivariate features, in order to maximize clusters separation.
In meantime, I have implemented following: clustering by each component separately, then calculating silhouette score, calinski harabasz score, dunn score and inverse davies bouldin score for each component (feature) separately. Then scaling those scores to same magnitude, then PCA them to 1 feature. This produced weights for each component. It seems this approach produces reasonable results. I suppose better approach would be full factorial experiment (DOE), but it seems that this simple approach produces satisfactory results as well.
I have personID and VaccinationsID plotted in x and y axis.
I want to group those personIDs who have the most similar selection of vaccinations. I am trying to use clustering machine learning algorithm. But I am not sure whether I should use this algorithm or user collaborative filtering.
My aim is to achieve Jaccard indexing, that is finding the intersection or similarities between 10000s of persons and form clusters and label them. Based on the degree of similarities, I need to group the personsID. Could anyone tell me which is an efficient approach? also if it is feasible to do using clustering for millions of data
I have added the screenshot of the graph
Number of vaccinations is an integer.
Just partition your data by this value, no need for clustering.
Everybody that has 7 vaccination goes into list 7.
After a lot of analysis, I used K-modes clustering algorithm. Based on the dissimilarity, the clusters are formed. Below is the link to the video of how the K-modes algorithm work.
[https://www.youtube.com/watch?v=b39_vipRkUo]
I would like to know what SPSS does when it computes the UICI and LICI (upper and lower individual confidence interval). I am asking because when we compute "by hand" the same prediction interval for a given individual using the output tables from a simple linear regression we get a slightly different interval (up to 0,005 difference).
I couldn't find online how to get the code used for this command in order to look closer at what SPSS does when we "check" the boxes for mean and individual prediction intervals.
Thanks for your help,
The SPSS Algorithms manual accessible from the Help menu will give you the formulas. Note that a confidence interval is not the same as a prediction interval.
I want to calculate precision and recall of a web service ranking algorithm. I have different web services in a data base.
A customer specify some conditions in his/her search. According to the customer`s requirements, my algorithm should assign a score for each web service in data base and retrieve web services with highest scores.
I have searched the net and have read all the questions in this site about this topic, and know about precision and recall,but I dont know how to calculate them in my case. The most relevant search was in this link:
http://ijcsi.org/papers/IJCSI-8-3-2-452-460.pdf
According to this article,
Precision = Highest rank score / Total rank score of all services
Recall= Highest rank score / Score of 2nd highest service
But, I think it is not true. Can you help me please?
Thanks a lot.
There is no such thing as "precision and recall for ranking". Precision and recall are defined for binary classification task and extended to multi label tasks. Rankings require different measures as this is much more complex problem. There are numerous ways to compute something similar to precision and recall, I will summarize some basic approaches for the precision, recall goes similarly:
limit search algorithm to some K best results and count true positives as number of queries for which the desired results is in those K results. So precision is fraction of queries for which you can find relevant result in K best outputs
very strict variation of the above, set K=1, meaning that results has to come "the best of all"
assign weights to each position, so for example you can give 1/T "true positive" to each query where valid result vame T'th. In other words, if the valid result was not returned you assign 1/inf=0, if it was the first one on the list then 1/1=1, if second 1/2, etc. now precision is simply a mean of these scores
As lejlot pointed out, using "precision and recall for ranking" is weired to measure ranking performance. The definition of "precision" and "recall" is very "customized" in the referenced paper you pointed out.
It is a measure of the tradeoff between the precision and
recall of the particular ranking algorithm. Precision is the
accuracy of the ranks i.e. how well the algorithm has
ranked the services according to the user preferences.
Recall is the deviation between the top ranked service and
the next relevant service in the list. Both these metrics are
used together to arrive at the f-measure which then tests the
algorithm efficiency.
Probably the original author has some specific motivation to use such definition. Some usual metric for evaluating ranking algorithms include:
Normalized discounted information gain or nDCG (used in a lot of kaggle competitions)
Precision#K, Recall#K
This paper also listed a few common ranking measures.
This is what I could think of:
Recall could be fraction of getting user click for top 5 queries and precision could be getting the fraction of user getting the click in the first query when compared to rest of the queries. I don't know but it seems very vague to speak about precision and recall in such a scenario.