Neo4j K-means algorithm - neo4j

Hello Stack Overflow community,
I'm really need a help about something,
I want to apply a community detection algorithm on a graph, contains distances between person ( social network)
I want to know if neo4j k-means algorithm of community détection works with this type of graphs ?

The k-means for Graph Data Science (GDS) is in alpha mode, meaning on the early stage of development and can have major change(s) on it.
Here is the documentation: https://neo4j.com/docs/graph-data-science/current/algorithms/alpha/kmeans/
Enjoy reading it!

Related

Louvain Community Detection Algorithm, How to identify the no. of clusters?

I want to use Louvain for clustering a large-scale network. How can I identify the number of the required clusters since there is no parameter that can be configured for this purpose in the algorithm built in Neo4j (Graph Data Science) library?
Update 1: according to this [Ref],1 k-means can be used to group items based on similar properties instead of relationships (nodes without the relationships between them). since I have a complete network topology, I think that K-means doesn't work in this scenario.
Update 2: Any suggestion about another algorithm (s) that can perform clustering and allows specifying the no. of clusters is welcomed :)
The aim of clustering is to create multiple network domains to distribute the traffic load in a large-scale SDN network, so I thought to use a community detection algorithm to perform the clustering so that to determine the required no. of SDN controllers to be deployed.
Louvain optimizes modularity by combining smaller communities into larger groups until some end state is reached. So the end number of clusters isn't under user control.
K-Means (available in alpha) allows you to pre-set the number of clusters, if that helps.
You might also edit your question to explain why Louvain is the method you'd like to go with, so people can offer suggestions that support your use case. :)

Neo4j Community detection

Hello Stack overflow community;
I am working in a scholar project using Neo4j database and i need help from members which are worked before with neo4j gds in order to finding a solution for my problem;
i want to apply a community detection algorithm called "Newman-Girvan" but it doesn't exist any algorithm with this name in neo4j gds library; i found an algorithm called "Modularity Optimization", is it the Newman-Girvan algorithm and just the name is changed or it is a different algorithm?
Thanks in advance.
I've not used the newman-girvan algorithm, but the fact that it's a hierarchical algorithm with a dendrogram output suggests you can use comparable GDS algorithms, specifically Louvain, or the newest, Leiden. Leiden has the advantage of enforcing the generation of intermediary communities. I've used both algorithms with multigraphs; I believe this capability was just introduce with gdg v 2.x.
The documentation on the algorithms is at
https://neo4j.com/docs/graph-data-science/current/
https://neo4j.com/docs/graph-data-science/current/algorithms/alpha/leiden/
multigraph:
https://neo4j.com/docs/graph-data-science/current/graph-project-cypher-aggregation/

Anomaly dectection algorithm for time series univariate dataset

I have univariate time series data and I need to run anomaly detection algorithm on the same. Can anyone suggest any standard algorithm for anomaly detection which works in most cases?
There is no such algorithm "which works in most cases". The task heavily depends on the specifics of your case, e.g. whether you need local anomalies when a point differs from other points near it or global ones when a point does not look similar to any other point in the dataset.
The very good review of anomaly detection algorithms can be found here
Perhaps you can easily try one-class-SVM which is available in many libraries and programming languages. For instance, in Python you can use scikit-learn.

Training the algorithm for better image recognition

This is a research question not a direct programming question.
I am working on a symbol recognition algorithm, What the software currently does, it takes an image, divide it into contours (blobs) and start matching each contour with a list of predefined templates. Then for each contour it takes the one that has the highest match rate.
The algorithm is doing fairely however I need to train it better. What I mean is this:
I want to use a machine learning algorithm that will train the algorithm to have better matching. So lets take an example:
I run the recognition on a symbol, the algorithm will run and find that this symbol is a car, then I have to confirm that result (maybe by clicking on "Yes" or "No") the algorithm should learn from that. So if I click on NO the algorithm should learn that this is not a car and will have better result next time (maybe try to match something else). while if i click on YES he will know that he was correct and next time he will perform better when searching for a car.
This is the concept I am trying to research. I need documents or algorithm that can achieve this sort of things. I am not looking for implementations or programming, just concept or researches.
I have done many researches and read a lot about machine learning, neural networks, decision trees.... but i was not able to know how can I use any in my scenarion.
I hope I was clear and this type of question is allowed on stack overflow. if not I'm sorry
Thanks a lot for any help or tip
Image recognition is still a challenge in the community. What you described in your process of manually clicking yes/no is just creating labeled data. Since this is a very broad area, I will just point you to a few links that might be useful.
To get start, you might want to use some existing image databases instead of creating your own, which saves you a lot of effort. e.g., this car dataset in UCIC image db.
Since you already have the background of machine learning, you can take a look at some survey paper that exactly match your project interests, e.g., search object recognition survey paper or feature extraction car in google.
Then you can dive into some good papers and see whether they are suitable for your project. For example, you can check the two papers below that linked with the UCIC image db.
Shivani Agarwal, Aatif Awan, and Dan Roth,
Learning to detect objects in images via a sparse, part-based representation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11):1475-1490, 2004.
Shivani Agarwal and Dan Roth,
Learning a sparse representation for object detection.
In Proceedings of the Seventh European Conference on Computer Vision, Part IV, pages 113-130, Copenhagen, Denmark, 2002.
Also check for some implemented softwares instead of starting from scratch, in your case, opencv should be good one to start with.
For image recognition, feature extraction is one of the most important step. You might want to check some stat-of-the-art algorithms in the community. (SIFT, mean-shift, harr features etc).
Boosting algorithm might also be useful when you reach the classification step. I see a lot of scholars mention this in image recognition community.
As #nickbar suggest, discuss more at https://stats.stackexchange.com/

Map Reduce Algorithms on Terabytes of Data?

This question does not have a single "right" answer.
I'm interested in running Map Reduce algorithms, on a cluster, on Terabytes of data.
I want to learn more about the running time of said algorithms.
What books should I read?
I'm not interested in setting up Map Reduce clusters, or running standard algorithms. I want rigorous theoretical treatments or running time.
EDIT: The issue is not that map reduce changes running time. The issue is -- most algorithms do not distribute well to map reduce frameworks. I'm interested in algorithms that run on the map reduce framework.
Technically, there's no real different in the runtime analysis of MapReduce in comparison to "standard" algorithms - MapReduce is still an algorithm just like any other (or specifically, a class of algorithms that occur in multiple steps, with a certain interaction between those steps).
The runtime of a MapReduce job is still going to scale how normal algorithmic analysis would predict, when you factor in division of tasks across multiple machines and then find the maximum individual machine time required for each step.
That is, if you have a task which requires M map operations, and R reduce operations, running on N machines, and you expect that the average map operation will take m time and the average reduce operation r time, then you'll have an expected runtime of ceil(M/N)*m + ceil(R/N)*r time to complete all of the tasks in question.
Prediction of the values for M,R,m, and r are all something that can be accomplished with normal analysis of whatever algorithm you're plugging into MapReduce.
There are only two books that i know of that are published, but there are more in the works:
Pro hadoop and Hadoop: The Definitive Guide
Of these, Pro Hadoop is more of a beginners book, whilst The Definitive Guide is for those that know what Hadoop actually is.
I own The Definitive Guide and think its an excellent book. It provides good technical details on how the HDFS works, as well as covering a range of related topics such as MapReduce, Pig, Hive, HBase etc. It should also be noted that this book was written by Tom White who has been involved with the development of Hadoop for a good while, and now works at cloudera.
As far as the analysis of algorithms goes on Hadoop you could take a look at the TeraByte sort benchmarks. Yahoo have done a write up of how Hadoop performs for this particular benchmark: TeraByte Sort on Apache Hadoop. This paper was written in 2008.
More details about the 2009 results can be found here.
There is a great book about Data Mining algorithms applied to the MapReduce model.
It was written by two Stanford Professors and it if available for free:
http://infolab.stanford.edu/~ullman/mmds.html

Resources