I want to use Louvain for clustering a large-scale network. How can I identify the number of the required clusters since there is no parameter that can be configured for this purpose in the algorithm built in Neo4j (Graph Data Science) library?
Update 1: according to this [Ref],1 k-means can be used to group items based on similar properties instead of relationships (nodes without the relationships between them). since I have a complete network topology, I think that K-means doesn't work in this scenario.
Update 2: Any suggestion about another algorithm (s) that can perform clustering and allows specifying the no. of clusters is welcomed :)
The aim of clustering is to create multiple network domains to distribute the traffic load in a large-scale SDN network, so I thought to use a community detection algorithm to perform the clustering so that to determine the required no. of SDN controllers to be deployed.
Louvain optimizes modularity by combining smaller communities into larger groups until some end state is reached. So the end number of clusters isn't under user control.
K-Means (available in alpha) allows you to pre-set the number of clusters, if that helps.
You might also edit your question to explain why Louvain is the method you'd like to go with, so people can offer suggestions that support your use case. :)
Related
I am looking to implement machine learning for a problems that are built on small data sets related to approvals of expenses in a specific supply chain domain. Typically labelled data is unavailable
I was looking to build models in one data set that I have labelled data and then use that model developed in similar contexts- where the feature set is very similar, but not identical. The expectation is that this allows the starting point for recommendations and gather labelled data in the new context.
I understand this is the essence of Transfer Learning. Most of the examples I read in this domain speak of image data sets- any guidance how this can be leveraged in small data sets using standard tree-based classification algorithms
I can’t really speak to tree-based algos, I don’t know how to do transfer learning with them. But, for deep learning models, the customary method for transfer learning is to load up a pretrained model, then retrain the last layer of the dataset using your new data, and then fine-tune the rest of the network.
If you don’t have much data to go on, you might look into creating synthetic data.
raghu, I believe you are looking for a kernel method when you are saying abstraction layer in deep learning. There are several ML algorithms that support kernel functions. With kernel functions, you might be able to do it; but using kernel functions might be more complex than solving your original problem. I would lean toward Tdoggo's suggestion of using Decision Tree.
Sorry, I want to add a comment, but they won't allow me, so I posted a new answer.
Ok with tree-based algos you can do just what you said: train the tree on one dataset and apply it to another similar dataset. All you would need to do is change the terms/nodes on the second tree.
For instance, let’s say you have a decision tree trained for filtering expenses for a construction company. You will outright deny any reimbursements for workboots, because workers should provide those themselves.
You want to use the trained tree on your accounting firm, and so instead of workboots, you change that term to laptops, because accountants should be buying their own.
Does that make sense, and is that helpful to you?
After some research, we have decided to proceed with random forest models with the intuition that trees in the original model that have common features will form the starting point for decisions.
As we gain more labelled data in the new context, we will start replacing the original trees with new trees that comprise of (a)only new features and (b) combination of old and new features
This has worked to provide reasonable results in initial trials
Let us supposed that we are trying to rank the importance of each feature of the dataset for each given cluster, in a clustering task. What are the characteristics that we should measure in the feature for considering it good for characterizing a given cluster?
I am looking for a more analytical characterization of these features. For example, if a feature f have a high standard deviation in the whole dataset, but a small standard deviation within a cluster c, does this means that this feature is important for distinguishing the cluster c?
There are two approaches you could use here:
A feature selection approach would be to remove the said feature and redo the clustering and see if it had strong effect, if no you could say this feature is unnecessary for the clustering task. The down side of this approach is the time it would take to run the clustering process for each subset of features in the dataset.
A statistical approach would be to split the data into two groups: the samples from the cluster and the rest of the samples. Then you ask how different are the feature values when comparing the two populations. Depends on the distribution of this feature, you could pick for this task a test like KS test, t test, chi-squared test or any other test for comparing distributions of two samples.
I am a beginner and have just started studying machine learning and neural networks and have just understood the very basics of this vast and interesting domain.
From my basic knowledge, I know that a model/classifier can be used to Classify an image as something. But I was curious if there is a way to detect multiple instances of the same object and count the same.
Basically I wanted to calculate the density of traffic at a red light to dynamically control the flow of traffic, so I was curious if there was a way to detect multiple cars and count the number of cars at a red light by training the ConvNet on Images of Cars (and if there is a way to implement the same using tensor-flow)
You might consider using an off the shelf object detector, e.g., the Tensorflow Object Detection API (github.com/tensorflow/models/tree/master/object_detection) to first detect cars, and then count them.
CNN is one branch of the machine learning. It can be trained to classify different cars as one class, just like many other technologies applied in machine learning.
My understanding of your question is: you want to count the number of cars at the red light and make decision of the traffic dynamically. So I would seperate your question into two part
Count the number of cars
Optimize the traffic flow
For the question 1 which you are actually interested in
I would suggest you to have a look at:
Counting the number of vehicles from an image with machine learning
I hope this can be helpful
I've heard of a max flow min cut method for sharding or segmenting a graph database. Does someone have a sample cypher query that can do that say against the movielens dataset? Basically I want to segment users into different shards/clusters based on what they like so maybe the min cuts can naturally find clusters of users around the genres say Horror, Drama, or maybe it will create non-intuitive clusters/segments like hipster/romantics and conservative/comedy/horror groups.
my short answer is no, sorry I don't know how you would express that.
my longer answer is even if this were possible - which it very well may be - I would advise against it.
multiple algorithms 'do' min-cut max-flow, these will all have different performance characteristics and, because clustering is computationally expensive, I'd guess you want control over the specific algorithm implementation used.
Cypher is a declarative language, you specify what you're looking for but not how to do it, and it will be difficult to specify such a complex problem in a way that the Cypher engine can figure out what you're trying to do. that will make it hard for Cypher (or any declarative language engine) to produce an efficient query plan.
my suggestion is find the specific algorithm you wish to use and implement it using the Neo4j Java API.
if you're running Neo4j in embedded mode you're done at that point. if you're running Neo4j server you'll then just have to run that code as an Unmanaged Server Extension
AFAIK you're after 'Community Detection' algorithms. There are non-overlapping (communities do not overlap) and overlapping variants, where non-overlapping is generally easier to implement and understand. Common algorithms are:
Non-overlapping: Louvain
Overlapping: Label Propagation Algorithm (LPA) (typically non-overlapping, but there are extensions to make it overlapping)
Here are a few C++ code examples for the algorithms: Louvain, Oslom (overlapping), LPA (non-overlapping), and Infomap)
And if you want bleeding edge I was recommended the SCD algorithm
Academic paper: "High Quality, Scalable and Parallel Community Detection for Large Real Graphs"
C++ implementation
I came across a way to calculate the influence score of a person on a twitter network. Here is a sample reference: http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank/
On similar lines, are there any other algorithms that calculate the influence score of a subscriber on a telecom network using his/her CDR data?
Please checkout Magnusson's thesis:
http://uu.diva-portal.org/smash/record.jsf?pid=diva2:509757
This thesis aims at investigating the usefulness of social network analysis in telecommunication networks. As these networks can be very large the methods used to study them must scale linearly when the network size increases. Thus, an integral part of the study is to determine which social network analysis algorithms that have this scalability. Moreover, comparisons of software solutions are performed to find product suitable for these specific tasks.