edmonds algorithm for bipartite maximum matching - bipartite

I want to find the maximum matching in a bipartite matching using edmonds algorithm. Unfortunately I am unable to get the pseudo-code. Can anyone help me?

This is a late post for the benefit of future visitors. There is pseudo code available in wikipidia page
Edmonds Algorithm is applicable to general graphs and not just
bipartite graphs. The wiki page shows how the general algorithm can be
used for bipartite graphs.

Related

Text corpus clustering

I have 27000 free text elements, each of around 2-3 sentences. I need to cluster these by similarity. So far, I have pretty limited success. I have tried the following:
I used Python Natural Language Toolkit to remove stop words, lemmatize and tokenize, then generated semantically similar words for each word in the sentence before inserting them into a Neo4j graph database. I then tried querying that using the TF counts for each word and related word. That didn't work very well and only resulted in being able to easily calculate the similarity between two text items.
I then looked at Graphawares NLP library to annotate, enrich and calculate the cosine similarity between each text item. After 4 days of processing similarity I checked the log to find that it would take 1.5 years to process. Apparently the community version of the plugin isn't optimised, so I guess it's not appropriate for this kind of volume of data.
I then wrote a custom implementation that took the same approach as the Graphaware plugin, but in much simpler form. I used scikitlearn's TfidfVectorizer to calculate the cosine similarity between each text item and every other text item and saved those as relationships between the Neo4j nodes. However, with 27000 text items that creates 27000 * 27000 = 729000000 relationships! The intention was to take the graph into Grephi selecting relationships of over X threshold of similarity and use modularity clustering to extract clusters. Processing for this is around 4 days which is much better. Processing is incomplete and is currently running. However, I believe that Grephi has a max edge count of 1M, so I expect this to restrict what I can do.
So I turned to more conventional ML techniques using scikitlearn's KMeans, DBSCAN, and MeanShift algorithms. I am getting clustering, but when it's plotted on a scatter chart there is no separation (I can show code if that would help). Here is what I get with DBSCAN:
I get similar results with KMeans. These algorithms run within a few seconds, which obviously makes life easier, but the results seem poor.
So my questions are:
Is there a better approach to this?
Can I expect to find distinct clusters at all in free text?
What should my next move be?
Thank you very much.
I think your question is too general to be a good fit for Stack Overflow, but to give you some pointers...
What is your data? You discuss your methods in detail but not your data. What sort of clusters are you expecting?
Example useful description:
I have a bunch of short product reviews. I expect to be able to separate reviews of shoes, hats, and refrigerators.
Have you tried topic modelling? It's not fancy but it's a traditional method of sorting textual documents into clusters. Start with LDA if you're not familiar with anything.
Are you looking for duplicates? If you're looking for plagiarism or bot-generated spam, look into MinHash, SimHash, and the FuzzyWuzzy library for Python.

How to measure similarity between code snippets written in programming language

I am trying to solve the following problem.
Given a particular code snippet I need to give back the top review comments for the code snippet, here we want to give all the comments that were given to similar code snippets.
I am trying to form it as a machine learning problem.I think we can use KNN algorithm, but here I am not sure how should I measure the similarity between two code snippet ? Is there any pre-existing similarity measure for it ? I tried to search in google but not found any useful link
Kindly help
Edit distance between the two strings containing the considered comments could be a useful measure of similarity.
Also, n-gram cosine distance could be useful, that is, you extract the n-grams (e.g. 3-char-segments), build the vectors counting these n-grams and calculate cosine distance.
Another one would be Jaccard similarity between n-gram vectors (as above).

Clustering of news articles

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their similarity.
What I did so far is to apply basic NLP techniques including stopword removal and stemming. I also calculated the tf-idf vector for each article, and with this can also calculate the, e.g., cosine similarity based on these tf-idf-vectors. But now with the grouping of the articles I struggles a bit. I see two principle ways -- probably related -- to do it:
1) Machine Learning / Clustering: I already played a bit with existing clustering libraries, with more or less success; see here. On the one hand, algorithms such as k-means require the number of clusters as input, which I don't know. Other algorithms require parameters that are also not intuitive to specify (for me that is).
2) Graph algorithms: I can represent my data as a graph with the articles being the nodes and weighted adges representing the pairwise (cosine) similarity between the articles. With that, for example, I can first remove all edges that fall below a certain threshold and then might apply graph algorithms to look for strongly-connected subgraphs.
In short, I'm not sure where best to go from here -- I'm still pretty new in this area. I wonder if there some best practices for that, or some kind of guidelines which methods / algorithms can (not) be applied in certain scenarios.
(EDIT: forgot to link to related question of mine)
Try the class of Hierarchical Agglomerative Clustering HAC algorithms with Single and Complete linkage.
These algorithms do not need the number of clusters as input.
The basic principle is similar to growing a minimal spanning tree across a given set of data points and then stop based on a threshold criteria. A closely related class is the Divisive clustering algorithms which first builds up the minimal spanning tree and then prunes off a branch of the tree based on inter-cluster similarity ratios.
You can also try a canopy variation on k-means to create a relatively quick estimate for the number of clusters (k).
http://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Will you be recomputing over time or do you only care about a static set of news? I ask because your k may change a bit over time.
Since you can model your dataset as a graph you could apply stochastic clustering based on markov models. Here are link for resources on MCL algorithm:
Official thesis description and code base
Gephi plugin for MCL (to experiment and evaluate the method)

Will all TSP algorithms give the same optimum route?

I was just wondering if all algorithms for the TSP will give the same optimum routes? I thought that this would be the case but ive implemented branch and bound and A* and they both give very different results to the same input, I was just wondering if this is normal?
The route my differ, but the cost of all optimal solution should be the same.
If your A* solution is more expensive, than your heuristic is wrong.
Have a look at wikipedia A* algorithm for proofs that it always finds an optimal solution.
No. Provided more than one optimal route exists, different algorithms will not necesarily find the same path. It will depend on the implementation, and I assume it will also depend on how you label the graph, so that different labelings will make the same algorithm find different routes.

What algorithm would you use for clustering based on people attributes?

I'm pretty new in the field of machine learning (even if I find it extremely interesting), and I wanted to start a small project where I'd be able to apply some stuff.
Let's say I have a dataset of persons, where each person has N different attributes (only discrete values, each attribute can be pretty much anything).
I want to find clusters of people who exhibit the same behavior, i.e. who have a similar pattern in their attributes ("look-alikes").
How would you go about this? Any thoughts to get me started?
I was thinking about using PCA since we can have an arbitrary number of dimensions, that could be useful to reduce it. K-Means? I'm not sure in this case. Any ideas on what would be most adapted to this situation?
I do know how to code all those algorithms, but I'm truly missing some real world experience to know what to apply in which case.
K-means using the n-dimensional attribute vectors is a reasonable way to get started. You may want to play with your distance metric to see how it affects the results.
The first step to pretty much any clustering algorithm is to find a suitable distance function. Many algorithms such as DBSCAN can be parameterized with this distance function then (at least in a decent implementation. Some of course only support Euclidean distance ...).
So start with considering how to measure object similarity!
In my opinion you should also try expectation-maximization algorithm (also called EM). On the other hand, you must be careful while using PCA because this algorithm may reduce the dimensions relevant to clustering.

Resources