How to minimize the maximum cost of a path in a vertex disjoint path cover? - graph-algorithm

Given a directed weighted graph G and n, where n is the number of paths to be used to cover all the vertices in the graph G. How can I minimize the maximum cost of the longest path? (assuming that a solution always exist in this graph)

For n = 1, this obviously becomes a Travelling Salesman Problem - which is NP-hard. Thus, I wouldn't look for exact algorithms in your case.
My guess would be that a good solution for small n would be to use one of the abundant algorithms for the Travelling Salesman Problem (which usually approximate optimal solutions quite good) and then remove the (n-1) heaviest edges from the found path. That way you end with n paths.
The Wikipedia Article on TSP actually lists some pretty easy algorithmic techniques which should give you a reasonably good approximation.

Related

K-Nearest Neighbor - how many reference points/features?

I want to use KNN to create a training model (I will use other ML models as well), but i'm just wondering...
I have around 6 features, with a total of let's say 60.000 (60 thousand) reference points (so, I have around 10.000 reference points per feature).
I know that this is, from a computational point of view, not ideal (for an algorithm like KNN), so should I use for example KD-Trees (or is KNN okay for this number of features/reference points)?
Because.. if I have to calculate the distance between my test point and all the reference points (with for example Euclidean distance, for a multi-dimensional model)..... I can imagine that it will take quite some time..?
I know that other (supervised) ML algorithms are maybe more efficient, but KNN is only one of the algorithms I will use.
The time complexity of (naive) KNN would be O(kdn) where d is the dimensionality which is 6 in your case, and n is the number of points, which is 60,000 in your case.
Meanwhile, building a KD tree from n points is O(dnlogn), with subsequent nearest-neighber lookups taking O(klogn) time. This is definitely much better: you sacrifice a little bit of time upfront to build the KD tree, but each KNN lookup later is much faster.
This is all under the assumption that your points are distributed in a "nice" way (see: https://en.wikipedia.org/wiki/K-d_tree#Degradation_in_performance_when_the_query_point_is_far_from_points_in_the_k-d_tree for more details). If they aren't distributed in a "nice" way, then KNN in general might not be the way to go.

Can I use Breadth-First-Search on weighted graphs if I modify it?

I am having a discussion with a friend if the following will work:
We recently learned in a lecture about Breadth-First-Search. I know that it is a special case of Dijkstra where each edge weight is set to one. Assume now we are given a graph where the edges have integer weights of more than one. Then I would modify this graph by introducing additional vertices and connecting them by edges with weight one, e.g. assume we have an edge of weight 3 connecting the vertices u and v, then I would introduce dummy-vertices d1, d2, remove the edge connecting u and v and instead add edges {u, d1}, {d1, d2}, {d2,v} of weight one.
If I modify my whole graph this way and then apply breadth-first search starting from one of the original vertices, wouldn't this work as well?
Thank you very much in advance!
Since BFS is guaranteed to return an optimal path on unweighted graphs, and you've created the unweighted equivalent of your original graph, you'll be guaranteed to get the shortest path.
What you lose by doing this over Dijkstra's algorithm is runtime optimality. Now the runtime of your algorithm is dependent on the edge weights, whereas Dijkstra's is only dependent on the number of edges.
This sort of thought experiment is a great way to understand how Dijkstra's algorithm works (eg. how would you modify your algorithm to not require creating a new graph? Or not take 100 steps for an edge with weight 100?). In fact this is probably how Dijkstra discovered the algorithm to begin with.

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

A*: Finding a better solution for 15-square puzzle with one given solution

Given that there is a 15-square puzzle and we will solve the puzzle using a-star search. The heuristic function is Manhattan distance.
Now a solution is provided by someone with cost T and we are not sure if this solution is optimal. With this information provided,
Is it possible to find a better solution with cost < T?
Is it possible to optimize the performance of searching algorithm?
For this question, I have considered several approaches.
h(x) = MAX_INT if g(x) >= T. That is, the f(x) value will be maximum if the solution is larger than T.
Change the search node as CLOSED state if g(x) >= T.
Is it possible to find a better solution?
You need to know if T is the optimal solution. If you do not know the optimal solution, use the average cost; a good path is better than the average. If T is already better than average, you don't need to find a new path.
Is it possible to optimize the performance of the searching algorithm?
Yes. Heuristics are assumptions that help algorithms to make good decisions. The A* algorithm makes the following assumptions:
The best path costs the least (Djikstra's Algorithm - stay near origin of search)
The best path is the most direct path (Greedy Search - minimize distance to goal)
Good heuristics vastly improve performance (A* is useful for this reason). Bad heuristics lead the search away from good solutions and obliterate performance. My advice is to know the game you are searching; in chess, it's generally best to avoid losing a queen, so that may be a good heuristic to use.
Heuristics will have the largest impact on performance, especially in the case of a 15x15 search space. In larger search spaces (2000x2000), good use of high efficiency data structures like arrays and integers may improve performance.
Potential solutions
Both the solutions you provide are effectively the same; if the path isn't as good as the other paths you have, ignore them. Search algorithms like A* do this for you, as j_random_hacker has said in a roundabout manner.
The OPEN list is a set of possible moves; select the best and ignore the rest. The CLOSED list is the set of moves that have already been selected, not the ones you wish to ignore.
(1) d(x) = Djikstra's Algorithm
(2) g(x) = Greedy Search
(3) a*(x) = A* Algorithm = d(x) + g(x)
To make your A* more greedy (prefer suboptimal but fast solutions), multiply the cost of g(x) to favour a greedy search; (4) a*(x) = d(x) + 1.1 * g(x)
I actually tested this in to a search space of 1500x2000. (3), a standard A*, took about 5 seconds to find the goal on the opposite side. (4) took only milliseconds to find the goal, demonstrating the value of using heuristics well.
You may also add other heuristics to A*, such as:
Depth-first search (prefer a greater amount of moves)
Bread-first (prefer a smaller amount of moves)
Stick to Roads (if terrain determines movement speed, increase the cost of choosing bad terrain)
Stay out of enemy territory (if you want to avoid losing units, don't put them in harms way)

Levenshtein Distance Algorithm better than O(n*m)?

I have been looking for an advanced levenshtein distance algorithm, and the best I have found so far is O(n*m) where n and m are the lengths of the two strings. The reason why the algorithm is at this scale is because of space, not time, with the creation of a matrix of the two strings such as this one:
Is there a publicly-available levenshtein algorithm which is better than O(n*m)? I am not averse to looking at advanced computer science papers & research, but haven't been able to find anything. I have found one company, Exorbyte, which supposedly has built a super-advanced and super-fast Levenshtein algorithm but of course that is a trade secret. I am building an iPhone app which I would like to use the Levenshtein distance calculation. There is an objective-c implementation available, but with the limited amount of memory on iPods and iPhones, I'd like to find a better algorithm if possible.
Are you interested in reducing the time complexity or the space complexity ? The average time complexity can be reduced O(n + d^2), where n is the length of the longer string and d is the edit distance. If you are only interested in the edit distance and not interested in reconstructing the edit sequence, you only need to keep the last two rows of the matrix in memory, so that will be order(n).
If you can afford to approximate, there are poly-logarithmic approximations.
For the O(n +d^2) algorithm look for Ukkonen's optimization or its enhancement Enhanced Ukkonen. The best approximation that I know of is this one by Andoni, Krauthgamer, Onak
If you only want the threshold function - eg, to test if the distance is under a certain threshold - you can reduce the time and space complexity by only calculating the n values either side of the main diagonal in the array. You can also use Levenshtein Automata to evaluate many words against a single base word in O(n) time - and the construction of the automatons can be done in O(m) time, too.
Look in Wiki - they have some ideas to improve this algorithm to better space complexity:
Wiki-Link: Levenshtein distance
Quoting:
We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
I found another optimization that claims to be O(max(m, n)):
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C
(the second C implementation)

Resources