How can we implement efficiently a maximum set coverage arc of fixed cardinality? - graph-algorithm

I am working on solving the following problem and implement the solution in C++.
Let us assume that we have an oriented weighted graph G = (V, A, w) and P a set of persons.
We receive a number of queries such that every query gives a person p and two vertices s and d and asks to compute the minimum weighted path between s and d for the person p. One person can have multiple paths.
After the end of all queries I have a number k <= |A| and I should give k arcs such that the number of persons using at least one of the k arcs is maximal (this is a maximum coverage problem).
To solve the first part I implemented the Djikistra algorithm using priority_queue and I compute the minimal weight between s and d. (Is this a good way to do ?)
To solve the second part I store for every arc the set of persons that use this arc and I use a greedy algorithm to compute the set of arcs (at each stage, I choose an arc used by the largest number of uncovered persons). (Is this a good way to do it ?)
Finally, if my algorithms are goods how can I implement them efficiently in C++?

Related

Decision tree split implementation

I am doing this as a part of my university assignment, but I can't find any resources online on how to correctly implement this.
I have read tons materials on metrics that define optimal set split (like Entropy, Gini and others), so I understand how we would choose an optimal value of feature to split learning set into left and right nodes.
However what I totally don't get is the complexity of implementation, considering we also have to choose optimal feature, which means that on each node to compute optimal value it would take O(n^2), which is bad considering real ML datasets are shaped about 10^2 x 10^6, this is really big in terms of computation cost.
Am I missing some kind of approach that could be used here to help reduce complexity?
I currently have this baseline implementation for choosing best feature and value to split on, but I really want to make it better:
for f_idx in range(X_subset.shape[1]):
sorted_values = X_subset.iloc[:, f_idx].sort_values()
for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
if threshold is not None:
G = calc_g(y_subset, y_left, y_right)
if G < tr_G:
threshold = v
feature_idx = f_idx
tr_G = G
else:
threshold = v
feature_idx = f_idx
tr_G = G
return feature_idx, threshold
So, since no one answered, here some stuff I found out.
Firstly, yes, this task is very computationaly intensive. However, several tricks may be used to reduce amount of splits you need to perform to "grow a tree".
This is especially important, since you don't really want a giant overfitted tree - it just doesn't has any value, what it is more important is to get weak model, which can be used with others in some sort of ensmebling teqnique.
As for the regularization tricks, here are couple of I used myself:
limit the maximum depth of tree
limit the minimal amount of items in node
limit the maximimum amount of leafes in tree
limit the minimum quiality change in split criteria after performing an optimal split
For algorithmic part, there is a way to build a tree a smart way. If you do it as in the code I posted earlier, time complexity will be around O(h * N^2 * D), where h is height of the tree. To work around this, there are several approaches, which I didn't personally code, but read about:
Use dynamic programming for accumulating of statistics per feature, so you don't have to recalculate them every split
Use data binning and bucket sort for O(n) sorting
Source of info: https://ml-handbook.ru/chapters/decision_tree/intro
(use google translate, since website is in russian)

Consider three mutually independent classifiers, A, B, C, with equal error probabilities:

Here's the problem:
Consider three mutually independent classifiers, A, B, C, with equal error probabilities:
Pr(errA) = Pr(errB) = Pr(errC) = t
Let D be another classifier that takes the majority vote of A, B, and C.
• What is Pr(errD)?
• Plot Pr(errD) as a function of t.
• For what values of t, the performance of D is better than any of the other three classifiers?
My questions are:
(1) I couldn't figure out the error probability of D. I thought it would be 1 minus alpha (1 - α), but I am not sure.
(2) How to plot t(Pr(errD))? I assume without finding Pr(errD) then I can plot it.
(3) Here as well, I couldn't figure it out. Comparatively, how should I determine the performance of D?
If I understand well, your problem can be formulated with simple terms without any ensemble learning.
Given that D is the result of a vote by 3 classifiers, D is wrong if and only if at most one of the estimators is right.
A,B,C are independent, so:
the probability of none being right is t^3
the probability of one being right while the other two are wrong is 3(1-t)t^2 (the factor 3 is because there are three ways to achieve this)
So P(errD) = t^3 + 3(1-t)t^2 = -2t^3 + 3t^2
You should be able to plot this as a function of t in the interval [0:1] without too many difficulties.
As for your third question, just solve P(errA) - P(errD) >0 (this means that the error probability of D is smaller than for A and so that its performance is better). If you solve this, you should find that the condition is t<0.5.
To come back to ensemble learning, note that the assumption of independence between your estimators is usually not verified in practice.

How to combine various distance functions into one given the following dataset?

I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.

Fast Exact Solvers for Chromatic Number

Finding the chromatic number of a graph is an NP-Hard problem, so there isn't a fast solver 'in theory'. Is there any publicly available software that can compute the exact chromatic number of a graph quickly?
I'm writing a Python script that computes the chromatic number of many graphs, but it is taking too long for even small graphs. The graphs I am working with a wide range of graphs that can be sparse or dense but usually less than 10,000 nodes. I formulated the problem as an integer program and passed it to Gurobi to solve. Do you have recommendations for software, different IP formulations, or different Gurobi settings to speed this up?
import networkx as nx
from gurobipy import *
# create test graph
n = 50
p = 0.5
G = nx.erdos_renyi_graph(n, p)
# compute chromatic number -- ILP solve
m = Model('chrom_num')
# get maximum number of variables necessary
k = max(nx.degree(G).values()) + 1
# create k binary variables, y_0 ... y_{k-1} to indicate whether color k is used
y = []
for j in range(k):
y.append(m.addVar(vtype=GRB.BINARY, name='y_%d' % j, obj=1))
# create n * k binary variables, x_{l,j} that is 1 if node l is colored with j
x = []
for l in range(n):
x.append([])
for j in range(k):
x[-1].append(m.addVar(vtype=GRB.BINARY, name='x_%d_%d' % (l, j), obj=0))
# objective function is minimize colors used --> sum of y_0 ... y_{k-1}
m.setObjective(GRB.MINIMIZE)
m.update()
# add constraint -- each node gets exactly one color (sum of colors used is 1)
for u in range(n):
m.addConstr(quicksum(x[u]) == 1, name='NC_%d' % u)
# add constraint -- keep track of colors used (y_j is set high if any time j is used)
for u in range(n):
for j in range(k):
m.addConstr(x[u][j] <= y[j], name='SH_%d_%d' % (u,j))
# add constraint -- adjacent nodes have different colors
for u in range(n):
for v in G[u]:
if v > u:
for j in range(k):
m.addConstr(x[u][j] + x[v][j] <= 1, name='ADJ_%d_%d_COL_%d' % (u,v,j))
# update model, solve, return the chromatic number
m.update()
m.optimize()
chrom_num = m.objVal
I am looking to compute exact chromatic numbers although I would be interested in algorithms that compute approximate chromatic numbers if they have reasonable theoretical guarantees such as constant factor approximation, etc.
You might want to try to use a SAT solver or a Max-SAT solver. I expect that they will work better than a reduction to an integer program, since I think colorability is closer to satsfiability.
SAT solvers receive a propositional Boolean formula in Conjunctive Normal Form and output whether the formula is satisfiable. The following problem COL_k is in NP:
Input: Graph G and natural number k.
Output: G is k-colorable.
To solve COL_k you encode it as a propositional Boolean formula with one propositional variable for each pair (u,c) consisting of a vertex u and a color 1<=c<=k. You need to write clauses which ensure that every vertex is is colored by at least one color. You also need clauses to ensure that each edge is proper.
Then you just do a binary search to find the value of k such that G is k-colorable but not (k-1)-colorable.
There are various free SAT solvers. I have used Lingeling successfully, but you can find many others on the SAT competition website. They all use the same input and output format. Google "MiniSAT User Guide: How to use the MiniSAT SAT Solver" for an explanation on this format.
You can also use a Max-SAT solver, again consult the Max-SAT competition website. They can solve the Partial Max-SAT problem, in which clauses are partitioned into hard clauses and soft clauses. Here, the solver finds the maximal number of soft clauses which can be satisfied while also satisfying all of the hard clauses, see the input format in the Max-SAT competition website (under rules->details).
You can formulate the chromatic number problem as one Max-SAT problem (as opposed to several SAT problems as above). In this sense, Max-SAT is a better fit. On the other hand, I have the impression that SAT solvers generally perform better than Max-SAT solvers. I don't have any experience with this kind of solver, so cannot say anything more.

a new edge is insert to a Minimum spanning tree

I trying to find an algorithm to the following question with one different :
the edge are not distinct.
Give an efficient algorithm to test if T remains the minimum-cost spanning tree with the new edge added to G.
in this link- there is a solution but it is not for the different I wrote up:
the edges are not nessecerliy distinct.
Updating a Minimum spanning tree when a new edge is inserted
someone has an idea?
Well, the naive approach of just using Prim or Kruskal to find the min cost spanning tree of the new graph and then see which one has a lower total cost isn't too bad at O(|E|log|E|).
But we don't need to look at the whole graph.
Suppose your new edge connects vertices A and B. Let C be the parent of A. If B is not a descendent of A, then if A-B is lower cost than A-C, then T is no longer the MST and B should be the new parent of the subtree rooted at A.
If B is a descendant of A, then if A-B is shorter than any of the branches in T along the path from A to B, then T is no longer the MST, and the highest cost edge along that path should be removed, B is the root of the newly disconnected component, and should be added as a child of A.
I believe you may need to check these things a second time, reversing which vertices are A and B. The complexity of this is log|V| where the base of the log is the average number of children per node of T. In the case of T being a straight line, it's O(|V|), but otherwise, I think you could say it is O(log|V|).
First find an MST using one of the existing efficient algorithms.
Now adding an edge (v,w) creates a cycle in the MST. If the newly added edge has the maximum cost among the edges on the cycle then the MST remains as it is. If some other edge on the cycle has the maximum cost, then that's the edge to be removed to get a tree with lower cost.
So we need an efficient way to find the edge with the maximum value on the cycle. You can climb from v and w until you reach LCA(v, w) (the least common ancestor of v and w) to get the edge with the max cost. This takes linear time in the worst case.
If you are going to answer multiple such queries then pre-processing the MST is probably better. You can pre-process the MST to get a sparse table data structure in O(N lg N) time and then use this data structure to answer max queries in O(lg N) time in the worst case.

Resources