Time series clustering of activity of machines - machine-learning

I have a NxM matrix where N is the number of time intervals and M are the number of nodes in a graph.
Each cell indicates the nodes that were active in that time interval
Now I need to find group of nodes that always appear together across time series. Is there some approach I can use to cluster these nodes together based on their time series activity.

In R you could do this:
# hierarchical clustering
library(dendextend) # contains color_branches()
dist_ts <- dist(mydata) # calculate distances
hc_dist <- hclust(dist_ts)
dend_ts <- as.dendrogram(hc_dist)
# set some value for h (height within the dendrogram) here that makes sense for you
dend_100 <- color_branches(dend_ts, h = 100)
plot(dend_100)
This creates a dendrogram with colored branches.
You could do much better visualizations, but your post is pretty generic (somewhat unclear what you're asking) and you didn't indicate whether you like R at all.

As the sets may overlap most clustering methods will not produce optimum results.
Instead, treat each time point as a transaction, containing all active nodes as items. Then run frequent itemset mining to find frequently active sets of machines.

Related

Clustering "access-time" data sequences

I have many sequences of data looking like this:
s1 = t11, t12, ..., t1m_1
s2 = t21, t22, ..., t2m_2
...
si = ti1, ti2, ..., tim_i
si means the i-th sequence, tij means the i-th sequence be accessed at time tj
each sequence has different length of data (m_1 may not equal to m_2),
and each sequence's data means that the sequence si was accessed time at ti1, ti2, ..., tim_i.
My goal is to cluster the similar access-time sequences.
I'm not sure whether I can translate this problem to a time-series problem.
For my understanding the time-series data like that each sequence's data means the value at that time like stock data, but my sequence's value means which time the sequence be accessed.
If it can translate to time-series problem, but there is another problem. The problem is that the sequence's access time is very discrete (may be accessed at 1s, 1000s, 2000s), so if I translate to time-series format, its space would be very large, I think this can't run cluster with some algorithm like (DTW), its time complexity may too large.
As you pointed out, DTW would be quite slow, since comparing the first two series takes k * m_1 * m_2 operations.
To avoid this, and to more easily compare your sequences, you might somehow hammer them into the same format (thereby also losing information).
Here are some ideas:
Differentiate to obtain times-between-accesses, and build histograms with fixed bins across all data.
Count the number of accesses during each minute every week (and divide by number of times that minute-of-week appears in each series). Adapt to timescales of interest.
Count "number of accesses up until now". So, instead of having data points only when an access was made ("sparse"), you'd get a data point for every timestamp ("dense") showing accesses for every minute up to the current one.
#3 would be similar to an "integral image" in computer vision. After this, new summarization techniques open up, like moving averages, or even direct comparison (if the recordings happen in parallel).
In order to pick a more useful representation, you need to think about what is meaningful in your application.
After you get a uniform-length representation, you can use cheaper similarity measures. A typical one is cosine similarity (but be sure to normalize first).

Are data dependencies relevant when preparing data for neural network?

Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].

a new edge is insert to a Minimum spanning tree

I trying to find an algorithm to the following question with one different :
the edge are not distinct.
Give an efficient algorithm to test if T remains the minimum-cost spanning tree with the new edge added to G.
in this link- there is a solution but it is not for the different I wrote up:
the edges are not nessecerliy distinct.
Updating a Minimum spanning tree when a new edge is inserted
someone has an idea?
Well, the naive approach of just using Prim or Kruskal to find the min cost spanning tree of the new graph and then see which one has a lower total cost isn't too bad at O(|E|log|E|).
But we don't need to look at the whole graph.
Suppose your new edge connects vertices A and B. Let C be the parent of A. If B is not a descendent of A, then if A-B is lower cost than A-C, then T is no longer the MST and B should be the new parent of the subtree rooted at A.
If B is a descendant of A, then if A-B is shorter than any of the branches in T along the path from A to B, then T is no longer the MST, and the highest cost edge along that path should be removed, B is the root of the newly disconnected component, and should be added as a child of A.
I believe you may need to check these things a second time, reversing which vertices are A and B. The complexity of this is log|V| where the base of the log is the average number of children per node of T. In the case of T being a straight line, it's O(|V|), but otherwise, I think you could say it is O(log|V|).
First find an MST using one of the existing efficient algorithms.
Now adding an edge (v,w) creates a cycle in the MST. If the newly added edge has the maximum cost among the edges on the cycle then the MST remains as it is. If some other edge on the cycle has the maximum cost, then that's the edge to be removed to get a tree with lower cost.
So we need an efficient way to find the edge with the maximum value on the cycle. You can climb from v and w until you reach LCA(v, w) (the least common ancestor of v and w) to get the edge with the max cost. This takes linear time in the worst case.
If you are going to answer multiple such queries then pre-processing the MST is probably better. You can pre-process the MST to get a sparse table data structure in O(N lg N) time and then use this data structure to answer max queries in O(lg N) time in the worst case.

Shuffling on Spark cartesian product

Assume a problem where I have an RDD X, I calculate the mean m in single a worker node and then I want to calculate X-m to e.g. calculate stdevs. I want this to happen in the cluster, not the driver node i.e. I want m to be distributed. I thought of implementing it as a cartesian product of those two RDDs so that essentially as soon as m gets calculated, it propagates to all workers and they calculate X-m. My fear is that Spark will shuffle X's to where m lives and do the subtraction there. Is there a guarantee on to who will shuffled in case of X.cartesian(m)?
The mean/stedev problem above is for illustration purposes - I know it's not excellent but it's simple enough.

Synonym chains - Efficient routing algorithm for iOS/sqlite

A synonym chain is a series of closely related words that span two anchors. For example, the English words "black" and "white" can connected as:
black-dark-obscure-hidden-concealed-snug-comfortable-easy-simple-pure-white
Or, here's "true" and "false":
true-just=fair=beautiful=pretty-artful-artificial-sham-false
I'm working on a thesaurus iOS app, and I would like to display synonym chains also. The goal is to return a chain from within a weighted graph of word relations. My source is a very large thesaurus with weighted data, where the weights measure similarity between words. (e.g., "outlaw" is closely related to "bandit", but more distantly related to "rogue.") Our actual values range from 0.001 to ~50, but you can assume any weight range.
What optimization strategies do you recommend to make this realistic, e.g., within 5 seconds of processing on a typical iOS device? Assume the thesaurus has half a million terms, each with 20 associations. I'm sure there's a ton of prior research on these kinds of problems, and I'd appreciate pointers on what might be applied to this.
My current algorithm involves recursively descending a few levels from the start and end words, and then looking for intercepting words, but that becomes too slow with thousands of sqlite (or Realm) selects.
Since you said your source is a large thesaurus with weighted data, I'm assuming if you pick any word, you will have the weight to its successor in the similarity graph. I will always use the sequence below, when I'm giving any example:
black-dark-obscure-hidden-concealed-snug-comfortable-easy-simple-pure-white
Let's think of the words as being a node on a graph, each relationship of similarity a word has with another is a path on that graph. Each path is weighted with a cost, which is the weight you have on the source file. So the best solution to find a path from one word to another is to use the A* (A star) path finding.
I'm using the minimum "cost" to travel from a word to its successor to be 1. You can adjust it accordingly. First you will need a good heuristic function to use, since this is a greedy algorithm. This heuristic function will return the "greedy" distance between two words, any words. You must respect the fact the the "distance" it returns can never be bigger than the real distance between the two words. Since I don't know any relationship between any words for a thesaurus, my heuristic function will always return the minimum cost 1. In other words, it will always say a word is the most similar word to any other. For example, my heuristic function tells me that 'black' is the best synonym for 'white'.
You must tune the heuristic function if you can, so it will respond with more accurate distances making the algorithm runs faster. That's the tricky part I guess.
You can see the pseudo-code for the algorithm on the Wikipedia article I sent. But here it is for a faster explanation:
function A*(start,goal)
closedset := the empty set -- The set of nodes already evaluated.
openset := {start} -- The set of tentative nodes to be evaluated, initially containing the start node
came_from := the empty map -- The map of navigated nodes.
g_score[start] := 0 -- Cost from start along best known path.
-- Estimated total cost from start to goal through y.
f_score[start] := g_score[start] + heuristic_cost_estimate(start, goal)
while openset is not empty
current := the node in openset having the lowest f_score[] value
if current = goal
return reconstruct_path(came_from, goal)
remove current from openset
add current to closedset
for each neighbor in neighbor_nodes(current)
if neighbor in closedset
continue
tentative_g_score := g_score[current] + dist_between(current,neighbor)
if neighbor not in openset or tentative_g_score < g_score[neighbor]
came_from[neighbor] := current
g_score[neighbor] := tentative_g_score
f_score[neighbor] := g_score[neighbor] + heuristic_cost_estimate(neighbor, goal)
if neighbor not in openset
add neighbor to openset
return failure
function reconstruct_path(came_from,current)
total_path := [current]
while current in came_from:
current := came_from[current]
total_path.append(current)
return total_path
Now, for the algorithm you'll have 2 arrays of nodes, the ones you are going to visit (opened) and the ones you already visited (closed). You will also have two arrays of distances for each node, that you will be completing as you travel through the graph.
One array (g_score) will tell you the real lowest traveled distance between the starting node and the specified node. For example, g_score["hidden"] will return the lowest weighted cost to travel from 'black' to 'hidden'.
The other array (f_score) will tell you the supposed distance between the node you specified to the goal you want to reach. For example, f_score["snug"] will return the supposed weighted cost to travel from "snug" to "white" using the heuristic function. Remember, this cost will always be less or equal the real cost to travel between words, since our heuristic function need to respect the aforementioned rule.
As the algorithm runs, you will be traveling from node to node, from the starting word, saving all the nodes you traveled and the costs you "used" to travel. You will be replacing the traveled path when you find a better cost to travel on the g_score array. You will use the f_score to predict which node will be best visited first, from the array of 'unvisited' nodes. It's best if you save your f_score as a minimum Heap.
You will end the algorithm when you find the node that is the goal that you want. Then you will reconstruct the minimum path using the array of nodes visited that you kept saving at each iteration. Another way the algorithm will stop is if it visited all neighbor nodes and didn't find the goal. When this happens, you can say there is no path from the starting node to the goal.
This algorithm is the most used on games to find the better path between two objects on a 3D world. To improve it, you just need to create a better heuristic function, that can let the algorithm find the better nodes to travel first, leding it to the goal faster.
-- 7f
Here's a closely related question and answer: Algorithm to find multiple short paths
There you can see comments about Dijkstra's and A-star, Dinic's, but more broadly also the idea of maximum flow and minimum cost flow.

Resources