Related
I want to remove some lines from a large text file, but I want to do this without allocating more memory than is required for holding the original string. So far, I can only manage the following:
s
|> StringIO.open()
|> then(fn {:ok, device} ->
IO.binstream(device, :line)
end)
|> Stream.reject(&Regex.match?(~r{<date_of_creation>.*</date_of_creation>\n}, &1))
|> Enum.join()
But this ends up doubling the memory required for the original string, because of the join at the end. Is there a better way to do this with just Elixir/Erlang?
Depending on what you want to do with the result, a way to considerably reduce the memory usage is to avoid build the string in the first place and use IO data.
In your example, this could be achieved by removing the call to Enum.join/1 and replace Stream.reject/2 by Enum.reject/2:
|> Enum.reject(&Regex.match?(~r{<date_of_creation>.*</date_of_creation>\n}, &1))
This will return an IO list which can be used directly for I/O, and you might not need the string at all. This is how Phoenix is able to render templates/JSON efficiently: avoiding expensive string concatenation in the first place.
Assuming you had to join with a separator (Enum.join/2), Enum.intersperse/2 could be used to build an IO list.
I have an array of objects, and I have two stacks i go over the array and for every elemment i push the previous objects (pop from stack 1 and push to stack 2) and then resotre them (pop from stack2 and push to stack 1).
I wonder what time complexting it is - o(n) or o(n^2)
because every stack operation (push pop) is o(1).
for (int i = 0; i < buildingsHeight.Length; i++)
{
while (!BuildingStack.IsEmpty() && !didWeFoundHigerBuilding)
{
BuildingStackNode buildingInLine = BuildingStack.Pop();
helperStack.Push(BuildingStack.Pop());
}
// restore data
while (!helperStack.IsEmpty())
{
BuildingStack.Push(helperStack.Pop());
}
}
First of all, I suppose you mean "O" and not little "o" (which has a different meaning)? Secondly, saying something like O(n) doesn't make any sense without defining "n". So what is n?
The code you showed is a bit strange in that you are looping over the array buildingsHeight but you are not doing anything with the array within the loop. Anyway, the time complexity of this piece of code depends on two parameters: the length of buildingsHeight a and the number b of items in BuildingStack (assuming that helperStack is initially empty).
The while loops each take O(b), and the for loop iterates a times, so the overall complexity is O(ab).
Could someone provide a step by step pseudocode using BFS to search a cycle in a directed/undirected graph?
Can it get O(|V|+|E|) complexity?
I have seen only DFS implementation so far.
You can take a non-recursive DFS algorithm for detecting cycles in undirected graphs where you replace the stack for the nodes by a queue to get a BFS algorithm. It's straightforward:
q <- new queue // for DFS you use just a stack
append the start node of n in q
while q is not empty do
n <- remove first element of q
if n is visited
output 'Hurray, I found a cycle'
mark n as visited
for each edge (n, m) in E do
append m to q
Since BFS visits each node and each edge only once, you have a complexity of O(|V|+|E|).
I find BFS algorithm perfect for that.
Time complexity is the same.
You want something like this(Edited from Wikipedia):
Cycle-With-Breadth-First-Search(Graph g, Vertex root):
create empty set S
create empty queue Q
root.parent = null
Q.enqueueEdges(root)
while Q is not empty:
current = Q.dequeue()
for each node n that is adjacent to current:
if n is not in S:
add n to S
n.parent = current
Q.enqueue(n)
else //We found a cycle
n.parent = current
return n and current
I added only the else its a cycle block for the cycle detection and removed the original if reached target block for target detection. In total it's the same algorithm.
To find the exact cycle you will have to find a common ancestor for n and current. The lowest one is available in O(n) time. Than the cycle is known. ancestor to n and current. current and n are connected.
For more info about cycles and BFS read this link https://stackoverflow.com/a/4464388/6782134
Disclaimer: The author is a newbie in Erlang.
Imagine, we have a graph consisting of 1M nodes, and each node has 0-4 neighbours (the edges are emanating from each node to those neighbours, so the graph is directed and connected).
Here is my choice of data structures:
To store the graph I use digraph, which is based on ETS tables. This allows fast (O(1)) access to the neighbours of a node.
For the list of unvisited nodes, I use gb_sets:take_smallest (the node is already sorted, and it is simultaneously deleted after fetching).
For the list of predecessors I use the dict structure, which allows to store the predecessors in the following way: {Node1,Node1_predecessor},{Node2,Node2_predecessor}.
For the list of visited nodes I use a simple list.
Problems:
The code becomes very hard to read and maintain when I try to update the weight of a node both in the digraph structure and in the Unvisited_nodes structure. It doesn't seem the right way to keep one 'object' with the 'fields' that need to be updated in two data structures simultaneously. What is the right way to do that?
The same question is about predecessors list. Where should I store the predecessor 'field' of a node 'object'? Maybe in the Graph (digraph structure)?
Maybe I should rethink the whole Dijkstra's algorithm in terms of processes and messages instead of objects (nodes and edges) and their fields(weights)?
UPD:
Here is the code based on the recommendations of Antonakos:
dijkstra(Graph,Start_node_name) ->
io:format("dijkstra/2: start~n"),
Paths = dict:new(),
io:format("dijkstra/2: initialized empty Paths~n"),
Unvisited = gb_sets:new(),
io:format("dijkstra/2: initialized empty Unvisited nodes priority queue~n"),
Unvisited_nodes = gb_sets:insert({0,Start_node_name,root},Unvisited),
io:format("dijkstra/2: Added start node ~w with the weight 0 to the Unvisited nodes: ~w~n", [Start_node_name, Unvisited_nodes]),
Paths_updated = loop_through_nodes(Graph,Paths,Unvisited_nodes),
io:format("dijkstra/2: Finished searching for shortest paths: ~w~n", [Paths_updated]).
loop_through_nodes(Graph,Paths,Unvisited_nodes) ->
%% We need this condition to stop looping through the Unvisited nodes if it is empty
case gb_sets:is_empty(Unvisited_nodes) of
false ->
{{Current_weight,Current_name,Previous_node}, Unvisited_nodes_updated} = gb_sets:take_smallest(Unvisited_nodes),
case dict:is_key(Current_name,Paths) of
false ->
io:format("loop_through_nodes: Found a new smallest unvisited node ~w~n",[Current_name]),
Paths_updated = dict:store(Current_name,{Previous_node,Current_weight},Paths),
io:format("loop_through_nodes: Updated Paths: ~w~n",[Paths_updated]),
Out_edges = digraph:out_edges(Graph,Current_name),
io:format("loop_through_nodes: Ready to iterate through the out edges of node ~w: ~w~n",[Current_name,Out_edges]),
Unvisited_nodes_updated_2 = loop_through_edges(Graph,Out_edges,Paths_updated,Unvisited_nodes_updated,Current_weight),
io:format("loop_through_nodes: Looped through out edges of the node ~w and updated Unvisited nodes: ~w~n",[Current_name,Unvisited_nodes_updated_2]),
loop_through_nodes(Graph,Paths_updated,Unvisited_nodes_updated_2);
true ->
loop_through_nodes(Graph,Paths,Unvisited_nodes_updated)
end;
true ->
Paths
end.
loop_through_edges(Graph,[],Paths,Unvisited_nodes,Current_weight) ->
io:format("loop_through_edges: No more out edges ~n"),
Unvisited_nodes;
loop_through_edges(Graph,Edges,Paths,Unvisited_nodes,Current_weight) ->
io:format("loop_through_edges: Start ~n"),
[Current_edge|Rest_edges] = Edges,
{Current_edge,Current_node,Neighbour_node,Edge_weight} = digraph:edge(Graph,Current_edge),
case dict:is_key(Neighbour_node,Paths) of
false ->
io:format("loop_through_edges: Inserting new neighbour node ~w into Unvisited nodes~n",[Current_node]),
Unvisited_nodes_updated = gb_sets:insert({Current_weight+Edge_weight,Neighbour_node,Current_node},Unvisited_nodes),
io:format("loop_through_edges: The unvisited nodes are: ~w~n",[Unvisited_nodes_updated]),
loop_through_edges(Graph,Rest_edges,Paths,Unvisited_nodes_updated,Current_weight);
true ->
loop_through_edges(Graph,Rest_edges,Paths,Unvisited_nodes,Current_weight)
end.
Your choice of data structures looks OK, so it is mostly a question of what to insert in them and how to keep them up to date. I'd suggest the following (I have changed the names a bit):
Result: A dict mapping Node to {Cost, Prev}, where Cost is the total cost of the path to Node and Prev is its predecessor on the path.
Open: A gb_sets structure of {Cost, Node, Prev}.
A graph with edges of the form {EdgeCost, NextNode}.
The result of the search is represented by Result and the graph isn't updated at all. There is no multiprocessing or message passing.
The algorithm goes as follows:
Insert {0, StartNode, Nil} in Open, where Nil is something that marks the end of the path.
Let {{Cost, Node, Prev}, Open1} = gb_sets:take_smallest(Open). If Node is already in Result then do nothing; otherwise add {Cost, Node, Prev} to Result, and for every edge {EdgeCost, NextNode} of Node add {Cost + EdgeCost, NextNode, Node} to Open1, but only if NextNode isn't already in Result. Continue with Open1 until the set is empty.
Dijkstra's algorithm asks for a decrease_key() operation on the Open set. Since this isn't readily supported by gb_sets we have used the workaround of inserting a tuple for NextNode even if NextNode might be present in Open already. That's why we check if the node extracted from Open is already in Result.
Extended discussion of the use of the priority queue
There are several ways of using a priority queue with Dijkstra's algorithm.
In the standard version of Wikipedia a node v is inserted only once but the position of v is updated when the cost and predecessor of v is changed.
alt := dist[u] + dist_between(u, v)
if alt < dist[v]:
dist[v] := alt
previous[v] := u
decrease-key v in Q
Implementations often simplify by replacing decrease-key v in Q with add v to Q. This means that v can be added more than once, and the algorithm must therefore check that an u extracted from the queue hasn't already been added to the result.
In my version I am replacing the entire block above with add v to Q. The queue will therefore contain even more entries, but since they are always extracted in order it doesn't affect the correctness of the algorithm. If you don't want these extra entries, you can use a dictionary to keep track of the minimum cost for each node.
How does one set a queue to hold N values. When the N is reached, remove the last item and add a value to the front of the queue.
Should this be done with if statement?
I also want to calculate the values within the queue as a new item is added. e.g. add all of the values in the queue.
I assume from your query that you both want to maximize the length of the queue and get the sum of all the values.
To answer your easiest question first: Erlang queues, however you wish to represent them, are normal Erlang data structures so there are no problems in storing them in a dictionary.
The OTP queue module is actually very simple but the plethora of interfaces easily makes it confusing to use. #Nathon's enqueue function can be made much more efficient by not using the queue data structure directly but by defining your own data structure which includes the queue and its current length, {Length,Queue}. If the sum is important then you could even include it as well.
The queue representations are very simple so it is very easy to write your own specialised form of it.
The simplest way is to keep the queue in a list and take elements from the head and add new elements to the end. So :
new(Max) when is_integer(Max), Max > 0 -> {0,Max,[]}. %Length, Max and Queue list
take({L,M,[H|T]}) -> {H,{L-1,M,T}}.
add(E, {L,M,Q}) when L < M ->
{L+1,M,Q ++ [E]}; %Add element to end of list
add(E, {M,M,[H|T]}) -
{M,M,T ++ [E]}. %Add element to end of list
When the queue becomes full the oldest member, which is at the front of the queue, is dropped. An empty queue generates an error. This is a very simple structure but it is inefficient as the queue is copied every time a new element is added. Reversing the list does not help as then the list is copied every time an element is removed from it. But it is simple, and it does work.
A much more efficient structure is to split the queue into two lists, the front end of the queue and the rear end of the queue. The rear end is reversed and becomes the new front when the front is empty. So:
new(Max) when is_integer(Max), Max > 0 ->
{0,Max,[],[]}. %Length, Max, Rear and Front
take({L,M,R,[H|T]}) -> {H,{L-1,M,R,T}};
take{{L,M,R,[]}) when L > 0 ->
take({L,M,[],lists:reverse(R)}). %Move the rear to the front
add(E, {L,M,R,F}) when L < M ->
{L+1,M,[R|E],F}; %Add element to rear
add(E, {M,M,R,[H|T]}) ->
{M,M,[R|E],T}; %Add element to rear
add(E, {M,M,R,[]}) ->
add(E, {M,M,[],lists:reverse(R)}). %Move the rear to the front
Again when the queue becomes full the oldest member, which is at the front of the queue, is dropped and an empty queue generates an error. This is the data structure used in the queue module.
It would be very easy to add the current sum of the elements to the structure and manage it directly.
Often, when working on simple data structures like this, it is just as easy to roll your own module as it is to use a provided one.
Given the comments, this will do it:
enqueue(Value, Queue) ->
Pushed = queue:in(Value, Queue),
Sum = fun (Q) -> lists:sum(queue:to_list(Q)) end,
case queue:len(Pushed) of
Len when Len > 10 ->
Popped = queue:drop(Pushed),
{Popped, Sum(Popped)};
_ ->
{Pushed, Sum(Pushed)}
end.
If you don't actually want to sum the items, you can use lists:foldl instead, or just write a function to do the operation directly on a queue.