Efficient way to get last value with Flux (InfluxDB) - influxdb

I'm changing from the old Influx query to the new Flux language and I'm wondering how to efficiently get the last value of something without knowing when this last value was. So far I can only get the last value by defining a range start time. See code:
from(bucket: "my_bucket")
|> range(start: -<some_value>s)
|> filter(fn: (r) => ...
|> keep(columns:["_time", "_value",])
|> last()
But the problem is that I don't know a priori when the last value was. So if I make <some_value> large it slows down the query for things that had many values in this time range and when I give it a too small value, it won't find the last value when it was too long ago. So my question is how to find the last value in the most efficient way, similar to SELECT LAST(value) in the old syntax.
Thanks for the help!
I can't find an example that doesn't define the range.start parameter.

You can use start parameter with value 0 to start at the very beginning of "all time".
If you experience query slow down with going too much back in time, I suggest to call keep() after last(), as keep() is not a push down op. Please have a look at performance hints in Optimizing Flux Performance. The list of push down functions and combinations is listed in Start queries with pushdowns

Related

aggregateWindow aligned to specified timezone

Given the following query in Influxdb's Flux language:
from(bucket: "some-great-metrics")
|> range(start: v.timeRangeStart, stop: v.timeRangeStop)
|> aggregateWindow(every: 1mo, fn: sum)
|> yield()
Assuming my current timezone is PST. How to ensure that aggregateWindow respect the beginning and end of the 1mo duration in this specific timezone (PST)?
Searching in the documentation does not bring so much light to me, so far.
Seems like Influx 2.1 comes with a new Flux timezone package
You can try to upgrade to 2.1 and add this before your query:
import "timezone"
option location = timezone.location(name: "America/Los_Angeles")
Things are getting more and more interesting with Flux. Besides the option location that allows you to assign one time zone to an entire query (applicable to the original question, but not applicable in general), many functions that operate on time (including aggregateWindow()) now offer a location argument that allows you to operate with multiple time zones in a single streams of table. Unfortunately some of these are still undocumented.
In case, I've written an extensive review here.

working on sequences with lagged operators

A machine is turning on and off.
seqStartStop is a seq<DateTime*DateTime> that collects the start and end time of the execution of the machine tasks.
I would like to produce the sequence of periods where the machine is idle. In order to do so, I would like to build a sequence of tuples (beginIdle, endIdle).
beginIdle corresponds to the stopping time of the machine during
the previous cycle.
endIdle corresponds to the start time of the current production
cycle.
In practice, I have to build (beginIdle, endIdle) by taking the second element of the tuple for i-1 and the fist element of the following tuple i
I was wondering how I could get this task done without converting seqStartStop to an array and then looping through the array in an imperative fashion.
Another idea creating two copies of seqStartStop: one where the head is tail is removed, one where the head is removed (shifting backwards the elements); and then appying map2.
I could use skipand take as described here
All these seem rather cumbersome. Is there something more straightforward
In general, I was wondering how to execute calculations on elements with different lags in a sequence.
You can implement this pretty easily with Seq.pairwise and Seq.map:
let idleTimes (startStopTimes : seq<DateTime * DateTime>) =
startStopTimes
|> Seq.pairwise
|> Seq.map (fun (_, stop) (start, _) ->
stop, start)
As for the more general question of executing on sequences with different lag periods, you could implement that by using Seq.skip and Seq.zip to produce a combined sequence with whatever lag period you require.
The idea of using map2 with two copies of the sequence, one slightly shifted by taking the tail of the original sequence, is quite a standard one in functional programming, so I would recommend that route.
The Seq.map2 function is fine with working with lists with different lengths - it just stops when you reach the end of the shorter list - so you don't need to chop the last element of the original copy.
One thing to be careful of is how your original seq<DateTime*DateTime> is calculated. It will be recalculated each time it is enumerated, so with the map2 idea it will be calculated twice. If it's cheap to calculate and doesn't involve side-effects, this is fine. Otherwise, convert it to a list first with List.ofSeq.
You can still use Seq.map2 on lists as a list is an IEnumerable (i.e. a seq). Don't use List.map2 unless the lists are the same length though as it is more picky than Seq.map2.

How determine which algorithm has a run-time of log n

The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n. However, we may run into a case with LinkedList where we may be easily misled. For instance, we may have an algorithm to find the maximum of the list by starting from either the head or the tail in order to have a run-time of less than n. Logically, we may think that the run time would be log n, but it's not. Why is that? And how do you determine that?
Starting from the front or back (or alternating) does not change the basis of the search for the greatest value. All it does is reorder the search strategy.
If you have a sequential, ordered list and you do a binary search, each comparison reduces the possible locations for a match by 1/2.
If you look at one element of the linked list, each comparison reduces the possible locations for a match by 1 element.
That is a crucial difference.

Splitting and runtime of log n

Sorry, I made a mistake in my earlier question. Because of that I didn't get the answer I wanted.
The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n. However, we may run into a case with LinkedList where we may be easily misled. For instance, we may have an algorithm to set the nth element of the list to something else by starting from either the head or the tail in order to have a run-time of less than n. Logically, we may think that the run time would be log n, but it's not. Why is that? And how do you determine that?
Do we need to absolutely have splitting to get a run-time of log n? I don't think it makes any logical sense to say the run-time of n when the maximum run-time of the loop is n/2.
I think some concepts need a bit of refining here, because the time complexity is only related to algorithm, not to the size of the data structure you're operating on.
The teacher told us that every time you divide something by 2, the run-time is likely to be log n. For instance, if we divide an array into two, each time we traverse one of the array, the run-time would be log n.
Now, traversing an array, like
for (int i = 0; i < array.size; i++) {
variable = array[i];
}
runs in O(n): the time needed to perform such an operation varies linearly with the size of the array. You will have O(log n) for operations like a binary search on an array, but you cannot generalize this concept to all array operations, and especially not to those who need to iterate over the array.
Now, this sentence
For instance, we may have an algorithm to set the nth element of the list to something else by starting from either the head or the tail in order to have a run-time of less than n.
leads me to believe that you think that the n as used in big O and what you call the "nth element" are directly related. They aren't. On a linked list your only option to go to element n is to go to the start of the list and follow the links down the element you're looking for (or in the case of a double linked list, go to the start or end depending on the position of the element you're looking for), so this operation has a time complexity of O(n), ie linearly related to the length of the collection.

Speeding up Erlang indexation function

So following on from this question:
Erlang lists:index_of function?
I have the following code which works just fine:
-module(test_index_of).
-compile(export_all).
index_of(Q)->
N=length(Q),
Qs=lists:zip(lists:sort(Q), lists:seq(1, N)),
IndexFn=fun(X)->
{_, {_, I}}=lists:keysearch(X, 1, Qs),
I
end,
[IndexFn(X) || X <- Q].
test()->
Q=[random:uniform() || _X <- lists:seq(1, 20)],
{T1, _}=timer:tc(test_index_of, index_of, [Q]),
io:format("~p~n", [T1]).
Problem is, I need to run the index_of function a very large number of times [10,000] on lists of length 20-30 characters; the index_of function is the performance bottleneck in my code. So although it looks to be implemented reasonably efficiently to me, I'm not convinced it's the fastest solution.
Can anyone out there improve [performance-wise] on the current implementation of index_of ? [Zed mentioned gb_trees]
Thanks!
You are optimizing an operation on the wrong data type.
If you are going to make 10 000 lookups on the same list of 20-30 items, then it really pays off to do pre-computation to speed up those lookups. For example, lets make a tuple sorted on the key in a tuples of {key, index}.
1> Ls = [x,y,z,f,o,o].
[x,y,z,f,o,o]
2> Ls2 = lists:zip(Ls, lists:seq(1, length(Ls))).
[{x,1},{y,2},{z,3},{f,4},{o,5},{o,6}]
3> Ts = list_to_tuple(lists:keysort(1, Ls2)).
{{f,4},{o,5},{o,6},{x,1},{y,2},{z,3}}
A recursive binary search for a key on this tuple will very quickly home in on the right index.
Use proplists:normalize to remove duplicates, that is, if it is wrong to return 6 when looking up 'o' instead of 5. Or use folding and sets to implement your own filter that removes duplicates.
Try building a dict with dict:from_list/1 and make lookups on that dict instead.
But this still begs the question: Why do you want the index into a list of something? Lookups with lists:nth/2 has O(n) complexity.
Not sure if I understand this completely, but if the above is your actual usecase, then...
First of all, you could generate Q as the following, and you already save the zipping part.
Q=[{N,random:uniform()} || N <- lists:seq(1, 20)]
Taking this further on, you could generate a tree indexed by the values from the beginning:
Tree = lists:foldl(
fun(T, N) -> gb_trees:enter(uniform:random(), N, T) end,
gb_trees:empty(),
lists:seq(1, 20)
).
Then looking up your index becomes:
index_of(Item, Tree) ->
case gb_trees:lookup(Item, Tree) of
{value, Index} -> Index;
_ -> not_found
end.
I think you need custom sort function which record permutations it makes to input list. For example you can use lists:sort source. This should give you O(N*log N) performance.
Just one question: WTF are you trying do?
I just can't found what is practical purpose of this function. I think you do something odd. It seems that you just improved from O(NM^2) to O(NM*logM) but it is still very bad.
EDIT:
When I synthesize what is goal, It seems that you are trying use Monte Carlo method to determine probabilities of team's 'finishing positions' in English Premiere League. But I'm still not sure. You can determine most probable position [1,1,2] -> 1 or as fractional number as some sort of average 1.33 - for example this last one can be achieve with less effort than others.
In functional programing languages data structures are more important that in procedural or OO ones. They are more about work-flow. You will do this and than this and than ... In functional language as Erlang you should think in manner, I have this input and I want that output. Required output I can determine from this and this and so. There may be not necessary have list of things as you used to be in procedural approaches.
In procedural approaches you are used to use arrays for storage with constant random access. List is not that such thing. There are not arrays in Erlang where you can write (even array module which is balanced tree in reality). You can use tuple or binary for read only array but no one read write. I can write a lot about that there doesn't exist data structure with constant access time at all (from RAM, through arrays in procedural languages to HASH maps) but there is not enough space to explain it in detail here (from RAM technology, through L{1,2,3} CPU caches to necessity increase HASH length when number of keys increase and key HASH computation dependency of key length).
List is data structure which have O(N) random access time. It is best structure for store data which you want take one by one in same order as stored in list. For small N it can be capable structure for random access for small N when corresponding constant is small. For example when N is number of teams (20) in your problem it can be faster than O(logN) access to some sort of tree. But you must take care how big your constant is.
One of common component of algorithms are Key-Value lookups. There can be used arrays as supporting data structure in procedural world in some circumstances. Key must be integer number, space of possible key must not be to sparse and so. List doesn't serve as its substitution well for this purpose except for very small N here. I learn that best way how write functional code is avoid Key-Value lookups where is unnecessary. It often needs rearrange work-flow or refactoring data structures and so. Sometimes it looks like flip over problem solution like glove.
If I ignore that your probability model is wrong. From information you provide it seems that in your model team's season points are independent random events which is not true of course. There is impossible that all teams have some high amount of point, 82 for example just because there is some limit of points taken by all teams in one season. So forgot for this for now. Then I will simulate one 'path' - season and take result in form [{78,'Liverpool'}, {81,'Man Utd'}, ...], then I can sort it using lists:sort without loosing information which team is where. Results I would collect using iteration by path. For each path I would iterate over sorted simulation result and collect it in dictionary where team is key (constant and very cheap hash computation from atom and constant storage because key set is fixed, there is possibility to use tuples/records but seems like premature optimization). Value can be tuple of size 20 and position in tuple is final position and value is count of it.
Something like:
% Process simulation results.
% Results = [[{Points, Team}]]
process(Results) ->
lists:foldl(fun process_path/2,
dict:from_list([{Team, {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}} ||
Team <- ['Liverpool', 'Man Utd', ...]]),
Results).
% process simulation path result
process_path(R, D) ->
process_path(lists:reverse(lists:sort(R)), D, 1).
process_path([], _, D) -> D;
process_path([{_, Team}|R], D, Pos) ->
process_path(R, update_team(Team, Pos, D), Pos + 1).
% update team position count in dictionary
update_team(Team, Pos, D) ->
dict:update(Team, fun(T) -> add_count(T, Pos) end, D).
% Add final position Pos to tuple T of counts
add_count(T, P) -> setelement(P, T, element(P, T) + 1).
Notice that there is nothing like lists:index_of or lists:nth function. Resulting complexity will look like O(NM) or O(NMlogM) for small number M of Teams, but real complexity is O(NM^2) for O(M) setelement/3 in add_count/2. For bigger M you should change add_count/2 to some more reasonable.

Resources