I have a really big dataset for which I'd like to fetch a diagnostic sample. Till now I've been fetching all the data and sampling on my own machine, but currently it causes both influxdb and my app to run out of memory.
Is there a way to maintain the entire dataset on the DB level and downsample in a query?
Let's say that I'm interested in 1% of the entire measurement data. How would that query look like?
e.g. I want to get 1% of all values of a measurement.
Example case:
Measurement X
time val1 val2
---- ---- ----
0 A1 0.5
1 A2 0.7
2 A1 1.0
3 A3 1.5
4 A4 0.7
5 A3 0.5
6 A7 1.0
7 A1 0.5
8 A10 0.7
9 A2 0.1
Magic Query - 10%
time val1 val2
---- ---- ----
5 A3 0.5
Magic Query - 20%
time val1 val2
---- ---- ----
9 A2 0.1
4 A4 0.7
Related
I wish to print this data in a table with the columns aligned. I tried with Format but the columns were not aligned. Does anyone know how to do it ? Thank you.
(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none"))
I tried to align the columns wit the ~T directive, no way. Is there a piece of code that prints nicely table data?
Let's break this down.
First, let's give your data a nice name:
(defparameter *data*
'(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none")))
Now, come up with a way to print each line using format and destructuring-bind. Widths of various fields are hard-coded in.
(defun print-line (line)
(destructuring-bind (a b c d e f) line
(format T "~20a ~5d ~6,2f ~6,2f ~10,2f ~4a~%" a b c d e f)))
Once you know you can print a line, you just need to do that for each line.
(mapcar 'print-line *data*)
Result:
tiscali 10000 2.31 0.84 -14700.00 none
atlantia 50 22.65 22.68 1.50 none
bper-banca 1000 1.59 2.01 423.00 none
alerion-cleanpower 30 44.14 36.45 -230.70 none
tesmec 10000 0.12 0.14 150.00 none
cover-50 120 8.95 9.60 78.00 none
ovs 1000 1.71 1.93 217.00 none
credito-emiliano 200 5.70 6.26 112.00 none
I have something like this in my personal code, that I reproduced here in a simplified way:
(defpackage :tabular (:use :cl))
(in-package :tabular)
I have a function that turns any object into a list of values (a row), here the usage is for a list of values, so it is already in the correct shape.
(defgeneric columnize (object)
(:documentation "Representation of object as a list of fields")
(:method ((o list)) o))
I also define a transpose method that works with lists of various sizes:
(defun transpose (lists)
(when (notany #'null lists)
(cons
(mapcar #'first lists)
(transpose (mapcar #'cdr lists)))))
Here is your data, as defined by Chris:
(defparameter *data*
'(("tiscali" 10000 2.31 0.84 -14700.0 "none")
("atlantia" 50 22.65 22.68 1.5 "none")
("bper-banca" 1000 1.59 2.01 423.0 "none")
("alerion-cleanpower" 30 44.14 36.45 -230.7 "none")
("tesmec" 10000 0.12 0.14 150.0 "none")
("cover-50" 120 8.95 9.6 78.0 "none")
("ovs" 1000 1.71 1.93 217.0 "none")
("credito-emiliano" 200 5.7 6.26 112.0 "none")))
And finally, a function that prints a list of objects in a tabular way.
Basically, I convert all objects to list of values, convert them to string, and compute their size. This gives a matrix of size that I transpose to have a list of sizes for the same column: this is used to compute the width of each column, based on the maximum size of the actual data.
In practice, I allow also the generic function to add indicators like how to justify (left/right), etc.
(defun tabulate (stream objects)
(loop
for n from 0
for o in objects
for row = (mapcar #'princ-to-string (columnize o))
collect row into rows
collect (mapcar #'length row) into row-widths
finally
(flet ((build-format-arguments (max-width row)
(when (> max-width 0)
(list max-width #\space row))))
(loop
with number-width = (ceiling (log n 10))
with col-widths = (transpose row-widths)
with max-col-widths = (mapcar (lambda (s) (reduce #'max s)) col-widths)
for index from 0
for row in rows
for entries = (mapcan #'build-format-arguments max-col-widths row)
do (format stream
"~v,'0d. ~{~v,,,va~^ ~}~%"
number-width index entries)))))
For example:
(fresh-line)
(tabulate *standard-output* *data*)
Gives:
0. tiscali 10000 2.31 0.84 -14700.0 none
1. atlantia 50 22.65 22.68 1.5 none
2. bper-banca 1000 1.59 2.01 423.0 none
3. alerion-cleanpower 30 44.14 36.45 -230.7 none
4. tesmec 10000 0.12 0.14 150.0 none
5. cover-50 120 8.95 9.6 78.0 none
6. ovs 1000 1.71 1.93 217.0 none
7. credito-emiliano 200 5.7 6.26 112.0 none
As you can see there is some adjustments that could be made to format floating points values so that they align on the dot, but this is already quite useful.
I want to predict daily sales data, I have a daily time sereis for 15 months. I have the additional feature that states if the store was closed on that day or not. If the store was closed, the sales is equal to zero. Hence, my data looks like this:
y = sales
x1 = sales yesterday
x2 = sales before yesterday
x3 = store closed?
y x1 x2 x3
4 - - 0
2 4 - 0
5 2 4 0
0 5 2 1
4 0 5 0
I am experimenting with tree regressors such as Random Forest and Extremely Randomized Trees. Intuitively, the first node should be store_closed == 1 and if this is true, the prediction should be zero. But somehow none of the algorithms works that way.
I don't understand why the zeros are not predicted correctly since it seems "easy" for me. Any ideas?
I'm doing some dynamic Monte Carlo simulation in Google Sheets, by utilizing the COUNTIF formula for the simulation. Something is not working the way I thought it would, but I cannot put my finger on. I have two columns that I'm comparing, and I need to count the instances where the value in one column is bigger than the value in the other column. If I do this explicitly by propagating the if comparison formula I obtain the correct result. However, if I do it with
=countif( A4:A, ">" & B4:B )
I do not obtain the correct result. My example is at this sheet, the number in cell C4 is the malfunctioning COUNTIF, which equals 2 in the example, and the number in cell E4 is 5, which is the correct count by propagating the comparison in column F and adding the correct comparisons in E4.
p1 p2 n
0.5 0.51 10
Monte Carlo
0.50 0.60 2 5 0
0.90 0.50 1
0.60 0.30 1
0.50 0.60 0
0.40 0.30 1
0.40 0.50 0
0.60 0.70 0
0.60 0.30 1
0.70 0.50 1
0.10 0.30 0
There are two scenarios with countif:
(1) As a non-array formula, =countif( A4:A, ">" & B4:B ) would give you the same result as =countif( A4:A, ">" & B4 ) i.e. it would count only values of A greater than .60, giving the answer 2.
(2) As an array formula, =sum(countif( A4:A, ">" & B4:B )) would give you a separate result for each value of B (2+5+9+2...) giving the answer 56.
If you wanted to use countif, you would need to do something like this:
=ArrayFormula(countif(A4:A-B4:B,">"&0))
try:
=INDEX(SUM(IF(A4:A>B4:B, 1)))
Is it possible to perform an arbitrary calculation (eg. A2*B2) on a set of rows and obtain the cumulative sum along the way using ARRAYFORMULA? For example, in the following sheet we have numbers (column A), multipliers (column B), the result of multiplying them (column C), and a cumulative tally (column D):
| A B C D E F
-------------------------------------------------------------------------------
1 | number multiplier result cumulative array formula array formula sum?
2 | 3 4 12 12 12
3 | 2 4 8 20 8
4 | 10 1 10 30 10
5 | 7 9 63 93 63
I can use ARRAYFORMULA in cell E2 (specifically, ARRAYFORMULA(A2:A5*B2:B5)) to do the multiplication. Is it possible to use ARRAYFORMULA (or alternative tool) in cell F2 to show the cumulative total?
use:
=ARRAYFORMULA(IF(A2:A="",,MMULT(TRANSPOSE((ROW(A2:A)<=
TRANSPOSE(ROW(A2:A)))*A2:A*B2:B), SIGN(B2:B))))
Calculate the cumulative sum with the SCAN and LAMBDA functions:
=SCAN(0, F5:F, LAMBDA(accumulated_value, cell_value, accumulated_value + cell_value))
This will run faster as it runs with linear complexity (O(N)) compared to the ARRAYFORMULA solution, which runs in quadratic time (O(N**2)).
Where:
0 is the initial value of the cumulative sum
F5:F is the range to sum over
LAMBDA(accumulated_value, cell_value, accumulated_value + cell_value)) is the function that calculates the sum at each cell
Sample File
I performed classification on a small data set 65x9 using Decision Trees (Random Forest and Random Tree). I have four classes and 8 Attributes and 65 Instances.
My Application is in assistive robotics. So,Im extracting some parameters from my sensor data that I think are relevant to classify the users run while they are performing some task. I get the movement data from the sensor package deployed on the wheelchair. Im classify certain action like turning 180 degree, and Im giving him a mark (from 1 to 4) So from the sensor package and the software I had extracted parameters like velocity, distance, time, standard deviation of the velocity etc. that are relevant for the classification of the users run. So my data are all numbers.
When I performed Decision Trees Classify I got this Results
=== Classifier model (full training set) ===
Random forest of 10 trees, each constructed while considering 4 random features.
Out of bag error: 0.5231
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 64 98.4615 %
Incorrectly Classified Instances 1 1.5385 %
Kappa statistic 0.9791
Mean absolute error 0.0715
Root mean squared error 0.1243
Relative absolute error 19.4396 %
Root relative squared error 29.0038 %
Total Number of Instances 65
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 c1
1 0 1 1 1 1 c2
0.952 0 1 0.952 0.976 1 c3
1 0.019 0.917 1 0.957 1 c4
Weighted Avg. 0.985 0.003 0.986 0.985 0.985 1
=== Confusion Matrix ===
a b c d <-- classified as
14 0 0 0 | a = c1
0 19 0 0 | b = c2
0 0 20 1 | c = c3
0 0 0 11 | d = c4
This is too good. Am I doing something wrong?