SQL: Advanced time slice in vertica - time-series

Hey folks: I have the following table in a vertica DB:
+-----+------+----------+
| Tid | item | time_sec |
+-----+------+----------+
| 1 | A | 1 |
| 1 | B | 2 |
| 1 | C | 4 |
| 1 | D | 5 |
| 1 | E | 6 |
| 2 | A | 5 |
| 2 | E | 5 |
+-----+------+----------+
My goal is to create new item groups that lie within a time window deltaT. Meaning that the difference between the first and last item's timestamp is smaller or equal to deltaT. Example: if deltaT = 2 sec we would get the new table:
+-----+------+
| Tid | item |
+-----+------+
| 11 | A |
| 11 | B |
| 12 | B |
| 12 | C |
| 13 | C |
| 13 | D |
| 13 | E |
| 14 | D |
| 14 | E |
| 15 | E |
| 21 | A |
| 21 | E |
+-----+------+
Here is the walk through of the table:
First we inspect all items with the Tid 1, and create sub groups with Tid 1n, where n is a counter.
Our first sub group with the Tid 11 consists of item A, B since deltaT between the last and first item is =<2. The next group has Tid 12 with item B,C. The group after that one has the Tid 13 and items C,D,E since all items are within a time span of 2 seconds. This goes on until the last item with Tid 1. Than we start over with the group that has Tid 2.
The new Tid numbering for the sub groups can be continous (1...6), I just choose this kind of numbering to show the relation to the original table.
I am looking at the vertica functions LAG and Time_slice but cannot figure out a way how to handle such a problem elegantly.

This is how far I got - and it does not answer your question, really. But it could constitute a few pointers:
WITH
-- your input
input(Tid,item,time_sec) AS (
SELECT 1,'A',1
UNION ALL SELECT 1,'B',2
UNION ALL SELECT 1,'C',4
UNION ALL SELECT 1,'D',5
UNION ALL SELECT 1,'E',6
UNION ALL SELECT 2,'A',5
UNION ALL SELECT 2,'E',5
)
-- end of your input, start your "real" WITH clause here
,
input_w_ts AS (
SELECT
*
, TIMESTAMPADD('SECOND',time_sec-1,TIMESTAMP '2000-01-01 00:00:00') AS ts
FROM input
)
SELECT
TS_LAST_VALUE(Tid) AS Tid
, item
, TS_LAST_VALUE(time_sec) AS time_sec
, tsr
FROM input_w_ts
TIMESERIES tsr AS '2 SECONDS' OVER (PARTITION BY item ORDER BY ts)
ORDER BY 1,4
;
Output:
Tid|item|time_sec|tsr
1|A | 1|2000-01-01 00:00:00
1|B | 2|2000-01-01 00:00:00
1|A | 1|2000-01-01 00:00:02
1|C | 4|2000-01-01 00:00:02
1|D | 5|2000-01-01 00:00:04
1|E | 6|2000-01-01 00:00:04
2|A | 5|2000-01-01 00:00:04

Related

Cross join of two tables

I have two tables x and y. I want to join on column b such that I get z in output.
x:([a:1 2 1 3]; b:`a`a`b`b)
q) a | b
-----
1 | a
2 | a
1 | b
3 | b
y:([b:`a`a`a`b]; c:7 8 9 10)
q) b | c
------
a | 7
a | 8
a | 9
b | 10
Desired output:
q) a | b | c
-----------
1 | a | 7
1 | a | 8
1 | a | 9
2 | a | 7
2 | a | 8
2 | a | 9
1 | b | 10
3 | b | 10
Is this some sort of cross join?
An equi join (ej) will produce the result you want:
q)ej[`b;x;y]

Counting if range matches ranged criteria 1:1

I have an ongoing scoreboard with a friend for a game we play. It looks like this:
A B C D E F
+-----------------------------+-------+------+--------+--------+------------+
1 | Through the Ages Scoreboard | | | | | |
+-----------------------------+-------+------+--------+--------+------------+
2 | Game title | Kevin | M | First? | Winner | Difference |
+-----------------------------+-------+------+--------+--------+------------+
3 | thekoalaz's Game | 174 | 213 | Kevin | M | 39 |
4 | Game #0 | 242 | 126 | Kevin | Kevin | 116 |
5 | Game #1 | 105 | 146 | Kevin | M | 41 |
6 | Game #2 | 158 | 135 | Kevin | Kevin | 23 |
7 | Game #3 | 149 | 145 | M | Kevin | 4 |
8 | Game #4 | 91 | 145 | Kevin | M | 54 |
9 | Game #5 | 211 | 187 | M | Kevin | 24 |
10 | Game #6 | 160 | 158 | M | Kevin | 2 |
11 | Game #7 | 154 | 215 | Kevin | M | 61 |
12 | Game #8 | 169 | 177 | M | M | 8 |
13 | Game #9 | 135 | 129 | M | Kevin | 6 |
14 | Game #10 | 156 | 262 | Kevin | M | 106 |
15 | Game #11 | 205 | 171 | M | Kevin | 34 |
16 | Game #12 (2) | 186 | 203 | Kevin | M | 17 |
17 | | | | | | |
+-----------------------------+-------+------+--------+--------+------------+
Where there's space at the end of the board to add scores for future games.
How do I count how many times the player who goes first wins? In this case it should be 3: D4 = E4, D6 = E6, D12 = E12. Is this possible to do in a single formula? And I'd like to make adding future game scores "just work" with this as well.
Here, first is {K;K;K;K;M;K;M;M;K;M;M;K;M;K}
And winner is {M;K;M;K;K;M;K;K;M;M;K;M;K;M}
I tried =COUNTIF($E$3:$E, $D$3:$D), but this gives me 7, which I presume is the same as =COUNTIF($E$3:$E, $D$3), without the ranged criteria.
Other ranged criteria questions didn't seem to focus on this 1:1 necessity (or maybe I don't know how to word it).
Here's what I used:
=SUMPRODUCT(D3:D=E3:E, E3:E<>"")
Let's break it down.
D3:D=E3:E (also expressible as EQ(D3:D, E3:E)) - equality. I tried to figure out the concept of testing equality of ranges, but the best thing I could find was Microsoft's tutorial on array formulas. What I can say is if you just put =D3:D=E3:E in your Google sheet, it will just be one of the results--the one that matches the row. It requires =ArrayFormula(D3:D=E3:E) to enter as the array of equality results.
SUMPRODUCT - Sums the product of corresponding array elements between multiple arrays. For example, SUMPRODUCT({1,3}, {2,4}) = 1*2 + 2*4 = 10. If used with one array, it would just aggregate the array's values. TRUE=1 and FALSE=0, so when considering the array formula above, it will count how many times D3:D=E3:E is true. Ranges work as arrays, so maybe that's why wrapping the equality with ArrayFormula(...) isn't necessary
E3:E<>"" - Another array formula testing if the E cell is not empty (<> is the "not equals" sign). Because I want this to automatically work for any new entries, D3:D=E3:E will evaluate true for any empty entries (empty=empty). Mutliplying these two array formulas together is effectively an AND operator--"sum this if Dn=En AND En is not empty". To convince you, here are the truth tables:
+-----+---+---+ +------+---+---+
| AND | T | F | | MULT | 1 | 0 |
+-----+---+---+ +------+---+---+
| T | T | F | | 1 | 1 | 0 |
| F | F | F | | 0 | 0 | 0 |
+-----+---+---+ +------+---+---+

Calculate a bunch of data to display on stacked bar

I'm struggeling with creating my first chart.
i have a dataset of ordinal scaled data from a survey.
There i have several question with the possible answer from 1 - 5.
So have around 110 answers from different persons which i want to collect and show in a stacked bar.
Those data looks like:
| taste | region | brand | price |
| 1 | 3 | 4 | 2 |
| 1 | 1 | 5 | 1 |
| 1 | 3 | 4 | 3 |
| 2 | 2 | 5 | 1 |
| 1 | 1 | 4 | 5 |
| 5 | 3 | 5 | 2 |
| 1 | 5 | 5 | 2 |
| 2 | 4 | 1 | 3 |
| 1 | 3 | 5 | 4 |
| 1 | 4 | 4 | 5 |
...
to can display that in a stacked bar chart, i need to sum that.
so i know at the end it need to be calculated like:
| | taste | region | brand | price |
| 1 | 60 | 20 | 32 | 12 |
| 2 | 23 | 32 | 54 | 22 |
| 3 | 24 | 66 | 36 | 65 |
| 4 | 55 | 68 | 28 | 54 |
| 5 | 10 | 10 | 12 | 22 |
(this is just to demonstarte, the values are not correct)
Or somehow there is already a function for it on spss but i have now idea where an how.
Any advice how to do that?
I can't think of a single command but there are many ways to get to where you want. Here's one:
first recreating your sample data:
data list list/ taste region brand price .
begin data
1 3 4 2
1 1 5 1
1 3 4 3
2 2 5 1
1 1 4 5
5 3 5 2
1 5 5 2
2 4 1 3
1 3 5 4
1 4 4 5
end data.
Now counting the values for each row:
vector t(5) r(5) b(5) p(5).
* the vector command is only nescessary so the new variables will be ordered compfortably for the following parts.
do repeat vl= 1 to 5/t=t1 to t5/r=r1 to r5/b=b1 to b5/p=p1 to p5.
compute t=(taste=vl).
compute r=(region=vl).
compute b=(brand=vl).
compute p=(price=vl).
end repeat.
Now we can aggregate and restructure to arrive to the the exact data structure you specified:
aggregate /outfile=* /break= /t1 to t5 r1 to r5 b1 to b5 p1 to p5 = sum(t1 to p5).
varstocases /make taste from t1 to t5 /make region from r1 to r5
/make brand from b1 to b5/ make price from p1 to p5/index=val(taste).
compute val = char.substr(val,2,1).
alter type val(f1).

sqlite join query iPad

I have three tables
Personal_video
+------------------------------+
|presonal_video_id | title |
----------------------------
1 | test1|
2 | test2|
3 | test3|
4 | test4|
personal_video_tags
+------------------------------+
|tag_id | tag_title |
----------------------------
1 | august|
2 | 2016 |
3 | 2015 |
4 | 2014 |
personal_video_tag_mapping
+------------------------------+
|tag_id | presonal_video_id |
----------------------------
1 | 1 |
2 | 2 |
3 | 3 |
4 | 1 |
Now i want to write a query which will return me the videos on the basis of common tags like if user select tag "August" & "2014" then the query should return videos which is connected to both the tags.
currently my query is
SELECT presonal_video_id,title
FROM personal_video
WHERE presonal_video_id IN
(
SELECT personal_video_id AS PID
FROM personal_video_tag_mapping
WHERE tag_id IN ("1","2") AND privacy_level != 2
GROUP BY personal_video_id
HAVING COUNT( PID ) > 1
)
It is giving me write result but when there is large data then it takes long time. Can someone teel me correct way to write this query
Thank You in advance
Try this query:
SELECT t1.presonal_video_id, t1.title
FROM personal_video AS t1
JOIN personal_video_tag_mapping AS t2
ON t1.presonal_video_id = t2.presonal_video_id
JOIN personal_video_tags AS t3
ON t2.tag_id = t3.tag_id
WHERE t3.tag_title IN ('august', '2014')
GROUP BY t1.presonal_video_id, t1.title
HAVING COUNT(*) = 2

Count occurrences of words from multiple columns

I have a spreadsheet like this, where the values A-E are the same options coming from a form:
+------+------+------+
| Opt1 | Opt2 | Opt3 |
+------+------+------+
| A | A | B |
| B | C | A |
| C | C | B |
| A | E | C |
| D | B | E |
| B | E | D |
+------+------+------+
I want to make a ranking, showing the most chosen options for each option. I already have this, where Rank is the ranking of the option and number is the count of the option:
+------+------+------+
| Rank | Opt1 | Numb |
+------+------+------+
| 1 | A | 2 |
| 1 | B | 2 |
| 3 | C | 1 |
| 3 | D | 1 |
+------+------+------+ (I have 3 of these, one for each option)
I want to do now a summary of the 3 options, making the same ranking but joining the options. It would be something like:
+------+------+------+
| Rank |Opt123| Numb |
+------+------+------+
| 1 | B | 5 |
| 2 | A | 4 |
| 2 | C | 4 |
| 4 | E | 3 |
| 5 | D | 2 |
+------+------+------+
The easiest way to do this would be getting the data from the three ranking tables or from the original three data columns?
And how would I do this?
I already have the formula to get the names of the options, the count and ranking, but I don't know how to make them work with multiple columns.
What I have (the F column is one of the data columns):
Column B on another sheet:
=SORT(UNIQUE(FILTER('Form Responses'!F2:F;NOT(ISBLANK('Form Responses'!F2:F)))); RANK(COUNTIF('Form Responses'!F2:F; UNIQUE(FILTER('Form Responses'!F2:F;NOT(ISBLANK('Form Responses'!F2:F))))); COUNTIF('Form Responses'!F2:F; UNIQUE(FILTER('Form Responses'!F2:F;NOT(ISBLANK('Form Responses'!F2:F))))); TRUE); FALSE)
Column C:
=ArrayFormula(COUNTIF('Form Responses'!F2:F; FILTER(B2:B;NOT(ISBLANK(B2:B)))))
Column A:
=ARRAYFORMULA(SORT(RANK(FILTER(C2:C;NOT(ISBLANK(C2:C))); FILTER(C2:C;NOT(ISBLANK(C2:C))))))
Edited:
Merge cols:
=TRANSPOSE(split(join(",",D2:D,E2:E),","))
merges 2 cols, not very clean, but works. (Same as here Stacking multiple columns on to one?)
Full formula:
=SORT(UNIQUE(FILTER(TRANSPOSE(split(join(",",D2:D,E2:E),","));NOT(ISBLANK(TRANSPOSE(split(join(",",D2:D,E2:E),",")))))); RANK(COUNTIF(TRANSPOSE(split(join(",",D2:D,E2:E),",")); UNIQUE(FILTER(TRANSPOSE(split(join(",",D2:D,E2:E),","));NOT(ISBLANK(TRANSPOSE(split(join(",",D2:D,E2:E),","))))))); COUNTIF(TRANSPOSE(split(join(",",D2:D,E2:E),",")); UNIQUE(FILTER(TRANSPOSE(split(join(",",D2:D,E2:E),","));NOT(ISBLANK(TRANSPOSE(split(join(",",D2:D,E2:E),","))))))); TRUE); FALSE)
The transpose could be done after the sort.

Resources