Iterating over irregular data with Ruby while 'filling in the blanks'

Iterating over irregular data with Ruby while 'filling in the blanks' - ruby-on-rails

I'm rolling the following:
Rails 3.2.9
Highcharts
State Machine
I've got an irregular set of data that represents the change of state of hundreds of linux boxes. Each box checks into a central ping server every two minutes.
Every time a device heartbeats, the ping server checks if the device's current state is offline and if so, changes the state to online and sets the heartbeat table's online col to true and inserts the time this happened.
On the ping server, we have a cron that runs a rake task every 5 minutes. This finds all devices with a heartbeat less than the time now minus 5 minutes.
If it discovers a device is offline, it sets the device state to offline and marks to heartbeat table with the time of the last heartbeat and a 0.
We've been doing this for a while and it seems like an efficient way to store the uptime data without creating a row for 500 devices every 5 minutes.
The table looks a little like this:
+---------------------+--------+--------+
| created_at | dev_id | online |
+---------------------+--------+--------+
| 2012-10-08 16:29:16 | 2345 | 0 |
| 2012-11-21 16:40:22 | 2345 | 1 |
| 2012-11-03 19:15:00 | 2345 | 0 |
| 2012-11-08 09:15:01 | 2345 | 1 |
| 2012-11-08 09:18:03 | 2345 | 0 |
| 2012-11-09 17:57:22 | 2345 | 1 |
| 2012-12-09 13:57:23 | 2345 | 0 |
| 2012-12-09 14:57:25 | 2345 | 1 |
| 2012-12-09 15:00:30 | 2345 | 0 |
| 2012-12-09 15:57:31 | 2345 | 1 |
| 2012-12-09 16:07:35 | 2345 | 0 |
| 2012-12-09 16:37:38 | 2345 | 1 |
| 2012-12-09 17:57:40 | 2345 | 0 |
+---------------------+--------+--------+
Following Ryan Bate's fantastic Railscast on Highcharts, I can create a line graph of this data with irregular intervals.
The chart and data series
Following this example:
http://www.highcharts.com/demo/spline-irregular-time
And using a data series something like this:
= #devices.heartbeats.map { |o| o.online == true ? 1 : 0 }
It was plotting the line graph pretty nicely.
Where I'm stuck
The graph finishes at the last time it checked in and I need the graph to show a point at Now. In Ryan's example, he maps a zero to a date if there's no value. I can't translate this part.
I'm trying to achieve a graph like the stack bar chart but can't get the data sorted.
http://www.highcharts.com/demo/bar-stacked
How can I format my query so I get the data until Now as well as each individual point so I can create such a graph?

Related

Multiple column SUMS with single formula

I need to SUM multiple columns (sum for each column, not a total range sum) with a single formula. So the output would look something like this:
+-------+-------+------------+-----------+------------+
| 2019 | 2018 | 2017 | 2016 | 2015 |
+-------+-------+------------+-----------+------------+
| $0.00 | $0.00 | $4,341.00 | $0.00 | $5,281.00 |
| $0.00 | 0 | 0 | 0 | 0 |
| $0.00 | 0 | $10,805.00 | $2,865.00 | $8,295.00 |
| $0.00 | 0 | 0 | 0 | $233.00 |
+-------+-------+------------+-----------+------------+
| $0.00 | $0.00 | $15,146.00 | $2,865.00 | $13,809.00 |
+-------+-------+------------+-----------+------------+
I've tried several approaches (SUM,SUMIF,SUMIFS,MMULT), but can't seem to get it right. The closest I've come is with this formula from a website I found
=ArrayFormula(MMULT(B2:F5,(transpose(COLUMN(B1:F1)^0))))
I would also prefer to avoid the need for a 0 value as shown in the MMULT attempt below. But, if that's what it takes to make it work, so be it. But a blank value would be preferred. Am I attempting the impossible or just looking in the exact wrong direction?
My sheet

=ARRAYFORMULA(TRANSPOSE(MMULT(TRANSPOSE(IF(B2:5<>"", B2:5, 0)), ROW(B2:5)^0)))

Data Warehouse dimension for schedules (Dimensional Modeling)

I have not found an example or a way of building a dimension that contains schedule attributes. For example, in my scenario I'm building a data warehouse that will help to gather analytics on podcast/radio show episodes.
We have the following:
dim_episode
dim_podcast_show
dim_date
fact_user_daily_activity
And I'm trying to add another dimension that contains schedule attributes about the podcast_show, for example, some shows air their episodes every day, others tuesdays and thursdays, others only saturdays.
dim_show_schedule (Option 1)
| schedule_key | show_key | time | sunday_flag | monday_flag | tuesday_flag | wednesday_flag | thursday_flag | friday_flag | saturday_flag |
|--------------|----------|-------|-------------|-------------|--------------|----------------|---------------|-------------|---------------|
| 1 | 0 | 00:30 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 2 | 1 | 12:30 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
| 3 | 2 | 21:00 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
However, would it be better to have a bridge table with something like:
bridge_show_schedule (Option 2)
| show_key | day_key |
|----------|---------|
| 0 | 2 |
| 0 | 4 |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
dim_show_schedule (Option 3) (suggested by #nsousa)
| schedule_key | show_key | time | day |
|--------------|----------|-------|-------------|
| 1 | 0 | 00:30 | tuesday |
| 1 | 0 | 00:30 | thursday |
| 2 | 1 | 12:30 | monday |
| 2 | 1 | 12:30 | tuesday |
| 2 | 1 | 12:30 | wednesday |
| 2 | 1 | 12:30 | thursday |
| 2 | 1 | 12:30 | friday |
| 3 | 2 | 21:00 | saturday |
I've searched in Kimball's Data warehouse lifecycle toolkit and could not find an example on this use case.
Any thoughts?

If you keep a dimension with a string attribute saying which days it’s on, e.g., “M,W,F”, the most entries you have are 2^7, 128. A bridge table is an unnecessary complication.

Option 1
You can create a scheduled dimension that has a unique record for every possible schedule (128 daily combinations) combined with every reasonable start time. Using 5 minute intervals would still be less than 37k rows which is trivial for a dimension.
Option 2
If you want to leverage a date dimension instead, create a "Scheduled" fact that relate the show dimension to the date dimension for that future date. This would be handled in your ETL process to map the relationship. Your date dimension should already have the week and day of week logic included. You could also leverage your Show duration attribute to create a semi-additive calculated measure to allow you to easily get the total programming for the period.
I would opt for Option 2 as it provides many more possibilities for analytics.

Time span accumulating fact tables design

I need to design a star schema to process order processing. The progress of an order look like this:
Customer C place an order on item I with quantity 100
Factory F1 take the order partially with quantity 30
Factory F2 take the order partially with quantity 20
Buy from market 50 items
F1 delivery 20 items
F1 delivery 7 items
F1 cancel the contract (we need to buy 3 more item from market)
F2 delivery 20 items
Buy from market 3 items
Complete the order
How can I design a fact table in this case, since the number of step is not fixed, the data types of event is not the same.
I'm sorry for my bad English.

The definition of an Accumulating Snapshot Fact table according to Kimball is:
summarizes the measurement events occurring at predictable steps between the beginning and the end of a process.
For this particular use case I would go with a Transaction Fact Table as the events (steps) are unpredictable, it is more like an event fact table, something similar to logs or audits.
| order_key | date_key | full_datetime | entity_key (customer, factory, etc. varchar) | entity_type | state | quantity |
|-----------|----------|---------------------|----------------------------------------------|-------------|----------|----------|
| 1 | 20190602 | 2019-06-02 04:30:00 | C1 | customer | request | 100 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F1 | factory | receive | 30 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F2 | factory | receive | 20 |
| 1 | 20190602 | 2019-06-02 05:40:00 | Company? | company | buy | 50 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | deliver | 20 |
| 1 | 20190603 | 2019-06-03 02:40:00 | F1 | factory | deliver | 7 |
| 1 | 20190603 | 2019-06-03 04:40:00 | F1 | factory | deliver | 3 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | cancel | |
| 1 | 20190604 | 2019-06-04 07:40:00 | F2 | factory | deliver | 20 |
| 1 | 20190604 | 2019-06-04 07:40:00 | Company? | company | buy | 3 |
| 1 | 20190604 | 2019-06-04 09:40:00 | Company? | company | complete | 100 |
I'm not sure about your reporting needs as they were not specified, but assuming you need to measure lag/durations of unpredictable steps, you could PIVOT and use dynamic SQL to create the required view
SQL Server dynamic PIVOT query?
Let me know if you came up with something different as I'm interested on this particular use case. Good luck

Influx: doing math the same fields in different groups

I have InfluxDB measurement currently set up with following "schema":
+----+-------------+-----------+
| ts | cost(field) | type(tag) |
+----+-------------+-----------+
| 1 | 10 | 'a' |
| 1 | 20 | 'b' |
| 2 | 12 | 'a' |
| 2 | 18 | 'b' |
| 2 | 22 | 'c' |
+------------------+-----------+
I am trying to write a query that will group my table by timestamp and get a delta between field values of two different tags. If I want to get delta between tag 'a' and tag 'b', it will give me following result (please not that I ignore tag 'c'):
+----+-----------+------------+
| ts | type(tag) | delta_cost |
+----+-----------+------------+
| 1 | 'a' | 10 |
| 2 | 'b' | 6 |
+----+-----------+------------+
Is it something Influx can do or am I using the wrong tool?

Just managed to answer my own question. While one of the obvious ways would be performing self-join, Influx does not support joins anymore. We can, however, use nested selects in a following format:
SELECT MEAN(cost_a) - MEAN(cost_b) as delta_cost
FROM
(SELECT cost as cost_a, tag, tablename where tag='a'),
(SELECT cost as cost_b, tag, tablename where tag='b')
GROUP BY time(60s)
Since I am getting my data every 60 seconds anyway, and I have a guarantee of just one point per tag per 60 seconds, I can use GROUP BY and take MEAN without any problems

Neo4j CSV import query super slow, when setting relationships

I am trying to evaluate Neo4j (using the community version).
I am importing some data (1 million rows) using the LOAD CSV process. It needs to match previously imported nodes to create a relationship between them.
Here is my query:
//Query #3
//create edges between Tr and Ad nodes
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///1M.txt'
AS line
FIELDTERMINATOR '\t'
//find appropriate tx and ad
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})
//create the edge (relationship)
CREATE (tx)-[out:OUT_TO]->(ad)
//set properties on the edge
SET out.id= TOINT(line.id)
SET out.n = TOINT(line.n)
SET out.v = TOINT(line.v)
I have indicies on:
Indexes
ON :Ad(p58) ONLINE (for uniqueness constraint)
ON :Tr(txid) ONLINE
ON :Tr(h) ONLINE (for uniqueness constraint)
This query has been running for 5 days now and it has so far created 270K relationships (out of 1M).
Java heap is 4g
Machine has 32G of RAM and an SSD for a drive, only running linux and Neo4j
Any hints to speed this process up would be highly appreciated.
Should I try the enterprise edition?
Query Plan:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
If a part of a query contains multiple disconnected patterns,
this will build a cartesian product between all those parts.
This may produce a large amount of data and slow down query processing.
While occasionally intended,
it may often be possible to reformulate the query that avoids the use of this cross product,
perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (ad))
20 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+---------------------------------+----------------+---------------------+----------------------------+
| Operator | Estimated Rows | Variables | Other |
+---------------------------------+----------------+---------------------+----------------------------+
| +ProduceResults | 1 | | |
| | +----------------+---------------------+----------------------------+
| +EmptyResult | | | |
| | +----------------+---------------------+----------------------------+
| +Apply | 1 | line -- ad, out, tx | |
| |\ +----------------+---------------------+----------------------------+
| | +SetRelationshipProperty(4) | 1 | ad, out, tx | |
| | | +----------------+---------------------+----------------------------+
| | +CreateRelationship | 1 | out -- ad, tx | |
| | | +----------------+---------------------+----------------------------+
| | +ValueHashJoin | 1 | ad -- tx | ad.p58; line.p58 |
| | |\ +----------------+---------------------+----------------------------+
| | | +NodeIndexSeek | 1 | tx | :Tr(txid) |
| | | +----------------+---------------------+----------------------------+
| | +NodeUniqueIndexSeek(Locking) | 1 | ad | :Ad(p58) |
| | +----------------+---------------------+----------------------------+
| +LoadCSV | 1 | line | |
+---------------------------------+----------------+---------------------+----------------------------+

OKAY, so by splitting the MATCH statement into two it sped up the query immensely. Thanks #William Lyon for pointing me to the Plan. I noticed the warning.
old MATCH atatement
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})
split into two:
MATCH (tx:Tr { txid: TOINT(line.txid) })
MATCH (ad:Ad {p58: line.p58})
on 750K relationships the query took 83 seconds.
Next up 9 Million CSV LOAD

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Iterating over irregular data with Ruby while 'filling in the blanks' - ruby-on-rails

Related

Multiple column SUMS with single formula

Data Warehouse dimension for schedules (Dimensional Modeling)

Time span accumulating fact tables design

Influx: doing math the same fields in different groups

Neo4j CSV import query super slow, when setting relationships

Categories

Resources