Generic VS Specific model in Cassandra? - datastax-enterprise

Quite long post so let’s start with some context:
Weather data have a central role in our architecture. A weather data is mainly composed of five values:
Temperature
Rain
Global Radiation
Wind (direction, speed)
Relative Humidity
But we could also have some more custom values.
Our specificities are:
Missing values: All this five values are not always available from one weather station. Sometimes we need to take missing values from the nearest weather stations (ex: global radiation)
Sampling rate: For one given weather station, sampling rates can be different between the five values.
Virtual stations: We also have special “virtual” weather stations, that are composed of separated weather sensors (from real weather stations).
In all cases, at the end of the acquisition process, for each event in a weather station (real or virtual) we need to calculate some higher level indices from this five values. Some of this five values or higher level indices are aggregated daily.
We plan to use Spark for data processing.
Which of these three models is the most relevant and that would not deprive us of Cassandra benefits?
How to manage relations between sensors and weather_stations (missing data and virtual stations)?
Sensors model - one table for all data
CREATE TABLE sensor_data {
sensor_id uuid
day text,
timestamp timestamp,
sensor_type text,
value double,
weather_station_id
PRIMARY KEY ((sensor_id, day), timestamp)
}
CREATE TABLE weather_data {
weather_station_id uuid,
day date,
timestamp timestamp,
sensor_data seq<sensor_data>
PRIMARY KEY ((weather_station_id, day), timestamp)
}
Measurement model - one table by data type
CREATE TABLE weather_temperature {
sensor_id uuid,
day text,
timestamp timestamp,
value double,
weather_station_id
PRIMARY KEY ((sensor_id, day), timestamp)
}
CREATE TABLE weather_rain {
...
}
Same for all kind of measurements. Then from these tables we need to process data to aggregate everything, filling missing values and repeating values with lower sampling rate.
CREATE TABLE weather_data (
weather_station_id uuid,
day date,
timestamp timestamp,
...
PRIMARY KEY ((weather_station_id, day), timestamp)
);
Weather station model - one table with all data
CREATE TABLE weather_data (
weather_station_id text,
day date,
timestamp timestamp,
temperature float,
rain float,
global_radiation float,
relative_humidity float,
wind_speed float,
wind_direction float,
PRIMARY KEY ((weather_station_id, day), timestamp)
);
Then we fill a weather_data_processed table with virtual station, missing values and repeat values with lower sampling rate.

To answer this question it's important to understand how you plan to query the data with cql. Can you provide a bit more information about that?
To solve the problem of 'virtual weather stations' you could go with something like this:
CREATE TABLE weather_data_station_sensor (
weather_station_id text,
sensor_id text,
day date,
timestamp timestamp,
temperature float,
rain float,
global_radiation float,
relative_humidity float,
wind_speed float,
wind_direction float,
PRIMARY KEY ((weather_station_id, day), sensor_id, timestamp)
);
And use the same table for real and virtual stations. When you get a reading from a sensor that belongs in both a real station and a virtual station you do two (or more updates) using a BATCH. For example:
BEGIN BATCH
INSERT INTO weather_data_station_sensor (weather_station_id, sensor_id ...etc) VALUES ('station_1', 'id_1' ... etc);
INSERT INTO weather_data_station_sensor (weather_station_id, sensor_id ...etc) VALUES ('station_2', 'id_1' ... etc);
APPLY BATCH

Related

Should we compare null value with known value?

I have a binary classification problem and need to prepare the data for model training. There are two classes, duplicate, and nonduplicate. Assume two records of the data is like
Id
Name
Phone
Email
City
A1
Mick
12345
m#m.com
London
A2
Mick
12345
null
London
It seems that these two records are duplicates. I need to turn them in one record and assign each feature a binary value of 1 if their values match; otherwise, a 0 as follows
Id1
Id2
Name
Phone
Email
City
Label
A1
A2
1
1
?
1
1
As the first table shows, we have a missing value for the email in the second row. I know I cannot compare a known value with a missing one. The question is, what is the best practice in this case?
Note: The number of missing values is high in my dataset, and I cannot drop them.
I tried to put 0, but I know it introduces bias in the dataset.
you can drop the records wit the null values
to do this use
Pandas dropna()

is there a function to get mean values of a column for every unique date in date column?

jupyter notebook screenshot showing al columns in the datasetI have an AQI(Air Quality Index) dataset for which there are various columns such as O3, SO2, PM2.5,etc and a datetime column which has timestamps in it like (20-Sep-2017 - 01:00, 20-Sep-2017 - 00:00). I want to get mean value of columns for every unique date such as O3 has several values but I want only mean for 20-Sep-2017. I've tried regex, and many other things but did not get desired output.

Is it possible to get percentile on aggregated data in Influxdb?

Is it possible to get percentile on aggregated data in Influxdb?
Say, my data is
db,label1=value1 measure1_count=20 measure1_mean=0.8 140000000000
db,label1=value1 measure1_count=8 measure1_mean=0.9 140000001000
db,label1=value1 measure1_count=15 measure1_mean=0.4 140000002000
It it possible to do percentile on above data in influxdb1/2?
Influx db provide the Median aggregate function for calculating median.
select MEDIAN(Value) from ProcessData group by TagName
Note: MEDIAN() is nearly equivalent to PERCENTILE(field_key, 50), except MEDIAN() returns the average of the two middle field values if the field contains an even number of values.
https://docs.influxdata.com/influxdb/v1.8/query_language/functions/#:~:text=Note%3A%20MEDIAN()%20is%20nearly,an%20even%20number%20of%20values.

What is the best way to store sensor data in Clickhouse?

We have a set of devices and all of them have sensors. All devices have some common set of sensors, but some devices have additional sensors. Every sensor has different discretization level and some sensors could change sometimes very fast, and sometimes could not change for some time.
For example, we have DeviceA and have a stream of packets in a form(NULL means that value doesn't change):
Timestamp, Temp, Latitude, Longitude, Speed...
111, 20, 54.111, 23.111, 10
112, 20, NULL, NULL, 13
113, 20, NULL, 23.112, 15
And DeviceB:
Timestamp, Temp, Latitude, Longitude, Speed..., AdditionalSensor
111, 24, 54.111, 23.121, 10 ... 1
112, 23, 55.111, 23.121, 13 ... 2
113, 23, 55.111, 23.122, 15 ... 1
After some time new sensors could be added to some device.
Every sensor could be any of numeric types(Int32, UInt8, Float32)
After that data will be used to calculate: dau, mau, retention, GPS coordinates clustering and so on.
We could simply create some table:
CREATE TABLE Sensors
(
Date Date,
Id FixedString(16),
DeviceNumber FixedString(16),
TimeUtc DateTime,
DeviceTime DateTime,
Version Int32,
Mileage Int32,
Longitude Float64,
Latitude Float64,
AccelX Float64,
AccelY Float64,
AccelZ Float64
...
) ENGINE = MergeTree(Date, (DeviceNumber, TimeUtc), 8192);
But two problems here: no support for a different set of sensors and sometimes we have null values for some sensor values in case of no changes and it would be great to see last non null value before a timestamp.
The first problem we could solve by creating a table with fields: SensorName, Timestamp, Date, Value. But how to choose correct type? Should we use different tables for different types?
Probably we need to use graphite engine, unfortunately, I have no any experience with that. So any help is really appreciated. It would be great to have possibility to keep only changed values of any sensor.
Update
I found a way how to deal with null values. We could use "anyLast" function to request last received value for a column:
SELECT anyLast(Lights) FROM test where TimeUtc <= toDateTime('2017-11-07 11:13:59');
Unfortunately we can't fill all missing values using some kind of overlapping window functions(no support for them in clickhouse). So in case of nullable field aggregate function will use only not null values and in case of non-nullable field all values including zero values will be used and both ways are incorrect. A workaround is to fill null values before insert using select with anyLast values for all null values in a row.
You can use Clickhouse like a time-series database.
Your table definition is restricting you from having dynamic metrics. That's why you are trying to deal with NULL values.
You can use this table for sensor values:
CREATE TABLE ts1(
entity String,
ts UInt64, -- timestamp, milliseconds from January 1 1970
s Array(String), -- names of the sensors
v Array(Float32), -- sensor values
d Date MATERIALIZED toDate(round(ts/1000)), -- auto generate date from ts column
dt DateTime MATERIALIZED toDateTime(round(ts/1000)) -- auto generate date time from ts column
) ENGINE = MergeTree(d, entity, 8192)
Here we are loading sensor values of device A:
INSERT INTO ts1(entity, ts, s, v)
VALUES ('deviceA', 1509232010254, ['temp','lat','long','speed'], [24, 54.111, 23.121, 11])
Querying deviceA temp data:
SELECT
entity,
dt,
ts,
v[indexOf(s, 'temp')] AS temp
FROM ts1
WHERE entity = 'deviceA'
┌─entity─┬──────────────────dt─┬────────────ts─┬─temp─┐
│ deviceA│ 2017-10-28 23:06:50 │ 1509232010254 │ 24 │
└────────┴─────────────────────┴───────────────┴──────┘
Check this full answer to get a detailed usage.

DAX: Create table with LATEST entries

I have a number of sensors in my home, and I want to use PowerBI to display a graph of the temperatures in the different rooms as well as gauge the current/most recent values.
I am having the hardest time writing this in dax:
The data comes to PowerBI from an Azure Table named "DeviceReadings" on the form:
Location (Text, Partition Key)
RowKey (Numeric value based on date stored, not used)
Date (DateTime, TimeStamp)
Temperature (Decimal)
Humidity (Decimal)
What I would like is: (PSEUDOCODE)
Select Location, Last(Date), Temperature, Humidity FROM DeviceReadings
GROUP BY Location
Expected/Wanted outcome is:
"Mediaroom", "01.10.2017 09:00", 26, 17
"Office", "01.10.2017 09:03", 28, 23
"Livingroom", "01.10.2017 09:13", 22, 32
Obviously, I'm trying to create a calculated DAX table based on these readings. The idea is to create a calculated table that at all times contain the most recent temperature/humidity so that I can display those values in gauge-style visuals.
I have been trying to set table = SUMMARIZECOLUMNS, grouping by Location, and then adding named columns "LastSampled" as "MAX(DeviceReadings[Date]) and then "Temperature";LASTNONBLANK(DeviceReadings[Temperature];DeviceReadings[Temperature]); but that does not get a "connected temperature reading, but something else.
Ideally, I want to group by location, then by max date pr location, and then display the raw temperature + humidity value
I simply want a "Most recent temperature reading" by location displayed on my PowerBI dashboard. (I'm using PowerBI desktop to write all the queries and make the reports, and have not yet uploaded to PowerBI portal)
My DAX skills are fairly low, so I need help in writing the calculated query.
Try using this expression:
Table = SELECTCOLUMNS (
FILTER (
CROSSJOIN (
SELECTCOLUMNS (
GROUPBY (
DeviceReadings,
DeviceReadings[Location],
"LASTDATE", MAXX ( CURRENTGROUP (), DeviceReadings[Date] )
),
"Location", DeviceReadings[Location],
"Date", [LASTDATE]
),
SELECTCOLUMNS (
DeviceReadings,
"DateJoin", DeviceReadings[Date],
"LocationJoin", DeviceReadings[Location],
"Humidity", DeviceReadings[Humidity],
"Temperature", DeviceReadings[Temperature]
)
),
[Date] = [DateJoin]
&& [Location] = [LocationJoin]
),
"Location", [Location],
"LastDate", [Date],
"Temperature", [Temperature],
"Humidity", [Humidity]
)
I am pretty sure there is an easier way to achieve what you are after but I cannot figure out right now.
Let me know if this helps.

Resources