I'm currently using (abusing?) Postgres to store time-series data, and given the amount of data now being consumed I'm needing to migrate over to a true time-series database that supports continuous queries and retention policies. I'm looking at InfluxDB, and the only thing holding me back is the ability to define custom aggregates. I currently have a custom aggregate defined in Postgres that calculates a weighted average for data in a table based, with weights being determined by the value in one of the columns. For example, I have a score column with values between 1 and 9. When calculating the weighted average score, I multiply the score by 10 if it's between 4 and 6, and multiply it by 1,000 if it's between 7 and 9 before calculating the average.
Is it possible to do something similar in InfluxDB?
At the moment InfluxDB doesn't allow for user defined functions within InfluxQL.
You could use both InfluxDB and Kapacitor together to achieve what you're trying to do. Particularly you'd use the User Defined Function Node in Kapacitor to define a weighted average UDF and then write the results of the UDF back into InfluxDB.
Related
I have an interesting question about time series forecasting. If someone has temporal data from multiple sensors, each dataset would have data, e.g., from 2010 to 2015, so if one were to train a forecasting model using all the data from those different sensors, how should the data be organized? because if one just stacked up the data set, it would generate, e.g., sensorDataset1 (2010–2015), sensorDataset2 (2010–2015), and the cycle would start over with sensors 3, 4, and n. Is this a problem with time series data or not?
If yes, what is the proper way to handle this?
I tried using all the data stacked up and training the model anyway, and actually it has a good error, but I wonder if that approach is actually valid.
Try sampling your individual sensor data sets to the same period.
For example, if sensor 1 has a data entry every 5 minutes and sensor 2 has an entry every 10 minutes. Try to sample your data to a common period across all sensors. Each data point you show to your model will have better quality data that should influence the performance of your model.
The aspect that will influence your error depends on what you're trying to forecast and the relationships that exist in your data that showcase the relationship between variables.
I have a time-series dataset that contains 10 features in addition to the "timestamp" column (the index of the data frame)
after scaling the features' values and implementing k-means clustering, I got the results as an np.array.
My problem is that I need to know what is the timestamp of each sample in a cluster. How can I keep the timestamp index while clustering without using it as a feature ??
Naive and easy, but in my opinion fine solution would be to make a new index to the original dataframe - just a row number - and then split the dataframe into two separate ones - one with the timestamp and the other one with the features. Then you could easily reassign the results to the timestamps, since fit_predict will keep the order.
I have a very high cardinality time-series database. Suppose, I have 4 columns in my time-series database (A,B,C and D) whose individual cardinalities are (10, 100, 50, 10,000,000). So, in total I have a database of (10*100*50*10,000,000) cardinality. I want to know following questions:
Which alerting system should I use to monitor high cardinality
(say 5 million cardinality in last one hour of data) database.
What is the best way to handle if 1 column in time-series database
is of very high cardinality?
I'm assuming you want use some sort of monitoring system where upon some events the system is triggered to alarm about a certain service right? like a anomaly detection system.
So, my question to you is, are you looking a monitoring tool, just to have reports overs the features, or use the time-series for machine learning for example?
I'll answer this as if it was oriented to Machine learning. I'm sorry if this is not your intention:
==> In ML features with high cardinality are usually handled through bining if you need usem as dummy variables. In orther words, for each level of the feature a new binary column is created. (Example: http code: 200, 200, 201, 404, 409, 500 ==> 2xx, 3xx, 4xx).
==> However, if you are using tree-based algorithms to handle high cardinality, no need for dummy variables to handle de cardinality.
Many more approaches can be used, but i need to know if this is what you are looking for in order for me to deepen the answer.
I have a question. I have a lot of different items, different articles of a company, (26000) and i have the sell quantity of 52 weeks of 2017. I need to do a forecasting model for the future so I decided to do a cluster of items.
The goal is to show the quantity of items that were sold during 2017 in the similar quantity and for the new collection of items i do a classification based on the cluster and do a specific model forecasting for items. It’s my first time that i use machine learning so i need help.
Do I need to do an analysis about correlation before i do the cluster?
I can create a metric based on correlation that i put in my cluster function like the distance metric.
Doing clustering on time series data cannot yield results on raw data.
Time series data is about trends and not actual values.
Try transforming your data to reflect some trends and the do clustering.
For example suppose your data is like 5,10,45,23
Transform it to 0,1,1,0. (1 means increase in value then previous). By doing so you can cluster the items which increases or decreases together.
This is just an opinion, you will have to try out various transformations and see what works on your data. https://datascience.stackexchange.com/ is relevant place to ask such questions
I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.
The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.
But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!