I have a time-series dataset that contains 10 features in addition to the "timestamp" column (the index of the data frame)
after scaling the features' values and implementing k-means clustering, I got the results as an np.array.
My problem is that I need to know what is the timestamp of each sample in a cluster. How can I keep the timestamp index while clustering without using it as a feature ??
Naive and easy, but in my opinion fine solution would be to make a new index to the original dataframe - just a row number - and then split the dataframe into two separate ones - one with the timestamp and the other one with the features. Then you could easily reassign the results to the timestamps, since fit_predict will keep the order.
Related
I have a data set attributes are (Date, Value, Variable-1, Variable-2, Variable-3, Variable-4, Variable-5), I have 100k plus rows. I wanted to predict the "Value" in the future based on 5 variables trained in time series manners, there will be seasonal trends and low and high scores in "Value". Can someone suggest to me some statistical or machine learning/deep learning solution for this?
Here is Dataset Screenshot, I wanted to forecast Value Variable
This is very interesting problem and you can use "Vector auto regression (VAR)" method to solve this problem. Packages are available in both R and Python to solve this problem.
Used K-means clustering Model for detecting anomaly using BigQuery ML.
Datasets information
date Date
trade_id INT
trade_name STRING
agent_id INT
agent_name String
total_item INT
Mapping - One trade has multiple agent based on date.
Model Trained with below information by sum(total_iteam)
trade_id
trade_name
agent_id
agent_name
Number of cluster: 4
Need to find the anomaly for each trades and agent based on date.
Model is trained with given set of data and distance_from_closest_centroid is calculated. for each trade and agent based on date distance is called. Rightest distance is consider as a anomaly. Using this information
Questions
1. How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).
Questions
2. How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis.
As the question was updated, I will sum up our discussion as an answer to further contribute to the community.
Regarding your first question "How to find the number of cluster need to use for model(eg: Elbow method used for selecting minimal cluster number selection).".
According to the documentation, if you omit the num_clusters option, BigQuery ML will choose a reasonable default based on the total number of rows in the training data. However, if you want to select the most optimal number, you can perform hyperarameter tunning, which is the process of selecting one (or a set) of optimal hyperparameter for a learning algorithm, in your case K-means within BigQuery ML. In order to determine the ideal number of clusters you would run CREATE MODEL query for different values of num_clusters. Then , finding the error measure and select the point which it is at the minimum value. You can select the error measure in the training tab Evaluation, it will show the Davies–Bouldin index and Mean squared distance.
Your second question was "How to build the model in case when trade data added on daily basis. Its is possible to build the incremental way of building the model on daily basis."
K-means is an unsupervised leaning algorithm. So you will train your model with your current data. Then store it in a data set. This model is already trained and can certainly be used with new data, using the ML.PREDICT. So it will use the model to predict which clusters the new data belong to.
As a bonus information, I would like to share this link for the documentation which explains how K-means in BigQuery ML can be used to detect data anomaly.
UPDATE:
Regarding your question about retraining the model:
Question: "I want to rebuild the model because new trade information has to be updated in my existing model. In this case is this possible to append the model with only two month of data or should we need to rebuild the entire model?"
Answer: You would have to retrain the whole model if new relevant data arrives. There is not the possibility to append the model with only two months of new data. Although, I must mention that you can and should use warm_start to retrain your already existing model, here.
As per #Alexandre Moraes
omiting the num_clusters using K-means, BigQuery ML will choose a reasonable amount based in the number of rows in the training data. In addition, you can also use hyperparameter tuning to determine a optimal number of clusters. Thus, you would have to run the CREATE MODEL query for different values of num_clusters, find the error measure and pick the point which the error is minimum, link. –
I am working to setup data for an unsupervised learning algorithm. The goal of the project is to group (cluster) different customers together based on their behavior on the website. Obviously, some sort of clustering algorithm is best for discovering patterns in the data we can't see as humans.
However, the database contains multiple rows for each customer (in chronological order) for each action the customer took on the website for that visit. For example customer with ID# 123 clicked on page 1 at time X and that would be a row in the database, and then the same customer clicked another page at time Y. That would make another row in the database.
My question is what algorithm or approach would you use for clustering in this given scenario? K-means is really popular for this type of problem, but I don't know if it's possible to use in this situation because of the grouping. Is it somehow possible to do cluster analysis around one specific ID that includes multiple rows?
Any help/direction of unsupervised learning I should take is appreciated.
In short,
Learn a fixed-length embedding (representation) of each event;
Learn a way to combine a sequence of such embeddings into a single representation for each event, then use your favorite unsupervised methods.
For (1), you can do it either manually or use an encoder/decoder;
For (2), there is a range of things you can do, ranging from just simply averaging embeddings from each event, to training an encoder-decoder on reconstructing the original sequence of events and take the intermediate representation (that the decoder uses to reconstruct the original sequence).
A good read on this topic (though a bit old; you now also have the option of Transformer Network):
Representations for Language: From Word Embeddings to Sentence Meanings
I have a question. I have a lot of different items, different articles of a company, (26000) and i have the sell quantity of 52 weeks of 2017. I need to do a forecasting model for the future so I decided to do a cluster of items.
The goal is to show the quantity of items that were sold during 2017 in the similar quantity and for the new collection of items i do a classification based on the cluster and do a specific model forecasting for items. It’s my first time that i use machine learning so i need help.
Do I need to do an analysis about correlation before i do the cluster?
I can create a metric based on correlation that i put in my cluster function like the distance metric.
Doing clustering on time series data cannot yield results on raw data.
Time series data is about trends and not actual values.
Try transforming your data to reflect some trends and the do clustering.
For example suppose your data is like 5,10,45,23
Transform it to 0,1,1,0. (1 means increase in value then previous). By doing so you can cluster the items which increases or decreases together.
This is just an opinion, you will have to try out various transformations and see what works on your data. https://datascience.stackexchange.com/ is relevant place to ask such questions
I am building an automated cleaning process that clean null values from the dataset. I discovered few functions like mode, median, mean which could be used to fill NaN values in given data. But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median. So to define whether data is categorical or continuous I decided to make a machine learning classification model.
I took few features like,
1) standard deviation of data
2) Number of unique values in data
3) total number of rows of data
4) ratio of unique number of total rows
5) minimum value of data
6) maximum value of data
7) number of data between median and 75th percentile
8) number of data between median and 25th percentile
9) number of data between 75th percentile and upper whiskers
10) number of data between 25th percentile and lower whiskers
11) number of data above upper whisker
12) number of data below lower whisker
First with this 12 features and around 55 training data I used the logistic regression model on Normalized form to predict label 1(continuous) and 0(categorical).
Fun part is it worked!!
But, did I do it the right way? Is it a correct method to predict nature of data? Please advise me if I could improve it further.
The data analysis seems awesome. For the part
But which one I should select?
Mean is always winner as far as I have tested. For every dataset I try out test for all the cases and compare accuracy.
There is a better approach but a bit time consuming. If you want to take forward this system, this can help.
For each column with missing data, find its nearest neighbor and replace it with that value. Suppose you have N columns excluding target, so for each column, treat it as dependent variable and rest of N-1 columns as independent. And find its nearest neighbor and then its output(dependent variable) is desired value for missing attribute.
But which one I should select? if data is categorical it has to be either mode or median while for continuous it has to be mean or median.
Usually for categorical data mode is used. For continuous - mean. But I recently saw an article where geometric mean was used for categorical values.
If you build a model that uses columns with nan you can include columns with mean replacement, median replacement and also boolean column 'index is nan'. But better not to use linear models in this case - you can face correlation.
Besides there are many other methods to replace nan. For example, MICE algorithm.
Regarding the features you use. They are ok but I'd like to advice to add some more features related to distribution, for example:
skewness
kurtosis
similarity to Gaussian Distribution (and other distributions)
a number of 1D GDs you need to fit your column (GMM; won't perform well for 55 rows)
All this items you can get basing on normal data + transformed data (log, exp).
I explain: you can have a column with many categories inside. And it simply may look like numerical column with the old approach but it does not numerical. Distribution matching algorithm may help here.
Also you can use different normalizing. Probably RobustScaler from sklearn may work good (it may help in case where categories have levels very similar to 'outlied' values).
And the last advice: you can use Random forest model for this and get important columns. This list may give some direction for feature engineering/generation.
And, sure, take a look on misclassification matrix and for which features errors happen is also a good thing!