I have a customer location streaming data, which i need to analyze and check out for each event if the location is his usual visited location or not and generate an alert in real time if its not his usually visited location.
I was looking at various clustering algorithms but couldn't find a good one which do it in 'real time'.
Kmeans is too rigid with number of centriods.. DBSCAN is heavy weight and not sure if its fast enough to respond in real time...
Can you suggest one, which suits the real time stream processing?
I believe DBSCAN is suitable enough. Its worst-case-scenario complexity is O(n2) which is decent enough compared to other traditional algorithms such as hierarchical. In comparison to kmeans, I believe that kmeans is applicable if you use a ST_Centroid function from a spatial database such as SpatiaLIte or PostGIS ( take for granted that you use geographic data).
Between kmeans and DBSCAN, I choose DBSCAN because I think the answer to your problem is a density-based approach regarding real-time data.
Related
I applied a discrete wavelet transform to horizontal wind speed data to receive the below plot. I'm basically trying to use the information from the detail coefficient (the turbulent flow) for further analysis, but I'm not sure the best direction to go in. I don't have much experience with Wavelet Transform, so forgive me if there are obvious options, but the examples I've seen usually discard the higher frequency information since it's the noise of the signal. Is there anything further I can do with this discrete wavelet transform like statistic analysis or forecasting?
The path to pursue really depends on the question that you are trying to answer.
First of all, I would suggest double checking that your DWT is actually doing what you expect it to do. The plot that you shared suggests that it is successful in separating the low frequency coherent (laminar?) flow from the high frequency turbulent flow, but it would be helpful to figure out which frequencies are present in the high frequency component in order to confirm that the processing parameters (e.g. decomposition level) were properly chosen.
Once convinced that your wavelet decomposition provides you with useful information about the turbulent flow, what should you do with these high pass filtered data?
I suggest computing their variance over 1 hour long intervals. This is a measure of the "energy" of the signal over the chosen interval. If you are dealing with large amounts of data this would allow you to boil down your time series into a single sample per hour. Maybe you will be able to spot diurnal variations in the turbulent flow (e.g. maybe turbulent flow is higher at dawn). If you have multiple stations it would be interesting to study if the turbulence variations share the same behavior.
Before venturing into time series forecasting, I would really take a closer look at you data and try to identify trends or nail down possible outliers.
Last but not least, I would suggest posting your question on Physics Stack Exchange (e.g. https://physics.stackexchange.com/) rather than on SO.
What is the exact definition of spatial and temporal? I saw in many places people use these two terms, e.g., spatial vector, temporal vector, temporal factor, spatial location.
I was searching in StackOverflow, and found this one- what's the difference between spatial and temporal characterization in terms of image processing?
What I understood so far is that the term spatial is related to space and the term temporal is related to time. Still, it is quite abstract to me. Again, I am also not sure about the uses of these two. So, as same as the person asked in the above link, I want to ask the same question- What do these two terms mean and why do we care about these two?
Spatial data have to do with location-aware information, in other words, data that have coordinates (x, y). A typical example of spatial data is latitude and longitude in geographic datasets. Spatial analyses are the techniques involved in analyzing spatial data. This is a significant component of GIS (Geographic Information Systems/Science)
Temporal data is time-series data. In other words, this is data that is collected as time progresses. Temporal analysis is also known as Time-Series analysis. These are the techniques for analyzing data units that change with time.
I hope this makes these concepts less abstract and more concrete.
Adding to Ekaba's answer, spatial data doesn't necessarily need to be two dimensional either. I'm going to take an example from a medical domain which would have both spatial and temporal elements of data.
If you consider magnetic resonance imaging, it is essentially a 3D Volumetric view of an organ (let's say brain for clarity). So if you are to analyse a traditional MRI, it would be spatial analysis and you'll have 3 dimensions as it is 3D. There's another MRI modality called DCE-MRI which is essentially a sequence of MRI volumes captured over time. Now this is a typical example of a temporal sequence. Let's say DCE-MRI sequence has 40 MRI volumes captured 20s apart from each. If you just consider one sequence out of these 40 and analyse that, you'll be analyzing it spatially whereas if you consider all 40 (or a subset) of these volumes at the same time, you are analyzing it spatially as well as temporally.
Hope that clarifies things.
Another similar medical example is ultrasound imaging of a beating heart (2D Echocardiography) where the ultrasound image shows opening and closing movement of heart valves in real-time and volumetric movement of heart chambers. With high temporal resolution (# 30 frames per second) it is easy to follow the valves opening and closing accurately. With high spatial resolution it is also easy to differentiate boarders of the heart chambers to provide accurate volumetric blood flow data.
Machine Learning (ML) can do two things from Vibration/ Acoustic Signal for Condition Based Monitoring (CBM):
1 . Feature Extraction (FT) and
2 . Classification
But if we look through the research/process, then why signal processing techniques are used for pre-processing and ML for rest of the part; I mean classification?
We can use only ML for all of these. But I have seen the merging model of the two techniques: conventional signal processing approach and ML.
I want to know the specific reason for that. Why researchers use these two; they could do with ML only; but they use both.
Yes you can do so. However, the task becomes more complicated.
FFT for example transforms the input space into a more meaningful representation. If you have rotating equipment you would expect that the spectrum is mainly on the frequency of rotation. However, if there is a problem the spectrum changes. This can often be detected by for example SVMS.
If you don't do the FFT but only give the raw signal, SVMs have a hard time.
Nevertheless, i've seen recent practical examples using Deep Convolutional Networks which have learned to predict problems on raw vibration data. The disadvantage is, however, that you do need more data. More data is not a problem in general, but if you take for example a wind turbine more failure data is obviously -- or hopefully ;-) -- a problem.
The other thing is that the ConvNet learned the FFT all by itself. But why not use prior knowledge if you have that.....
I have univariate time series data and I need to run anomaly detection algorithm on the same. Can anyone suggest any standard algorithm for anomaly detection which works in most cases?
There is no such algorithm "which works in most cases". The task heavily depends on the specifics of your case, e.g. whether you need local anomalies when a point differs from other points near it or global ones when a point does not look similar to any other point in the dataset.
The very good review of anomaly detection algorithms can be found here
Perhaps you can easily try one-class-SVM which is available in many libraries and programming languages. For instance, in Python you can use scikit-learn.
This is my problem description:
"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many misstatements. Despite these misstatements, we have some features in the dataset which are certainly reliable. But these certain features are just a little part of information for each household wealth."
Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns.
Now, I have below questions:
How should we deal with unreliable data in data science?
Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
-How can we evaluate our errors in this study? Since we have unlabeled dataset, should I look for labeling techniques? Or, should I use unsupervised methods? Or, should I work with semi-supervised learning methods?
Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
Please introduce me any ideas or references which can help me in this issue.
Thanks in advance.
Q: How should we deal with unreliable data in data science
A: Use feature engineering to fix unreliable data (make some transformations on unreliable data to make it reliable) or drop them out completely - bad features could significantly decrease the quality of the model
Q: Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
A: ML algorithms are not magic sticks, they can't figure out anything unless you tell them what you are looking for. Can you describe what means 'unreliable'? If yes, you can, as I mentioned, use feature engineering or write a code which will fix the data. Otherwise no ML algorithm will be able to help you, without the description of what exactly you want to achieve
Q: Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
A: I don't think so just because the question itself is too open-ended. What means 'the quality of the data'?
Generally, here are couple of things for you to consider:
1) Spend some time on googling feature engineering guides. They cover how to prepare your data for you ML algorithms, refine it, fix it. Good data with good features dramatically increase the results.
2) You don't need to use all of features from original data. Some of features of original dataset are meaningless and you don't need to use them. Try to run gradient boosting machine or random forest classifier from scikit-learn on your dataset to perform classification (or regression, if you do regression). These algorithms also evaluate importance of each feature of original dataset. Part of your features will have extremely low importance for classification, so you may wish to drop them out completely or try to combine unimportant features together somehow to produce something more important.