Autocorrelation and heteroskedasticity in VAR models - time-series

I am building a VAR(X) model to find the effects between advertising expenditures in different channels and Google Trends Search Volume Index for a specific brand and its competitors using daily time-series data.
However, when checking for residual autocorrelation the null hypothesis of no autocorrelation is rejected for a high number of lags. However I read contradicting information on this topic whether autocorrelation is a big issue. Could you please advise me on what might be the best option to overcome auto correlation? I am working with eviews.
Another issue I encounter has regard to the heteroskedacticity of the residuals which assumption is also violated. I cannot log transform the data because I have a lot of zero values.
I hope somebody could help me with these modelling issues.
KR,
Larissa Komen

Serial autocorrelation ("autocorrelarion for a high number of lags") is usually a result of misspecification. Probably you used non-stationary time series. If this is the case, you could not make a VAR model but should make a vector error correction model. Or at least difference the data.
If your data are stationary try to play with number of lags. It usually helps.
And there is one more probable solution. Perhaps your data have structural breaks or outliers. In this case try to use dummies.
Hope this will help.

Related

which clustering algorithm is more likely to give the expected clustering result

I am given a set of 2-dimentional data in the format of Figure 1. The layout and the expected clustering results (in two different colors and symbols) are shown in Figure 2. Among the common clustering methods, which one(s) is/are more likely to give the expected clustering result? Why? Thanks.
Figure 1
Figure 2
This question is rather vague. So what exactly do you mean by among "the clustering approaches"?
I'll give it a try anyway:
At first glance I would guess, that there are a lot of good clustering algorithms which wouldn't have a hard time clustering your data, for the obvious reason, that your data is well separated.
Another thing to keep in mind is, whether you know the amount of clusters your expecting in your data, which you don't really state, but which highly influences the approach you would want to take (or whether you would add some sort of metric which determines the quality of clustering in order to find the suitable amount of clusters e.g. Ellbow method, or some entropy measurement).
Following a few clustering approaches that could work for you:
k-means
Region growing
I hope this gives you a start what to look into.

Is a Detrended Correspondance Analysis over time possible?

I have a dataset with presence-absence species data measured at several different sites. The data was measured over the span of 10 years. On many sites, measurements were taken several times within a year. The frequency of measurements is not constant nor were all sites measured several times, some were even only measured once.
I know that a classical Detrended Correspondance Analysis is not helpful here, since it does not consider the cofactor time. Is there any way to include all sampling points or any other correspondance analysis method that is useful here?
Thanks a lot for any help!
If you want to estimate the time effect or partial it out, yes, but not in vegan. Canoco has detrended canonical correspondence analysis (DCCA), the constrained form of DCA but vegan doesn't and is unlikely to ever have it.
There's nothing stopping you throwing all samples into a DCA you just can't remove the temporal effects.
Alternatively, choose a suitable dissimilarity coefficient and use NMDS via vegan's wrapper metaMDS(). This will give you a DCA-like analysis. If you want to account for the temporal effects, then using the same dissimilarity look at dbrda() as one option.

Do you have any suggestions for a Machine Learning method that may actually learn to distinguish these two classes?

I have a dataset that overlaps a lot. So far my results with SVM are not good. Do you have any recomendations for a model that may be able to differ between these 2 datasets?
Scatter plot from both classes
It is easy to fit the dataset by interpolation of one of the classes and predicting the other one otherwise. The problem with this approach is though, that it will not generalize well. The question you have to ask yourself is, if you can predict the class of a point given its attributes. If not then every ML algorithm will also fail to do so.
Then the only reasonable thing you can do is to collect more data and more attributes for every point. Maybe by adding a third dimension you can seperate the data more easily.
If the data is overlapping so much, both should be of the same class, but we know they are not. So, there is/are some feature(s) or variable(s) that is/are separating these data points into two classes. Try to add more features for data.
And sometimes, just transforming the data into a different scale can help.
Both the classes need not be equally distributed, as skewed data distribution can be handled separately.
First of all, what is your criterion for "good results"? What style of SVM did you use? Simple linear will certainly fail for most concepts of "good", but a seriously convoluted Gaussian kernel might dredge something out of the handfuls of contiguous points in the upper regions of the plot.
I suggest that you run some basic statistics on the data you've presented, to see whether they're actually as separable as you'd want. I suggest a T-test for starters.
If you have other dimensions, I strongly recommend that you use them. Start with the greatest amount of input you can handle, and reduce from there (principal component analysis). Until we know the full shape and distribution of the data, there's not much hope of identifying a useful algorithm.
That said, I'll make a pre-emptive suggestion that you look into spectral clustering algorithms when you add the other dimensions. Some are good with density, some with connectivity, while others key on gaps.

what should I do when training set contains some error data in supervised classification?

I am working on a project which performs text auto-classification, I have a lot of data set like as below:
Text | CategoryName
xxxxx... | AA
yyyyy... | BB
zzzzz... | AA
then, I will use the above data set to generate a classifier, once new text coming, the classifier can label new text with correct CategoryName
(text is natural language, size between 10-10000)
Now, the problem is, the original data set contains some incorrect data, (E.g. AAA should be labeled as Category AA, but it is labeled as Category BB accidentally ) because these data are classified manually. And I don't know which label is wrong and how many percentages are wrong because I can't review all data manually...
So my question is, what should I do?
Can I find the wrong labels via some automatic way?
How to increase precision and recall when new data coming?
How to evaluate the impact of wrong data? (since I don't know how many percentage data is wrong)
Any other suggestions?
Obviously, there is no easy way to solve your problem - after all, why build a classifier if you already have a system that can detect wrong classifications.
Do you know how much the erroneous classifications affect your learning? If there are only a small percentage of them, they should not hurt the performance much. (Edit. Ah, apparently you don't. Anyway, I suggest you try it out - at least if you can identify a false result when you see one.)
Of course, you could always first train your system and then have it suggest classifications for the training data. This might help you identify (and correct) your faulty training data. This obviously depends on how much training data you have, and if it is sufficiently broad to allow your system to learn correct classification despite the faulty data.
Can you review any of the data manually to find some mislabeled examples? If so, you might be able to train a second classifier to identify mislabeled data, assuming there is some kind of pattern to the mislabeling. It would be useful for you to know if mislabeling is a purely random process (it is just noise in the training data) or if mislabeling correlates with particular features of the data.
You can't evaluate the impact of mislabeled data on your specific data set if you have no estimate regarding what fraction of your training set is actually mislabeled. You mention in a comment that you have ~5M records. If you can correctly manually label a few hundred, you could train your classifier on that data set, then see how the classifier performs after introducing random mislabeling. You could do this multiple times with varying percentages of mislabeled data to see the impact on your classifier.
Qualitatively, having a significant quantity of mislabeled samples will increase the impact of overfitting so it is even more important that you do not overfit your classifier to the data set. If you have a test data set (assuming it also suffers from mislabling), then you might consider training your classifier to less-than-maximal classification accuracy on the test data set.
People usually deal with the problem you a describing by having multiple annotators and computing their agreement (e.g. Fleiss' kappa). This is often seen as the upper bound on the performance of any classifier. If three people give you three different answers, you know the task is quite hard and your classifier stands no chance.
As a side note:
If you do not know how many of your records have been labelled incorrectly, you do not understand one of the key properties of the problem. Select 1000 records at random and spend the day reviewing their labels to get an idea. It really is time well spent. For example, I found I can easily review 500 labelled tweets per hour. Health warning: it is very tedious, but a morning spent reviewing gives me a good idea of how distracted my annotators were. If 5% of the records are incorrect, it is not such a problem. If 50 are incorrect, you should go back you your boss and tell them it can't be done.
As another side note:
Someone mentioned active learning. I think it is worth looking into options from the literature, keeping in mind labels might have to change. You said that it hard.

What algorithm would you use for clustering based on people attributes?

I'm pretty new in the field of machine learning (even if I find it extremely interesting), and I wanted to start a small project where I'd be able to apply some stuff.
Let's say I have a dataset of persons, where each person has N different attributes (only discrete values, each attribute can be pretty much anything).
I want to find clusters of people who exhibit the same behavior, i.e. who have a similar pattern in their attributes ("look-alikes").
How would you go about this? Any thoughts to get me started?
I was thinking about using PCA since we can have an arbitrary number of dimensions, that could be useful to reduce it. K-Means? I'm not sure in this case. Any ideas on what would be most adapted to this situation?
I do know how to code all those algorithms, but I'm truly missing some real world experience to know what to apply in which case.
K-means using the n-dimensional attribute vectors is a reasonable way to get started. You may want to play with your distance metric to see how it affects the results.
The first step to pretty much any clustering algorithm is to find a suitable distance function. Many algorithms such as DBSCAN can be parameterized with this distance function then (at least in a decent implementation. Some of course only support Euclidean distance ...).
So start with considering how to measure object similarity!
In my opinion you should also try expectation-maximization algorithm (also called EM). On the other hand, you must be careful while using PCA because this algorithm may reduce the dimensions relevant to clustering.

Resources