Down-to-earth introduction to time-series for a programmer - time-series

I'm a programmer who is interested in processing and analyzing time-series data. I know basic statistics and math, but I'm afraid that's all.
Can you please recommend good books and/or articles that does not require Ph.D. to understand them?
As for my concrete tasks - I want to be able to spot trends, eliminate outliers, be able to make predictions and calculate stats over a range of values. We have quite a bit of events coming off our systems.
I started reading "Introduction to Time Series and Forecasting" by Brockwell and Davis - and I'm completely lost in math.
update on outliers by outliers I mean data points that doesn't necessarily make sense. e.g. the exchange rate is 1.5$(+-10 cents) for a pound on average, but a guy around the corner offers 1.09$ and says he's completely legit.

I've found the NIST Engineering Statistics Handbook's chapter on time series to be a simple and clear introduction to basic time series modeling. It discusses exponential smoothing, auto-regressive, moving average, and eventually ARMA time series modeling. These can be used for trend analysis and possibly prediction, subject to validation.
Outlier/anomaly detection is a much different task; the NIST book doesn't have much on this. It would be helpful to know what kind of outliers you are trying to detect.

I've gone through numerous books and articles and here are my findings. May be they will help others like me.
Regarding theory - I found an article An Introductory Study on Time Series Modeling and Forecasting very well written. That doesn't mean I understood all of its contents, but it's a really good overview of available time series models.
If you're like me and like to see some actual code - there's article series on QuantStart. Examples are in R, but I guess many of them are portable to Python.
I can highly recommend QuantStart blog by Michael Halls-Moore, I found articles easy to read and the author has done a great job trying not to overwhelm a reader with math. I also read Michael's first book and it's a good one for a beginner in the space like me.
Textbooks on the topic are extremely hard for me to read. I tried Time Series Analysis by Hamilton, but haven't gotten far.
Regarding outlier detection I mentioned - I've found this question on SO and its stats counterpart. By the looks of it, it's not something you can study and implement in a couple of evenings, at least not for me.

Related

Tips for writing an algorithm for paraphrasing sentences(machine learning)

I am doing a project at the university and I need to train an algorithm to rephrase sentences, what can you advise for implementation? Is it possible to use a translator to translate into another language in the end to get a paraphrased sentence? Also i want to use Word2Vec, or it's a bad idea?
This kind of broad-advice question – and about a very-tough problem, paraphrasing text, that is still a very active research problem – would be better answered by surveyin the research literature.
A great site for searching relevant papers – and then finding other related papers once you've set some positive examples – is http://www.arxiv-sanity.com/.
Searching for [paraphrasing] or [summarization] would give you a running start in seeing major techniques & their limitations. And, once you start bookmarking papers by the little 'disk' icon, it can autosuggest important related papers... so even if your 1st few finds are tangential or far-from-usefulness, it can lead you to the seminal papers, & prevailing cutting-edge algorithms/libraries, pretty quickly.

How to give a logical reason for choosing a model

I used machine learning to train depression related sentences. And it was LinearSVC that performed best. In addition to LinearSVC, I experimented with MultinomialNB and LogisticRegression, and I chose the model with the highest accuracy among the three. By the way, what I want to do is to be able to think in advance which model will fit, like ml_map provided by Scikit-learn. Where can I get this information? I searched a few papers, but couldn't find anything that contained more detailed information other than that SVM was suitable for text classification. How do I study to get prior knowledge like this ml_map?
How do I study to get prior knowledge like this ml_map?
Try to work with different example datasets on different data types by using different algorithms. There are hundreds to be explored. Once you get the good grasp of how they work, it will become more clear. And do not forget to try googling something like advantages of algorithm X, it helps a lot.
And here are my thoughts, I think I used to ask such questions before and I hope it can help if you are struggling: The more you work on different Machine Learning models for a specific problem, you will soon realize that data and feature engineering play the more important parts than the algorithms themselves. The road map provided by scikit-learn gives you a good view of what group of algorithms to use to deal with certain types of data and that is a good start. The boundaries between them, however, are rather subtle. In other words, one problem can be solved by different approaches depending on how you organize and engineer your data.
To sum it up, in order to achieve a good out-of-sample (i.e., good generalization) performance while solving a problem, it is mandatory to look at the training/testing process with different setting combinations and be mindful with your data (for example, answer this question: does it cover most samples in terms of distribution in the wild or just a portion of it?)

Increasing the efficiency of equipment using Amazon Machine Learning

The problem statement is kind of vague but i am looking for directions because of privacy policy i can't share exact details. so please help out.
We have a problem at hand where we need to increase the efficiency of equipment or in other words decide on which values across multiple parameters should the machines operate to produce optimal outputs.
My query is whether it is possible to come up with such numbers using Linear Regression or Multinomial Logistic Regression algorithms, if no then can you please specify which algorithms will be more suitable. Also can you please point me to some active research done on this kind of problem that is available in public domain.
Does the type of problem i am asking suggestions for comes in the area of Machine Learning ?
Lots of unknowns here but I’ll make some assumptions.
What you are attempting to do could probably be achieved with multiple linear regression. I have zero familiarity with the Amazon service (I didn’t even know it existed until you brought this up, it’s not available in Europe). However, a read of the documentation suggests that the Amazon service would be capable of doing this for you. The problem you will perhaps have is that it’s geared to people unfamiliar with this field and a lot of its functionality might be removed or clumped together to prevent confusion. I am under the impression that you have turned to this service because you too are somewhat unfamiliar with this field.
Something that may suit your needs better is Response Surface Methodology (RSM), which I have applied to industrial optimisation problems that I think are similar to what you suggest. RSM works best if you can obtain your data through an experimental design such as a Central Composite Design or Box-Behnken design. I suggest you spend some time Googling these terms to get your head around them, I don’t think it’s an unmanageable burden to learn how to apply these with no prior experience in this area. Because your question is vague, only you can determine if this really is suitable. If you already have the data in an unstructured format, you can still generate an RSM but it is less robust. There are plenty of open-access articles using these techniques but Science Direct is conveniently down at the moment!
Minitab is a software package that will do all the regression and RSM for you. Its strength is that it has a robust GUI and partially reflects Excel so it is far less daunting to get into than something like R. It also has plenty of guides online. They offer a 30 day free trial so it might be worth doing some background reading, collecting the tutorials you need and develop a plan of action before downloading the trial.
Hope that is some help.

Clustering of Twitter Feeds

I am new to clustering, just implemented a couple of algorithms before.
I need to cluster tweets according to their similarity.
One way is to use only hash tags, but I don't think it would be that informative. So complete tweets should be analyzed.
Moreover I was searching the web for the algorithms for clustering feeds.
One I encountered is TF-IDF. I want to know are there better algorithms which can be implemented in few hours and are better than TF-IDF.Also I would be intersetd in some informatics source about the clustering of twitter feeds.
PS: No. of tweets : 10^5
As Anony Mousse pointed out in his comment above, TF/IDF is only a normalization measure to make sure words that are overly popular among all documents don't gain too much important.
For data preparation, I'd recommend reading this and the second part of it too (linked via the above link), if you haven't already done so. It is very important to get a vector of numbers from each tweet. In general, in machine learning, it is important to get a feature vector because that way, you can apply mathematical algorithms to your data then.
Now that you have a feature vector for each tweet in your collection, things get a bit simple. There are two clustering algorithms that come to my mind that you can whip up in a couple of hours each, with maybe extensive testing taking a weekend.
K-Means Clustering
Hierarchical Clustering With Single Linkage
With 100,000 tweets only, you should actually be able to implement these algorithms on a single computer (i.e. this is not big data -- no need for cluster computing), using your favorite language (C++, Java, Python, MATLAB, etc.). Personally, I think it's easier to implement K-Means Clustering (which I have done before) compared to Hierarchical Clustering (which I have also done before).
EDIT: Please follow the below comments only if you have labeled training data, i.e. you have tweets say, with labeled sentiments (happy-user, ok-ok, bad product, angry-user, abusive-user) and the question you want to answer is: Given a new tweet, what is it's sentiment?
Here is one very good resource you should look at, to get a better understanding of K-Nearest Neighbors:
Laszlo Kozma's Slides
In general, for the other two algorithms, there are ample resources, with Wikipedia articles the best way to start. Personally, I feel K-Nearest Neighbors (shorthand k-NN) is the easiest of the three to implement and will give you quick results.

Any ideas on how to implement flickr's tags clustering system? (preferrably in Rails)

I'm mainly just looking for a discussion of approaches on how to go from decentralized, non-normalized, completely open user-submitted tags, to start making sense of all of it through combining them into those semantic groups they called "clusters".
Does it take actual people to figure out what people actually mean by the tags used, or can it be done simply by automatically analyzing how often the tags go together?
That kind of stuff. Feel free to elaborate wildly :) (Also, if this has been discussed elsewhere, I'd love to hear about it).
Read this article: Automated Tag Clustering. It provides a good overview of the existing approaches and describes the algorithms for tag clustering.
Algorithms of the Intelligent Web (Manning) (esp. Chapter 4) and a book with a similar title from O'Reilly cover clustering algorithms. The Manning book starts with naive SQL approaches and moves to K-means, ROCK, and DBSCAN. It's more generalized than just focusing on tags, but easy to apply in that context. Code is presented in Java but is easily adapted to Ruby (sometimes more easily than adapting the Java code to your problem).
Chapter 5 covers classifications, which is about building topologies, and discusses Bayesian algorithms.

Resources