I was diving more into anomaly detection algorithms and found many applications in various domains including seismic acitivty, IDS etc. However I could not find a single paper on Google Scholar nor on Semantic Scholar on the application to stock markets, but there are endless on the prediction of stock markets.
I just found these two sites, which briefly discuss it: SliceMatrix and Intro to AD
How come? Is it not as interesting as predicting the stock market price? Or is there an additional complication of stock markets, I am not aware of?
I would be very glad, if anyone could guide me to some resources on this topic.
Kind Regards
Related
Imagine a Clustering problem where the educational researchers would like to find clusters of students (the group of students) who have similar correlation patterns when it comes to the correlation of their GPA vs Income of their parents. And you are hired as the Data Scientist to do the job.
What kind of Objective Function would you design?
Why?
Please use your own reasoning and explain in detail.
In a recent research, a PSO was implemented to classify students under unknown number of groups. I think that all you need is the specific research.
The paper is: Forming automatic groups of learners using particle swarm optimization for applications of differentiated instruction
I have been looking into some applications for unsupervised learning, but have only found some hypothetical applications on the internet, for example unsupervised learning could be used for, say, fraud detection. For example, for supervised learning you have the instant physician which is being implemented in the real world. However, for unsupervised learning, the applications seem to be hypothetical, thus are actually being implemented or are they just hypothetical?
There are lots of applications of unsupervised learning and there are various techniques that help us achieve these applications.I will brief out some of them
1) Image Segmentation - wherein you divide your image into different regions and then cluster them to objectify them.
2) Netflix movie recommendation system wherein those movies which are watched by a user are put into one cluster. Here unsupervised learning plays a important role in determining those movies and then recommending other such/similar movies to that user.
3) Amazon shopping where similar users based on their shopping history, amount paid for items, visiting particular type of items are all put in one clusters which help these giants to look into these factors and then recommend only those things to you.
To achieve these and others one requires unsupervised learning based techniques. Some techniques like k-means, hierarchical clustering, Density based clustering are extensively used for real- world applications.
These are just few of the many applications of unsupervised learning.
Hope this clarify.
Friends,
We are trying work on a problem where we have a dump of only reviews but there is no rating in a .csv file. Each row in .csv is one review given by customer of a particular product, lets a TV.
Here, I wanted to do classification of that text into below pre-defined category given by the domain expert of that products:
Quality
Customer
Support
Positive Feedback
Price
Technology
Some reviews are as below:
Bought this product recently, feeling a great product in the market.
Was waiting for this product since long, but disappointed
The built quality is not that great
LED screen is picture perfect. Love this product
Damm! bought this TV 2 months ago, guess what, screen showing a straight line, poor quality LED screen
This has very complicated options, documentation of this TV is not so user-friendly
I cannot use my smart device to connect to this TV. Simply does not work
Customer support is very poor. I don't recommend this
Works great. Great product
Now, with above 10 reviews by 10 different customers, how do I categorize them into the given buckets (you can call multilabel classification or Named Entity Recognition or Information extraction with sentiment analysis or be it anything)
I tried all NLP word frequency counting related stuff (in R) and referred StanfordNLP (https://nlp.stanford.edu/software/CRF-NER.shtml) and many more. But could not get a concrete solution.
Can anybody please guide me how can we tackle this problem? Thanks !!!
Most NLP frameworks will handle multi-class classification. Word count by itself in R will not likely be very accurate. A python library you can explore is Spacy. Commercial APIs like Google, AWS, Microsoft can also be used. You will need quite a few examples per category for training. Feel free to post your code and the problem or performance gap you see for further help.
I'm trying to cluster the Twitter stream. I want to put each tweet to a cluster that talk about the same topic. I tried to cluster the stream using an online clustering algorithm with tf/idf and cosine similarity but I found that the results are quite bad.
The main disadvantages of using tf/idf is that it clusters documents that are keyword similar so it's only good to identify near identical documents. For example consider the following sentences:
1- The website Stackoverflow is a nice place.
2- Stackoverflow is a website.
The prevoiuse two sentences will likely by clustered together with a reasonable threshold value since they share a lot of keywords. But now consider the following two sentences:
1- The website Stackoverflow is a nice place.
2- I visit Stackoverflow regularly.
Now by using tf/idf the clustering algorithm will fail miserably because they only share one keyword even tho they both talk about the same topic.
My question: is there better techniques to cluster documents?
In my experience, cosine similarity on latent semantic analysis (LSA/LSI) vectors works a lot better than raw tf-idf for text clustering, though I admit I haven't tried it on Twitter data. In particular, it tends to take care of the sparsity problem that you're encountering, where the documents just don't contain enough common terms.
Topic models such as LDA might work even better.
As mentioned in other comments and answers. Using LDA can give good tweet->topic weights.
If these weights are insufficient clustering for your needs you could look at clustering these topic distributions using a clustering algorithm.
While it is training set dependent LDA could easily bundle tweets with stackoverflow, stack-overflow and stack overflow into the same topic. However "my stack of boxes is about to overflow" might instead go into another topic about boxes.
Another example: A tweet with the word Apple could go into a number of different topics (the company, the fruit, New York and others). LDA would look at the other words in the tweet to determine the applicable topics.
"Steve Jobs was the CEO at Apple" is clearly about the company
"I'm eating the most delicious apple" is clearly about the fruit
"I'm going to the big apple when I travel to the USA" is most likely about visiting New York
Long answer:
TfxIdf is currently one of the most famous search method. What you need are some preprocessing from Natural Langage Processing (NLP). There is a lot of resources that can help you for english (for example the lib 'nltk' in python).
You must use the NLP analysis both on your querys (questions) and on yours documents before indexing.
The point is : while tfxidf (or tfxidf^2 like in lucene) is good, you should use it on annotated resource with meta-linguistics information. That can be hard and require extensive knowledge about your core search engine, grammar analysis (syntax) and the domain of document.
Short answer : The better technique is to use TFxIDF with light grammar NLP annotations, and both re-write query and indexing.
I'm looking for educational material on the subject of scalability analysis. I'm not simply looking for Big-O analysis, but for material on approaches and techniques for analysis of the scalability of large scale transactional systems. Amazon's orders and payment systems might be good examples of the sort of systems I'm referring to.
I have a preference for online materials, including text and video, in that they tend to be easily accessible but I'm open to book suggestions, too.
highscalability blog, for real life issues