I'm studying the classification based on sparse coding and dictionary learning. I've read many documents but couldn't find an easy-to-understand one. As I understand, it's based on an optimization problem:
What's the meaning of the subscript (2)? And I guess the sign ||a|| means the amplitude of the vector?
And could you please suggest a good tutorial/introduction document for sparse coding? Thank you. (I tagged "image processing" and "machine learning" because I read somewhere that these fields use sparse coding. If not true, please comment and I'll remove the tags).
It's just definition of Mean-Squared Error
The subscript 2 means it's a l2 or Euclidean norm. See here for detailed explaination.
This is an high-level overview of sparse coding,
Related
What does it mean to provide weights to each sample for
classification? How does a classification algorithm like Logistic regression or SVMs use weights to emphasize certain examples more than others? I would love going into details to unpack how these algorithms leverage sample weights.
If you look at the sklearn documentation for logistic regression, you can see that the fit function has an optional sample_weight parameter which is defined as an array of weights assigned to individual samples.
this option is meant for imbalance dataset. Let's take an example: i've got a lot of datas and some are just noise. But other are really important to me and i'd like my algorithm to consider them a lot more than the other points. So i assigne a weight to it in order to make sure that it will be dealt with properly.
It change the way the loss is calculate. The error (residues) will be multiplie by the weight of the point and thus, the minimum of the objective function will be shifted. I hope it's clear enough. i don't know if you're familiar with the math behind it so i provide here a small introduction to have everything under hand (apologize if this was not needed)
https://perso.telecom-paristech.fr/rgower/pdf/M2_statistique_optimisation/Intro-ML-expanded.pdf
See a good explanation here: https://www.kdnuggets.com/2019/11/machine-learning-what-why-how-weighting.html .
I want to implement a sentence similarity algorithm. Is it possible to implement it using sequence prediction algorithm? If it is possible what kind of approach should i go forward with or is there any other method which is more suitable for sentence similarity algorithm ,please share your views.
You could try to treat your sentences as separate documents and then use traditional approach for finding similarity between documents. It was answered here using sklearn:
Similarity between two text documents
If you want, you could try and implement the same code in tensorflow.
I also strongly recommend to read this answer, which covers more sophisticated approaches: https://stackoverflow.com/a/15173821/3633250
You could consider using Doc2Vec. Each sentence (document) is mapped to an n-dimensional space. To find the most similar document,
model.most_similar(“documentID”)
Reference
I'm new to Topic Models, Classification, etc… now I'm already a while doing a project and read a lot of research papers. My dataset consists out of short messages that are human-labeled. This is what I have come up with so far:
Since my data is short, I read about Latent Dirichlet Allocation (and all it's variants) that is useful to detect latent words in a document.
Based on this I found a Java implementation of JGibbLDA http://jgibblda.sourceforge.net but since my data is labeled, there is an improvement of this called JGibbLabeledLDA https://github.com/myleott/JGibbLabeledLDA
In most of the research papers, I read good reviews about Weka so I messed around with this on my dataset
However, again, my dataset is labeled and therefore I found an extension of Weka called Meka http://sourceforge.net/projects/meka/ that had implementations for Multi-labeled data
Reading about multi-labeled data, I know most used approaches such as one-vs-all and chain classifiers...
Now the reason me being here is because I hope to have an answer to following questions:
Is LDA a good approach for my problem?
Should LDA be used together with a classifier (NB, SVM, Binary Relevance, Logistic Regression, …) or is LDA 'enough' to function as a classifier/estimator for new, unseen data?
How do I need to interpret the output coming from JGibbLDA / JGibbLabeledLDA. How do I get from these files to something which tells me what words/labels are assigned to the WHOLE message (not just to each word)
How can I use Weka/Meka do get to what I want in previous question (in case LDA is not what I'm looking for)
I hope someone, or more than one person, can help me figure out how I need to do this. The general idea of all is not the issue here, I just don't know how to go from literature to practice. Most of the papers don't give enough description of how they perform their experiments OR are too technical for my background about the topics.
Thanks!
I am very new to the field of quants but i was just wondering if matrices can be used to identify the arbitrage opportunity available in multi currency conversions. It would be sort of a shortest path finding problem or minimum cost algorithm used in different other problem sets.
This algorithms book explains (or hints, since it's an exercise), how to do it using logarithms then a classic shortest path. It was a fun problem.
For the question "are matrices useful to identify arbitrage opportunities available in multi-currency conversions?", the answer is yes. You would use a matrix to store each conversion rate from currency i to currency j in cell (i,j).
For the question "would an algorithm that finds such opportunities be similar to a shortest path finding problem?", the answer is also yes. Given the matrix for a problem, you would apply an algorithm that only resembles the Floyd-Warshall algorithm.
For a full explanation have a look here.
I've been reading a lot of articles that explain the need for an initial set of texts that are classified as either 'positive' or 'negative' before a sentiment analysis system will really work.
My question is: Has anyone attempted just doing a rudimentary check of 'positive' adjectives vs 'negative' adjectives, taking into account any simple negators to avoid classing 'not happy' as positive? If so, are there any articles that discuss just why this strategy isn't realistic?
A classic paper by Peter Turney (2002) explains a method to do unsupervised sentiment analysis (positive/negative classification) using only the words excellent and poor as a seed set. Turney uses the mutual information of other words with these two adjectives to achieve an accuracy of 74%.
I haven't tried doing untrained sentiment analysis such as you are describing, but off the top of my head I'd say you're oversimplifying the problem. Simply analyzing adjectives is not enough to get a good grasp of the sentiment of a text; for example, consider the word 'stupid.' Alone, you would classify that as negative, but if a product review were to have '... [x] product makes their competitors look stupid for not thinking of this feature first...' then the sentiment in there would definitely be positive. The greater context in which words appear definitely matters in something like this. This is why an untrained bag-of-words approach alone (let alone an even more limited bag-of-adjectives) is not enough to tackle this problem adequately.
The pre-classified data ('training data') helps in that the problem shifts from trying to determine whether a text is of positive or negative sentiment from scratch, to trying to determine if the text is more similar to positive texts or negative texts, and classify it that way. The other big point is that textual analyses such as sentiment analysis are often affected greatly by the differences of the characteristics of texts depending on domain. This is why having a good set of data to train on (that is, accurate data from within the domain in which you are working, and is hopefully representative of the texts you are going to have to classify) is as important as building a good system to classify with.
Not exactly an article, but hope that helps.
The paper of Turney (2002) mentioned by larsmans is a good basic one. In a newer research, Li and He [2009] introduce an approach using Latent Dirichlet Allocation (LDA) to train a model that can classify an article's overall sentiment and topic simultaneously in a totally unsupervised manner. The accuracy they achieve is 84.6%.
I tried several methods of Sentiment Analysis for opinion mining in Reviews.
What worked the best for me is the method described in Liu book: http://www.cs.uic.edu/~liub/WebMiningBook.html In this Book Liu and others, compared many strategies and discussed different papers on Sentiment Analysis and Opinion Mining.
Although my main goal was to extract features in the opinions, I implemented a sentiment classifier to detect positive and negative classification of this features.
I used NLTK for the pre-processing (Word tokenization, POS tagging) and the trigrams creation. Then also I used the Bayesian Classifiers inside this tookit to compare with other strategies Liu was pinpointing.
One of the methods relies on tagging as pos/neg every trigrram expressing this information, and using some classifier on this data.
Other method I tried, and worked better (around 85% accuracy in my dataset), was calculating the sum of scores of PMI (punctual mutual information) for every word in the sentence and the words excellent/poor as seeds of pos/neg class.
I tried spotting keywords using a dictionary of affect to predict the sentiment label at sentence level. Given the generality of the vocabulary (non domain dependent), the results were just about 61%. The paper is available in my homepage.
In a somewhat improved version, negation adverbs were considered. The whole system, named EmoLib, is available for demo:
http://dtminredis.housing.salle.url.edu:8080/EmoLib/
Regards,
David,
I'm not sure if this helps but you may want to look into Jacob Perkin's blog post on using NLTK for sentiment analysis.
There are no magic "shortcuts" in sentiment analysis, as with any other sort of text analysis that seeks to discover the underlying "aboutness," of a chunk of text. Attempting to short cut proven text analysis methods through simplistic "adjective" checking or similar approaches leads to ambiguity, incorrect classification, etc., that at the end of the day give you a poor accuracy read on sentiment. The more terse the source (e.g. Twitter), the more difficult the problem.