Sentiment Analysis for local languages (Nepali) - localization

I would like to do sentiment analysis on document level. But I am try to do sentiment analysis Nepali. So, I dont have any resources. I can't do Naive Bayes Classifier as I don't have any labelled data and I can't do vai wordnet as no nepali wordnet exist. Papers I read generally had labelled data or senti-wordnet for other languages.
I would like know these things:
Which approach should I use in above case for sentiment analysis?
Is there any method for me to dynamically generate labels for data?

Since you don't have any labelled data, Have a look at this GitHub Repo, feel free to fork.
It has the code for neural network for Handwriting recognition in Java. Jeff Heaton has done it easy for us, with a nice UI, you can train this model to recognize Nepali.
And for sentiment Analysis, you can try using Opennlp which has some good support, this blog for Beginner's.
Also DL4J is a good library for deep learing for Java which can be used for Sentiment Analysis. It has a good Word2Vector Implementation and has a lot of support.
These resources will help you, any futher doubts-feel free to comment.

Related

How to determine, whether to use machine learning algorithms or data mining technique for a given scenario?

I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

Sentiment Analysis with ruby

Any one with sentient analysis experience with liblinear algorithm. Any one have used liblinear-ruby-swig gem?
Please suggest me something to start with.
I used lib linear a lot for other classification not for sentiment analysis
Are you interested in using lib linear or to do sentiment analysis?
For simple sentiment analysis look at
https://chrismaclellan.com/blog/sentiment-analysis-of-tweets-using-ruby
sad_panda gem (https://rubygems.org/gems/sad_panda) is similar to an R library I have used in the past. It has tools for both polarity and emotion classification of text (as "sadness", "anger", "joy", a few others).
There is not much work in ruby for sentiment analysis or machine learning in general. One of the best machine learning library is weka, so you can consider using it with jruby.
That said I have created an entry level gem, I am planning to enhance it by porting some of the weka algorithms in ruby.

Classifying website type from webpages

Are there any reliable/deployed approaches, algorithms or tools to tagging the website type by parsing some its webpages.
For ex: forums, blogs, PressRelease sites, news, E-Comm etc.
I am looking for some well-defined characteristics (Static rules) from which this can be determined. If not, then i hope Machine Learning model may help.
Suggestions/Ideas ?
If you approach this from machine learning standpoint, Naive Bayes classifier probably has the greatest work/payoff ratio. A version of it is used in Winnow to categorize news articles.
You will need a collection of pages, each tagged with it's proper category. Then you extract words or other relevant elements from each page and use them as features
Dr.Dobbs has an article on implementing Naive Bayes
If you're interested in persuing the naïve Bayes approach (there are other machine learning options, after all), then I suggest the following document, which follows the coverage of this subject in "Data Mining: Practical Machine Learning Tools and Techniques", by Witten and Frank:
http://www.coli.uni-sb.de/~crocker/Teaching/Connectionist/lecture10_4up.pdf

Naive Bayesian for Topic detection using "Bag of Words" approach

I am trying to implement a naive bayseian approach to find the topic of a given document or stream of words. Is there are Naive Bayesian approach that i might be able to look up for this ?
Also, i am trying to improve my dictionary as i go along. Initially, i have a bunch of words that map to a topics (hard-coded). Depending on the occurrence of the words other than the ones that are already mapped. And depending on the occurrences of these words i want to add them to the mappings, hence improving and learning about new words that map to topic. And also changing the probabilities of words.
How should i go about doing this ? Is my approach the right one ?
Which programming language would be best suited for the implementation ?
Existing Implementations of Naive Bayes
You would probably be better off just using one of the existing packages that supports document classification using naive Bayes, e.g.:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
Perl - Perl has the Algorithm::NaiveBayes module, complete with a sample usage snippet in the package synopsis.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J. You can see a training and scoring code snippet here.
Bootstrapping Classification from Keywords
It sounds like you want to start with a set of keywords that are known to cue for certain topics and then use those keywords to bootstrap a classifier.
This is a reasonably clever idea. Take a look at the paper Text Classication by Bootstrapping with Keywords, EM and Shrinkage by McCallum and Nigam (1999). By following this approach, they were able to improve classification accuracy from the 45% they got by using hard-coded keywords alone to 66% using a bootstrapped Naive Bayes classifier. For their data, the latter is close to human levels of agreement, as people agreed with each other about document labels 72% of the time.

Resources