Keyword extraction from short dutch texts - keyword

I would like to extract keywords from short dutch texts. Is there an API for this or some library which i could use.
In case those are not available for dutch, any tips on how to extract them myself are also appreciated. I already tried it myself by running the texts through a part of speech tagger and lemmatizer. But from then on i find it quite difficult to extract decent keywords. TF-IDF is not useful sice texts are too short to get good results.
I prefer Java, but any other language implementations are also very welcome.

Here is my video series on text mining with RapidMiner. It shows how to easily get the TF-IDF and more:
http://vancouverdata.blogspot.ca/2010/11/text-analytics-with-rapidminer-loading.html

Related

Learning word alignment from nltk

I have a parallel corpus for english-german. Is there a way to extract word alignment table from this corpus using nltk? I don't know if nltk.align is supposed to do this. I am unable to figure out from the documentation.
Look at the source of the modules in the nltk.translate package (previously known as nltk.align); you'll find descriptions of the available algorithms and references to the research literature that explains them in more detail.

Accent detection API?

I've been doing some research on the feasibility of building a mobile/web app that allows users to say a phrase and detects the accent of the user (Boston, New York, Canadian, etc.). There will be about 5 to 10 predefined phrases that a user can say. I'm familiar with some of the Speech to Text API's that are available (Nuance, Bing, Google, etc.) but none seem to offer this additional functionality. The closest examples that I've found are Google Now or Microsoft's Speaker Recognition API:
http://www.androidauthority.com/google-now-accents-515684/
https://www.microsoft.com/cognitive-services/en-us/speaker-recognition-api
Because there are going to be 5-10 predefined phrases I'm thinking of using a machine learning software like Tensorflow or Wekinator. I'd have initial audio created in each accent to use as the initial data. Before I dig deeper into this path I just wanted to get some feedback on this approach or if there are better approaches out there. Let me know if I need to clarify anything.
There is no public API for such a rare task.
Accent detection as language detection is commonly implemented with i-vectors. Tutorial is here. Implementation is available in Kaldi.
You need significant amount of data to train the system even if your sentences are fixed. It might be easier to collect accented speech without focusing on the specific sentences you have.
End-to-end tensorflow implementation is also possible but would probably require too much data since you need to separate speaker-instrinic things from accent-instrinic things (basically perform the factorization like i-vector is doing). You can find descriptions of similar works like this and this one.
You could use(this is just an idea, you will need to experiment a lot) a neural network with as many outputs as possible accents you have with a softmax output layer and cross entropy cost function

Is this a use case for nlp?

I have a list of problems and resolutions. The current search for existing problems is just a keyword search. To improve searching for existing solutions to problems :
Classify documents based on their semantic meaning using nlp. User enters a search term and documents that closely match this search are displayed alongside possible solutions.
The search will be based on semantic meaning, this is use case for nlp ?
NLP stand for Natural Language Processing. Thus everything that involves processing natural language can be described as NLP. This is not any strict science, this is just a bag term for everything related to this field. Even if you just map your text to vector space and forget about any language and semantics, you are still processing natural language, thus - NLP.

Arabic OCR in .Net

I used Tesseract and trained it with complete word as character, How chinese OCR are doing. But this kills me to make my own fonts and its a time consuming and slow process. This approach is good for some scenario but I wanted to trained tesseract based on arabic characters.
Or Suggest me which can help me developed my own arabic ocr with or without Tesseract.
I have researched on OpenCV but it didnt go well.
I will highly appreicate your quick response.
Tesseract has pre-trained files for a lot of languages, here is the Arabic one.
This is a very old question, but for whoever is looking for the same, now tesseract 4 comes with pre-trained Arabic data alongside many other languages which can be found here
And here is a demo of Arabic OCR based on tesseract 4, you can see how accurate it becomes now.

Research papers classification on the basis of title of the research paper

Dear all I am working on a project in which I have to categories research papers into their appropriate fields using titles of papers. For example if a phrase "computer network" occurs somewhere in then title then this paper should be tagged as related to the concept "computer network". I have 3 million titles of research papers. So I want to know how I should start. I have tried to use tf-idf but could not get actual results. Does someone know about a library to do this task easily? Kindly suggest one. I shall be thankful.
If you don't know categories in advance, than it's not classification, but instead clustering. Basically, you need to do following:
Select algorithm.
Select and extract features.
Apply algorithm to features.
Quite simple. You only need to choose combination of algorithm and features that fits your case best.
When talking about clustering, there are several popular choices. K-means is considered one of the best and has enormous number of implementations, even in libraries not specialized in ML. Another popular choice is Expectation-Maximization (EM) algorithm. Both of them, however, require initial guess about number of classes. If you can't predict number of classes even approximately, other algorithms - such as hierarchical clustering or DBSCAN - may work for you better (see discussion here).
As for features, words themselves normally work fine for clustering by topic. Just tokenize your text, normalize and vectorize words (see this if you don't know what it all means).
Some useful links:
Clustering text documents using k-means
NLTK clustering package
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Note: all links in this answer are about Python, since it has really powerful and convenient tools for this kind of tasks, but if you have another language of preference, you most probably will be able to find similar libraries for it too.
For Python, I would recommend NLTK (Natural Language Toolkit), as it has some great tools for converting your raw documents into features you can feed to a machine learning algorithm. For starting out, you can maybe try a simple word frequency model (bag of words) and later on move to more complex feature extraction methods (string kernels). You can start by using SVM's (Support Vector Machines) to classify the data using LibSVM (the best SVM package).
The fact, that you do not know the number of categories in advance, you could use a tool called OntoGen. The tool basically takes a set of texts, does some text mining, and tries to discover the clusters of documents. It is a semi-supervised tool, so you must guide the process a little, but it does wonders. The final product of the process is an ontology of topics.
I encourage you, to give it a try.

Resources