I want to write a Learning Algo which can automatically create summary of articles .
e.g, there are some fiction novels(one category considering it as a filter) in PDF format. I want to make an automated process of creating its summary.
We can provide some sample data to implement it in supervised learning approach.
Kindly suggest me how can i implement this properly.
I am a beginner & am pursuing Andrew Ng course and aware of some common algorithms(linear reg, logistic , neural net) + Udacity Statistics courses and ready to dive more into NLP , Deep learning etc. but motive is to solve this. :)
Thanks in advance
The keyword is Automatic Summarization.
Generally, there are two approaches to automatic summarization: extraction and abstraction.
Extractive methods work by selecting a subset of existing words, phrases, or sentences in the original text to form the summary.
Abstractive methods build an internal semantic representation and then use natural language generation techniques to create a summary that is closer to what a human might generate.
Abstractive summarization is a lot more difficult. An interesting, approach is described in A Neural Attention Model for Abstractive Sentence Summarization by Alexander M. Rush, Sumit Chopra, Jason Weston (source code based on the paper here).
A "simple" approach is used in Word (AutoSummary Tool):
AutoSummarize determines key points by analyzing the document and assigning a score to each sentence. Sentences that contain words used frequently in the document are given a higher score. You then choose a percentage of the highest-scoring sentences to display in the summary.
You can select whether to highlight key points in a document, insert an executive summary or abstract at the top of a document, create a new document and put the summary there, or hide everything but the summary.
If you choose to highlight key points or hide everything but the summary, you can switch between displaying only the key points in a document (the rest of the document is hidden) and highlighting them in the document. As you read, you can also change the level of detail at any time.
Anyway automatic data (text) summarization is an active area of machine learning / data mining with many ongoing researches. You should start reading some good overviews:
Summarization evaluation: an overview by Inderjeet Mani.
A Survey on Automatic Text Summarization by Dipanjan Das André F.T. Martins (emphasizes extractive approaches to summarization using statistical methods).
Related
I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application it will be a part of and so forth). I have previously not used this field at all when doing feature engineering.
I was thinking that it would be nice to do some type of clustering on this data, and then use the cluster ID as a feature for my model, perhaps the lead time is correlated with the type of info present in that field.
Here was my line of thinking)
1) Cleaning & tokenizing text.
2) TF-IDF
3) Clustering
But after thinking more about it, is it a bad idea? Because the clustering was based on the old data, if new words are introduced in the new data this will not be captured by the clustering algorithm, and the data should perhaps be clustered differently now. Does this mean that I would have to retrain the entire model (k-means model and then the supervised model) whenever I want to predict new data points? Are there any best practices for this?
Are there better ways of finding clusters for text data to use as features in a supervised model?
I understand the urge to use an unsupervised clustering algorithm first to see for yourself, which clusters were found. And of course you can try if such a way helps your task.
But as you have labeled data, you can pass the product description without an intermediate clustering. Your supervised algorithm shall then learn for itself if and how this feature helps in your task (of course preprocessing such as removal of stopwords, cleaining, tokenizing and feature extraction needs to be done).
Depending of your text descriptions, I could also imagine that some simple sequence embeddings could work as feature-extraction. An embedding is a vector of for example 300 dimensions, which describes the words in a manner that hp office printer and canon ink jet shall be close to each other but nice leatherbag shall be farer away from the other to phrases. For example fasText-Word-Embeddings are already trained in english. To get a single embedding for a sequence of hp office printerone can take the average-vector of the three vectors (there are more ways to get an embedding for a whole sequence, for example doc2vec).
But in the end you need to run tests to choose your features and methods!
I've been working on a research project. I have a database of Wikipedia descriptions of a large number of entities, including sportspersons, politicians, actors, etc. The aim is to determine the type of entity using the descriptions. I have access to some data with the predicted type of entity which is quite accurate. This will be my training data. What I would like to do is train a model to predict the dominant type of entity for rest of the data.
What I've done till now:
Extracted the first paragraph, H1, H2 headers of Wiki description of the entity.
Extracted the category list of the entity on the wiki page (The bottom 'Categories' section present on any page like here.
Finding the type of entity can be difficult for entities that are associated with two or more concepts, like an actor who later became a politician.
I want to ask as to how I create a model out of the raw data that I have? What are the variables that I should use to train the model?
Also are there any Natural Language Processing techniques that can be helpful for this purpose? I know POS taggers can be helpful in this case.
My search over the internet has not been much successful. I've stumbled across research papers and blogs like this one, but none of them have relevant information for this purpose. Any ideas would be appreciated. Thanks in advance!
EDIT 1:
The input data is the first paragraph of the Wikipedia page of the entity. For example, for this page, my input would be:
Alan Stuart Franken (born May 21, 1951) is an American comedian, writer, producer, author, and politician who served as a United States Senator from Minnesota from 2009 to 2018. He became well known in the 1970s and 1980s as a performer on the television comedy show Saturday Night Live (SNL). After decades as a comedic actor and writer, he became a prominent liberal political activist, hosting The Al Franken Show on Air America Radio.
My extracted information is, the first paragraph of the page, the string of all the 'Categories' (bottom part of the page), and all the headers of the page.
From what I gather you would like to have a classifier which takes text input and predicts from a list of predefined categories.
I am not sure what your level of expertise is, so I will give a high level overview if additional people would like to know about the subject.
Like all NLP tasks which use ML, you are going to have to transform your textual domain to a numerical domain by a process of featurization.
Process the text and labels
Determine the relevant features
Create numerical representation of features
Train and Test on a Classifier
Process the text and labels
the text might have some strange markers or things that need to be modified to make it more "clean". this is standard as a text normalisation step.
then you will have to keep the related categories as labels for the texts.
It will end up being something like the following:
For each wiki article:
Normalise wiki article text
Save associated categories labels with text for training
Determine the relevant features
Some features you seem to have mentioned are:
Dominant field (actor, politician)
Header information
Syntactic information (POS Tags) are local (token level), but can be used to extract specific features such as if words are proper nouns or not.
Create numerical representation of features
Luckily, there are ways of doing auto-encoding, such as doc2vec, which can make a document vector from the text. Then you can add additional bespoke features that seem relevant.
You will then have a vector representation of features relevant to this text as well as the labels (categories).
This will become your training data.
Train and Test on a Classifier
Now train and test on a classifier of your choice.
Your data is one-to-many as you will try to predict many labels.
Try something simple just to seem if things work as you expect.
You should test your results with a cross validation routine such as k-fold validation using standard metrics (Precision, Recall, F1)
Clarification
Just to help clarify, This task is not really a named entity recognition task. It is a kind of multi-label classification task, where the labels are the categories defined on the wikipedia pages.
Named Entity Recognition is finding meaningful named entities in a document such as people, places. Usually something noun-like. This is usually done on a token level whereas your task is on a document level it seems.
I have a data that represents comments from the operator on various activities performed on a industrial device. The comments, could reflect either a routine maintainence/replacement activity or could represent that some damage occured and it had to be repaired to rectify the damage.
I have a set of 200,000 sentences that needs to be classified into two buckets - Repair/Scheduled Maintainence(or undetermined). These have no labels, hence looking for an unsupervised learning based solution.
Some sample data is as shown below:
"Motor coil damaged .Replaced motor"
"Belt cracks seen. Installed new belt"
"Occasional start up issues. Replaced switches"
"Replaced belts"
"Oiling and cleaning done".
"Did a preventive maintainence schedule"
The first three sentences have to be labeled as Repair while the second three as Scheduled maintainence.
What would be a good approach to this problem. though I have some exposure to Machine learning I am new to NLP based machine learning.
I see many papers related to this https://pdfs.semanticscholar.org/a408/d3b5b37caefb93629273fa3d0c192668d63c.pdf
https://arxiv.org/abs/1611.07897
but wanted to understand if there is any standard approach to such problems
Seems like you could use some reliable keywords (verbs it seems in this case) to create training samples for an NLP Classifier. Or you could use KMeans or KMedioids clustering and use 2 as K, which would do a pretty good job of separating the set. If you want to get really involved, you could use something like Latent Derichlet Allocation, which is a form of unsupervised topic modeling. However, for a problem like this, on the small amount of data you have, the fancier you get the more frustrated with the results you will become IMO.
Both OpenNLP and StanfordNLP have text classifiers for this, so I recommend the following if you want to go the classification route:
- Use key word searches to produce a few thousand examples of your two categories
- Put those sentences in a file with a label based on the OpenNLP format (label |space| sentence | newline )
- Train a classifier with the OpenNLP DocumentClassifier, and I recommend stemming for one of your feature generators
- after you have the model, use it in Java and classify each sentence.
- Keep track of the scores, and quarantine low scores (you will have ambiguous classes I'm sure)
If you don't want to go that route, I recommend using a text indexing technology de-jeur like SOLR or ElasticSearch or your favorite RDBMS's text indexing to perform a "More like this" type function so you don't have to play the Machine learning continuous model updating game.
Document summarization can be done by text extraction from the source document or you can employ learning algorithms to decipher what's conveyed by the document, and then generate the summary using language generation techniques (much like a human does).
Are there algorithms or existing research work for the latter method? In general, what are some good resources to learn about document summarization techniques?
The topic you are looking for is called Automatic Summarization in computer science community.
Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.
Methods of automatic summarization include extraction-based, abstraction-based, maximum entropy-based, and aided summarization.
Here is a good survey paper on this topic. You might want to take a look at two other papers: 1 and 2 as well.
Hope it helps.
Automatic text summarisation is generally of two types: Abstractive and Extractive. Abstractive approach is a bit complicated than Extractive. In the first one, important features, key information from sentences are extracted. Using natural language generation techniques, new sentences are generated using those features.
However, in later approach, all the sentences are ranked using methods like Lexical Ranking, lexical chaining etc. The similar sentences are clustered using approaches like cosine similarity, fuzzy matching etc. The most important sentences of the clusters are used to generate a summary of given document.
Some existing Automatic document text summarisation work and techniques compiled from various sources:
Semantria method of lexical chaining
MEAD
http://dl.acm.org/citation.cfm?id=81789
https://www.cs.cmu.edu/~afm/Home_files/Das_Martins_survey_summarization.pdf
http://www.upf.edu/pdi/iula/iria.dacunha/docums/cicling2013LNCS.pdf
I read different documents how CRF(conditional random field) works but all the papers puts the formula only. Is there any one who can send me a paper that describes about CRF with examples like if we have a sentence
"Mr.Smith was born in New York. He has been working for the last 20 years in Microsoft company."
if the above sentence is given as an input to train, how does the Model works during the training taking in to consideration for the formula for CRF?
Smith is tagged as "PER" New York is as "LOC" Microsoft Company as "ORG".
Moges.A
Here is a link to a set of slides made by Shasha Rush, a PhD student who is currently working on NLP at Google. One of the reasons I really like the slides is because they contain concrete examples and walk you through executions of important algorithms.
It is not a paper, but there is available whole online free course on probabilistic graphical models -- CRF is one of them.
It is very definitive and you'll get an intuitive level of understanding after completing it.
I don't think somebody will write such tutorial. You can check HMM tutorial which is easier to understand and can be explained by example. The problem with CRF is that it is global optimization with many dependencies, so it is very hard to show step by step how we optimize parameters and how we predict labels. But the idea is very simple - maximization of dependency(clique) graph using sparsity...