Extracting information from web-pages using NER - machine-learning

My task is to extract information from a various web-pages of a particular site. Now, the information to be extracted can be of the form as product name, product id, price, etc. The information is given in text using natural language. Also, I have been asked to extract that information using some Machine Learning algorithm. I thought of using NER (Named Entity Recognition) and training it on custom training data (which I can prepare using the scraped data and manually labeling the integers/data as required). I wanted to know if the model can even work this way?
Also, let me know if I can improve this question further.

You say a particular site. I am assuming that it means you have some fair idea of what the structures of webpages are, if the data is in table form or a free text form, how the website generally looks. In this case, a simple regex (prices, ids etc) supported by some POS tagger to extract product names and all is enough for you. A supervised approach is definitely an overkill and might underperform than the simpler regex.

Related

NER for predefined entities

I'm developing a application to categorize requirements in a requirement specification in to categories like database, front end, back end, etc. Requirement specification is a single document where I want to see the underlying categories in it. Can I use NER to get the categories? Sentences are divided in to categories if they contain certain words that match that particular category.
Example
data should be stored in a secured database.
If we consider above given sentence is a requirement it should be categorized in to database category considering the words it contains (database, data).
To my best of knowledge, NER pre-built tools would not help, but I recommend you to use spaCy. It is a NER tools with state-of-the-art accuracy and support re-train your DL model and customize it. Hope this helps!

Using NLP or machine learning to extract keywords off a sentence

I'm new to the ML/NLP field so my question is what technology would be most appropriate to achieve the following goal:
We have a short sentence - "Where to go for dinner?" or "What's your favorite bar?" or "What's your favorite cheap bar?"
Is there a technology that would enable me to train it providing the following data sets:
"Where to go for dinner?" -> Dinner
"What's your favorite bar?" -> Bar
"What's your favorite cheap restaurant?" -> Cheap, Restaurant
so that next time we have a similar question about an unknown activity, say, "What is your favorite expensive [whatever]" it would be able to extract "expensive" and [whatever]?
The goal is if we can train it with hundreds of variations(or thousands) of the question asked and relevant output data expected, so that it can work with everyday language.
I know how to make it even without NLP/ML if we have a dictionary of expected terms like Bar, Restaurant, Pool, etc., but we also want it to work with unknown terms.
I've seen examples with Rake and Scikit-learn for classification of "things", but I'm not sure how would I feed text into those and all those examples had predefined outputs for training.
I've also tried Google's NLP API, Amazon Lex and Wit to see how good they are at extracting entities, but the results are disappointing to say the least.
Reading about summarization techniques, I'm left with the impression it won't work with small, single-sentence texts, so I haven't delved into it.
As #polm23 mentioned for simple stuff you can use the POS tagging to do the extraction. The services you mentioned like LUIS, Dialog flow etc. , uses what is called Natural Language Understanding. They make uses of intents & entities(detailed explanation with examples you can find here). If you are concerned that your data is going online or sometimes you have to go offline, you always go for RASA.
Things you can do with RASA:
Entity extraction and sentence classification. Mention which particular term to be extracted from the sentence by tagging the word position with a variety of sentence. So if any different word comes other than what you had given in the training set it will be detected.
Uses rule-based learning and also keras LSTM for detection.
One downside when comparing with the online services is that you have to manually tag the position numbers in the JSON file for training as opposed to the click and tag features in the online services.
You can find the tutorial here.
I am having pain in my leg.
Eg I have trained RASA with a variety of sentences for identifying body part and symptom (I have limited to 2 entities only, you can add more), then when an unknown sentence (like the one above) appears it will correctly identify "pain" as "symptom" and "leg" as "body part".
Hope this answers your question!
Since "hundreds to thousands" sound like you have very little data for training a model from scratch. You might want to consider training (technically fine-tuning) a DialogFlow Agent to match sentences ("Where to go for dinner?") to intents ("Dinner"), then integrating via API calls.
Alternatively, you can invest time in fine-tuning a small pre-trained model like "Distilled BERT classifier" from "HuggingFace" as you won't need the 100s of thousands to billions of data samples required to train a production-worthy model. This can also be assessed offline and will equip you to solve other NLP problems in the future without much low-level understanding of the underlying statistics.

Find startup's industry from its description

I am using AngelList DB to categorize startups based on their industries since these startups are categorized based on community input which is misleading most of the time.
My business objective is to extract keywords that indicate to which industry this specific startup belongs to then map it to one of the industries specified in LinkedIn sheet https://developer.linkedin.com/docs/reference/industry-codes
I experimented with Azure Machine learning, where I pushed 300 startups descriptions and analyzed the keyword extraction was pretty bad and was not even close to what I am trying to achieve.
I would like to know how data scientists will approach this problem? where should I look? and where I should not? is keyword analysis tools (like Google Adwords keyword planner is a viable option)
Using Text Classification...
To be able to treat this as a classification problem, you need a training set, which is a set of AngelList entries that are labeled with correct LinkedIn categories. This can be done manually, or you can hire some Mechanical Turks to do the job for you.
Since you have ~150 categories, I'd imagine you need at least 20-30* AngelList entries for each of them. So your training set will be {input: angellist_description, result: linkedin_id}
After that, you need to dig through text classification techniques to try and optimize the accuracy/precision of your results. The book "Taming Text" has a full chapter on text classification. And a good tool to implement a text-based classifier would be Apache Solr or Apache Lucene.
* 20-30 is a quick personal estimate and not based on a scientific method. You can look up some methods online for a good estimation method.
Using Text Clustering.
Step #1
Use text clustering to extract main 'topics' from all the descriptions. (Carrot2 can be helpful here)
Input corpus of all descriptions
Process: Text Clustering using Carrot2
Output each document will be labeled with a topic
Step #2
Manually map the extracted topics into LinkedIn's categories.
Step #3
Use the output of the first two steps to traverse from company -> extracted topic -> linkedin category

Twitter data topical classification

So I have a data set which consists of tweets from various news organizations. I've loaded it into RapidMiner, tokenized it, and produced some n-grams of it. Now I want to be able to have RapidMiner automatically classify my data into various categories based on the topic of the tweets.
I'm pretty sure RapidMiner can do this, but according to the research I've done into it, I need a training data set to be able to show RapidMiner how I want things classified. So I need a training data set, though given the categories I wanted to classify things into, I might have to create my own.
So my questions are these:
1) Is there a training data set for twitter data that focuses more on the topic of the tweet as opposed to a sentiment analysis publicly available?
2) If there isn't one publicly available, how can I create my own? My idea to do it was to go through the tweets themselves and associate the tokens and n-grams with the categories I want. Some concerns I have with that are that I won't be able to manually classify enough tweets to create a training data set comprehensive enough so that I can get a good accuracy rate for the automatic classifier.
3) Any general advice for topical classification of text data would be great. This is the first time that I've done a project like this, and I'm sure there are things I could improve on. :)
There may be training corpora that work for you, but you need to say what your topic or categories are to identify it. The fact that this is Twitter may be relevant, but the data source is likely to be much less relevant to the classification accuracy you will achieve than the topic is. So if you take the infamous 20 newsgroups data set this is likely to work on Twitter as well, but only if the categories you are after are the 20 categories from that data set. If you want to classify cats vs dogs or Android vs iPhone you need to find a data set for that.
In most cases you will have to create initial labels manually, which is, as you say, a lot of work. One workaround might be to start with something simpler like a keyword search to create subsets of your tweets for which you know they deal with a particular category. Then you create the model on top of that and hope that it generalizes to identify the same categories even though the original keywords do not occur.
Alternatively, depending on your application (and if you actually want to build an applicaion), you may as well start with only a small data set and accept that you have poor classification. Then you generate classifications, show them to the users of your apps, and collect some form of explicit or implicit feedback on the classification (e.g. users can flag tweets as incorrectly classified). This way you improve your training corpus and periodically update your model.
Finally, if you do not know what your topics are and you want RapidMiner to identify the topics, you may want to try clustering as opposed to classification. Just create a few clusters and look at the top words for each cluster. They may well be quite dissimilar and describe what the respective clusters are about.
I believe your third question may be a bit broad for stackoverflow and is probably better answered by a text book.

Information Extraction - business documents

i'm currently trying to extract information, e.g. sender or recipient from business documents like bills. The documents were processed with ocr software into xml files, so they are annotated with formatting characteristics. I want to extract specific information from a new document after annotated one similar document manually with features like sender and recipient.
So my question is, if there is a learning or matching algorithm which is able to extract specific data by comparing with only one or two examples of similar documents. If yes: is there somehow a java framework capable of that?
Yours thankfully
maggu
If the XML structure is always the same (using the same template):
Just save the XML parent nodes of the selected nodes where the information is located so you know the path to the information. Shouldn't be a problem - trivial task.
If you have to search for the information:
It could work by creating certain feature extraction rules and then use that features to train a Support Vector Machine for detecting the areas where the information is located.
I once asked a similar question Algorithm to match natural text in mail.
But that is far from trivial, and definitely needs more than one or two training documents.

Resources