in my previous question I asked something similar, but I don't know how to beginn this task.
I have to extract phrases from datasheets and as I have a ton of training data (PDFs and the according extracted information), I want to manage this via Deep Learning, as every datasheet could be different, but the structure is always similar.
But how to start? What to google? What is the term for my plans?
I hope someone has a concrete way for me.
Related
Is it necessary to repeat similar template data... Like the meaning and context is the same, but the smaller details vary. If I remove these redundancies, the dataset is very small (size in hundreds) but if the data like these are included, it easily crosses thousands. Which is the right approach?
SAMPLE DATA
This is acutally not a question suited for stack overflow but I'll answer anyways:
You have to think about how the emails (or what ever your data this is) will look in real-life usage: Do you want to detect any kind of spam or just similiar to what your sample data shows? If the first is the case, your dataset is just not suited for this problem since there are not enough various data samples. When you think about it, every of the senteces are exactly the same because the company name isn't really valueable information and will probably not be learned as a feature by your RNN. So the information is almost the same. And since every input sample will run through the network multiple times (once each epoch) it doesnt really help having almost the same sample multiple times.
So you shouldnt have one kind of almost identical data samples dominating your dataset.
But as I said: When you primarily want to filter out "Dear customer, we wish you a ..." you can try it with this dataset but you wouldnt really need an RNN to detect that. If you want to detect all kind of spam, you should search for a new dataset since ~100 unique samples are not enough. I hope that was helpful!
I know it is a little bit wide topic but all I am looking for is if someone can help me with the list of all the NLP algorithms and when to use them, or maybe a resource which I can refer. for example - **RNN **might be a good use case for a question and answer NLP use case and a simple dense network might just work quite good for binary segregation of documents or identify sarcastic comments from useful news.
I was hoping we can add to the list below from whatever anyone has an idea about will be great. Of course, the below list is not a hard and fast rule and more often than not on various use case we might have to try different things approaches but this is an effort just to have an exhaustive list for NLP algorithms.
Dense layer - useful for document segregation or (sarcastic comments from useful news)
RNN(LSTM) - Good for Question and Answer API
You can find some research publication here for different set of application in the domain of natural language processing.
And also you can find some resources in google scholar.
I’m working on web app where users can ask questions. These questions should be categorized by some criteria based on question content, title, user data, region and so on. Next these questions should be processed in so way: for some additional information requests should be sent, others should be deleted or marked as spam and some – sent directly to some specialist.
The problem is that users can’t choose the right category themselves, it’s pretty complex things and users can cheat.
Are there any approaches how to do that automatically? For now a few persons do this job filtering questions. Perhaps some already done solutions exist.
This is a really complex task. You should take a look at supervised machine learning classification algorithms. You can try to use similar to some spam filtering algorithm (https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering)
Gather some number of questions categorized before (labeled examples).
Gather some number of words (vocabulary) used for questions classifications (identify group).
Process question text removing “stop words” and replace words with their stems.
Map question text, title, user data and so to some numbers (question vector).
Use some algorithm like SVM to create and use classifier (model)
But it’s like very general approach you can look at. It’s hard to say something more specific without additional details. I don’t think you can find already done solution, it’s pretty specific task. But of cause you can use a lot of machine-learning frameworks.
My question is related to the project I've just started working on, and it's a ChatBot.
The bot I want to build has a pretty simple task. It has to automatize the process of purchasing movie tickets. This is pretty close domain and the bot has all the required access to the cinema database. Of course it is okay for the bot to answer like “I don’t know” if user message is not related to the process of ordering movie tickets.
I already created a simple demo just to show it to a few people and see if they are interested in such a product. The demo uses simple DFA approach and some easy text matching with stemming. I hacked it in a day and it turned out that users were impressed that they are able to successfully order tickets they want. (The demo uses a connection to the cinema database to provide users all needed information to order tickets they desire).
My current goal is to create the next version, a more advanced one, especially in terms of Natural Language Understanding. For example, the demo version asks users to provide only one information in a single message, and doesn’t recognize if they provided more relevant information (movie title and time for example). I read that an useful technique here is called "Frame and slot semantics", and it seems to be promising, but I haven’t found any details about how to use this approach.
Moreover, I don’t know which approach is the best for improving Natural Language Understanding. For the most part, I consider:
Using “standard” NLP techniques in order to understand user messages better. For example, synonym databases, spelling correction, part of speech tags, train some statistical based classifiers to capture similarities and other relations between words (or between the whole sentences if it’s possible?) etc.
Use AIML to model the conversation flow. I’m not sure if it’s a good idea to use AIML in such a closed domain. I’ve never used it, so that’s the reason I’m asking.
Use a more “modern” approach and use neural networks to train a classifier for user messages classification. It might, however, require a lot of labeled data
Any other method I don’t know about?
Which approach is the most suitable for my goal?
Do you know where I can find more resources about how does “Frame and slot semantics” work in details? I'm referring to this PDF from Stanford when talking about frame and slot approach.
The question is pretty broad, but here are some thoughts and practical advice, based on experience with NLP and text-based machine learning in similar problem domains.
I'm assuming that although this is a "more advanced" version of your chatbot, the scope of work which can feasibly go into it is quite limited. In my opinion this is a very important factor as different methods widely differ in the amount and type of manual effort needed to make them work, and state-of-the-art techniques might be largely out of reach here.
Generally the two main approaches to consider would be rule-based and statistical. The first is traditionally more focused around pattern matching, and in the setting you describe (limited effort can be invested), would involve manually dealing with rules and/or patterns. An example for this approach would be using a closed- (but large) set of templates to match against user input (e.g. using regular expressions). This approach often has a "glass ceiling" in terms of performance, but can lead to pretty good results relatively quickly.
The statistical approach is more about giving some ML algorithm a bunch of data and letting it extract regularities from it, focusing the manual effort in collecting and labeling a good training set. In my opinion, in order to get "good enough" results the amount of data you'll need might be prohibitively large, unless you can come up with a way to easily collect large amounts of at least partially labeled data.
Practically I would suggest considering a hybrid approach here. Use some ML-based statistical general tools to extract information from user input, then apply manually built rules/ templates. For instance, you could use Google's Parsey McParseface to do syntactic parsing, then apply some rule engine on the results, e.g. match the verb against a list of possible actions like "buy", use the extracted grammatical relationships to find candidates for movie names, etc. This should get you to pretty good results quickly, as the strength of the syntactic parser would allow "understanding" even elaborate and potentially confusing sentences.
I would also suggest postponing some of the elements you think about doing, like spell-correction, and even stemming and synonyms DB - since the problem is relatively closed, you'll probably have better ROI from investing in a rule/template-framework and manual rule creation. This advice also applies to explicit modeling of conversation flow.
Let's say I want to build a search engine that goes through a text and finds sentences or paragraphs that could be turned into an image, video or 3d-animation. So sentences that contain information that could be expressed visually.
Ideally, this search engine would get better over time.
Is there already search engine that could to that?
If not, which type of things would I need to look at/consider? My point here being that I don't really know much about machine learning and search engines. I am trying to get a feeling of which areas of machine learning, information retrieval and so forth I would need to look at.
I don't expect long answers here, just things like "well, take a look at this type of machine learning" or "this part of information retrieval theory may be relevant".
Just to get a broad overview of what I would need to look at.
Natural Language Understanding
I don't know about any existing search engine doing that. But this can be done with the help of Natural Language Understanding and Semantic Parsing.
Have a look at Stanford's Natural Language Understanding course (discussion of the text-to-scene problem can be found here) for further details.
How semantic search works is, it analysis data and put them into a 3-D vector space. One it's done with the help of bid data and knowledge graph the algorithm will try to find data points that connect to the article, the authority of the author, website relevance, and a couple of other factors. Once these factors are factored in, it then tries to create co-relate data to create a layer of information interconnected to each other. Once these information's are gathered then it is used to arrive at a conclusion to decide how relevant the data is.