Creating domain expertise features as feature-engineering methods - machine-learning

can you explain to me what is the method of (Creating domain expertise features)? I have already read a paper that they mentioned this as as method of feature engineering. this is the link of paper:
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7937700&casa_token=4-z8UtRkp1UAAAAA:nsm1HOViwJA6W0lsQz-vTN05OV308R1VvO0c0sjhoAnsspR0ryqrjwApjG0ayUT7IdY4WU1E&tag=1

The paper you linked to is not accessible (only the intro), but this seems like some prior knowledge was used to build the features. These are sometimes called expert knowledge features.
For example, say I want to predict apple pie taste score and I know that in the apple pie\desserts domain people that score the taste like sugar - I would use this knowledge to have features about the sugar levels.

Related

Using NLP or other algorithms to mach two strings

My goal of a project is to correctly assign medications. I have a large catalog at my disposal for this purpose. However, the medications do not appear there in exactly the same spelling. Possibly additional information was added or possible parts of the prescription were abbreviated.
I was already able to implement a possible algorithm using the Levensthein distance (token_set_ratio).
Because of the sometimes long additional information this algorithm assigns wrong medications, I wanted to ask if there are better algorithms for comparing strings. For example, does it make sense to implement machine learning algorithms or NLP technology? This is a relatively new area for me. I would appreciate any ideas or inspiration.
This sounds like a classic Deduplication task. For example, have a look at dedupe. This tool lets you annotate training examples and learns when two items refer to the same thing. It can be used with as few as 10 training sanples and has an active learning approach implemented.

Advice on classifying users in machine learning scenario

I'm looking for some advice in the problem of classifying users into various groups based on there answers to a sign up process.
The idea is that these classifications will group people with similar travel habits, i.e. adventurous, relaxing, foodie etc. This shouldn't be a classification known to the user, so isn't as simple as just asking what sort of holidays they like ( The point is to remove user bias/not really knowing where to place yourself).
The way I see it working is asking questions such as apps they use, accounts they interact with on social media (gopro, restaurants etc) , giving some scenarios and asking which sounds best, these would be chosen from a set provided to them, hence we have control over the variables. The main problem I have is how to get numerical values associated to each of these.
I've looked into various Machine learning algorithms and have realised this is most likely a clustering problem but I cant seem to figure out how to use this style of question to assign a value to each dimension that will actually give a useful categorisation.
Another question I have is whether there is some resources where I could find information on the sort of questions to ask users to gain information that'd allow classification like this.
The sort of process I envision is one similar to https://www.thread.com/signup/introduction if anyone is familiar with it.
Any advice welcomed.
The problem you have at hand is that you want to calculate a similarity measure based on categorical variables, which is the choice of their apps, accounts etc. Unless you measure the similarity of these apps with respect to an attribute such as how foodie is the app, it would be a hard problem to specify. Also, you would need to know all the possible states a categorical variable can assume to create a similarity measure like this.
If the final objective is to recommend something that similar people (based on app selection or social media account selection) have liked or enjoyed, you should look into collaborative filtering.
If your feature space is well defined and static (known apps, known accounts, limited set with few missing values) then look into content based recommendation systems, something as simple as Market Basket Analysis can give you a reasonable working model.
Else if you really want to model the system with a bunch of features that can assume random states, this could be done with multivariate probabilistic models, if the structure (relationships and influences between features) is well defined, you could benefit from Probabilistic Graphical Models, such as Bayesian Networks.
You really do need to define your problem better before you start solving it though.
You can use prime numbers. If each choice on the list of all possible choices is assigned a different prime, and the user's selection is saved as a product, then you will always know if the user has made a particular choice if the modulo of selection/choice is 0. Beauty of prime numbers, voila!

Ordering movie tickets with ChatBot

My question is related to the project I've just started working on, and it's a ChatBot.
The bot I want to build has a pretty simple task. It has to automatize the process of purchasing movie tickets. This is pretty close domain and the bot has all the required access to the cinema database. Of course it is okay for the bot to answer like “I don’t know” if user message is not related to the process of ordering movie tickets.
I already created a simple demo just to show it to a few people and see if they are interested in such a product. The demo uses simple DFA approach and some easy text matching with stemming. I hacked it in a day and it turned out that users were impressed that they are able to successfully order tickets they want. (The demo uses a connection to the cinema database to provide users all needed information to order tickets they desire).
My current goal is to create the next version, a more advanced one, especially in terms of Natural Language Understanding. For example, the demo version asks users to provide only one information in a single message, and doesn’t recognize if they provided more relevant information (movie title and time for example). I read that an useful technique here is called "Frame and slot semantics", and it seems to be promising, but I haven’t found any details about how to use this approach.
Moreover, I don’t know which approach is the best for improving Natural Language Understanding. For the most part, I consider:
Using “standard” NLP techniques in order to understand user messages better. For example, synonym databases, spelling correction, part of speech tags, train some statistical based classifiers to capture similarities and other relations between words (or between the whole sentences if it’s possible?) etc.
Use AIML to model the conversation flow. I’m not sure if it’s a good idea to use AIML in such a closed domain. I’ve never used it, so that’s the reason I’m asking.
Use a more “modern” approach and use neural networks to train a classifier for user messages classification. It might, however, require a lot of labeled data
Any other method I don’t know about?
Which approach is the most suitable for my goal?
Do you know where I can find more resources about how does “Frame and slot semantics” work in details? I'm referring to this PDF from Stanford when talking about frame and slot approach.
The question is pretty broad, but here are some thoughts and practical advice, based on experience with NLP and text-based machine learning in similar problem domains.
I'm assuming that although this is a "more advanced" version of your chatbot, the scope of work which can feasibly go into it is quite limited. In my opinion this is a very important factor as different methods widely differ in the amount and type of manual effort needed to make them work, and state-of-the-art techniques might be largely out of reach here.
Generally the two main approaches to consider would be rule-based and statistical. The first is traditionally more focused around pattern matching, and in the setting you describe (limited effort can be invested), would involve manually dealing with rules and/or patterns. An example for this approach would be using a closed- (but large) set of templates to match against user input (e.g. using regular expressions). This approach often has a "glass ceiling" in terms of performance, but can lead to pretty good results relatively quickly.
The statistical approach is more about giving some ML algorithm a bunch of data and letting it extract regularities from it, focusing the manual effort in collecting and labeling a good training set. In my opinion, in order to get "good enough" results the amount of data you'll need might be prohibitively large, unless you can come up with a way to easily collect large amounts of at least partially labeled data.
Practically I would suggest considering a hybrid approach here. Use some ML-based statistical general tools to extract information from user input, then apply manually built rules/ templates. For instance, you could use Google's Parsey McParseface to do syntactic parsing, then apply some rule engine on the results, e.g. match the verb against a list of possible actions like "buy", use the extracted grammatical relationships to find candidates for movie names, etc. This should get you to pretty good results quickly, as the strength of the syntactic parser would allow "understanding" even elaborate and potentially confusing sentences.
I would also suggest postponing some of the elements you think about doing, like spell-correction, and even stemming and synonyms DB - since the problem is relatively closed, you'll probably have better ROI from investing in a rule/template-framework and manual rule creation. This advice also applies to explicit modeling of conversation flow.

Feature Selection: Hybrid vs Embedded approaches

I have been doing research on feature selection and I'm failing to understand the difference about these two approaches.
According to most authors on literature, feature selection algorithms are categorized into three categories. The first two, filter and wrapper are easy to understand and there is a general agreement on that. However, on the last category there seems to be a misunderstandment. Some authors as the case of H. Liu name the last category as hybrid. In contrast, V. Kumar names it embedded. In addiction to that there are cases where authors define 4 categories including both embedded and hybrid algorithms, as is the case of P. Abinaya.
Authors explain the hybrid algorithms as the combination between a filter algorithm and a wrapper approachs. The main idea behind these algorithms is to use a filter approach to reduce the search space for a wrapper approach.
On the other hand the definition of embedded algorithms on the literature is very different depending on the source. Some use almost the same definitation as the hybrid algorithms as is the case of the wikipedia page. Others give more abstract definitions such as: methods that perform feature selection during learning of optimal parameters, and methods that incorporate knowledge about the specific structure of the class of functions used by a certain learning machine.
So I would appreciate if anyone could explain me what's the difference between these two approaches or give a less abstract definition of embedded methods.
Thanks.

What subjects, topics does a computer science graduate need to learn to apply available machine learning frameworks, esp. SVMs

I want to teach myself enough machine learning so that I can, to begin with, understand enough to put to use available open source ML frameworks that will allow me to do things like:
Go through the HTML source of pages
from a certain site and "understand"
which sections form the content,
which the advertisements and which
form the metadata ( neither the
content, nor the ads - for eg. -
TOC, author bio etc )
Go through the HTML source of pages
from disparate sites and "classify"
whether the site belongs to a
predefined category or not ( list of
categories will be supplied
beforhand )1.
... similar classification tasks on
text and pages.
As you can see, my immediate requirements are to do with classification on disparate data sources and large amounts of data.
As far as my limited understanding goes, taking the neural net approach will take a lot of training and maintainance than putting SVMs to use?
I understand that SVMs are well suited to ( binary ) classification tasks like mine, and open source framworks like libSVM are fairly mature?
In that case, what subjects and topics
does a computer science graduate need
to learn right now, so that the above
requirements can be solved, putting
these frameworks to use?
I would like to stay away from Java, is possible, and I have no language preferences otherwise. I am willing to learn and put in as much effort as I possibly can.
My intent is not to write code from scratch, but, to begin with putting the various frameworks available to use ( I do not know enough to decide which though ), and I should be able to fix things should they go wrong.
Recommendations from you on learning specific portions of statistics and probability theory is nothing unexpected from my side, so say that if required!
I will modify this question if needed, depending on all your suggestions and feedback.
"Understanding" in machine learn is the equivalent of having a model. The model can be for example a collection of support vectors, the layout and weights of a neural network, a decision tree, or more. Which of these methods work best really depends on the subject you're learning from and on the quality of your training data.
In your case, learning from a collection of HTML sites, you will like to preprocess the data first, this step is also called "feature extraction". That is, you extract information out of the page you're looking at. This is a difficult step, because it requires domain knowledge and you'll have to extract useful information, or otherwise your classifiers will not be able to make good distinctions. Feature extraction will give you a dataset (a matrix with features for each row) from which you'll be able to create your model.
Generally in machine learning it is advised to also keep a "test set" that you do not train your models with, but that you will use at the end to decide on what is the best method. It is of extreme importance that you keep the test set hidden until the very end of your modeling step! The test data basically gives you a hint on the "generalization error" that your model is making. Any model with enough complexity and learning time tends to learn exactly the information that you train it with. Machine learners say that the model "overfits" the training data. Such overfitted models seem to appear good, but this is just memorization.
While software support for preprocessing data is very sparse and highly domain dependent, as adam mentioned Weka is a good free tool for applying different methods once you have your dataset. I would recommend reading several books. Vladimir Vapnik wrote "The Nature of Statistical Learning Theory", he is the inventor of SVMs. You should get familiar with the process of modeling, so a book on machine learning is definitely very useful. I also hope that some of the terminology might be helpful to you in finding your way around.
Seems like a pretty complicated task to me; step 2, classification, is "easy" but step 1 seems like a structure learning task. You might want to simplify it to classification on parts of HTML trees, maybe preselected by some heuristic.
The most widely used general machine learning library (freely) available is probably WEKA. They have a book that introduces some ML concepts and covers how to use their software. Unfortunately for you, it is written entirely in Java.
I am not really a Python person, but it would surprise me if there aren't also a lot of tools available for it as well.
For text-based classification right now Naive Bayes, Decision Trees (J48 in particular I think), and SVM approaches are giving the best results. However they are each more suited for slightly different applications. Off the top of my head I'm not sure which would suit you the best. With a tool like WEKA you could try all three approaches with some example data without writing a line of code and see for yourself.
I tend to shy away from Neural Networks simply because they can get very very complicated quickly. Then again, I haven't tried a large project with them mostly because they have that reputation in academia.
Probability and statistics knowledge is only required if you are using probabilistic algorithms (like Naive Bayes). SVMs are generally not used in a probabilistic manner.
From the sound of it, you may want to invest in an actual pattern classification textbook or take a class on it in order to find exactly what you are looking for. For custom/non-standard data sets it can be tricky to get good results without having a survey of existing techniques.
It seems to me that you are now entering machine learning field, so I'd really like to suggest to have a look at this book: not only it provides a deep and vast overview on the most common machine learning approaches and algorithms (and their variations) but it also provides a very good set of exercises and scientific paper links. All of this is wrapped in an insightful language starred with a minimal and yet useful compendium about statistics and probability

Resources