How do I identify most related parameters in statistical modeling

How do I identify most related parameters in statistical modeling - machine-learning

I have data about car mechanic company which allow mechanics to apply for there garrage on freelance basis.
I have previous mechanic job history and based on this historical data, I want to recommend best possible location to mechanics so that he can get good job and company gets maximum acceptance.
I manually checked various parameters like location_ID, lang, lat of the job location, mechanic_Exp_years, open_position, mechanic_specialization etc.
Also tried to see relation using chart like this
https://imgur.com/a/jxmTXty
I am adding link because can not upload image due to less then 10 point
Is there any standards technique available which can statistically says that out of this 100 parameters this paramteres are good to be considered for prediction/training?
Any reference link or library much appreciated. I did checked many articles but no luck

There are many methods to do that. If you're using python, I would recommend scikit-learn's FeatureSelection module. There are many methods listed, but my pick would be Recursive Feature Elimination or short RFE. The RFE works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
Other then that, you can also try to use PCA (Principal Component Analysis) to reduce your features to only useful ones that bring some information to your model.

Related

Using NLP or other algorithms to mach two strings

My goal of a project is to correctly assign medications. I have a large catalog at my disposal for this purpose. However, the medications do not appear there in exactly the same spelling. Possibly additional information was added or possible parts of the prescription were abbreviated.
I was already able to implement a possible algorithm using the Levensthein distance (token_set_ratio).
Because of the sometimes long additional information this algorithm assigns wrong medications, I wanted to ask if there are better algorithms for comparing strings. For example, does it make sense to implement machine learning algorithms or NLP technology? This is a relatively new area for me. I would appreciate any ideas or inspiration.

This sounds like a classic Deduplication task. For example, have a look at dedupe. This tool lets you annotate training examples and learns when two items refer to the same thing. It can be used with as few as 10 training sanples and has an active learning approach implemented.

In a regression task, how do I find which independent variables are to be ignored or are not important?

In the regression problem I'm working with, there are five independent columns and one dependent column. I cannot share the data set details directly due to privacy, but one of the independent variables is an ID field which is unique for each example.
I feel like I should not be using ID field in estimating the dependent variable. But this is just a gut feeling. I have no strong reason to do this.
What shall I do? Is there any way I decide which variables to use and which to ignore?

Well, I agree with #desertnaut. Id attribute does not seem relevant when creating a model and provides no help in prediction.
The term you are looking for is feature selection. Since it's a comprehensive section so I would just tell you the methods that are mostly used by data scientists.
As for regression problems you can try correlation heatmap to find the features that are highly correlated with the target.
sns.heatmap(df.corr())
There are several other ways too like PCA,using tree inbuilt feature selection methods to find the right features for your model.
You can also try James Phillips method. This approach is limited since model time complexity will increase linearly with the features. But in your case where you've only four features to compare you can try it out. You can compare the regression model trained with all the four features with the model trained with only three features by dropping one of the four features recursively. This would mean training four regression models and comparing them.

According to you, the ID variable is unique for each example. So the model won't be able to learn anything from this variable as with every example, you get a new ID and hence no general patterns to learn as every ID only occurs once.
Regarding feature elimination, it depends. If you have domain knowledge, based on that alone you can engineer/ remove features as needed. If you don't know much about the domain, you can try out some basic techniques like Backward Selection, Forward Selection, etc via Cross Validation to get the model with the best value of the metric that you're working with.

How to build target variable for supervised machine learning project

I am quite new to machine learning with small experience and I did some projects.
Now I have a project relates to insurance. So I have databases about clients that I will merge to get all possible information about the clients and I have one database for the claims. I need to build a model to identify how risky the client based on ranks.
My question: I need to build my target variable that ranks the clients based on how risky they are, counting on the claims. I could have different strategies to do that, but I am confused about how I will deal with the following:
- Shall I do a specific type of analysis before building the ranks such as clustering, or I need to have a strong theoretical assumption matching with the project provider vision.
- If I use some variables in the claims database to build up the ranks, how shall I deal with them later. In other words, shall I remove them from the final data set for training, to avoid correlation with target variable, or I can treat them in a different way and keep them.
- If I will keep them, is there a special treatment for them depending on whether they are categorical or continuous variables.

Every machine learning project's starting point is EDA. First create some feature, like how often do they get bad claims or how many do they get. Then do some EDA to find which features are more useful. Secondly, the problem looks like classification. Clustering is usually harder to evaluate.

In data sciences when you make a business model, EDA exploratory data analytics play a major role which includes data cleaning, feature engineering, filtering data. As mentioned how to build target variable, it all depends on the attributes you have and what model do you want to apply say linear regression or logistic or make a decision tree. You need to use those algorithms. But most importantly you need to find out the impacting variable. That's probably the core elation between the output and the given input and priority must be given accordingly. Also attributes which add no value must be removed as that would contribute to overfitting.
You can do clustering too. And interesting thing is any unspervisoned learning could be converted to a form of supervised learning. Probably you can try to do logistic regression or do linear regression etc... And find out which model fits best to your project.

Advice on classifying users in machine learning scenario

I'm looking for some advice in the problem of classifying users into various groups based on there answers to a sign up process.
The idea is that these classifications will group people with similar travel habits, i.e. adventurous, relaxing, foodie etc. This shouldn't be a classification known to the user, so isn't as simple as just asking what sort of holidays they like ( The point is to remove user bias/not really knowing where to place yourself).
The way I see it working is asking questions such as apps they use, accounts they interact with on social media (gopro, restaurants etc) , giving some scenarios and asking which sounds best, these would be chosen from a set provided to them, hence we have control over the variables. The main problem I have is how to get numerical values associated to each of these.
I've looked into various Machine learning algorithms and have realised this is most likely a clustering problem but I cant seem to figure out how to use this style of question to assign a value to each dimension that will actually give a useful categorisation.
Another question I have is whether there is some resources where I could find information on the sort of questions to ask users to gain information that'd allow classification like this.
The sort of process I envision is one similar to https://www.thread.com/signup/introduction if anyone is familiar with it.
Any advice welcomed.

The problem you have at hand is that you want to calculate a similarity measure based on categorical variables, which is the choice of their apps, accounts etc. Unless you measure the similarity of these apps with respect to an attribute such as how foodie is the app, it would be a hard problem to specify. Also, you would need to know all the possible states a categorical variable can assume to create a similarity measure like this.
If the final objective is to recommend something that similar people (based on app selection or social media account selection) have liked or enjoyed, you should look into collaborative filtering.
If your feature space is well defined and static (known apps, known accounts, limited set with few missing values) then look into content based recommendation systems, something as simple as Market Basket Analysis can give you a reasonable working model.
Else if you really want to model the system with a bunch of features that can assume random states, this could be done with multivariate probabilistic models, if the structure (relationships and influences between features) is well defined, you could benefit from Probabilistic Graphical Models, such as Bayesian Networks.
You really do need to define your problem better before you start solving it though.

You can use prime numbers. If each choice on the list of all possible choices is assigned a different prime, and the user's selection is saved as a product, then you will always know if the user has made a particular choice if the modulo of selection/choice is 0. Beauty of prime numbers, voila!

Ordering movie tickets with ChatBot

My question is related to the project I've just started working on, and it's a ChatBot.
The bot I want to build has a pretty simple task. It has to automatize the process of purchasing movie tickets. This is pretty close domain and the bot has all the required access to the cinema database. Of course it is okay for the bot to answer like “I don’t know” if user message is not related to the process of ordering movie tickets.
I already created a simple demo just to show it to a few people and see if they are interested in such a product. The demo uses simple DFA approach and some easy text matching with stemming. I hacked it in a day and it turned out that users were impressed that they are able to successfully order tickets they want. (The demo uses a connection to the cinema database to provide users all needed information to order tickets they desire).
My current goal is to create the next version, a more advanced one, especially in terms of Natural Language Understanding. For example, the demo version asks users to provide only one information in a single message, and doesn’t recognize if they provided more relevant information (movie title and time for example). I read that an useful technique here is called "Frame and slot semantics", and it seems to be promising, but I haven’t found any details about how to use this approach.
Moreover, I don’t know which approach is the best for improving Natural Language Understanding. For the most part, I consider:
Using “standard” NLP techniques in order to understand user messages better. For example, synonym databases, spelling correction, part of speech tags, train some statistical based classifiers to capture similarities and other relations between words (or between the whole sentences if it’s possible?) etc.
Use AIML to model the conversation flow. I’m not sure if it’s a good idea to use AIML in such a closed domain. I’ve never used it, so that’s the reason I’m asking.
Use a more “modern” approach and use neural networks to train a classifier for user messages classification. It might, however, require a lot of labeled data
Any other method I don’t know about?
Which approach is the most suitable for my goal?
Do you know where I can find more resources about how does “Frame and slot semantics” work in details? I'm referring to this PDF from Stanford when talking about frame and slot approach.

The question is pretty broad, but here are some thoughts and practical advice, based on experience with NLP and text-based machine learning in similar problem domains.
I'm assuming that although this is a "more advanced" version of your chatbot, the scope of work which can feasibly go into it is quite limited. In my opinion this is a very important factor as different methods widely differ in the amount and type of manual effort needed to make them work, and state-of-the-art techniques might be largely out of reach here.
Generally the two main approaches to consider would be rule-based and statistical. The first is traditionally more focused around pattern matching, and in the setting you describe (limited effort can be invested), would involve manually dealing with rules and/or patterns. An example for this approach would be using a closed- (but large) set of templates to match against user input (e.g. using regular expressions). This approach often has a "glass ceiling" in terms of performance, but can lead to pretty good results relatively quickly.
The statistical approach is more about giving some ML algorithm a bunch of data and letting it extract regularities from it, focusing the manual effort in collecting and labeling a good training set. In my opinion, in order to get "good enough" results the amount of data you'll need might be prohibitively large, unless you can come up with a way to easily collect large amounts of at least partially labeled data.
Practically I would suggest considering a hybrid approach here. Use some ML-based statistical general tools to extract information from user input, then apply manually built rules/ templates. For instance, you could use Google's Parsey McParseface to do syntactic parsing, then apply some rule engine on the results, e.g. match the verb against a list of possible actions like "buy", use the extracted grammatical relationships to find candidates for movie names, etc. This should get you to pretty good results quickly, as the strength of the syntactic parser would allow "understanding" even elaborate and potentially confusing sentences.
I would also suggest postponing some of the elements you think about doing, like spell-correction, and even stemming and synonyms DB - since the problem is relatively closed, you'll probably have better ROI from investing in a rule/template-framework and manual rule creation. This advice also applies to explicit modeling of conversation flow.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart