Matching user profiles with employment opportunities - machine-learning

I am currently working on a software which can connect users to jobs based on their user profiles. I ran text analytics on the job descriptions and derived important keywords from it. I have also collected user information from their profile. Matching the jobs to the user profiles seems to be a challenging task. Are there any Machine Learning based algorithms which can be used for match making?

OK, so basically, you have keywords for each job description and then you have some sort of text data (user profiles) to which you try to match those keywords.
Since your training data (user profiles) is not labeled, the supervised learning will not help you here. Unsupervised learning (clustering) could maybe help you in finding a certain patterns (keywords) from a loads of user profiles, but you would certainly need to experiment with different sorts of techniques (such as gaussian mixture models etc.) and observe possible patterns.
A simpler thing you could maybe do is to derive/find keywords also for each user profile(in other words to identify how many of your job keywords also exist in user profile) and then compare a distance between them using cosine similarity. You would then only need to determine the minimal angle threshold. This would be a parameter to play with. Of course you would need to vectorize your text data using bigrams or similar; if you use python there already is feature extraction in scikit). You could possibly also use tf-idf vectorizer on both, the job description and user profile but with some heavy and well determined words stop list.

Related

How do I identify most related parameters in statistical modeling

I have data about car mechanic company which allow mechanics to apply for there garrage on freelance basis.
I have previous mechanic job history and based on this historical data, I want to recommend best possible location to mechanics so that he can get good job and company gets maximum acceptance.
I manually checked various parameters like location_ID, lang, lat of the job location, mechanic_Exp_years, open_position, mechanic_specialization etc.
Also tried to see relation using chart like this
https://imgur.com/a/jxmTXty
I am adding link because can not upload image due to less then 10 point
Is there any standards technique available which can statistically says that out of this 100 parameters this paramteres are good to be considered for prediction/training?
Any reference link or library much appreciated. I did checked many articles but no luck
There are many methods to do that. If you're using python, I would recommend scikit-learn's FeatureSelection module. There are many methods listed, but my pick would be Recursive Feature Elimination or short RFE. The RFE works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
Other then that, you can also try to use PCA (Principal Component Analysis) to reduce your features to only useful ones that bring some information to your model.

Advice on classifying users in machine learning scenario

I'm looking for some advice in the problem of classifying users into various groups based on there answers to a sign up process.
The idea is that these classifications will group people with similar travel habits, i.e. adventurous, relaxing, foodie etc. This shouldn't be a classification known to the user, so isn't as simple as just asking what sort of holidays they like ( The point is to remove user bias/not really knowing where to place yourself).
The way I see it working is asking questions such as apps they use, accounts they interact with on social media (gopro, restaurants etc) , giving some scenarios and asking which sounds best, these would be chosen from a set provided to them, hence we have control over the variables. The main problem I have is how to get numerical values associated to each of these.
I've looked into various Machine learning algorithms and have realised this is most likely a clustering problem but I cant seem to figure out how to use this style of question to assign a value to each dimension that will actually give a useful categorisation.
Another question I have is whether there is some resources where I could find information on the sort of questions to ask users to gain information that'd allow classification like this.
The sort of process I envision is one similar to https://www.thread.com/signup/introduction if anyone is familiar with it.
Any advice welcomed.
The problem you have at hand is that you want to calculate a similarity measure based on categorical variables, which is the choice of their apps, accounts etc. Unless you measure the similarity of these apps with respect to an attribute such as how foodie is the app, it would be a hard problem to specify. Also, you would need to know all the possible states a categorical variable can assume to create a similarity measure like this.
If the final objective is to recommend something that similar people (based on app selection or social media account selection) have liked or enjoyed, you should look into collaborative filtering.
If your feature space is well defined and static (known apps, known accounts, limited set with few missing values) then look into content based recommendation systems, something as simple as Market Basket Analysis can give you a reasonable working model.
Else if you really want to model the system with a bunch of features that can assume random states, this could be done with multivariate probabilistic models, if the structure (relationships and influences between features) is well defined, you could benefit from Probabilistic Graphical Models, such as Bayesian Networks.
You really do need to define your problem better before you start solving it though.
You can use prime numbers. If each choice on the list of all possible choices is assigned a different prime, and the user's selection is saved as a product, then you will always know if the user has made a particular choice if the modulo of selection/choice is 0. Beauty of prime numbers, voila!

How do I choose training data set for job recommendation using linear regression model?

I have two kind of profiles in database.one is candidate
prodile,another is job profile posted by recruiter.
in both the profiles i have 3 common field say location,skill and
experience
i know the algorithm but i am having problem in creating training data
set where my input feature will be location,skill and salary chosen
from candidate profile,but i am not getting how to choose output
(relevant job profile).
as far as i know output can only be a single variable, then how to
choose relevant job profile as a output in my training set
or should i choose some other method?another thought is clustering.
As I understand you want to predict job profile given candidate profile using some prediction algorithm.
Well, if you want to use regression you need to know some historical data -- which candidates were given which jobs, then you can create some model based on this historical data. If you don't have such training data you need some other algorithm. Say, you could set location,skill and experience as features in 3d and use clustering/nearest neighbors to find candidate profile closest to a job profile.
You could look at "recommender systems", they can be an answer to your problem.
Starting with a content based algorithm (you will have to find a way to automate the labels of the jobs, or manually do them), you can improve to an hybrid one by gathering which jobs your users were actually interested (and become an hybrid recommender)

Find startup's industry from its description

I am using AngelList DB to categorize startups based on their industries since these startups are categorized based on community input which is misleading most of the time.
My business objective is to extract keywords that indicate to which industry this specific startup belongs to then map it to one of the industries specified in LinkedIn sheet https://developer.linkedin.com/docs/reference/industry-codes
I experimented with Azure Machine learning, where I pushed 300 startups descriptions and analyzed the keyword extraction was pretty bad and was not even close to what I am trying to achieve.
I would like to know how data scientists will approach this problem? where should I look? and where I should not? is keyword analysis tools (like Google Adwords keyword planner is a viable option)
Using Text Classification...
To be able to treat this as a classification problem, you need a training set, which is a set of AngelList entries that are labeled with correct LinkedIn categories. This can be done manually, or you can hire some Mechanical Turks to do the job for you.
Since you have ~150 categories, I'd imagine you need at least 20-30* AngelList entries for each of them. So your training set will be {input: angellist_description, result: linkedin_id}
After that, you need to dig through text classification techniques to try and optimize the accuracy/precision of your results. The book "Taming Text" has a full chapter on text classification. And a good tool to implement a text-based classifier would be Apache Solr or Apache Lucene.
* 20-30 is a quick personal estimate and not based on a scientific method. You can look up some methods online for a good estimation method.
Using Text Clustering.
Step #1
Use text clustering to extract main 'topics' from all the descriptions. (Carrot2 can be helpful here)
Input corpus of all descriptions
Process: Text Clustering using Carrot2
Output each document will be labeled with a topic
Step #2
Manually map the extracted topics into LinkedIn's categories.
Step #3
Use the output of the first two steps to traverse from company -> extracted topic -> linkedin category

Machine Learning Algorithm Suggestion?

I'm novice in ML. I've crunch time and in need to choose the algorithm to complete my following task:
Traveler, is visiting my website. I make them fill the form and have all the necessary signal (attributes) with me like whether they have booked flight or not, whether email is genuine is not, phone no is given or not, trip date is fixed, destination location is fixed or not.
But along with that I have many visitor who don't fill the form completely or just uses fake phone number.
I again re-iterate, I have lot of signal available with me, and I need to filter out the traveler who is certain to go for traveling so that I can personally contact them. I also need some score as well on the scale of 10.
Which ML algorithm is best suited for this job and why ?
Previously I have worked in WEKA.
You'll need to create an ensemble model (composition of many different algorithms).

Resources