New Stack Exchange: Climate Models - stack

I often ask about climate models, but I thought it would be good to start a new stack exchange site that brings together computer scientists and climate scientists so that they can create very sophisticated climate models. If you are interested in participating, please have a look at the link:
https://area51.stackexchange.com/proposals/127131/clima-models

Related

Extract relevent keywords from job advertisements [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My friend asked me if I could write a program capable of identifying relevant keywords from job adverts knowing 3 variables: Industry, job title and the job posting text (example below).
The problem we are trying to address, from a job seeker's point of view, evolves around having the correct keywords in your resume for each job application hereby increasing your chances of getting shortlisted for an interview. This is especially important when the first stage screening is done by bots scanning for keywords.
Initially I was considering a relational database containing all industries, all job titles and their related keywords. This however is an enormous task and the data in progressive fields like information and bio technology would quickly become stale.
It seems machine learning and natural language processing is unavoidable.
Consider below job advert for a bank seeking a teller:
Are you an experienced Bank Teller seeking that perfect work life
balance? If you’re looking for Casual Hours and have an absolute
passion for customer service then this is the role for you!
Our client services Queensland Public Servants (particularly
Queensland Police); and is currently seeking a Bank Teller to join
their Brisbane CBD team to start ASAP.
The successful candidate will be required to work from 9:30am to
2:30pm, Monday to Friday therefore 25 hours per week. Based on
experience the successful candidate will be paid (approximately) $25 -
$27 + superannuation per hour.
This position is casual/temporary with the potential to for a
permanent placement (based on performance/length of assignment etc.).
DUTIES & RESPONSIBILITIES:
As a Bank Teller your will be required to:
Attend to customers in a exceptional professional and efficient
manner; Processing basic transactions such as deposits and
withdrawals; Complete complex transactions such as loans and
mortgages; Pass referrals onto sales team (NO SALES); Large amounts of
cash handling; and Ensuring high attention to detail is at the top of
your list! SKILLS & EXPERIENCED:
The successful candidate will have the following:
Previous teller experience (within last 5 years) IDEAL; Previous
customer service experience (within finance) IDEAL; Ability to work in
a fast paced and time pressured environment; Excellent presentation
and attitude; Exceptional attention to detail; Ability to quickly
‘master’ multiple software packages; and Strong time management skills
and ability to work autonomously. If you boast to have fantastic
customer service skills, a professional manner, and preferrably teller
experience we would LOVE to hear from you!
If I was the hiring manager (or a bot) I would probably look for these keywords in the resume:
teller, transactions, deposits, withdrawals, loans, mortgages, customer
service, time management
How would you attack this problem?
If you have access to lots of advertisements, group them by job title and then run a topic modelling algorithm such as Latent Dirichlet Allocation (LDA) on each group. This will produce the keywords.
For more information see Relink who does exactly what you are trying to do. They provide an outline of the process here:
The Science Behind Relink - Organizing Job Postings
Here is a paper that may help: Modeling Career Path Trajectories.
For a technical paper on just LDA see Latent Dirichlet Allocation.
For an article with sample Python code using the gensim library see Experiments on the English Wikipedia. This is an interesting article as it deals with a huge corpus, a dump of the entire Wikipedia database, and talks about ways of improving execution times using distributed LDA on a cluster of computers. The sample code also shows how to apply Latent Semantic Analysis and compares the results with LDA.
The following article and sample code by Jordan Barber, Latent Dirichlet Allocation (LDA) with Python, uses NLTK to create a corpus and gensim for LDA. This code is more adaptable to other applications than the Wikipedia code.

Recommendation rules for sorting a list based on a profile

I working on a site that needs to present a set of options that have no particular order. I need to sort this list based on the customer that is viewing the list. I thought of doing this by generating recommendation rules and sorting the list putting the best suited to be liked by the customer on the top. Furthermore I think I'd be cool that if the confidence in the recommendation is high, I can tell the customer why I'm recommending that.
For example, lets say we have an icecream joint who has website where customers can register and make orders online. The customer information contains basic info like gender, DOB, address, etc. My goal is mining previous orders made by customers to generate rules with the format
feature -> flavor
where feature would be either information in the profile or in the order itself (like, for example, we might ask how many people are you expecting to serve, their ages, etc).
I would then pull the rules that apply to the current customer and use the ones with higher confidence on the top of the list.
My question, what's the best standar algorithm to solve this? I have some experience in apriori and initially I thought of using it but since I'm interested in having only 1 consequent I'm thinking now that maybe other alternatives might be better suited. But in any case I'm not that knowledgeable about machine learning so I'd appreciate any help and references.
This is a recommendation problem.
First the apriori algorithm is no longer the state of the art of recommendation systems. (a related discussion is here: Using the apriori algorithm for recommendations).
Check out Chapter 9 Recommendation System of the below book Mining of Massive Datasets. It's a good tutorial to start with.
http://infolab.stanford.edu/~ullman/mmds.html
Basically you have two different approaches: Content-based and collaborative filtering. The latter can be done in terms of item-based or user-based approach. There are also methods to combine the approaches to get better recommendations.
Some further readings that might be useful:
A recent survey paper on recommendation systems:
http://arxiv.org/abs/1006.5278
Amazon item-to-item collaborative filtering: http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
Matrix factorization techniques: http://research.yahoo4.akadns.net/files/ieeecomputer.pdf
Netflix challenge: http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
Google news personalization: http://videolectures.net/google_datar_gnp/
Some related stackoverflow topics:
How to create my own recommendation engine?
Where can I learn about recommendation systems?
How do I adapt my recommendation engine to cold starts?
Web page recommender system

What's the difference between incremental software process model, evolutionary model, and the spiral model?

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I am studying Software Engineering this year and I am little confused about the question in the title.
Both of my professor and the reference ("Software Engineering A Practitioner Approach") differentiates the three titles as different models. However, I can't see obvious difference as their methodologies look the same to me but using different statements to define them.
I feel that practically they all represent the same process model.
Can anybody explain the different models better?
Craig Larman wrote extensively on this topic and I suggest his famous paper Iterative and Incremental Development: A Brief History (PDF) and his book Agile and Iterative Development: A Manager's Guide.
Here is how I would summarize things:
Incremental Development
Incremental Development is a practice where the system functionalities are sliced into increments (small portions). In each increment, a vertical slice of functionality is delivered by going through all the activities of the software development process, from the requirements to the deployment.
Incremental Development (adding) is often used together with Iterative Development (redo) in software development. This is referred to as Iterative and Incremental Development (IID).
Evolutionary method
The terms evolution and evolutionary have been introduced by Tom Gilb in his book Software Metrics published in 1976 where he wrote about EVO, his practice of IID (perhaps the oldest). Evolutionary development focuses on early delivery of high value to stakeholders and on obtaining and utilizing feedback from stakeholders.
In Software Development: Iterative & Evolutionary, Craig Larman puts it like this:
Evolutionary iterative development implies that the requirements, plan, estimates, and solution evolve or are refined over the course of the iterations, rather than fully defined and “frozen” in a major up-front specification effort before the development iterations begin. Evolutionary methods are consistent with the pattern of unpredictable discovery and change in new product development.
And then discusses further evolutionary requirements, evolutionary and adaptive planning, evolutionary delivery. Check the link.
Spiral model
The Spiral Model is another IID approach that has been formalized by Barry Boehm in the
mid-1980s as an extension of the Waterfall to better support iterative development and puts a special emphasis on risk management (through iterative risk analysis).
Quoting Iterative and Incremental Development: A Brief History:
A 1985 landmark in IID publications
was Barry Boehm’s “A Spiral Model of
Software Development and Enhancement”
(although the more frequent citation
date is 1986). The spiral model was
arguably not the first case in which a
team prioritized development cycles by
risk: Gilb and IBM FSD had previously
applied or advocated variations of
this idea, for example. However, the
spiral model did formalize and make
prominent the risk-driven-iterations
concept and the need to use a discrete
step of risk assessment in each
iteration.
What now?
Agile Methods are a subset of IID and evolutionary methods and are preferred nowadays.
References
Iterative and Incremental Development: A Brief History - Craig Larman, Victor R. Basili (June 2003)
Software Development: Iterative & Evolutionary - Craig Larman
Incremental versus iterative development - Alistair Cockburn
Iterative and incremental development
Software development process
T. Gilb, Software Metrics, Little, Brown, and Co., 1976 (out of print).
B. Boehm, “A Spiral Model of Software Development and Enhancement,” Proc. Int’l Workshop Software Process and Software Environments, ACM Press, 1985; also in ACM Software Eng. Notes, Aug. 1986, pp. 22-42.
These concepts are usually poorly explained.
Incremental is a property of the work products (documents, models, source code, etc.), and it means that they are created little by little rather than in a single go. For example, you create a first version of your class model during requirements analysis, then augment it after UI modelling, and then you even extend it more during detailed design.
Evolutionary is a property of deliverables, i.e. work products that are delivered to the users, and in this regard it is a particular kind of "incremental". It means that whatever is delivered it is delivered as early as possible in a initial form, not fully functional, and then re-delivered every so often, each time with more and more functionality. This often implies an iterative lifecycle.
[An iterative lifecycle, but the way, refers to the tasks that you carry out (as opposed to "incremental", which refers to the products; this is the view adopted by SEMAT), and it means that you perform tasks of the same type over and over. For example, in an iterative lifecycle you would find yourself doing design, then coding, then unit testing, then release, and then again the same things, over and over. Please note that iterative and incremental do not imply each other; any combination of both is possible.]
The spiral model for lifecycles is a model proposed by Barry Boehm that combines aspects of waterfall with innovative advances such as an iterative approach and built-in quality control.
For the concepts of "work product", "task", "lifecycle", etc. please see ISO/IEC 24744.
Hope this helps.
This is the ipsis litteris definition from ISO 24748-1:2016 (Systems and Software Engineering Life Cycle Management):
There are many different development strategies that can be applied to system and software projects. Three of these strategies are summarized below:
a) Once-through. The “once-through” strategy, also called “waterfall,” consists of performing the development process a single time. Simplistically: determine user needs, define requirements, design the system, implement the system, test, fix and deliver.
b) Incremental. The “incremental” strategy determines user needs and defines the system requirements, then performs the rest of the development in a sequence of builds. The first build incorporates part of the planned capabilities, the next build adds more capabilities, and so on, until the system is complete.
c) Evolutionary. The “evolutionary” strategy also develops a system in builds but differs from the incremental strategy in acknowledging that the user need is not fully understood and all requirements cannot be defined up front. In this strategy, user needs and system requirements are partially defined up front, and then are refined in each succeeding build.
Hope this helps. Tati

Machine learning/information retrieval project

I’m reading towards M.Sc. in Computer Science and just completed first year of the source. (This is a two year course). Soon I have to submit a proposal for the M.Sc. Project. I have selected following topic.
“Suitability of machine learning for document ranking in information retrieval system”. Researchers have been using various machine learning algorithms for ranking documents. So as the first phase of the project I will be doing a complete literature survey and finding out advantages/disadvantages of current approaches. In the second phase of the project I will be proposing a new (modified) algorithm in order to overcome the limitations of current approaches.
Actually my question is whether this type of project is suitable as a M.Sc. project? Moreover if somebody has some interesting idea in information retrieval filed, is it possible to share those ideas with me.
Thanks
Ranking is always the hardest part of any of Information Retrieval systems. I think it is a very good topic but you have to take care to -- as soon as possible -- to define a scope of the work. Probably you will not be able to develop a new IR engine but rather build a prototype based on, e.g., apache lucene.
Currently there is a lot of dataset including stackoverflow data dump, which provide you all information you need to define a rich feature vector (number of points, time, you can mine topics of previous question etc., popularity of a tag) for you machine learning ranking algorithm. In this part of the work you could, e.g., classify types of features (e.g., user specific, semantic feature - software name in the title) and perform series of experiments to learn which features are most important and which are not for a given dataset.
The second direction of such a project can be how to perform learning efficiently. The reason behind is the quantity of data within web or community forums and changes in the forum (this would be important if you take a community specific features), e.g., changes in technologies, new software release, etc.
There are many other topics related to search and machine learning. The best idea is to search on scholar.google.com for the recent survey papers on ranking, machine learning, and search to learn what is the state-of-the-art. The very next step would be to talk with your MSc supervisor.
Good luck!
Everything you said is good and should be done, but you forgot the most important part:
Prove that your algorithm is better and/or faster than other algorithms, with good experiments and maybe some statistics (p-value, confidence interval).
If you do that and convince people that your algorithm is useful you surely will not fail :)

Useful Entry-Level Resources for Machine Learning [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
I am looking for some entry level posts about machine learning. Can anyone suggest anything for somebody new to this subject?
By 'posts' i'll assume you mean any resource available online.
I recommend two groups of resources:
First, find Machine Learning blogs in which the blogger's preferred language is the same as yours. In my experience, reading a blog post on a single subject (e.g., SVM) while reading through the author's source code supplied along with blog post is about the best way for a programmer to learn ML. A couple of excellent examples are the blogs Smell the Data (Python), and Igvita (Ruby). Both contain (at least) several posts each describing, tutorial-style, specific ML techniques, which include close walk-throughs of their (posted) source code. Igvita, in particular, has excellent tutorials with working Ruby code on Support Vector Machines, Decision Trees, Singular Value Decomp, and Ensemble Methods--like, the other blog i mentioned, an upper-level undergraduate course could be taught based solely on the ML posts in either blog.
Second, I highly recommend VideoLectures.net.
This is by far the best source--whether free or paid--i have found for very-high quality (both w/r/t the video quality and w/r/t the presentation content) video lectures and tutorials on machine learning. The target audience for these video lectures ranges from beginner (some lectures are specifically tagged as "tutorials") to expert; most of them seem to be somewhere in the middle.
All of the lectures and tutorials are taught to highly experienced professionals and academics, and in many instances, the lecturer is the leading authority on the topic he/she is lecturing on. The site is also 100% free.
The one disadvantage is that you cannot download the lectures and store them in e.g., itunes; however, nearly every lectures has a set of slides which you can download (or, conveniently, you can view them online as you watch the presentation).
A few that i've watched and that i can recommend highly:
Semi-Supervised Learning Approaches
Introduction to Machine Learning
Gaussian Process Basics
Graphical Models
k-Nearest Neighbor Models
Introduction to Kernel Methods
Machine learning is such a broad topic. I would start with Wikipedia and focus in on areas that you find interesting.
Also, you could visit the Stack Exchange site for machine learning.
Stanford published videos and materials from a set of engineering courses at http://see.stanford.edu
One course by Andrew Ng focuses on Machine Learning techniques
http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
The course is also available on iTunes U
Its a really good course from someone who obviously knows the field well, but he spends alot of the time deriving mathematical results - so if your rusty in linear algebra or prob/stats, you might need a refresher first.
I think the best that I know of are:
Stanford's Lectures on Machine Learning
Books: (In decreasing order of ease of understanding - IMHO)
Machine Learning: An Algorithmic Perspective by Stephen Marsland
Pattern Recognition and Machine Learning by Christopher Bishop
Introduction to Machine Learning - Ethem Alpaydin

Resources