SPSS statistics vs modeler - spss

I have a problem statement like to predict the loan defaulter with the given 5 to 10 years data. I was told that we could use SPSS, but I'm new to this. Also, I see that there is Statistics and Modeler. Are they two different tools? which one should I explore for my problem? Can someone help?

They are different tools from the same company. Modeler it's more intuitive but statistics it's a great software as well. Both could do predictions as you need.
You can try time series models or multiple regression to do that. Take a look on the officials manuals from IBM also to help in your decision.

Related

Using NLP or other algorithms to mach two strings

My goal of a project is to correctly assign medications. I have a large catalog at my disposal for this purpose. However, the medications do not appear there in exactly the same spelling. Possibly additional information was added or possible parts of the prescription were abbreviated.
I was already able to implement a possible algorithm using the Levensthein distance (token_set_ratio).
Because of the sometimes long additional information this algorithm assigns wrong medications, I wanted to ask if there are better algorithms for comparing strings. For example, does it make sense to implement machine learning algorithms or NLP technology? This is a relatively new area for me. I would appreciate any ideas or inspiration.
This sounds like a classic Deduplication task. For example, have a look at dedupe. This tool lets you annotate training examples and learns when two items refer to the same thing. It can be used with as few as 10 training sanples and has an active learning approach implemented.

Data prediction from previous data history using AI/ML

I am looking for solutions where I can automatically approve or disapprove different supplier invoices based on historical data.
Let's say, I got an invoice from an HP laptop supplier and based on the previous data, I have to approve or reject that invoice.
Basically, I want to make a decision or prediction based on the data already available based on the history with artificial intelligence, machine learning or any other cloud service
This isn't a direct question though but you can start by looking into various methods of classifications. There is a huge amount of material available online. Try reading about K-Nearest Neighbors, Naive Bayes, K-means, etc. to get an idea about how algorithms in Machine Learning domain work. Once you start understanding what is written in the documentation then start implementing them. You will face a lot of problems which you can search online and I'm sure you will find most of them answered here in this portal.

Finding items which are similar

I have a big database of many items of retail company. If I would like to find the items which are similar to any particular item, can I use pearson correlation in Spark ML to do that? Is there any other better algorithm to do it? How do I make sure the machine also learns as it evolves?
Edit - I implemented Mapreduce program to find distance between various features. But how can I make it as Machine learning solution? Suppose if I let the program identify the correct neighbor, how can the program make use of this learning for next time?
Using Azure ML recommendation model it is easy to perform tasks such as "reltaed purchase items" it would be a quick and easy start.
https://gallery.cortanaintelligence.com/MachineLearningAPI/Recommendations-2

should I use mahout for this?

I want to recommend items that are tagged and are categorized into three price categories (cheap, regular and expensive). I know that with Mahout recommendation could be achieved but here's why I don't know how to use it.
Mahout is based on the other users opinion but all of the new items that I want to recommend are just the new ones that don't have any preferences set yet.
Is Mahout the right tool for this? Is this content-based? (which mahout don't support yet????) or should I use classification?
Thanks!
Since I've never built any recommender system - do not take this answer very seriously (no-one has answered it, so I try)
recommendation system has to be built on some already known (or partially known data). If you have only new (unseen) data there is only possibility to use some clustering algorithm in order to build some clusters.
And if those clusters would be ok, they can be used for training some recommendation system.
Mahout is just a tool which implement various ML methods. You can use other tools like Weka, R, ...
If you have no data at all about a new user, there's really nothing you can do to make recommendations, no matter what you do. There is zero input that would differentiate the person from anyone else.
Good systems should however be able to do something reasonable after the first input is available.
This is not a classifier problem by nature, no. It is also not a clustering tool, other answers notwithstanding.
The price categories are not core to any rec process you would use. You have other data presumably, what is it? That's important.
Finally whether or not to use Mahout depends on taste. You would use it if you want to use Java and Hadoop. And in turn you would only consider Hadoop if you had very large input, and few people have that much data (like >10M data points at least).
(Well, not quite -- my recommender pieces in Mahout pre-date Hadoop and are for on-line, smaller-scale applications. You might indeed be interested in this, if you are working in Java.)

Machine learning/information retrieval project

I’m reading towards M.Sc. in Computer Science and just completed first year of the source. (This is a two year course). Soon I have to submit a proposal for the M.Sc. Project. I have selected following topic.
“Suitability of machine learning for document ranking in information retrieval system”. Researchers have been using various machine learning algorithms for ranking documents. So as the first phase of the project I will be doing a complete literature survey and finding out advantages/disadvantages of current approaches. In the second phase of the project I will be proposing a new (modified) algorithm in order to overcome the limitations of current approaches.
Actually my question is whether this type of project is suitable as a M.Sc. project? Moreover if somebody has some interesting idea in information retrieval filed, is it possible to share those ideas with me.
Thanks
Ranking is always the hardest part of any of Information Retrieval systems. I think it is a very good topic but you have to take care to -- as soon as possible -- to define a scope of the work. Probably you will not be able to develop a new IR engine but rather build a prototype based on, e.g., apache lucene.
Currently there is a lot of dataset including stackoverflow data dump, which provide you all information you need to define a rich feature vector (number of points, time, you can mine topics of previous question etc., popularity of a tag) for you machine learning ranking algorithm. In this part of the work you could, e.g., classify types of features (e.g., user specific, semantic feature - software name in the title) and perform series of experiments to learn which features are most important and which are not for a given dataset.
The second direction of such a project can be how to perform learning efficiently. The reason behind is the quantity of data within web or community forums and changes in the forum (this would be important if you take a community specific features), e.g., changes in technologies, new software release, etc.
There are many other topics related to search and machine learning. The best idea is to search on scholar.google.com for the recent survey papers on ranking, machine learning, and search to learn what is the state-of-the-art. The very next step would be to talk with your MSc supervisor.
Good luck!
Everything you said is good and should be done, but you forgot the most important part:
Prove that your algorithm is better and/or faster than other algorithms, with good experiments and maybe some statistics (p-value, confidence interval).
If you do that and convince people that your algorithm is useful you surely will not fail :)

Resources