Publicly Available Spam Filter Training Set [closed]

Publicly Available Spam Filter Training Set [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm new to machine learning, and for my first project I'd like to write a naive Bayes spam filter. I was wondering if there are any publicly available training sets of labeled spam/not spam emails, preferably in plain text and not a dump of a relational database (unless they pretty-print those?).
I know such a publicly available database exists for other kinds of text classification, specifically news article text. I just haven't been able to find the same sort of thing for emails.

Here is what I was looking for: http://untroubled.org/spam/
This archive has around a gigabyte of compressed accumulated spam messages dating 1998 - 2011. Now I just need to get non-spam email. So I'll just query my own Gmail for that using the getmail program and the tutorial at mattcutts.com

Sure, there's Spambase, which is as far as i'm aware, is the most widely cited spam data set in the machine learning literature.
I have used this data set many times; each time i am impressed how much effort has been put into the formatting and documentation of this data set.
A few characteristics of the Spambase set:
4601 data points--all complete
each comprised of 58 features
(attributes)
each data point is labelled 'spam' or
'no spam'
approx. 40% are labeled spam
of the features, all are continuous
(vs. discrete)
a representative feature: average
continuous sequence of capital
letters
Spambase is archived in the UCI Machine Learning Repository; in addition, it's also available on the Website for the excellent ML/Statistical Computation Treatise, Elements of Statistical Learning by Hastie et al.

SpamAssassin has a public corpus of both spam and non-spam messages, although it hasn't been updated in a few years. Read the readme.html file to learn what's there.

You might consider taking a look at the TREC spam/ham corpus (which I think is the collection of emails from Enron that was made public from the court case). TREC generally runs a bunch of competitive text processing tasks, so it might give you some references for comparison.
The downside is that they're stored in raw mbox format, though there are parsers available in many languages (Apache Tika is a good example).
The webpage isn't TREC, but this seems to be a good overview of the task with links to the data: http://plg.uwaterloo.ca/~gvcormac/spam/

A more modern one spam training set can be found at kaggle. Moreover, you can test accuracy of your classifier on their website by uploading your results.

I have also an answer, here you can find a daily refreshed Bayesian database for initial training and also a daily created archive containing captured spams. You will find the instructions how to use it on the site.

Related

Fine-tuning GPT-2/3 on new data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I'm trying to wrap my head around training OpenAI's language models on new data sets. Is there anyone here with experience in that regard?
My idea is to feed either GPT-2 or 3 (I do not have API access to 3 though) with a textbook, train it on it and be able to "discuss" the content of the book with the language model afterwards. I don't think I'd have to change any of the hyperparameters, I just need more data in the model.
Is it possible??
Thanks a lot for any (also conceptual) help!

Presently GPT-3 has no way to be finetuned as we can do with GPT-2, or GPT-Neo / Neo-X. This is because the model is kept on their server and requests has to be made via API. A Hackernews post says that finetuning GPT-3 is planned or in process of construction.
Having said that, OpenAI's GPT-3 provide Answer API which you could provide with context documents (up to 200 files/1GB). The API could then be used as a way for discussion with it.
EDIT:
Open AI has recently introduced Fine Tuning beta.
https://beta.openai.com/docs/guides/fine-tuning
Thus it will be best answer to the question to follow through description on that link.

You can definitely retrain GPT-2. Are you only looking to train it for language generation purposes or do you have a specific downstream task you would like to adapt the GPT-2?
Both these tasks are possible and not too difficult. If you want to train the model for language generation i.e have it generate text on a particular topic, you can train the model exactly as it was trained during the pre-training phase. This means training it on a next-token prediction task with a cross-entropy loss function. As long as you have a dataset, and decent compute power, this is not too hard to implement.
When you say, 'discuss' the content of the book, it seems to me that you are looking for a dialogue model/chatbot. Chatbots are trained in a different way and if you are indeed looking for a dialogue model, you can look at DialoGPT and other models. They can be trained to become task-oriented dialog agents.

Use machine learning to analyze sports bets [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Lets say I have database with over 1 Million bets (all kinds of sports) made by couple thousands of users, over a period of 2 years (and still growing).
These data are just lying around doing nothing, so I thought if it would be possible to use something like https://www.tensorflow.org/, do a bit of tinkering and it would analyze all the bets in database and learn from it some patterns, whats good and whats not.
The point being is we dont have resources to employ dozens of people for god knows how long to write some complicated software from the ground up. So I was thinking we could use some module from TensorFlow and go from there.
So then I would feed the network with new open bets that are currently in the system (those would be bets that are on matches that are about to be played) and it would pick for me what I should bet on, for example there is a 90% chance this bet will win, because 10 very successful players made this bet, and they have very high success when betting on this particular sport.
We have lots of experienced users, they make lots of money from betting. So the system could be trained on the data we have and then it would know, for example, if user A bets on this league/team, its very likely he will win.
The question is, where do we go from here? Can anybody point us in the right direction? Or is this just too difficult to do, for 2 people in few months? Can we use some pre-programmed solutions, like TensorFlow?

Without having a look of the data, it is impossible to suggest what direction should you take your next steps but anyway your first step should be to explore your data throughly, create model on small subset of data and test your hypothesis.
Overall you can try to:
Use Python or R to Load and Clean Data
Take a random subset of data(some 10,000 rows) and create a simple model using SVM or Random Forest looks like a classification Win/Lose.
Test your results and verify your hypothesis with some data.
Explore about your data to see if you can generate better features
Design a small neural network first and then think about a deep neural network using tensorflow or keras etc.
Have a look at this: https://hackernoon.com/how-to-create-your-own-machine-learning-predictive-system-in-the-nba-using-python-7189d964a371

Yes, this is possible but can be more difficult than it appears.
Consider Microsoft's Cortana which (while only picking if a game will win outright and not ATS) is only approx. 63% accurate; which is quite good but not exactly 90% as you mention in your question (1).
The size of your database should be great for ANN models. It would be a very interesting project for sure!
To your question "where I go from here..." my answer is to simply explore the data in RStudio or using a cloud service such as Microsoft's Azure ML Studio (2) or Amazon's Machine Learning services (3).
Good luck!
Ref. 1: http://www.businessinsider.com/nfl-picks-microsoft-cortana-elo-week-5-2017-10
Ref. 2: https://studio.azureml.net/
Ref. 3: https://aws.amazon.com/amazon-ai/

Extract relevent keywords from job advertisements [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
My friend asked me if I could write a program capable of identifying relevant keywords from job adverts knowing 3 variables: Industry, job title and the job posting text (example below).
The problem we are trying to address, from a job seeker's point of view, evolves around having the correct keywords in your resume for each job application hereby increasing your chances of getting shortlisted for an interview. This is especially important when the first stage screening is done by bots scanning for keywords.
Initially I was considering a relational database containing all industries, all job titles and their related keywords. This however is an enormous task and the data in progressive fields like information and bio technology would quickly become stale.
It seems machine learning and natural language processing is unavoidable.
Consider below job advert for a bank seeking a teller:
Are you an experienced Bank Teller seeking that perfect work life
balance? If you’re looking for Casual Hours and have an absolute
passion for customer service then this is the role for you!
Our client services Queensland Public Servants (particularly
Queensland Police); and is currently seeking a Bank Teller to join
their Brisbane CBD team to start ASAP.
The successful candidate will be required to work from 9:30am to
2:30pm, Monday to Friday therefore 25 hours per week. Based on
experience the successful candidate will be paid (approximately) $25 -
$27 + superannuation per hour.
This position is casual/temporary with the potential to for a
permanent placement (based on performance/length of assignment etc.).
DUTIES & RESPONSIBILITIES:
As a Bank Teller your will be required to:
Attend to customers in a exceptional professional and efficient
manner; Processing basic transactions such as deposits and
withdrawals; Complete complex transactions such as loans and
mortgages; Pass referrals onto sales team (NO SALES); Large amounts of
cash handling; and Ensuring high attention to detail is at the top of
your list! SKILLS & EXPERIENCED:
The successful candidate will have the following:
Previous teller experience (within last 5 years) IDEAL; Previous
customer service experience (within finance) IDEAL; Ability to work in
a fast paced and time pressured environment; Excellent presentation
and attitude; Exceptional attention to detail; Ability to quickly
‘master’ multiple software packages; and Strong time management skills
and ability to work autonomously. If you boast to have fantastic
customer service skills, a professional manner, and preferrably teller
experience we would LOVE to hear from you!
If I was the hiring manager (or a bot) I would probably look for these keywords in the resume:
teller, transactions, deposits, withdrawals, loans, mortgages, customer
service, time management
How would you attack this problem?

If you have access to lots of advertisements, group them by job title and then run a topic modelling algorithm such as Latent Dirichlet Allocation (LDA) on each group. This will produce the keywords.
For more information see Relink who does exactly what you are trying to do. They provide an outline of the process here:
The Science Behind Relink - Organizing Job Postings
Here is a paper that may help: Modeling Career Path Trajectories.
For a technical paper on just LDA see Latent Dirichlet Allocation.
For an article with sample Python code using the gensim library see Experiments on the English Wikipedia. This is an interesting article as it deals with a huge corpus, a dump of the entire Wikipedia database, and talks about ways of improving execution times using distributed LDA on a cluster of computers. The sample code also shows how to apply Latent Semantic Analysis and compares the results with LDA.
The following article and sample code by Jordan Barber, Latent Dirichlet Allocation (LDA) with Python, uses NLTK to create a corpus and gensim for LDA. This code is more adaptable to other applications than the Wikipedia code.

Which datamining tool to use? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Can somebody explain me the main pros and cons of the most known datamining open-source tools?
Everywhere I read that RapidMiner, Weka, Orange, KNIME are the best ones.
look at this blog post
Can somebody do a fast technical comparison in a small bullet list.
My needs are the following:
It should support classification algorithms (Naive Bayes, SVM, C4.5,
kNN).
It should be easy to implement in Java.
It should have understandable documentation.
It should have reference production projects or use cases working on in.
some additional benchmark comparison if possible.
Thanks!

I would like to say firstly there are pro's and cons for each of them on your list however I would suggest out of your list weka from my personal experience it is incredibly simple to implement in your own java application using the weka jar file and has its own self contained tools for data mining.
Rapid miner seems to be a commercial solution offering an end to end solution however the most notable number of examples of external implementations of solutions for rapid miner are usually in python and r script not java.
Orange offers tools that seem to be targeted primarily at people with possibly less need for custom implementations into their own software but a far easier time with user itneraction, its written in python and source is available, user addons are supported.
Knime is another commercial platform offering end to end solutions for data mining and analysis providing all the tools required, this one has various good reviews around the internet but i havent used it enough to advise you or anyone on the pro's or cons of it.
See here for knime vs weka
Best data mining tools
As i said weka is my personal favorite as a software developer but im sure other people have varying reasons and opinions on why to choose one over the other. Hope you find the right solution for you.
Also per your requirements weka supports the following:
Naivebayes
SVM
C4.5
KNN

I have tried Orange and Weka with a 15K records database and found problems with the memory management in Weka, it needed more than 16Gb of RAM while Orange could've managed the database without using that much. Once Weka reaches the maximum amount of memory, it crashes, even if you set more memory in the ini file telling Java virtual machine to use more.

I recently evaluated many open source projects, comparing and contrasted them with regards to the decision tree machine learning algorithm. Weka and KNIME were included in that evaluation. I covered the differences in algorithm, UX, accuracy, and model inspection. You might chose one or the other depending on what features you value most.

I have had positive experience with RapidMiner:
a large set of machine learning algorithms
machine learning tools - feature selection, parameter grid search, data partitioning, cross validation, metrics
a large set of data manipulation algorithms - input, transformation, output
applicable to many domains - finance, web crawling and scraping, nlp, images (very basic)
extensible - one can send and receive data other technologies: R, python, groovy, shell
portable - can be run as a java process
developer friendly (to some extent, could use some improvements) - logging, debugging, breakpoints, macros
I would have liked to see something like RapidMiner in terms of user experience, but with the underlying engine based on python technologies: pandas, scikit-learn, spacy etc. Preferably, something that would allow moving back and forth from GUI to code.

Looking for training data for music accompaniment [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am building a system that uses machine learning to generate an accompanying melody in real time as a leading melody is being played. It uses a type of Recurrent Neural Networks and at every step it tries to predict the next note on the accompanying track. At this point I am satisfied with just working with midi files.
I am having serious trouble finding training data. My original idea was to just download midi files from sites such as mididb and convert them to csv, but the problem is that it's hard to come up with a way to distinguish between the leading melody and the accompanying melody. Sometimes this is possible, but then again I would prefer the accompanying tracks to be always from the same (or similar) instrument(s), because different instruments are used differently (the duration and pitch of notes is very different from one instrument to the other etc.) and that would just get the network really confused.
I found Bach Corales on the UCI Machine Learning repository . The problem with this dataset though, is that it only has melodies with 1 voice. I want datasets with 2 voices, one of which is the leading melody and the other the accompanying melody.
I understand that this is difficult, so any advice on how to approach the problem would be very appreciated. I have tools that convert midi files to csv format, and if you can think of certain types/genres of songs, for which it would be easy to distinguish between leading and accompanying tracks (programmatically or manually), please let me know. Any advice will be greatly appreciated.

Exciting Topic. There aren't much other databases out there for data mining other than the set you mentioned. So you'll need to get a bit creative.
Have you read Jürgen Schmidhuber's approach on music composition using LSTM Reccurent Neural Networks? If not, you should definitely do so:
A First Look at Music Composition using LSTM Recurrent Neural Networks
Finding Temporal Structure in Music: Blues Improvisation with LSTM Recurrent Networks
You can browse through his work on his site
Now, the first paper created their own dataset, you might try asking the authors. The training set of the latter paper can be seen on their webpage to the study.
The best approach I think is to generate your own dataset:
1) Note that they have used sheets (pdf) and audio (not only midi but also wav/mp3) so you might want to think about extracting chords from wav files and labeling them with possible melody harmonies manually.
2) You can search directly for single scores instead of data mining datasets. E.g. www.free-scores.com to find specific scores. You can edit them, import them to Sibelius or Finale and convert them to midi in these programs. The easiest way would be if you can find scores written in Sibelius/Finale itself so you can export them to midi right away.
Edit:
One more comment on your chord/melody structure. Try to keep it simple at the beginning. Try to maintain a format like in the "First Look at.." paper: Melody+Chord Structure. No Instruments. After this is working you can try to reach the same results building this representation from multiple instrument scores. If that's working, try to build the multiple instrumentation scores from midi. If that works, start with real audio files.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart