Recommended download using google prediction - machine-learning

I run a download portal and basically what I want to do is after a user downloads a file i would like to recommend other related categories. I'm thinking of using google predict to do this but I'm not sure how to structure the training data. I'm thinking something like this:
category of the file downloaded (label), geo, gender, age
however that seems incomplete because the data doesn't have any information on the file downloaded. Would appreciate some advice, new to ML.

Here is a suggestion that might work...
For your training data, assuming you have the logs of downloads per user, create the following dataset:
download2 (serves as label), download1, features of users, features of download1
Then train a classifier to predict class given a download and user - the output classes and corresponding scores represent downloads to recommend.

Related

Too much data to label

I’m working on a personal Data Science project where I try to flag bots on Instagram. I already collected public data about 80k users, labelled ~4k users and managed to get 3k more thanks to similarities (e.g. same comment, same profile pic, same scammy website in bio, etc.). This last step got me more bots but also changed the distribution of the bot/legit user in the labelled dataset.
I heard about semi-supervised learning but I’m still very new in Data Science as this is my first ML project, I don’t feel super confident about using it.
What are my different options? Can I balance the labelled data and stop labelling after a point? Should I label everything? What would you advise me?

openface How to import person pictures and compare any picture later on

I am trying to understand openface.
I have installed the docker container, run the demos and read the docks.
What I am missing is, how to start using it correctly.
Let me explain to you my goals:
I have an app on a raspberry pi with a webcam. If I start the app it will take a picture of the person infront.
Now it should send this picture to my openface app and check, if the face is known. Known in this context means, that I already added pictures of this person to openface before.
My questions are:
Do I need to train openface before, or could I just put the images of the persons in a directory or s.th. and compare the webcam picture on the fly with these directories?
Do I compare with images or with a generated .pkl?
Do I need to train openface for each new person?
It feels like I am missing a big thing that makes the required workflow clearer to me.
Just for the record: With help of the link I mentioned I could figure it out somehow.
Do I need to train openface before, or could I just put the images of the persons in a directory or s.th. and compare the webcam picture on the fly with these directories?
Yes, a training is required in order to compare any images.
Do I compare with images or with a generated .pkl?
Images are compared with the generated classifier pkl.
The pkl file gets generated when training openface:
./demos/classifier.py train ./generated-embeddings/
This will generate a new file called ./generated-embeddings/classifier.pkl. This file has the SVM model you'll use to recognize new faces.
Do I need to train openface for each new person?
OK, for this question I don't have an answer yet. But just because I did not look deeper into this topic yet.

How do I choose training data set for job recommendation using linear regression model?

I have two kind of profiles in database.one is candidate
prodile,another is job profile posted by recruiter.
in both the profiles i have 3 common field say location,skill and
experience
i know the algorithm but i am having problem in creating training data
set where my input feature will be location,skill and salary chosen
from candidate profile,but i am not getting how to choose output
(relevant job profile).
as far as i know output can only be a single variable, then how to
choose relevant job profile as a output in my training set
or should i choose some other method?another thought is clustering.
As I understand you want to predict job profile given candidate profile using some prediction algorithm.
Well, if you want to use regression you need to know some historical data -- which candidates were given which jobs, then you can create some model based on this historical data. If you don't have such training data you need some other algorithm. Say, you could set location,skill and experience as features in 3d and use clustering/nearest neighbors to find candidate profile closest to a job profile.
You could look at "recommender systems", they can be an answer to your problem.
Starting with a content based algorithm (you will have to find a way to automate the labels of the jobs, or manually do them), you can improve to an hybrid one by gathering which jobs your users were actually interested (and become an hybrid recommender)

How do you add new categories and training to a pretrained Inception v3 model in TensorFlow?

I'm trying to utilize a pre-trained model like Inception v3 (trained on the 2012 ImageNet data set) and expand it in several missing categories.
I have TensorFlow built from source with CUDA on Ubuntu 14.04, and the examples like transfer learning on flowers are working great. However, the flowers example strips away the final layer and removes all 1,000 existing categories, which means it can now identify 5 species of flowers, but can no longer identify pandas, for example. https://www.tensorflow.org/versions/r0.8/how_tos/image_retraining/index.html
How can I add the 5 flower categories to the existing 1,000 categories from ImageNet (and add training for those 5 new flower categories) so that I have 1,005 categories that a test image can be classified as? In other words, be able to identify both those pandas and sunflowers?
I understand one option would be to download the entire ImageNet training set and the flowers example set and to train from scratch, but given my current computing power, it would take a very long time, and wouldn't allow me to add, say, 100 more categories down the line.
One idea I had was to set the parameter fine_tune to false when retraining with the 5 flower categories so that the final layer is not stripped: https://github.com/tensorflow/models/blob/master/inception/README.md#how-to-retrain-a-trained-model-on-the-flowers-data , but I'm not sure how to proceed, and not sure if that would even result in a valid model with 1,005 categories. Thanks for your thoughts.
After much learning and working in deep learning professionally for a few years now, here is a more complete answer:
The best way to add categories to an existing models (e.g. Inception trained on the Imagenet LSVRC 1000-class dataset) would be to perform transfer learning on a pre-trained model.
If you are just trying to adapt the model to your own data set (e.g. 100 different kinds of automobiles), simply perform retraining/fine tuning by following the myriad online tutorials for transfer learning, including the official one for Tensorflow.
While the resulting model can potentially have good performance, please keep in mind that the tutorial classifier code is highly un-optimized (perhaps intentionally) and you can increase performance by several times by deploying to production or just improving their code.
However, if you're trying to build a general purpose classifier that includes the default LSVRC data set (1000 categories of everyday images) and expand that to include your own additional categories, you'll need to have access to the existing 1000 LSVRC images and append your own data set to that set. You can download the Imagenet dataset online, but access is getting spotier as time rolls on. In many cases, the images are also highly outdated (check out the images for computers or phones for a trip down memory lane).
Once you have that LSVRC dataset, perform transfer learning as above but including the 1000 default categories along with your own images. For your own images, a minimum of 100 appropriate images per category is generally recommended (the more the better), and you can get better results if you enable distortions (but this will dramatically increase retraining time, especially if you don't have a GPU enabled as the bottleneck files cannot be reused for each distortion; personally I think this is pretty lame and there's no reason why distortions couldn't also be cached as a bottleneck file, but that's a different discussion and can be added to your code manually).
Using these methods and incorporating error analysis, we've trained general purpose classifiers on 4000+ categories to state-of-the-art accuracy and deployed them on tens of millions of images. We've since moved on to proprietary model design to overcome existing model limitations, but transfer learning is a highly legitimate way to get good results and has even made its way to natural language processing via BERT and other designs.
Hopefully, this helps.
Unfortunately, you cannot add categories to an existing graph; you'll basically have to save a checkpoint and train that graph from that checkpoint onward.

Twitter data topical classification

So I have a data set which consists of tweets from various news organizations. I've loaded it into RapidMiner, tokenized it, and produced some n-grams of it. Now I want to be able to have RapidMiner automatically classify my data into various categories based on the topic of the tweets.
I'm pretty sure RapidMiner can do this, but according to the research I've done into it, I need a training data set to be able to show RapidMiner how I want things classified. So I need a training data set, though given the categories I wanted to classify things into, I might have to create my own.
So my questions are these:
1) Is there a training data set for twitter data that focuses more on the topic of the tweet as opposed to a sentiment analysis publicly available?
2) If there isn't one publicly available, how can I create my own? My idea to do it was to go through the tweets themselves and associate the tokens and n-grams with the categories I want. Some concerns I have with that are that I won't be able to manually classify enough tweets to create a training data set comprehensive enough so that I can get a good accuracy rate for the automatic classifier.
3) Any general advice for topical classification of text data would be great. This is the first time that I've done a project like this, and I'm sure there are things I could improve on. :)
There may be training corpora that work for you, but you need to say what your topic or categories are to identify it. The fact that this is Twitter may be relevant, but the data source is likely to be much less relevant to the classification accuracy you will achieve than the topic is. So if you take the infamous 20 newsgroups data set this is likely to work on Twitter as well, but only if the categories you are after are the 20 categories from that data set. If you want to classify cats vs dogs or Android vs iPhone you need to find a data set for that.
In most cases you will have to create initial labels manually, which is, as you say, a lot of work. One workaround might be to start with something simpler like a keyword search to create subsets of your tweets for which you know they deal with a particular category. Then you create the model on top of that and hope that it generalizes to identify the same categories even though the original keywords do not occur.
Alternatively, depending on your application (and if you actually want to build an applicaion), you may as well start with only a small data set and accept that you have poor classification. Then you generate classifications, show them to the users of your apps, and collect some form of explicit or implicit feedback on the classification (e.g. users can flag tweets as incorrectly classified). This way you improve your training corpus and periodically update your model.
Finally, if you do not know what your topics are and you want RapidMiner to identify the topics, you may want to try clustering as opposed to classification. Just create a few clusters and look at the top words for each cluster. They may well be quite dissimilar and describe what the respective clusters are about.
I believe your third question may be a bit broad for stackoverflow and is probably better answered by a text book.

Resources