How to predict edges in the social network? - machine-learning

I need to make a recommendation system that would predict friends for users in the social graph. The number of users is around 1.500.000. I thought of creating all possible pairs of users and then calculate such metrics as jaccard distance for each pair but doing it for the matrix 1.500.000*1.500.000 seems to be an impossible task. What approaches exist to handle such an amount of nodes?

Related

Recommender system result curation

so I want to ask if there's some sort of curation algorithm that arranges/sends results from a recommender system to a user.
For example, how Twitter recommends feeds to users. Is there some sort of algorithm that does that or Twitter just sorts it by highest number of interactions with that tweet (based on time posted too).
No, there is nothing like that.
Actually the recommendation system model is made in such a way, where it sort it based on Content Based filtering or Collaborative filtering according to the view stats of the user.
There are some algorithms like calculating co-relation between the view stats of the user and the content which is in twitter, and then recommend it.
Or Cosine Similarity and Cosine distance can also be used to calculate distance between view stats and content of twitter to recommend.
You must explore also other recommendation system, which is based on other algo's like Pearson Correlation, Weighted Average,etc.

Options on evaluating Recommender System

I've created a recommender system that works this way:
-each user selects some filters and based on those filters there's a score generated
-each user is clustered using k-means based on those scores
-whenever a user receives a recommandation i'm using pearson's correlation to see which user has the best correlation to other users from the same cluster
My problem is that i'm not really sure what would be the best way to evaluate this system? I've seen that one way to do it is by hiding some values of the dataset but that's not the case for me because i'm not predicting scores.
Are there any metrics or something that i could use?

Algorithm to classify instances from a dataset similar to another smaller dataset, where this smaller dataset represents a single class

I have a dataset that represents instances from a binary class. The twist here is that there are only instances from the positive class and I have none of the negative one. Or rather, I want to extract those from the negatives which are closer to the positives.
To get more concrete let's say we have data of people who bought from our store and asked for a loyalty card at the moment or later of their own volition. Privacy concerns aside (it's just an example) we have different attributes like age, postcode, etc.
The other set of clients, following with our example, are clientes that did not apply for the card.
What we want is to find a subset of those that are most similar to the ones that applied for the loyalty card in the first group, so that we can send them an offer to apply for the loyalty program.
It's not exactly a classification problem because we are trying to get instances from within the group of "negatives".
It's not exactly clustering, which is typically unsupervised, because we already know a cluster (the loyalty card clients).
I thought about using kNN. But I don't really know what are my options here.
I would also like to know how, if possible, can this be achieved with weka or another Java library and if I should normalize all the attributes.
You could use anomaly detection algorithms. These algorithms tell you whether your new client belongs to the group of clients who got a loyalty card or not (in which case they would be an anomaly).
There are two basic ideas (coming from the article I linked below):
You transform the feature vectors of your positive labelled data (clients with card) to a vector space with a lower dimensionality (e.g. by using PCA). Then you can calculate the probability distribution for the resulting transformed data and find out whether a new client belongs to the same statistical distribution or not. You can also compute the distance of a new client to the centroid of the transformed data and decide by using the standard deviation of the distribution whether it is still close enough.
The Machine Learning Approach: You train an auto-encoder network on the clients with card data. An auto-encoder has a bottleneck in its architecture. It compresses the input data into a new feature vector with a lower dimensionality and tries afterwards to reconstruct the input data from that compressed vector. If the training is done correctly, the reconstruction error for input data similar to the clients with card dataset should be smaller than for input data which is not similar to it (hopefully these are clients who do not want a card).
Have a look at this tutorial for a start: https://towardsdatascience.com/how-to-use-machine-learning-for-anomaly-detection-and-condition-monitoring-6742f82900d7
Both methods would require to standardize the attributes first.
Und try a one-class support vector machine.
This approach tries to model the boundary, and will give you a binary decision on whether a point should be in the class, or not. It can be seen as a simple density estimation. The main benefit is that the support vector art will be much smaller than the training data.
Or simply use the nearest-neighbor distances to rank users.

Classification of industry based on tags

I have a dataset (1M entries) on companies where all companies are tagged based on what they do.
For example, Amazon might be tagged with "Retail;E-Commerce;SaaS;Cloud Computing" and Google would have tags like "Search Engine;Advertising;Cloud Computing".
So now I want to analyze a cluster of companies, e.g. all online marketplaces like Amazon, eBay, etsy, and the like. But there is no single tag that I can look for, but I have to use a set of tags to quantify the likelihood for a company to be a marketplace.
For example tags like "Retail", "Shopping", "E-Commerce" are good tags, but then there might be some small consulting agencies or software development firms that consult / build software for online marketplaces and have tags like "consulting;retail;e-commerce" or "software development;e-commerce;e-commerce tools", which I want to exclude as they are not online marketplaces.
I'm wondering what is the best way of identifying all online market places from my dataset. What machine learning algorithm, is suited to select the maximum amount of companies which are in the industry I'm looking for while excluding the ones that are obviously not part of it.
I thought about supervised learning, but I'm not sure because of a few issues:
Labelling needed, which means I would have to go through thousands of companies and flag them on multiple industries (marketplace, finance, fashion, ...) as I'm interested in 20-30 industries overall
There are more than 1,000 tags associated with the companies. How would I define my features? 1 feature per tag would lead to a massive dimensionality.
Are there any best practices for such cases?
UPDATE:
It should be possible to assign companies to multiple clusters, e.g. Amazon should be identified as "Marketplace", but also as "Cloud Computing" or "Online Streaming".
I used tf-idf and kmeans to identify tags that form clusters, but I don't know how to assign likelihoods / scores to companies that indicate how good the company fits into the cluster based on its tags.
UPDATE:
While tf-idf in combination with kmeans delivered pretty neat clusters (meaning the companies within a cluster were actually similiar), I also tried to calculate probabilities of belonging to a cluster with Gaussian Mixture Models (GMMs), which led to completely messed up results where companies within a cluster were more or less random or came from a handful different industries.
No idea why this happened though...
UPDATE:
Found the error. I applied a PCA before the GMM to reduce dimensionality, however, this apparently led to the random results. Removing the PCA improved the results significantly.
However, the resulting posterior probabilities of my GMM are 0. or 1. exactly 99.9% of the time. Is there a parameter (I'm using a sklearn BayesianGMM) that needs to be adjusted to get more valuable probablilities that are a little bit more centered? Because right now everything < 1.0 is not part of a cluster anymore, but there's also few outliers that get a posterior of 1.0 and are thus assigned to an industry. For example, a company with "Baby;Consumer" gets assigned to the "Consumer Electronics" cluster, even though only 1 out of 2 tags may be suggesting this. So I'd like this to get a probability of < 1. such that I can define a threshold based on some cross-validation.

Calculating the influence of a user in a telecom cdr data

I came across a way to calculate the influence score of a person on a twitter network. Here is a sample reference: http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank/
On similar lines, are there any other algorithms that calculate the influence score of a subscriber on a telecom network using his/her CDR data?
Please checkout Magnusson's thesis:
http://uu.diva-portal.org/smash/record.jsf?pid=diva2:509757
This thesis aims at investigating the usefulness of social network analysis in telecommunication networks. As these networks can be very large the methods used to study them must scale linearly when the network size increases. Thus, an integral part of the study is to determine which social network analysis algorithms that have this scalability. Moreover, comparisons of software solutions are performed to find product suitable for these specific tasks.

Resources