Calculating the influence of a user in a telecom cdr data - machine-learning

I came across a way to calculate the influence score of a person on a twitter network. Here is a sample reference: http://thenoisychannel.com/2009/01/13/a-twitter-analog-to-pagerank/
On similar lines, are there any other algorithms that calculate the influence score of a subscriber on a telecom network using his/her CDR data?

Please checkout Magnusson's thesis:
http://uu.diva-portal.org/smash/record.jsf?pid=diva2:509757
This thesis aims at investigating the usefulness of social network analysis in telecommunication networks. As these networks can be very large the methods used to study them must scale linearly when the network size increases. Thus, an integral part of the study is to determine which social network analysis algorithms that have this scalability. Moreover, comparisons of software solutions are performed to find product suitable for these specific tasks.

Related

Converting "Implicit" user interactions to "Explicit" user ratings for recommender systems

I'm currently in the process of building a recommendation system with implicit data (e.g. clicks, views, purchases), however much of the research I've looked at seems to skip the step of "aggregating implicit data". For example, how do you aggregate multiple clicks and purchases overtime into a single user rating (as is required for a standard matrix factorization model)?
I've been experimenting with several Matrix Factorization based methods, including Neural Collaborative Filtering, Deep Factorization Machines, LightFM, and Variational Autoencoders for Collaborative Filtering. None of these papers seem to address the issue of aggregating implicit data. They also do not discuss how to weight different types of user events (e.g. clicks vs purchase) when calculating a score.
For now I've been using a confidence score approach (the conference score corresponds to the count of events) as outlined in this paper: http://yifanhu.net/PUB/cf.pdf. However this approach doesn't address incorporating other types of user events (other than clicks), nor does it address negative implicit feedback (e.g. a ton of impressions with zero clicks).
Anyway, I'd love some insight on this topic! Any thoughts at all would be hugely appreciated!
There's the method for building a recommendation system - Bayesian personalized ranking from implicit feedback. I also wrote an article on how it can be implemented using TensorFlow.
There's no "right" answer for the question of how to transfer implicit feedback explicitly. The answer will depend on business requirements. If the task is to increase the click rate, you should try to use the clicks. If the task of increasing conversion, you need to work with purchases.

Feature engineering for fraud detection

I'm doing some research into fraud detection for academic purposes.
I' d like to know specifically about techniques for feature selection\engeneering from a transactional dataset.
In more details, given a dataset of transactions (credit card for example), what kind of features are selected to be used on the model and how are they engineered?
All the papers I've come across focus on the model itself (SVM, NN, ...) not really touching on this subject.
Also, if anyone knows of public datasets that are not anonymized - that would also help.
Thanks
Having a good understanding of feature selection/ranking can be a great asset for a data scientist or machine learning practitioner. A good grasp of these methods leads to better performing models, better understanding of the underlying structure and characteristics of the data and leads to better intuition about the algorithms that underlie many machine learning models.
There are in general two reasons why feature selection is used:
1. Reducing the number of features, to reduce overfitting and improve the generalization of models.
2. To gain a better understanding of the features and their relationship to the response variables.
Possible methods:
Univariate feature selection:
Pearson Correlation
Mutual information and maximal information coefficient (MIC)
Distance correlation
Model based ranking
Tree based methods:
Random forest feature importance (Mean decrease impurity, Mean decrease accuracy)
Others:
stability selection
RFE

Comma.ai self-driving car neural network using client/server architecture in TensorFlow, why?

In comma.ai's self-driving car software they use a client/server architecture. Two processes are started separately, server.py and train_steering_model.py.
server.py sends data to train_steering_model.py via http and sockets.
Why do they use this technique? Isn't this a complicated way of sending data? Isn't this easier to make train_steering_model.py load the data set by it self?
The document DriveSim.md in the repository links to a paper titled Learning a Driving Simulator. In the paper, they state:
Due to the problem complexity we decided to learn video prediction with separable networks.
They also mention the frame rate they used is 5 Hz.
While that sentence is the only one that addresses your question, and it isn't exactly crystal clear, let's break down the task in question:
Grab an image from a camera
Preprocess/downsample/normalize the image pixels
Pass the image through an autoencoder to extract representative feature vector
Pass the output of the autoencoder on to an RNN that will predict proper steering angle
The "problem complexity" refers to the fact that they're dealing with a long sequence of large images that are (as they say in the paper) "highly uncorrelated." There are lots of different tasks that are going on, so the network approach is more modular - in addition to allowing them to work in parallel, it also allows scaling up the components without getting bottlenecked by a single piece of hardware reaching its threshold computational abilities. (And just think: this is only the steering aspect. The Logs.md file lists other components of the vehicle to worry about that aren't addressed by this neural network - gas, brakes, blinkers, acceleration, etc.).
Now let's fast forward to the practical implementation in a self-driving vehicle. There will definitely be more than one neural network operating onboard the vehicle, and each will need to be limited in size - microcomputers or embedded hardware, with limited computational power. So, there's a natural ceiling to how much work one component can do.
Tying all of this together is the fact that cars already operate using a network architecture - a CAN bus is literally a computer network inside of a vehicle. So, this work simply plans to farm out pieces of an enormously complex task to a number of distributed components (which will be limited in capability) using a network that's already in place.

Should the neurons in a neural network be asynchronous?

I am designing a neural network and am trying to determine if I should write it in such a way that each neuron is its own 'process' in Erlang, or if I should just go with C++ and run a network in one thread (I would still use all my cores by running an instance of each network in its own thread).
Is there a good reason to give up the speed of C++ for the asynchronous neurons that Erlang offers?
I'm not sure I understand what you're trying to do. An artificial neural network is essentially represented by the weight of the connections between nodes. The nodes themselves don't exist in isolation; their values are only calculated (at least in feed-forward networks) through the forward-propagation algorithm, when it is given input.
The backpropagation algorithm for updating weights is definitely parallelizable, but that doesn't seem to be what you're describing.
The usefulness of having neurons in a Neural Network (NN), is to have a multi-dimension matrix which coefficients you want to handle ( to train them, to change them, to adapt them little by little, so as they fit well to the problem you want to solve). On this matrix you can apply numerical methods (proven and efficient) so as to find an acceptable solution, in an acceptable time.
IMHO, with NN (namely with back-propagation training method), the goal is to have a matrix which is efficient both at run-time/predict-time, and at training time.
I don't grasp the point of having asynchronous neurons. What would it offers ? what issue would it solve ?
Maybe you could explain clearly what problem you would solve putting them asynchronous ?
I am indeed inverting your question: what do you want to gain with asynchronicity regarding traditional NN techniques ?
It would depend upon your use case: the neural network computational model and your execution environment. Here is a recent paper (2014) by Plotnikova et al, that uses "Erlang and platform Erlang/OTP with predefined base implementation of actor model functions" and a new model developed by the authors that they describe as “one neuron—one process” using "Gravitation Search Algorithm" for training:
http://link.springer.com/chapter/10.1007%2F978-3-319-06764-3_52
To briefly cite their abstract, "The paper develops asynchronous distributed modification of this algorithm and presents the results of experiments. The proposed architecture shows the performance increase for distributed systems with different environment parameters (high-performance cluster and local network with a slow interconnection bus)."
Also, most other answers here reference a computational model that uses matrix operations for the base of training and simulation, for which the authors of this paper compare by saying, "this case neural network model [ie matrix operations based] becomes fully mathematical and its original nature (from neural networks biological prototypes) gets lost"
The tests were run on three types of systems;
IBM cluster is represented as 15 virtual machines.
Distributed system deployed to the local network is represented as 15 physical machines.
Hybrid system is based on the system 2 but each physical machine has four processor cores.
They provide the following concrete results, "The presented results evidence a good distribution ability of gravitation search, especially for large networks (801 and more neurons). Acceleration depends on the node count almost linearly. If we use 15 nodes we can get about eight times acceleration of the training process."
Finally, they conclude regarding their model, "The model includes three abstraction levels: NNET, MLP and NEURON. Such architecture allows encapsulating some general features on general levels and some specific for the considered neural networks features on special levels. Asynchronous message passing between levels allow to differentiate synchronous and asynchronous parts of training and simulation algorithms and, as a result, to improve the use of resources."
It depends what you are after.
2nd Generation of Neural Networks are synchronous. They perform computations on an input-output basis without a delay, and can be trained either through reinforcement or back-propagation. This is the prevailing type of ANN at the moment and the easiest to get started with if you are trying to solve a problem via machine learning, lots of literature and examples available.
3rd Generation of Neural Networks (so-called "Spiking Neural Networks") are asynchronous. Signals propagate internally through the network as a chain-reaction of spiking events, and can create interesting patterns and oscillations depending on the shape of the network. While they model biological brains more closely they are also harder to make use of in a practical setting.
I think that async computation for NNs might prove beneficial for the (recognition) performance. In fact, the result might be similar (maybe less pronounced) to using dropout.
But a straight-forward implementation of async NNs would be much slower, because for synchronous NNs you can use linear algebra libraries, which make good use of vectorization or GPUs.

Information retrieval (IR) vs data mining vs Machine Learning (ML)

People often throw around the terms IR, ML, and data mining, but I have noticed a lot of overlap between them.
From people with experience in these fields, what exactly draws the line between these?
This is just the view of one person (formally trained in ML); others might see things quite differently.
Machine Learning is probably the most homogeneous of these three terms, and the most consistently applied--it's limited to the pattern-extraction (or pattern-matching) algorithms themselves.
Of the terms you mentioned, "Machine Learning" is the one most used by Academic Departments to describe their Curricula, their academic departments, and their research programs, as well as the term most used in academic journals and conferences proceedings. ML is clearly the least context-dependent of the terms you mentioned.
Information Retrieval and Data Mining are much closer to describing complete commercial processes--i.e., from user query to retrieval/delivery of relevant results. ML algorithms might be somewhere in that process flow, and in the more sophisticated applications, often are, but that's not a formal requirement. In addition, the term Data Mining seems usually to refer to application of some process flow on big data (i.e, > 2BG) and therefore usually includes a distributed processing (map-reduce) component near the front of that workflow.
So Information Retrieval (IR) and Data Mining (DM) are related to Machine Learning (ML) in an Infrastructure-Algorithm kind of way. In other words, Machine Learning is one source of tools used to solve problems in Information Retrieval. But it's only one source of tools. But IR doesn't depend on ML--for instance, a particular IR project might be storage and rapid retrieval of the fully-indexed data responsive to a user's search query IR, the crux of which is optimizing performance of the data flow, i.e., the round-trip from query to delivering the search results to the user. Prediction or pattern matching might not be useful here. Likewise, a DM project might use an ML algorithm for the predictive engine, yet a DM project is more likely to also be concerned with the entire processing flow--for instance, parallel computation techniques for efficient input of an enormous data volume (TB perhaps) which delivers a proto-result to a processing engine for computation of descriptive statistics (mean, standard deviation, distribution, etc. on the variables (columns).
Lastly consider the Netflix Prize. This competition was directed solely to Machine Learning--the focus was on the prediction algorithm, as evidenced by the fact that there was a single success criterion: accuracy of the predictions returned by the algorithm. Imagine if the 'Netflix Prize' were rebranded as a Data Mining competition. The success criteria would almost certainly be expanded to more accurately access the algorithm's performance in the actual commercial setting--so for instance overall execution speed (how quickly are the recommendations delivered to the user) would probably be considered along with accuracy.
The terms "Information Retrieval" and "Data Mining" are now in mainstream use, though for a while I only saw these terms in my job description or in vendor literature (usually next to the word "solution.") At my employer, we recently hired a "Data Mining" analyst. I don't know what he does exactly, but he wears a tie to work every day.
I'd try to draw the line as follows:
Information retrieval is about finding something that already is part of your data, as fast as possible.
Machine learning are techniques to generalize existing knowledge to new data, as accurate as possible.
Data mining is primarly about discovering something hidden in your data, that you did not know before, as "new" as possible.
They intersect and often use techniques of one another. DM and IR both use index structures to accelerate processes. DM uses a lot of ML techniques, for example a pattern in the data set that is useful for generalization might be a new knowledge.
They are often hard to separate. Do yourself a favor and don't just go for the buzzwords. In my opinion the best way of distinguishing them is by their intention, as given above: find data, generalize to new data, find new properties of existing data.
You can also add pattern recognition and (computational?) statistics as another couple of areas that overlap with the three you mentioned.
I'd say there is no well-defined line between them. What separates them is their history and their emphases. Statistics emphasizes mathematical rigor, data mining emphasizes scaling to large datasets, ML is somewhere in between.
Data mining is about discovering hidden patterns or unknown knowledge, which can be used
for decision making by people.
Machine learning is about learning a model to classify new objects.

Resources