What is the best visualization tool to analyze evolving communities in graph data? - time-series

I'm trying to analyze graphs about social media. The graphs contains time information, so it's possible to do some time series analysis. For each time point, I can run a community detection algorithm (e.g. Louvain method) to detect communities at that time. I can see that the communities are evolving: nodes in smaller communities are sometimes merging into a bigger community, and sometimes they are splitting up. However, I failed to find a comprehensive visualization tool to analyze and demonstrate the evolution of the communities.
Does anyone recommend a tool to serve this purpose? Thank you.


Formal Concept Analysis data-sets

I am currently completing a postgrad degree in Information Systems Management and have been given a thesis topic that relates to Formal Concept Analysis.
The objective is compare open-source software that is able to read data and represent this in a lattice diagram (an example application is that of Concept Explorer). Additionally, the performance of these tools need to be compared with one another, with varying data-set sizes etc.
The main problem I'm experiencing is finding data sets that are compliant with these tools that are big enough to test the limits of each application, as well as figuring out how to accurately measure things such as CPU time taken to draw the lattice diagram and other similar measures. Data for formal contexts generally follow a binary relationship, such as a cross table that shows how attributes and objects can be related.
As such, my question is where would I find such data and how would I be able to manipulate that data to be usable with software like Concept Explorer.
P.S. I am new here, so not sure if this is posted in the right place!

Web page recommender system

I am trying to build a recommender system which would recommend webpages to the user based on his actions(google search, clicks, he can also explicitly rate webpages). To get an idea the way google news does it, it displays news articles from the web on a particular topic. In technical terms that is clustering, but my aim is similar. It will be content based recommendation based on user's action.
So my questions are:
How can I possibly trawl the internet to find related web-pages?
And what algorithm should I use to extract data from web-page is textual analysis and word frequency the only way to do it?
Lastly what platform is best suited for this problem. I have heard of Apache mahout and it comes with some re-usable algos, does it sound like a good fit?
as Thomas Jungblut said, one could write several books on your questions ;-)
I will try to give you a list of brief pointers - but be aware there will be no ready-to-use off-the-shelf solution ...
Crawling the internet: There are plenty of toolkits for doing this, like Scrapy for Python , crawler4j and Heritrix for Java, or WWW::Robot for Perl. For extracting the actual content from web pages, have a look at boilerpipe.
First of all, often you can use collaborative filtering instead of content-based approaches. But if you want to have good coverage, especially in the long tail, there will be no way around analyzing the text. One thing to look at is topic modelling, e.g. LDA. Several LDA approaches are implemented in Mallet, Apache Mahout, and Vowpal Wabbit.
For indexing, search, and text processing, have a look at Lucene. It is an awesome, mature piece of software.
Besides Apache Mahout which also contains things like LDA (see above), clustering, and text processing, there are also other toolkits available if you want to focus on collaborative filtering: LensKit, which is also implemented in Java, and MyMediaLite (disclaimer: I am the main author), which is implemented in C#, but also has a Java port.
This should be a good read: Google news personalization: scalable online collaborative filtering
It's focused on collaborative filtering rather than content based recommendations, but it touches some very interesting points like scalability, item churn, algorithms, system setup and evaluation.
Mahout has very good collaborative filtering techniques, which is what you describe as using the behaviour of the users (click, read, etc) and you could introduce some content based using the rescorer classes.
You might also want to have a look at Myrrix, which is in some ways the evolution of the taste (aka recommendations) portion of Mahout. In addition, it also allows applying content based logic on top of collaborative filtering using the rescorer classes.
If you are interested in Mahout, the Mahout in Action book would be the best place to start.

Machine learning/information retrieval project

I’m reading towards M.Sc. in Computer Science and just completed first year of the source. (This is a two year course). Soon I have to submit a proposal for the M.Sc. Project. I have selected following topic.
“Suitability of machine learning for document ranking in information retrieval system”. Researchers have been using various machine learning algorithms for ranking documents. So as the first phase of the project I will be doing a complete literature survey and finding out advantages/disadvantages of current approaches. In the second phase of the project I will be proposing a new (modified) algorithm in order to overcome the limitations of current approaches.
Actually my question is whether this type of project is suitable as a M.Sc. project? Moreover if somebody has some interesting idea in information retrieval filed, is it possible to share those ideas with me.
Ranking is always the hardest part of any of Information Retrieval systems. I think it is a very good topic but you have to take care to -- as soon as possible -- to define a scope of the work. Probably you will not be able to develop a new IR engine but rather build a prototype based on, e.g., apache lucene.
Currently there is a lot of dataset including stackoverflow data dump, which provide you all information you need to define a rich feature vector (number of points, time, you can mine topics of previous question etc., popularity of a tag) for you machine learning ranking algorithm. In this part of the work you could, e.g., classify types of features (e.g., user specific, semantic feature - software name in the title) and perform series of experiments to learn which features are most important and which are not for a given dataset.
The second direction of such a project can be how to perform learning efficiently. The reason behind is the quantity of data within web or community forums and changes in the forum (this would be important if you take a community specific features), e.g., changes in technologies, new software release, etc.
There are many other topics related to search and machine learning. The best idea is to search on scholar.google.com for the recent survey papers on ranking, machine learning, and search to learn what is the state-of-the-art. The very next step would be to talk with your MSc supervisor.
Good luck!
Everything you said is good and should be done, but you forgot the most important part:
Prove that your algorithm is better and/or faster than other algorithms, with good experiments and maybe some statistics (p-value, confidence interval).
If you do that and convince people that your algorithm is useful you surely will not fail :)

Map Reduce Algorithms on Terabytes of Data?

This question does not have a single "right" answer.
I'm interested in running Map Reduce algorithms, on a cluster, on Terabytes of data.
I want to learn more about the running time of said algorithms.
What books should I read?
I'm not interested in setting up Map Reduce clusters, or running standard algorithms. I want rigorous theoretical treatments or running time.
EDIT: The issue is not that map reduce changes running time. The issue is -- most algorithms do not distribute well to map reduce frameworks. I'm interested in algorithms that run on the map reduce framework.
Technically, there's no real different in the runtime analysis of MapReduce in comparison to "standard" algorithms - MapReduce is still an algorithm just like any other (or specifically, a class of algorithms that occur in multiple steps, with a certain interaction between those steps).
The runtime of a MapReduce job is still going to scale how normal algorithmic analysis would predict, when you factor in division of tasks across multiple machines and then find the maximum individual machine time required for each step.
That is, if you have a task which requires M map operations, and R reduce operations, running on N machines, and you expect that the average map operation will take m time and the average reduce operation r time, then you'll have an expected runtime of ceil(M/N)*m + ceil(R/N)*r time to complete all of the tasks in question.
Prediction of the values for M,R,m, and r are all something that can be accomplished with normal analysis of whatever algorithm you're plugging into MapReduce.
There are only two books that i know of that are published, but there are more in the works:
Pro hadoop and Hadoop: The Definitive Guide
Of these, Pro Hadoop is more of a beginners book, whilst The Definitive Guide is for those that know what Hadoop actually is.
I own The Definitive Guide and think its an excellent book. It provides good technical details on how the HDFS works, as well as covering a range of related topics such as MapReduce, Pig, Hive, HBase etc. It should also be noted that this book was written by Tom White who has been involved with the development of Hadoop for a good while, and now works at cloudera.
As far as the analysis of algorithms goes on Hadoop you could take a look at the TeraByte sort benchmarks. Yahoo have done a write up of how Hadoop performs for this particular benchmark: TeraByte Sort on Apache Hadoop. This paper was written in 2008.
More details about the 2009 results can be found here.
There is a great book about Data Mining algorithms applied to the MapReduce model.
It was written by two Stanford Professors and it if available for free:

Writing an image processing application for analysis of satellite imagery

I have to start work on application for analysis of satellite imagery to identify some man made structure. I would like to use C or Java for this.
For satellite I am planning to use Google Maps data.
I have three questions here:
What is best source for GIS data besides Google Maps/earth.
Best language to write such an application considering i will have to use third-party APIs
Is there a open image processing engine available which identifies man made structures?
Thats a lot of questions but I hope the smarter guys here can help me here.
Overly processed imagery such as Google or Bing maps is a horrible source of imagery for performing feature extraction or feature recognition. Usually, you want the most unprocessed, raw form possible with camera models... of course, if you don't have access to this sort of data, then you have to work with what you have.
A more important consideration of Google Maps/Earth imagery is that you may run afoul of their License Agreement. I suggest you check it before you decide on their data as your imagery source. In particular, if you bypass their API's, you've violated their license agreement.
As far as libraries and langauges, there are dozens of machine vision libraries available. I can't recommend one over the other as I've only been a down-stream consumer of their results. My understanding of the problem is that the biggest concern is how you build the "models" to compare against... i.e. how do you give the system an "example" of what you're looking for.
Once you've found a library, then you can make a decision on the language. Generally, a high-level language like Python or Matlab is used for this kind of prototyping. Once a method has been found, then conversion to a "higher performance" language is done--if necessary.
Personally, I'd probably use Python because (1) it's freely available, (2) has a significant community in the scientific and research worlds, and (3) can interop with a wide variety of languages and platforms.
Specifically, check out Glovis: http://glovis.usgs.gov/
You can browse the earth, and download maps from several different satellites and sensors. Even though you have to go through a bogus "ordering" process, the imagery is free.
You may find the USGS (United States Geological Survey) website helpful. They provide both GIS information and a wide range of data sets.
I agree with James Schek. Google gives you RGB images - not the most helpful fot your task. Most imagery will have a couple of additional channels that may be better suited for you. Different channels show different features, water, urban areas, types of foliage etc. For example an infra-red channel could be used to pick out buildings in a cool climate. If you contact several data provider they may be able to recommend the best channels to use in their data.
Ariel imagery can be huge, numerous terrabytes for a detailed world database. Carefully consider how much information you need to process. If you are only doing a few square miles performance is not an issue. If you are processing thousands of square miles, performance becomes an issue. Processing millions, performance is mission critical and must be considered from day one.
Knowing the number of channels you need to process, your performance requirements and the file format of your data, look around for libraries that fulfil all your requirements. Many of them are written in C/C++ so using a language that interops with them both could be helpful
Take a look at this demo:
Finding Vegetation in a Multispectral Image
, part of the Image Processing Toolbox in MATLAB. It is related to your problem of analysing satellite images to find specific patterns.
I believe it's an excellent example of the sort of things you can achieve easily with MATLAB using very little code.
