I'm working on Social Network mining project, and I'm looking for a "real social network dataset" (comments, ,comments on comment, likes, friendship, interest, feeling, places,liked pages, published photos, videos, posts, hashtags anything more is positive )
I searched a lot, but all available networks are just about nodes and edges (like A follow B). For example
http://snap.stanford.edu/
I search twitter, but its not open because of privacy terms
http://an.kaist.ac.kr/traces/WWW2010.html
Anyone have a suggestion for a dataset?
I found this tutorial
step by step how to stream twitter to hadoop
https://github.com/DataDanSandler/YouTubeREADME#video-5-configure-flume
Related
Just helping figure out how to keep software engineers at my company trained. How are you getting trained in light of working from home and / or tech conferences getting cancelled for the foreseeable future?
Tech conferences I would consider of minimal value when it comes to training a software engineer. There is plenty of courses you can do online in terms of training. You could ask your engineers to complete online courses. There is plural sight, safari, YouTube, LinkedIn now has training courses.
Friends,
We are trying work on a problem where we have a dump of only reviews but there is no rating in a .csv file. Each row in .csv is one review given by customer of a particular product, lets a TV.
Here, I wanted to do classification of that text into below pre-defined category given by the domain expert of that products:
Quality
Customer
Support
Positive Feedback
Price
Technology
Some reviews are as below:
Bought this product recently, feeling a great product in the market.
Was waiting for this product since long, but disappointed
The built quality is not that great
LED screen is picture perfect. Love this product
Damm! bought this TV 2 months ago, guess what, screen showing a straight line, poor quality LED screen
This has very complicated options, documentation of this TV is not so user-friendly
I cannot use my smart device to connect to this TV. Simply does not work
Customer support is very poor. I don't recommend this
Works great. Great product
Now, with above 10 reviews by 10 different customers, how do I categorize them into the given buckets (you can call multilabel classification or Named Entity Recognition or Information extraction with sentiment analysis or be it anything)
I tried all NLP word frequency counting related stuff (in R) and referred StanfordNLP (https://nlp.stanford.edu/software/CRF-NER.shtml) and many more. But could not get a concrete solution.
Can anybody please guide me how can we tackle this problem? Thanks !!!
Most NLP frameworks will handle multi-class classification. Word count by itself in R will not likely be very accurate. A python library you can explore is Spacy. Commercial APIs like Google, AWS, Microsoft can also be used. You will need quite a few examples per category for training. Feel free to post your code and the problem or performance gap you see for further help.
I've been searching on the web, and found media such as CNN and NPR provide links to access to their transcripts. To obtain them requires writing something like a crawler which is not so convenient. The reason is that I'm trying to use some transcripts of TV show, interview, radio, movie as training data in my natural language processing projects. So I'm wondering whether there's any collection or database freely available on the web so that I can download all of them at once without writing a crawler by myself?
I would recommend the British National Corpus. I would also mention the American National Corpus, but the transcripts there are only of phone calls or face to face conversations - no news, tv shows, etc.
You also mentioned CNN and NPR. There are transcripts from 1996 as an LDC corpus here.
I have know how to communicate with twitter and how to retrieve tweets but I am looking for further working on these tweets.
I have two categories food and sports. Now I want to categorize tweets into food and sports. Can anyone please suggest me how to categorize on basis of computer algorithm?
regards
Gaurav
I've been doing some work recently with Latent Dirichlet Allocation. The general idea is that documents contain words that are generated from topics. What you could try doing is loading a corpus of documents known to be about the topics you are interested in, update with the tweets of interest, and then select tweets that have strong probabilities for the same topics as your known documents.
I use R for LDA (package:topicmodels and package:lda), but I think there are some prebuilt python tools for this too. I would probably steer away from trying to write your own unless you have a solid grounding in Bayesian statistics.
Here's the documentation for the topicmodels package: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf
I doubt that a set of algorithm could possibly categorize tweets in open domain. In other words I don't think a set of rules can possibly categorizes open domain tweets. You need to parse tweets into a semantic representation customized for the categorization.
I would like to know how stumbleupon recommends articles for its users?.
Is it using a neural network or some sort of machine-learning algorithms or is it actually recommending articles based on what the user 'liked' or is it simply recommending articles based on the tags in the interests area?. With tags I mean, using something like item-based collaborative filtering etc.?
First, i have no inside knowledge of S/U's Recommendation Engine. What i do know, i've learned from following this topic for the last few years and from studying the publicly available sources (including StumbleUpon's own posts on their company Site and on their Blog), and of course, as a user of StumbleUpon.
I haven't found a single source, authoritative or otherwise, that comes anywhere close to saying "here's how the S/U Recommendation Engine works", still given that this is arguably the most successful Recommendation Engine ever--the statistics are insane, S/U accounts for over half of all referrals on the Internet, and substantially more than facebook, despite having a fraction of the registered users that facebook has (800 million versus 15 million); what's more S/U is not really a site with a Recommendation Engine, like say, Amazon.com, instead the Site itself is a Recommendation Engine--there is a substantial volume of discussion and gossip among the fairly small group of people who build Recommendation Engines such that if you sift through this, i think it's possible to reliably discren the types of algorithms used, the data sources supplied to them, and how these are connected in a working data flow.
The description below refers to my Diagram at bottom. Each step in the data flow is indicated by a roman numeral. My description proceeds backwards--beginning with the point at which the URL is delivered to the user, hence in actual use step I occurs last, and step V, first.
salmon-colored ovals => data sources
light blue rectangles => predictive algorithms
I. A Web Page recommended to an S/U user is the last step in a multi-step flow
II. The StumbleUpon Recommendation Engine is supplied with data (web pages) from three distinct sources:
web pages tagged with topic tags matching your pre-determined
Interests (topics a user has indicated as interests, and which are
available to view/revise by clicking the "Settings" Tab on the upper
right-hand corner of the logged-in user page);
socially Endorsed Pages (*pages liked by this user's Friends*); and
peer-Endorsed Pages (*pages liked by similar users*);
III. Those sources in turn are results returned by StumbleUpon predictive algorithms (Similar Users refers to users in the same cluster as determined by a Clustering Algorithm, which is perhaps k-means).
IV. The data used fed to the Clustering Engine to train it, is comprised of web pages annotated with user ratings
V. This data set (web pages rated by StumbleUpon users) is also used to train a Supervised Classifier (e.g., multi-layer perceptron, support-vector machine) The output of this supervised classifier is a class label applied to a web page not yet rated by a user.
The single best source i have found which discussed SU's Recommendation Engine in the context of other Recommender Systems is this BetaBeat Post.