Apache Storm, Twitter - twitter

I am processing twitter tweets by using twiiter4j.properties through storm-bolts. My topology looks like:
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 2).setNumTasks(4);
builder.setBolt("Preprocesing", new preprocessBolt2(), 2)
.setNumTasks(4).shuffleGrouping("KafkaSpout");
builder.setBolt("AvgScoreAnalysis",
new AvgScoringBolt(), 4).setNumTasks(8)
.fieldsGrouping("Preprocesing",new Fields("tweetId"));
builder.setBolt("PrinterBolt", new LocalFile(), 6).setNumTasks(4)
.shuffleGrouping("AvgScoreAnalysis");
Where I am taking tweets from KafkaSpout and sending it to bolt for pre-processing, My problem is in the avgScoring where I am calling S3 in that I am having csv for each user and calculating the scoring for each user for each single tweet. I am having 100 users means my avg scoring has to calculate avg score for each tweet for all the number of users in the s3. It is pretty slow how can I increase the performance in this bolt and there are so many duplicates in the file how can I remove duplicates?

Calling S3 from the AvgScoringBolt is not a great idea if you want high performance: unless you're filtering the tweets with some criteria, there's no way to make a connection to S3 for every tweet and still parse thousands tweet per second.
Since there are only 100 users, maybe you could download the CVS of the users at the start of application, make the computation inside the bolt without connecting to S3 (using the downloaded CSV), and periodically upload the updated CSV to S3 to have a loosely synchronization with S3. I don't know if this scenario fits your requirements.

Related

YouTube data API cost per request or per object?

I get that there is a cost incurred when I use YouTube API service, but what I would like to know is if the cost is per request or not.
For example, when I query the meta data of 3 videos, would the cost be tripled for that one request, or would the cost be the same as if I query the meta data for 1 video?
I assume you talk about the quota with the YouTube API v3, i can suggest you to visit this link, a quota calculator:
This tool lets you estimate the quota cost for an API query. All API requests, including invalid requests, incur a quota cost of at least one point.
https://developers.google.com/youtube/v3/determine_quota_cost COST 3
would the cost be tripled for that one request, or would the cost be the same as if I query the meta data for 1 video?
We can assume "cost be the same as if I query the meta data for 1 video" because they speaks about "request" like this :
GET https://www.googleapis.com/youtube/v3/videos?part=snippet COST 3
The request for multiple videos is like this :
GET https://www.googleapis.com/youtube/v3/videos?part=snippet&id=zMbIipvQL0c%2CLOcKckBLouM&key={YOUR_API_KEY}
Which is also one request, so it's also a cost of 3 !
The real deal is when you have multiple pages:
Note: If your application calls a method, such as search.list, that returns multiple pages of results, each request to retrieve an additional page of results will incur the estimated quota cost.

Getting a sum from Parse with Parse Cloud Code (for iOS app)

I'm new to Parse Cloud Code and am struggling with a seemingly simple task.
I'm working on a small iOS game where the users can choose from a list of characters to play -- imagine mario or luigi. In addition to tracking user scores in the game, I'm tracking total points for each character in Parse, so I can display a "mario" total and a "luigi" total (from all users.)
There could be multiple users playing at once (I hope), so I don't have Parse saving to just one mario and one luigi counter. Instead, each user gets a running count of their own mario and luigi scores.
So how do I pull the total marioPoints and total luigiPoints?
Parse doesn't have SQL-styled querying, so I've been looking at Parse Cloud Code and their "average stars" example (https://parse.com/docs/cloudcode/guide#cloud-code) looked kind of close at first glance:
But I can't get it sorted. And even if I could, it's limited to 1,000 responses, which wouldn't be enough. (I'm optimistic.)
Thanks!
Your best option is to keep a running total when any individual user update is saved. Do that using a save hook and the increment( attr, amount ) function.

Twython Stream: did you reach the volume limit or not?

I am using Twython with the streaming API to retrieve specific tweets. There is a limitation of 1% of the total volume of tweets that can be gathered. If a request reaches this limit (https://dev.twitter.com/discussions/2907), there should be an entry in the json file like
{"limit":{"track":1234}}
I tried to reach the limit by tracking every tweet that contains 'the' or every tweet in the whole world with location='-180.,-90.,180,90.' but I never got this limit/track.
Could you tell me if I can have access to it with Twython?
Best regards,
F.

Using Yahoo Pipes to filter tweets

I am trying to create a yahoo pipe that takes ideally takes all tweets tweeted at any point in time and filters down by a number of attributes to then display a filtered feed.
Basically in order this is what I want to happen:
Get a feed of all tweets at any one time.
Filter tweets by geolocation origin, i.e. UK,
Filter by a number of of different combinations of keywords.
Output as an RSS feed (though this isn't really the crucial stage as Yahoo Pipes takes care of this anyway)
Disclaimer: of course I understand that there are limits to the amount of tweets that could come through etc but I would like to cast the input net as wide as possible.
I have managed to get stages 3 & 4 working correctly and for the time being I am not really worrying about step 2 (although if you have any suggestions I am all ears), but stages 1 is where I am struggling. What I have attempted is using a Fetch Feed module with the URL - http://search.twitter.com/search.atom?q=lang:en - however it seems that this only pulls 15 tweets. Is there any way that I can pull more than 15 tweets every time the pipe is run, otherwise I think this may all be in vain.
FYI, here is the link to the pipe as it stands - http://pipes.yahoo.com/ludus247/182ef4a83885698428d57865da5cf85b
Thanks in advance!

Collecting follower/friend Ids of large number of users - Twitter4j

I'm working on a research project which analyses closure patterns in social networks.
Part of my requirement is to collect followers and following IDs of thousands of users under scrutiny.
I have a problem with rate limit exceeding 350 requests/hour.
With just 4-5 requests my limit is exceeding - ie, when the number of followers I collected exceeds the 350 mark.
ie, if I have 7 members each having 50 followers, then when I collect the follower details of just 7 members, my rate exceeds.(7*50 = 350).
I found a related question in stackoverflow here - What is the most effective way to get a list of followers using Twitter4j?
The resolution mentioned there was to use lookupUsers(long[] ids) method which will return a list of User objects... But I find no way in the API to find the screen names of friends/followers of a particular "User" object. Am I missing something here.. Is there a way to collect friends/followers of thousands of users effectively?
(Right now, I'm using standard code - Oauth authentication(to achieve 350 request/hour) followed by a call to twitter.getFollowersIDs)
It's fairly straightforward to do this with a limited number of API calls.
It can be done with two API calls.
Let's say you want to get all my followers
https://api.twitter.com/1/followers/ids.json?screen_name=edent
That will return up to 5,000 user IDs.
You do not need 5,000 calls to look them up!
You simply post those IDs to users/lookup
You will then get back the full profile of all the users following me - including screen name.

Resources