About data mining by using twitter data - twitter

I plan to write a thesis about using sentiment information to enhance the predictivity of some financial trading model for currency.
The sentiment data should be twitter threads including some keyword, like "EUR.USD". And I will filter out some sentiment words to identify the sentiment. Simple idea. Then we try to see whether here is any relation between the degree of sentiment and the movement of EUR.USD.
My big concern is on twitter data. As we all know that the twitter set up the limit to see the history data. You could only browser back for like 5 days. It is not enough since our strategy based on daily sentiment.
I noticed that google have some fantastic thing like timeline about the twitter updates: http://www.readwriteweb.com/archives/googles_twitter_timeline_lets_you_explore_the_past.php
But first of all, I am in Switzerland and seems I have no such function on my google which is too smart to identify my location and may block some US google version function like this. Secondly, even I could see some fancy interactive google timeline control on my firefox, How could I dig out data from my query and save them? Does google supply such api?

The Google service you mentioned has shut down recently so you won't be able to use it. (http://www.searchenginejournal.com/google-realtime-shuts-down-as-twitter-deal-expires/31007/)
If you need a longer timespan of data to analyze I see the following options:
pay for historical data :) (https://dev.twitter.com/docs/twitter-data-providers)
if you don't want to pay, you need to fetch tweets containing EUR/USD whatever else (you could use the streaming API for this) and store them somehow. Run this service for a while (if possible) and you'll have more than just 5 days of data.

Related

Can a group of 3 researchers share/pool Twitter API tokens to accelerate/improve data collection on a sentiment analysis project?

Our group is working on a sentiment analysis research project. We are trying to use the Twitter API to collect tweets. Out aimed dataset involves a lot of query terms and filters. However, since each of us has a developer account, we were wondering if we can pool API access tokens to accelerate the data collection. For example, we will make an app that allows us to define a configuration file that contains a list of our access tokens that the app will try to use to search for a tweet. This app will be run on our local computer. Since the app uses our individual access tokens, we believe that we are not actually not bypassing or changing any Twitter limit as the record is kept for each access token. Are there any problems legal/technical that may arise from this methodology? Thank you! =D
Here is a pseudocode for what we are trying to do:
1. define a list of search terms such as 'apple', 'banana'
and 'oranges' (we have 100 of these search terms, we are okay
with the 100 limit per tweet)
2. define a list of frequent emotional adjectives such as 'happy', 'sad', 'crazy', etc. (we have have 100 of these) using TF-IDF
3. get the product of the search terms and emotional adjectives,
in total we have 10,000 query terms and we have computed
through the rate limit rules that we would need at least
55 runs of 15-minute sessions with 180 tweets per 15-minute.
55 * 15 = 825 minutes or ~14 hours to collect this amount of tweets.
4. we were thinking of improving the data collection by
pooling access tokens so that we can trim down the time
of collection from 14 hours to ~4 hours, e.g. by dividing the query items into subsets and letting a specific access token work on a subset
We were pushing for this since we just think it's efficient if it's possible and permitted since why not and it might help future researches as well?
The question is, are we actually breaking any Twitter rules or policies by doing this? By sharing one access token per each of us three and creating an app that we name as clones of the research project, we believe that in turn we are also losing something which is the headroom for one more app that we fully control.
I can't find specific rule in Twitter so far about this. Our concern is that we will publish a paper and will publish the app we will program and use for documentation and the app we plan to build. Disclaimer: Only the app's source code will be published and not the dataset because of Twitter's explicit rules about datasets.
This is absolutely not allowed under the Twitter Developer Policy and Agreement.
Twitter developer policy 5a:
Do not do any of the following:
Use a single application API key for multiple use cases or multiple application API keys for the same use case.
Feel free to check with Twitter directly via the developer forums. StackOverflow is not really the best place for this question since it is not specifically a coding question.

Is it possible to receive keyword-level Google Display Network ad cost via the API?

I've skimmed through the Keywords Performance Report of the API documentation, and couldn't understand whether it would be possible for me to use this report to determine daily keyword costs.
What I want is basically to be able to look for keyword to an API request result and get the cost associated with it. Is such a thing possible? Am I looking in the right place?
Apparently, it's not possible to do so, since all costs on all Display Network items are listed with a special ID (3000000) in costs, meant to capture all GDN displays.

How to get the greatest Tweets flow?

I am working on a project using the Twitter API. I am doing data analysing on Twitter Data. Actually, I have to stream the greatest number of Tweets ever to perform my algorithm on it. Everything works well but I am quite disapointed by the number of Tweets I can actually collect.
By using the STREAMING API, I have only access to around 4500 Tweets each hour in an area like London. The more Tweets I might collect, the better my analysis. I read somewhere that someone was collecting like more than 100,000 each hour...
Do you think my authentification might be rate limited ?
Is SEARCH allows us to collect more Tweets than STREAMING ?
Thank you.

Collecting all or at least a good sample of tweets related to a particular event

I'm interested in using the Tweets in a Complex Networks research project for which I've to collect a good amount of tweets(unbiased sample of tweets) that're related to a particular event, for example, say Boston Bombing. I need to collect the tweets that're related to this event from the event date to current date.
Tried Search API and got to know that I can't retrieve tweets older than a week. I've a small doubt regarding Streaming API, does Streaming API allow you to get older tweets and not real-time tweets. Also, I don't need tweets that're of a particular user.
Is there any way that I can collect the necessary tweets? If there aren't any, can you give me any other archiving sources that I can use?
Thanks a lot.
The streaming API is for accessing what is happening right now, there is no option to "replay" the stream from some arbitrary point in the past. You can access past tweets from individual users' timelines (up to 3,200), but this would require knowing who had tweeted about your event and, as you say, you can only search the past week.
If you are prepared to pay for it, you can get tweets since the start of 2010 via DataSift.
Historical data going back to the very first Tweet ever is only available from Gnip.

Is there any way to "backdate" requests to google server-side analytics?

I have an iOS app which can be used offline. I need to do anonymous page view tracking, so our customers can tell which pages people are most interested in (to drive future investments). So when the user is offline, we save a timestamped page view list, and if the user happens to be online when they use the app, we send these historic records up, and also do real-time tracking.
I'm keeping some summary statistics in my GAE app, so I can report the page views with historic accuracy. However, I'm also feeding these views into google analytics, using some python code I ported from google's server-side samples.
That all works great (except for language tracking, which I may have solved thanks to a separate question here on SO). However, I'd love for google analytics to be able to understand the historical hits in context. Right now, if I connect up after looking at several pages offline, GA thinks I just popped through a bunch of pages over the course of a couple seconds.
There is no documented utm variable for timestamping. The google analytics SDK for iOS (which I'm not using) has this ominous note:
Known Issues
Possible inaccurate timestamps: timestamps are recorded at the time the application dispatches to Google Analytics, so if a user experiences long periods of offline use, the timestamps may not be 100% accurate.
That seems like a bit of an understatement. Wouldn't offline timestamps be 100% inaccurate?
Anyway, the fact that the SDK doesn't handle this right makes me think I'm not going to be able to solve this. But I figured some SO wizard might have an idea...
In fact, timestamp is a "relative" (client side) information used by Analytics to compute things like "time on page".
When the page is view in "absolute" (date and time) is always the time you send the request.

Resources