I am working on creating the actual path of movement of Hurricane Sandy from twitter data. My approach is as follows:
I collect all the tweets related to hash-tag "Hurricane Sandy" between October 28, 2012 and October 31st, 2012(Hurricane Sandy made a landfall on October 29, 2012 at near Brigantine, New Jersey). It affected many neighboring states in next 2 days. I arrange all the collected tweets in time series and then divide the time sorted tweets into fixed sized time windows. Then, in each time window, I calculate the relevant tweets i.e. the tweets pointing to the position of hurricane track. Next, I take the location of the origin of relevant tweet and connect them to get the hurricane track.
The problem I am facing is how to determine the relevancy of any tweet to the track taken by the hurricane i.e. how to determine if a tweet is originated from an area that falls under the track of the hurricane. What possible features or algorithms are possible to do so?
Have you had a look at the data?
Twitter data is 99% mess, and 1% signal.
I doubt you can achieve your goals from this data. In particular, the network may have been down where the real hurricane was...
Related
I m boosting my website performance . I'm testing my website page speed with Google PSI. I've done maximum correction and i'm getting 90+ score in PSI . All lab data attributes are green . But Field data is still not being updated. I m just wondering, how long Google page speed insight take to update the Field data. also, If Fields data will update, then it will be same as the Lab data ?
Page Insight Screenshort
The data displayed is aggregated daily, so the data should change day to day.
However the reason you do not see the results of any improvements you make instantly is because the data is taken over a rolling 28 day period.
After about 7 days you should start to see your score improve, in 28 days the report data will be reflective of changes you made today.
Because of this delay I would recommend taking your own metrics using the Web Vitals Library so you can see the results in real time. This also lets you verify the changes you made work at all screen sizes and cross browser.
Either that or you could query the CrUX data set yourself on shorter timescales.
I have an application where I need to get complete, realtime search results from twitter (preferably polling every 500ms or less). Based on my understanding, doing this using the search API will run into rate limits very quickly. However, the streaming API doesn't seem to support getting complete anything (only a 5% sample).
More specifically, I have a search query term which typically comes up with <20 matching tweets per hour, and I would like to be informed of these new tweets within 1-2 seconds, and it is considered a failure if I am not notified within 5 seconds. Due to the relatively low frequency of posting, missing even one tweet is very undesirable.
Is there any way I can realistically do this using twitter API, or is my only choice to write a browser extension to repeatedly refresh the search page?
The answer is "yes". Although you are rate limited (the limit is closer to 1% than 5%), that is only a cutoff based on your query results. Very roughly, you can stream about 60 tweets per second max. In your case, you say you expect under 20 tweets per hour, so you should have no problem getting all those tweets.
You also require a latency less than 5 seconds. In my experience latency has always been a second or two. I think you should be fine.
I am storing in a database, every 30 minutes, Twitter's trending topics of a country Y. No problem with that.
Now, I want to get as much tweets as possible matching those trending topics for research purposes.
Since I would like to study the patterns of the trends, I would like continuous tweet data of at least 3 days centered in the day the trend peak was detected, for every trending topic. In order to achieve that, I thought of doing the following:
Suppose I am in day X. I could retrieve the unique trends of day X-2, and for every trend, look for tweets matching the trend in the interval [X-3, X-1], that is 3 days. However, the problem here is Twitter rate limitations. If I have 100 trending topics in day X-2, and I make 20 GET search requests/trend, I would end up doing a total of 2,000 requests, which overpasses Twitter's 350 hourly rate limit. If make 300 req/hour, it would take more than 6 hours to get the data for only one day...
Does anybody know any other (better) way for getting tweets associated with trends?
Thanks in advance
Twitter Streaming API?
Twitter Streaming API doesn't deliver any past tweets. You only receive tweets starting from the time the server connection is established. The search API will return tweets matching the current query up to 7 days old in theory, but that is entirely up to Twitter’s current load. (Note*-At times this interval has been as short as 24 hours. In addition, you are limited by the ability to only receive up to 1,500 tweets regardless of how old they are.)
Is there any way to get more tweets from the streaming?
None that I know. But, do refer the below mentioned information if you are considering to switch among search or streaming API.
Please choose your case:
If you need real time data and your number of requests are high:
Go for Streaming API
The streaming API requires that you keep the connection active. This requires a server process with an infinite loop, to get the latest tweets.
Advantage
1)Lag in retrieving results: Tweets delivered with this method are basically real-time, with a lag of a second or two at most between the time the tweet is posted and it is received from the API
2)Not rate limited.
If you need aggregate data regardless of its time range and your number of requests are not high:
Go for Search API
The search API is the easier of the two methods to implement but it is rate limited .Each request will return up to 100 tweets, and you can use a page parameter to request up to 15 pages, giving you a theoretical maximum of 1,500 tweets for a single query.
Advantage
1)Finding tweets in the past:The search API wins by default in this area, because the streaming API doesn’t deliver any past tweets
2)Easier to implement
For a research project, I need to download the top 100 most used words on Twitter, multiple times per hour. However, as far as I can tell, the Twitter API only supports downloading the top 10 most used words ("trends").
My questions therefore are:
Am I missing something in the API? Is there another way to fetch more than 10 trends?
If there isn't, does anybody know of a workaround for this problem?
Put ?count=50 at the end of the URL to get the top 50. I haven't been able to get more than 50.
http://api.twitter.com/1/trends/current.json?count=50
you should monitor your timeline get all tweets , save in database, analyze via NLP, and save words(for example person names), after aggregate and get counts, for example "Obama 50 times, Java 10 times, linux 5 times"
I have a site with several pages for each company and I want to show how their page is performing in terms of number of people coming to this profile.
We have already made sure that bots are excluded.
Currently, we are recording each hit in a DB with either insert (for the first request in a day to a profile) or update (for the following requests in a day to a profile). But, given that requests have gone from few thousands per days to tens of thousands per day, these inserts/updates are causing major performance issues.
Assuming no JS solution, what will be the best way to handle this?
I am using Ruby on Rails, MySQL, Memcache, Apache, HaProxy for running overall show.
Any help will be much appreciated.
Thx
http://www.scribd.com/doc/49575/Scaling-Rails-Presentation-From-Scribd-Launch
you should start reading from slide 17.
i think the performance isnt a problem, if it's possible to build solution like this for website as big as scribd.
Here are 4 ways to address this, from easy estimates to complex and accurate:
Track only a percentage (10% or 1%) of users, then multiply to get an estimate of the count.
After the first 50 counts for a given page, start updating the count 1/13th of the time by a count of 13. This helps if it's a few page doing many counts while keeping small counts accurate. (use 13 as it's hard to notice that the incr isn't 1).
Save exact counts in a cache layer like memcache or local server memory and save them all to disk when they hit 10 counts or have been in the cache for a certain amount of time.
Build a separate counting layer that 1) always has the current count available in memory, 2) persists the count to it's own tables/database, 3) has calls that adjust both places