Is there anyway i can retrieve only English tweets using the Twitter's Live Straeming API?
It seems like using "sample" or "filter" results around 60-70 percent of non-English tweets.
Thanks
Joel
I haven't found a good solution to this, I've solved this using the following:
1) filter by lang attribute equal to "en".
2) I found that several non-english languages are still in the english labelled tweets. So, I downloaded spanish, dutch, and indonesian word lists, and checked for number of non-english word occurrences in tweets. More than 1, and I discard it as non-english.
3) I think I need to filter for portuguese as well, need to investigate this.
Filtering only English-language messages from the twitter stream is an active research area. You could use an off-the-shelf language identification system to locally process the stream and select only messages in English. One such system is langid.py. Full disclosure, I am the author of langid.py.
Another system I know of is ldig by Nakatani Shuyo. I haven't had a chance to experiment with it yet, but it is made specifically for language identification of Twitter messages.
Twitter will soon be releasing a new (or updated) attribute just for this purpose! See their blog post, Introducing new metadata for Tweets
The new lang attribute specifies the language the Tweet was written in, as identified by Twitter's machine language detection algorithms.
At the time of this writing the lang attribute and language parameter haven't yet appeared, however check the Calendar of API changes to see when they plan on releasing it (currently just specifies "2013").
Update 3/30/2013:
The lang attribute was added to the Streaming API on March 26, 2013. In addition, it was also made available on the REST API on March 6, 2013.
For use in the Twitter Streaming API, language is now a request parameter:
https://dev.twitter.com/docs/streaming-apis/parameters#language
So for English you'd add 'language=en' into your request parameter string.
Twitter just finished it!!
cf calendar API:
https://dev.twitter.com/calendar
March 26, 2013 lang attribute & language parameter appears on streaming Blog post Streaming API.
The twitter API rocks!!
Related
I am trying to do a filter by urls but no result is being returned.
From the following doc, it shows it is possible https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/premium-operators
but I think it's a premium feature. Is this true? If yes then is there any other way to filter by urls without using the premium feature?
Standard Twitter streaming API provide us with 'track' parameter. This is a Standard streaming API parameter (see the doc). It matches Tweets as by phrases as by URLs. A common use case according to the doc:
a common use case where you may want to track all mentions of a particular domain name (i.e., regardless of subdomain or path), you should use “example com” as the track parameter for “example.com”
This parameter value Will match...:
example.comwww.example.com foo.example.com foo.example.com/bar I hope my startup isn’t merely another example of a dot com boom!
I tested the option by means of twitter-hbc library for Java. It works as expected!
To avoid confusion, please, take the note:
The text of the Tweet and some entity fields are considered for matches. Specifically, the text attribute of the Tweet, expanded_url and display_url for links and media, text for hashtags, and screen_name for user mentions are checked for matches.
I would like to know if there is a java library or API , that can extract tweets of my interest such as i want to know which tweets have reported about the strike in upcoming days in Islamabad city or bomb blast recently occured in my city . etc. I know there are other libraries out there but they only tell about whether the specific tweet is positive , negative or neutral. Thanks
Every library for integrating twitter tweets to your application is based on the twitter apis.
For your specific example, you can try Search API.
The process is really simple: just try from here your keywords to determine which keywords best suit your need and then use the actual API like this(returns tweets with keyword:Islamabad) to return the tweets you need in a JSON format.
NOTE:
Version 1.1 of the APIS uses an OAuth Authentication(I have not tried it yet so I can not provide more details :( ).
Now for Java libraries(frameworks) that simplify this process, the only one I know is Spring Social. But in case you are not familiar with Spring framework in general, the best thing is to just read the Json url generated by the api and unmarshall it to get your results.
I'm planning to release a community website that doesn't have a PRIMARY audience that is english speaking. This means that URLs that point to /profile /forums and so on will be in english and not in their native language. I'm not concerned if a user is using the website while accessing different URL paths in English, but I am concerned if I were to use non english URLs then would a search engine pickup on pages on the website better or worse?
Anyone care to share their opinions?
In my opinion, it would be better to have URLs that reflect the primary language of your users as it would make them finding your website easier on search engines (supposing they search using their primary language). From a SEO perspective, if possible try to also include in your URLs the relevant search terms you think would be used by your audience. If you have a forum, for example, include in the thread URLs the full thread title if possible, and so on.
Sources: my own experience with building and managing powershell.it and sqlserver.it, two of the most important Italian technology-related communities.
The best place to start on this issue would be Google's Webmaster Central section on Internationalization.
If you will have versions of the same URL in multiple languages, you can connect them using the rel="alternate"mechanism, which is explained at Google's Webmaster Tools page.
1. Summary
Using non-English URLs for non-English websites is fine.
2. Argumentation
Google Senior Webmaster Trends Analyst John Mueller said in a recent SEO snippets video that using non-English URLs for non-English websites is fine and that Google is able to crawl, index and rank them.
This includes non-Latin characters in your URLs. John Mueller said “as long as URLs are valid and unique, that’s fine.” He added, “So to sum it up, yes, non-English words and URLs are fine, and we recommend using them for non-English websites.”
Read full article here.
3. Disclaimer
Data of this answer were relevant in March 2018 and may be obsolete in the future.
What is the best way to determine the language of twitter posts.
There is the language parameter that comes with the streaming API but it doesn't really seem to be very accurate. Even many Japanese posts are labelled as English.
What have others done to sort out the langauges?
I've had very good results with this PHP package:
http://pear.php.net/package/Text_LanguageDetect/
It is fast and open source. We use it to select English only posts for a site we run at http://2012twit.com.
google have language detection within their Translate API if using evil external services is a go-er?
http://code.google.com/apis/language/translate/v1/reference.html#detectResult
Does anyone know where to find the RSS feeds in the new twitter? I cannot find the rss icon and the source of the page just points to "Your Twitter Favorites" even though I am on the page of the user I want to get an RSS feed from...
Simple I know, but its bugging me to no end!
2014 edit:
It looks like Twitter has retired RSS feeds, and now only exports data as JSON:
What output formats will API v1.1 support?
API v1.1 will support JSON only. We’ve been hinting at this for some
time now, first dropping XML support on the Streaming API and more
recently on the trends API. XML, Atom, and RSS are infrequently used
today, and we’ve chosen to throw our support behind the JSON format
shared across the platform. Consequently, we’ve decided to discontinue
support for these other formats. For historical context, when we
originally built the API all major languages did not have performant,
well vetted libraries supporting JSON - today they do.
Orignal 2010 answer:
Here are the various feed URLs (using the account "Twitter" for these examples):
http://twitter.com/statuses/public_timeline.rss
http://twitter.com/statuses/user_timeline/Twitter.rss
http://twitter.com/favorites/Twitter.rss
http://search.twitter.com/search.rss?q=Twitter
The new Twitter layout isn't very RSS-friendly, unfortunately.
You won't be able to find it because Twitter stopped support for RSS :(
Something I needed, so built a Twitter to RSS converter, it works on hashtags, searches and lists. I've now opened it up totally free for anyone else who needs a solution.
Get it here - Twitter RSS Feed Generator
Recommende you to use the free website ahejlsberg, put the id into the textbox next to #, then click the "Fetch RSS" button.
You can get the RSS feed url: https://twitrss.me/twitter_user_to_rss/?user=ahejlsberg.
I found that this works for particular users (I had been trying to figure out their ids which was the way rss used to work but this works fine):
[Updated]
http://api.twitter.com/1/statuses/user_timeline.rss?screen_name=johnpiper
[Updated Sept 2014]
no longer works again...
[Alternative Solution: May 2015]
I have since discovered http://www.queryfeed.net
http://www.queryfeed.net/twitter?q=from%3Ajohnpiper
See the home page for further documentation about how to structure other queries. The service does not seem to return all tweets.