Training data for phishing or spam tweets - twitter

I want to do phishing/spam detection on twitter.
I’ve got about 500,000 tweets through Streaming API provided by twitter. Then I extract the url appeared in these tweets and submit them to two blacklists – Google safebrowsing and Phishtanks to receive a basic judge of whether it’s a phishing link or not. The problem here is that according to my experiment results, I can’t get enough samples of phishing tweet.
Are there any exsisting tweet data that have already be marked as malicious/normal so that I can carry on with my work?

the url blacklist did not work well, because there is latency. u can use suspended account as the label, but you should pay attention that not all the suspended account are phishing accounts.

Related

How we can publish a twitter dataset?

We are going to collect records of roughly 80M from twitter but, we do not know if that is legal or illegal to publish it somewhere like Github.
I saw that users, mostly publish the tweet IDs other than the whole tweet data (like text, username and ...). How is possible to publish twitter data?
I saw that users, mostly publish the tweet IDs other than the whole tweet data
That's correct, and that's aligned with the Twitter Developer Policy that you agree to when using the API, which states:
If you provide Twitter Content to third parties, including downloadable datasets or via an API, you may only distribute Tweet IDs, Direct Message IDs, and/or User IDs (except as described below). We also grant special permissions to academic researchers sharing Tweet IDs and User IDs for non-commercial research purposes.
[... - ...]
Academic researchers are permitted to distribute an unlimited number of Tweet IDs and/or User IDs if they are doing so on behalf of an academic institution and for the sole purpose of non-commercial research. For example, you are permitted to share an unlimited number of Tweet IDs for the purpose of enabling peer review or validation of your research. If you have questions about whether your use case qualifies under this category please submit a request via the API Policy Support form.
Basically, if you are in any doubt you should ask Twitter directly via the form linked above, but the policy is pretty clear that you should only be sharing Tweet IDs. You should also have stated your intent when applying for API access.

Is it possible to access the live tweets of a user using tweepy?

I am trying to access live tweets of a user whenever he tweets it. So, all I want is something that continuously monitors a user account and whenever he tweets something I have to capture it. All the tweets are random so I cannot use any filters.
For any security reasons, if I cannot access other's tweets can I do it on my own account?
With tweepy you can connect to the REST API or the Streaming API.
Using the Streaming API you can use the filter endpoint to select the users you wish to follow with that streaming connection and you will receive updates as they get published.
Twitter's documentation: https://dev.twitter.com/docs/api/1.1/post/statuses/filter, tweepy's code: https://github.com/tweepy/tweepy/blob/master/tweepy/streaming.py
Tweepy's documentation doesn't give examples on the Streaming functions but you can find sample code searching at GitHub or StackOverflow for "tweepy filter follow".
Tweepy talks to the Twitter REST API, and the REST API doesn't have any way to react to someone posting a tweet.
HOWEVER...
You could certainly write an application that retrieves the tweets of a partcular user and looks for any tweets that weren't there the last time you checked.
You'd want to be cautious about how often you check so you don't run afoul of the API rate limits.

Google Analytics referrals: What's the difference between t.co and twitter.com?

In Google Analytics' Pages report, I'm looking at a particular page on my site, then, under secondary dimensions, pulling up "Source/Medium." I'm seeing rows for both of these:
t.co / referral
twitter.com / social
What's the difference between these two? I understand that links shared on Twitter get wrapped in t.co URLs, but then what are the visits coming from twitter.com?
t.co is a link shortner similar to bitly, but with some assurance from twitter that the link isn't harmful (they say they scan the page before giving out the shortened URL). People refered to your sites are getting there from a page that shared on twitter, and the people coming from twitter.com came from a twitter post.
So, basically:
t.co/referral implies Twitter -> other page -> your site
twitter/social implies Twitter -> your site
I'm not sure that #j-a-streich is right about that. From what I've seen, dealing with google analytics, you can't 100% predict why google ends up with the info they get - there are a lot of edge cases and gotchas. Here is what I have uncovered:
Twitter's official stance on how they handle referrer information: https://dev.twitter.com/docs/tco-redirection-behavior
t.co currently handles redirections by context and known user agents. We've taken care to preserve original referrers in all contexts where they are reliably provided.
But that doesn't actually help much, because they don't differentiate in that doc between t.co and twitter.com.
I did some testing, and it appears that whenever someone comes to your site direct from clicking a t.co wrapped link on twitter.com the referrer is the t.co link. That includes tweets and profile bios. That is also true for urls that were directly pasted into twitter and those that were pre-shortened by another shortener.
I also checked links from a client (old tweetdeck) and that showed up as t.co.
I wasn't actually able to find a valid source of twitter.com referrers.
My main problem with what #j-a-streich says is that he is appearing to say that the referrer will be maintained if someone goes from twitter to another site, then to you - and that won't happen. The referrer will be the other site.
I deal with this issue in my analytics as well, but I've just come to accept that referral data is not perfect and I treat it as so. I currently see direct, t.Co and twitter.Com referrals in GA and generally associate them all with Twitter. It's obviously hard to make a distinction with the direct visits, but there isn't much to do about that.
I think mobile device referrals play a part in it too because of the applications browser. I've noticed clicks in the android app show direct visits instead of t.co or twitter.

How to verify twitter account?

Let's say I am making a sign up form in which I asked user's twitter ID. How do I verify if the ID entered by user belongs to him/her? In case of verifying email we simply send a verification link which user has to click so how do I verify twitter ID? I have never used twitter before.
The only reliable and practical way to verify that twitter account X belongs to user Y this to do full on “3 legged” OAuth authentication. That being said, you may want to consider if you might be OK with just taking the user at their word on it.
Getting OAuth to work and securely storing the resulting tokens is much easier nowadays than it once was, but is still non-trivial.
Reasons to verify the twitter account, in increasing reasonableness:
You will be making enough server side requests, on behalf of multiple users, that you run up against Twitter’s API Rate Limiting. (Having multiple auth-tokens will allow for a higher API rate)
You need to automagically send tweets and/or follow accounts on the user’s behalf
N.B. do this as opt-in and be ultra clear about when/why you will be doing this, or you will face the justified fury of scorned users
Don’t verify the account if you’re looking to do these things:
You need to send tweets and/or follow accounts on the user’s behalf, and the user will be able to perform a browser based confirmation workflow for each of those actions; use Twitter’s Web Intents for this.
If you just want to pull in real time data for user’s avatar, bio, or recent Tweets Twitter supplies some prefab widgets for you.
All of the authenticated Twitter API Calls can be done client side with JavaScript. Twitter has a js framework, which does not require you to handle and store tokens on your server, to help you with that.
An alternate contact method for password resets, notifications, etc.
Private communication between users on twitter requires mutual following, many users probably never check their Direct Messages (or even know what a DM is), and any messages would be limited to 140 characters. Just use email for all that kind of nonsense.
If you’re just gathering this info to display it on a user’s profile page, in an “other places on the web” kind of way, integrating and maintaining all the server side OAuth pieces is likely too much bother. Just make sure you have a reasonable and clear TOS and an obvious way for 3rd parties to report any of your users who may be claiming a twitter account that is not their own.
If you’re still interested in OAuth, Twitter's Dev page has plenty of resources, including a nice overview of a generic “Sign In with Twitter” “3 legged” OAuth work flow.

Twitter follower demographics?

Any ideas on how I can get location, age and gender of my followers on Twitter? I know that Twitter can only give me location (not always), by looking at either time-zone or location. But I have seen other applications presenting the demographics of my followers and I have no idea how they do it? Is it fake?
Twitter does not provide this information. Some third party sites use techniques like analyzing tweets to try and predict demographics based on content. Others have deals with third partys that connect Twitter accounts with other profiles like Facebook accounts. There is no sure fire way to do it and all options will have reliability issues.

Resources