How do I obtain the data to train my ML model? - machine-learning

I am building a machine learning model that would suggest attractions in a specific location.
I have most of the details worked out. However, I still need to collect the data of the attractions to train my model.
Is there somewhere I could find a dataset for this (I already checked Kaggle)? If not which websites should I scrape?

If you want to scrape data, twitter probably is the easiest to start. You can use twitter API to get any tweet that contain a specific keyword or hashtag, input your desired location as the keyword and scrape it using tweepy, i would suggest you to scrape from a specific account like Influencer or travel blog to get data about attraction.
Applying to get twitter API might take several days, and you can only scrape tweet within a time range of a weeks. older than that you need to sign up to their premium subscription.

Related

How can I retrieve the N most popular tweets for a country using the Twitter API?

TL;DR: I want to be able to retrieve the N most popular tweets for any arbitrary country within the last X hours (up to 24 hours)
More detail
I want to show the details of the most popular tweets by geographic region (country) over the past few hours (adjustable up to 24 hours). How can I use the Twitter REST API to achieve this (v1.1 or v2)?
There are endpoints for querying tweets and filtering by popularity, but they require a search string (e.g. "NASA") and return the most popular tweets matching that search string. I am not interested in the contents of the tweets, I just want to know what is most popular.
I plan on using this functionality to show a world map (using Leaflet) to summarise the most popular tweets by country for the past day.
I am using Twit in NodeJS but not looking for answers specific to Node, rather how to leverage the capabilities of the API.
I am not aware of a way that this can be done directly through the API itself (V1 or V2). I also do not think that this is going to be a trivial task at all.
What I would suggest is using the search endpoint...
V1: Reference
V2: Reference Note that to use geolocation search parameters (see below) you'll need academic access.
... in conjunction with one of the geolocation search parameters. For example, you could pull some subset of tweets from within a country (you will not be able to download all tweets within a single country on any given day, not to mention all countries). After you get this data, you'll need to do some of your own data processing based on how you want to define "popular" (e.g. retweets, likes, etc.) and then go from there.
As I said earlier, this seems like a very large project and not something that can be solved simply with the Twitter API.

Not able to see time zone, place or geolocation of any tweets

I am following two tutorials right now and both are up and running and I've gotten plenty of tweets/sentiment scores from them:
1) Twitter Stream Analytics on Azure https://azure.microsoft.com/en-us/documentation/articles/stream-analytics-twitter-sentiment-analysis-trends/
2) Twitter Analysis with Spark Streaminghttp://ampcamp.berkeley.edu/3/exercises/realtime-processing-with-spark-streaming.html
I am using the free oauth tool provided from apps.twitter.com.
Problem
I've tried getPlace, getGeoLocation in the Spark Streaming app and every tweet I get has a null value for those two fields. I have tried filtering for tweets that only have values for getPlace, get GeoLocation and I get null for both (I ran the app for almost 20 minutes).
I've also tried getting TimeZone in the Azure app (so I can get some sort of geography data) and even then I kept getting null values for TimeZone.
Possible Obstacles
1) Does the free twitter api filter out the place/geoLocation information so I end up buying a subscription to a better api?
2) Do I need to explicitly search for tweets that have geoLocation/Places? Rather than getting all tweets and then filtering out ones that have geoLocation/Places? If so, can I execute this search in Spark Streaming?This is the code that I have in Spark Streaming:
val stream = TwitterUtils.createStream(ssc, None, filters)
val hashTags = stream.map(status => Tweet(status.getPlace().getName(), classifyTweet(status.getText())))
Thank you for the help!
I've personally used the free Twitter api to get locations and publish them on a a map on PowerBi. So you can rule out the first obstacle.
One thing to note is that location field is only available if the client specifically allows the application to have location, which renders it quite rare to be found. The ratio for data with location in my sample data was about 8%.
Don't have an answer for spark side, just wanted to help you rule out the first possibility.
Hope this helps.

finding twitter accounts associated with a topic

I was asked to find Twitter accounts associated with the Dominican Republic (the project had to do with voting). This was a strange request since some twitter accounts have GeoSpatial data associated with the account, we have no idea whether it is accurate.
I wound up searching by hand for keywords that I knew were related: #dominican, #washingtonheights and I hopped along their friends and followers and I found the people I was looking for.
More genereally:
How do I search for Twitter accounts associated with a given topic? How might it be possible to train a bot to identify hashtags relevant to a given topic? And then we can search for those keywords.
#Moderators: This is not really a coding question. If you can think of a better StackExchange, please migrate this!
Since you already have a given Topic i would suggest he following:
Get a couple of Account by Hand by these Hashtags you already mentioned.
Retrieve X tweets for these Accounts
Do some Natural Language Processing of these Tweets to get new ideas for Keywords.
Some things i used in this/similar contex:
tf-idf + NMF to get Topics and then sort by components to retrieve
the topics a user is talking about (user can have multiple topics).
some sort of clustering (your biggest problem here will be the high
sparesity of the data, so PCA could be an option)
use wordnet etc to collect similar keywords

Tweets, Location, Keywords and Data

I'm trying do some analysis on locations where people are going during winters. The approach I'm following is get tweets from a specific city (say, New York) and with the keyword Foursquare. Then use foursquare data for that user to see his/her checkins and try to trace a pattern.
So, I'm stuck in the first phase. How do I get those tweets from ONE city and with the keyword FOURSQUARE. I'm not sure if I understood how to use streaming API correctly and the ReST API isn't working (shows NOT AUTHORISED)
Could you tell me a detailed procedure for a rookie to understand the process of doing the above mentioned process. Also, let me know if you have a better approach for analysing trends in check ins.
Thanks
You want to read these:
https://dev.twitter.com/docs/api/1/get/search
https://dev.twitter.com/docs/platform-objects/places
You can give Twitter a latitude/longitude coordinate and a radius, or you can use the "place" field as a filter. Either way, expect to fine-tune this a bit to fit your needs. You also need to take into account that a lot of people might tweet without location services enabled.
If you want to use the REST API, you need to get an API key from twitter.

Training data for a recommender system of a location based social network

I'm currently developing a location based social network in Ruby on Rails. I also want to include a recommendation system. For testing the algorithms of this recommendations I need some real, anonymous training data. I've found the data from the Netflix Prize, but they are only including .
I'm searching for data that includes
users
friendships
locations or venues
check-ins (like in foursquare)
Does anybody know a good source for such data? Or a proven algorithm for generating this data? Or any other idea?
Search for random graph generation algorithms (more prciese, "social graph generation") to simulate social graph. Try retrieving the some test geolocation data by Google maps API or similar services. Unfortunately, I don't know what is "check-ins (like in foursquare)".
Also see Free Social Graph Data
I've finally solved it by using the gowalla API. Here you get a lot of information about users, without asking the users to permit the access to their data (kinda strange, but it works).
check it out: http://gowalla.com/api/explorer#/users/sco

Resources