How to create an automatic news site? - parsing

I've seen sites like this (http://www.tradename.net/) on the web that seem to be nothing more than a collection of news articles pulled in from different places - all seemingly automated... I would like to know how can I create something like this that:
(a) either automatically, one its own pulls data from different news feeds and creates these articles/news-conent, OR
(b) I run a program periodically to update all its content
I am looking for a ready-to-run software / module that I can take and put in either the keywords or links to news feeds and get it to work... I'm not interested in one of those paid template sites.
Another example: http://www.limitedliability.org/

You can just make your own website like that. Just use rss-feeds from topics / newswebsites that you like to show your users. Customize your website like how you want it yourself using one of the scripting languages. It's not very hard to loop through all news flashes in a rss-feed and show them to your users.
You can use PHP
Or .NET
Or Javascript
Ans obviously there are more ways to do this. Just take a good look around and check with what scripting language you feel most comfortable.

Create a script that parses the rss-feeds from the news sites, and only store the ones that you are interested in.

Or just create your own Google News feed and add it to your site. There are free feeds for non-commercial use.
Available Google News Feeds
RSS Feeds: Incorporate feeds onto my site

Related

Displaying tweets from multiple users (similar to Embedded Timelines) without twitter-side user lists

I am new to Twitter and need some tips.
I need to display tweet feed from multiple users on some webpage.
The first thing I stumbled upon is Embedded Timelines. It allows to display tweets from list of users but the gotcha is that those lists should be maintained on Twitter-side (i.e. I cannot specify #qwe and #asd only on my side and get timeline without adding those users into list on Twitter-side).
The thing is that list of users that should be included into timeline is dynamic and managing those lists through Twitter API will probably be painful. Not to mention that my website will probably generate tons of those lists and I feel that I will violate some api quotas sooner or later.
So, my question is - am I stuck with using Embedded Timelines that refer some user list on Twitter-side and managing those lists through, say Twitter REST api, or there is a simplier way to do what I want?
It's pretty simple to display tweets for multiple users.
Links to start with
This post explains some of the search queries you can make
This post is a simple library to make requests to the twitter API that 'just works'
Your Query
Okay, so you want multiple users. The endpoint you're looking at using is the search/tweets one: https://api.twitter.com/1.1/search/tweets.json.
The query string uses :from and you can interpolate multiple froms with AND/OR.
An example query for the GET request:
?q=from:user1+OR+from:user2
Read more about the search API queries here.
Your "over-the-quote" issue
This is something you're going to need to figure out yourself - depending on the number of requests you expect to make, and the twitter imposed limits, maybe some sort of caching or saving information when you hit your limit, and only pull back from the cache whilst you're hitting your limit..

Receive notification when site server adds page

I've been doing some programming off and on for my brother, who is a stock trader. I'm wondering if it is possible to receive a push notification when a site server adds a page. For example, the site smallcapfortunes.com frequently adds pages that are simple extensions off the main URL. For example, the site regularly adds pages under URLs such as /neca/, /stev/, etc.
Are there existing methods to execute this? Or is this something I need to write myself? Has anyone here written anything like that?
I know there are existing sites to track basic updates to a single page. In my research, though, I haven't found anything like this.
Please let me know if there are any other details I need to provide.
Generally you can only get a push notification if a specific website offers that service.
Some websites publish a structured (XML) site map. If the one you're interested in does that, you could pull that sitemap on a regular basis and look for differences.
you're most likely going to want to use http://scrapy.org/ to go through the site and find new /neca/ and /stev/ urls, etc, then just trigger the script every so often.

Display a list of twitter accounts with twitter controls/info

I have some lists of twitter accounts that I would like to recommend on my website (e.g. follow these great crafting bloggers). If I have the twitter ID for each of these people, is it possible to create a list of items that show their twitter info (pic, number of tweets/followers, etc.) as well as controls that allow the user to follow each (or multiple) twitter account? I'd like to be able to do it dynamically based on the list of accounts so that I can update the list and not have to redesign the page. I feel like I've seen this around the web before, but I don't see any widgets for doing it and I'm wondering how it's done.
(I would like to use javascript/jquery, but am pretty flexible here)
Thanks!
Jeff
I would first look at the Twitter API. You will find more information on how Twitter works, and you may find information applicable to what you want to accomplish on your website. It's a start, and there's no better place to start than the source itself.

where can I find a machine-readable list of the most popular users on twitter?

I'm interested in building a simple demo and need a list of top twitter users. Is there a web site that offers a JSON or RSS feed (or otherwise easily parseable list) of the top 1000 twitter users by number of followers. Is there a web site that offers something like this? (I know I can scrape one of the many sites like Twitaholic but I'd rather not bother with that if there is an easier alternative.)
Twitter Counter , they also have a nice REST api that I like. Lady Gaga is #1 of course.
Edit based on comment
Here is a Yahoo Pipe for Top5 which can probably be edited for more
http://pipes.yahoo.com/pipes/pipe.info?_id=10ba4ad51d85cbf06d97236a2a291ac6
http://twittercounter.com/
http://twittercounter.com/pages/api?ref=footer

Ruby Rss parser and event trigger

I'm using RSS library so i can parse Atom and RSS in Ruby and Rails and store it in a model.
I've looked at the standard RSS library, but is there one library that will auto-detect that there is a new rss feed so i can update my database ?
what are the best practice to trigger an instruction in order to store the new rss feed ?
should i use threads to handle that problem ?is it going to be slow?
thank you for your help
OK heres the deal.
If you want a real fast feed parser go for Feedzirra. Does not work on windows. http://github.com/pauldix/feedzirra
Autodiscovery?
-Theres truffle-hog if you don't want to do GET redirects. http://github.com/pauldix/truffle-hog
-Theres feedbag if you want to do GET redirects to find feeds from given urls. This is slower though. http://github.com/damog/feedbag
Feedzirra is the best bet if you want to poll for new entries for your feed. But if you want a more non-polling solution to your problem then i would suggest going through the pubsubhubbub spec. Make sure while parsing your feeds they are pubsubhubbub enabled. Check for the link tag. If it points to pubsubhubbub.appspot.com or any other pubsub enabled hub then just subscribe to the feed by sending a subscription request to the hub. You can then define a endpoint in your app which will in turn receive updated entry pings for your feed subscription from the hub. Just read the raw POST data and store it in your database. Stats are that 95% of the blogger blogs are pubsub enabled. That is a lot of data in your hands already. :)
If you are polling for changes then you should check the last-modified or etag from the header rather than parse the entire feed again. Saves you from wasting resources. Feedzirra takes care of this for you.
I am not sure what you mean by "auto-detect" a new feed?
Are you looking for code that can discover when someone creates a new feed on a site? Or, do you mean discover when an existing feed has a new article?
The first is tough because your code needs to know what site to look at so it needs some sort of auto-discovery of sites with new feeds. Searching the google for "new rss feeds" doesn't return anything that looks useful, at least not on the first page. If you, or your users, know of a new site then you can have an interface to add new sites to search. Then you grab the page at that URL, look for the RSS/Atom auto-discovery links, and go from there. Auto-discovery links can open a can of worms because of duplicate content being served using different protocols (RDF, RSS and Atom), so you have to determine which to use, or multiple feeds with alternate content listed.
If you mean you want to discover when an existing feed has new articles, then you have to keep track of the last time your code looked at the feed, and the last article that was seen, then retrieve the feed and see if any articles were not in your list of previously seen articles. Your code needs to be sensitive to the time-to-live information in a lot of feeds too. Hitting the feed every fifteen minutes when they update once a week is bad form. Most aggregation code can do those things already but you might need to configure a database and tell the code how to find it.
Generally, for this sort of task I set up a crontab entry on a production Linux or Unix system and fire off the job periodically, looking in the database for feeds whose last-run-time plus the stored time-to-live value is in the past.
Does that help any?
Very easy solution is to use Dynamic attribute-based finders
When you are filling your model with RSS feed data, instead of Model.create(...) use Model.find_or_create_by_column(value, :other_column => other_value).
You can specify a date as unique value or RSS message title ... (whatever you want)
I think this is pretty easy. You can make some cron task to fill your model once per hour for example. Only new feeds will be added.
There is no chance to get some "event" when RSS is updated without downloading whole RSS feed again.

Resources