Is there any news feed (event/activity stream) engine? - scalability

I'm looking for an open source news feed engine to use in an app I'm developing.
The engine needs to be able to aggregate news (items) from multiple sources a user is following and also optionally group them by news source or news type. A scalable solution in Java or with Java interface would be great.
I have already developed a very simple one, but I would prefer to use a robust and reliable solution instead.
Do you have any suggestion?

I created a backend for this in Java builing on neo4j: It is independent on the number of users and news sources one follows but depends just linear on the amount of items you want to display
find an explaination how it works and benchmarks with social networks with up to 2 mio users at http://www.rene-pickhardt.de/graphity-an-efficient-graph-model-for-retrieving-the-top-k-news-feeds-for-users-in-social-networks/
There is also the source code available: https://github.com/renepickhardt/graphity-evaluation

Check out Rome
http://rometools.org/
In case you also use .NET, Argotic Syndication Framework is for sure the best
http://argotic.codeplex.com/

Yahoo pipes is a very good rss feed which lets you create your own feed aggregator with custom filters.
Note: Python version for Yahoo pipes Pipe2py.
Aonther offering from yahoo is Daper.
There are few more online tools for creating custom feeds which you might want to look at.

FeedDistiller is a free service for aggregating new feeds by subject,

Related

Downloading Twitter corpus

I am working on a data mining system and one of the requirements is it being able to perform the analysis without the use of API. Is there a way to download the Twitter database (or a big part of it, at least) and work with it locally?
There is a paper about creating corpora from twitter. It is called “TWORPUS – An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora”. I recommend to read it because it also covers licensing issues etc. They also provide there code on Github.
In fact, you cannot download the twitter data dumps directly. I can download single tweets and stored them in a corpus. But, it is also not allowed to share that data. Therefore, the authors built the Tworpus client to create private twitter corpora.
APIs are the official way of getting Twitter data and they work really well so it is not comprehensible why you do not want to use APIs. The web scraping is a work around but not recommended, in addition you would like to get a big part of it, so I do not think you will be satisfied with it. You can also buy the data from Gnip.

Filter Products based on Product attribute

I need to create filter like below link
https://paytm.com/shop/g/paytm-home/incredible-offers/smartphones-flat-20-cashback
When i click the smart phone on Landing page then filter show based on smart phone
Like camera color sim internal memory external memory .. etc
Current i have list of productViewmodel which contain the product and product variant only
Please guide me
Thanks in Advance :)
The search term you are looking for is faceted search.
One option to implementing it is using a faceted search engine, such as Bobo-Browse.Net (which is implemented as an extension to the Lucene.Net search engine). It is a .NET port of the Java version, meaning it is a 100% .NET solution.
See the faceted search prototype and car demo for some examples of how to implement it in MVC.
Full Disclosure: I am a major contributor to the Bobo-Browse.Net project.
Another option is to use solr, which runs as a separate process than the web site that uses it. It is a Java-based solution.
Either way, the best solution from a web site is to use AJAX so the drill-down happens without reloading the entire page.

Determining language of twitter posts

What is the best way to determine the language of twitter posts.
There is the language parameter that comes with the streaming API but it doesn't really seem to be very accurate. Even many Japanese posts are labelled as English.
What have others done to sort out the langauges?
I've had very good results with this PHP package:
http://pear.php.net/package/Text_LanguageDetect/
It is fast and open source. We use it to select English only posts for a site we run at http://2012twit.com.
google have language detection within their Translate API if using evil external services is a go-er?
http://code.google.com/apis/language/translate/v1/reference.html#detectResult

Search Engine without crawling?

Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/

How to create an automatic news site?

I've seen sites like this (http://www.tradename.net/) on the web that seem to be nothing more than a collection of news articles pulled in from different places - all seemingly automated... I would like to know how can I create something like this that:
(a) either automatically, one its own pulls data from different news feeds and creates these articles/news-conent, OR
(b) I run a program periodically to update all its content
I am looking for a ready-to-run software / module that I can take and put in either the keywords or links to news feeds and get it to work... I'm not interested in one of those paid template sites.
Another example: http://www.limitedliability.org/
You can just make your own website like that. Just use rss-feeds from topics / newswebsites that you like to show your users. Customize your website like how you want it yourself using one of the scripting languages. It's not very hard to loop through all news flashes in a rss-feed and show them to your users.
You can use PHP
Or .NET
Or Javascript
Ans obviously there are more ways to do this. Just take a good look around and check with what scripting language you feel most comfortable.
Create a script that parses the rss-feeds from the news sites, and only store the ones that you are interested in.
Or just create your own Google News feed and add it to your site. There are free feeds for non-commercial use.
Available Google News Feeds
RSS Feeds: Incorporate feeds onto my site

Resources