I'm working on an RSS reader app and I'd like to provide a consistent article experience for my users (i.e. the new Safari Reader). Is there an API that would apply here? I'm aware of Readability but I'm not sure if that is what I need.
just was doing some research for my app, heres what I've found. couldn't post all the links cause I'm new, but easily googleable
Read, Clear Read API: http://readapp.net/pub.html
Instapaper itself. Simple and Full API
Readability
RTCOOL
Feeds api
Boilerpipe
Goose
An overview of text extraction algorithms: http://www.readwriteweb.com/hack/2011/03/text-extraction.php
best of luck!
Related
I am working on a data mining system and one of the requirements is it being able to perform the analysis without the use of API. Is there a way to download the Twitter database (or a big part of it, at least) and work with it locally?
There is a paper about creating corpora from twitter. It is called “TWORPUS – An Easy-to-Use Tool for the Creation of Tailored Twitter Corpora”. I recommend to read it because it also covers licensing issues etc. They also provide there code on Github.
In fact, you cannot download the twitter data dumps directly. I can download single tweets and stored them in a corpus. But, it is also not allowed to share that data. Therefore, the authors built the Tworpus client to create private twitter corpora.
APIs are the official way of getting Twitter data and they work really well so it is not comprehensible why you do not want to use APIs. The web scraping is a work around but not recommended, in addition you would like to get a big part of it, so I do not think you will be satisfied with it. You can also buy the data from Gnip.
I am browsing the reddit source code and would like to know specifically where the filter code is. How does reddit's filters learn spam vs no spam etc?
From the open-sourcing announcement post:
There are a few portions of the code that we're keeping to ourselves, mostly related to anti-cheating/spam protection.
Alright — this sort of question shows my naïveté but I am asking it nonetheless so I don't venture down the wrong rabbit hole while trying out this app.
I'm making what amounts to a news app. Imagine taking a Wordpress blog and fitting it to iOS. Now, here's my question — what sort of feed / architecture should I be using to push information from my Wordpress server to my app? I would assume RSS using AFNetworking, but that seems to cause some rough edges, and all tutorials that I see end up pushing to a web view instead of a scrollview with nice, rendered text. Plus, none of the same tutorials seem to have anything further than the initial feed (loading more than the initial 10 stories given, for example).
I've already committed a few hours to trying the RSS / AFNetworking approach, but is there a significantly better alternative that I just haven't come across. (Note that I do have access to the back-end of my Wordpress site, i.e. it isn't somebody else's)
If you are building an iOS app that connects to WordPress, I suggest you to access the website data by an API instead of feed, then you can hit the API from your app and manipulate the data as you want.
If you have access to the WordPress backend, check the Thermal API which is a plugin that will probably solve your problem.
Cheers,
I would suggest that you look at https://wordpress.org/plugins/json-rest-api/ this is to be added to the core of wordpress so it would seem that is the way to go I think.
By the way I am working on the same type of thing as you.
I think the most popular Wordpress API is the one that comes with Jetpack. You can find its documentation here: https://developer.wordpress.com/docs/api/
If you just want read access, then i think the easiest way to do so is by using: https://github.com/evermeer/AlamofireJsonToObjects/blob/master/AlamofireJsonToObjectsTests/WordpressTest.swift
If you also want write access, then you have to implement Oauth2. For that you can select a library from: https://cocoapods.org/?q=oauth
How could I access a website and turn components of the website into strings. For example taking information from Facebook posts. I have done a little searching but can't find any good tutorials or anything useful.
Try looking at this tutorial. It should get you more familiar on the subject and start you off on the right track.
As it states at the beginning of the tutorial...
How to Parse HTML on iOS
Let’s say you want to find some information inside a web page and
display it in a custom way in your app. This technique is called
“scraping.” Let’s also assume you’ve thought through alternatives to
scraping web pages from inside your app, and are pretty sure that’s
what you want to do. Well then you get to the question – how can you
programmatically dig through the HTML and find the part you’re looking
for, in the most robust way possible? Believe it or not, regular
expressions won’t cut it! Well, in this tutorial you’ll find out how!
You’ll get hands-on experience with parsing HTML into an Objective-C
data model that your apps can use.
http://www.raywenderlich.com/14172/how-to-parse-html-on-ios
Is there a way to collect web content in order to use it in a search engine without passing by the web crawling phase? Any alternative to web crawling?
Thanks
No, to collect the content you have to...collect the content. :-)
Yes (and sort-of no).
:)
You can download existing data dumps from various websites (wikipedia, stackoverflow, etc.) and construct a partial index that way. It obviously won't be a complete index of the internet.
You could also use meta-search to construct your search engine. This is where you use the APIs of other search engines and use THEIR search results as the basis of your index. Examples include citosearch and opensearch. duckduckgo uses yahoo's boss api (and now yahoo uses bing...) as part of their search engine.
There are also real-time streaming APIs that you could use instead of crawling the web. Look at datasift as an example. There are lots more resources you could cleverly use and avoid/minimize crawling.
If you want to be updated with the latest content on pages, then you can use something like pubsubhubbub protocol to get push notifications for subscribed links.
Or use paid services like superfeedr that make use of the same protocol.
directly or indirectly you have to crawl the web in order to get the content.
Well if you don't want to crawl, you can follow a wiki-like approach, where users can submit links to sites (with title, description and tags). So a collaborative link collection can be built.
To avoid spam a +/- system can be involved, to vote useful sites or tags up and useless ones down.
To avoid spammers mass voting SERPs you can weight votes by user reputation.
User reputation can be gained by submitting useful sites. Or somehow tracing usage patterns.
And considering other abuse patterns too.
Well, you got the point, I think.
As spammers gradually discover weaknesses of traditional search engines (see Google bomb, content scraper sites, etc.), a community based approach may work. But it would suffer severely from the cold start effect, and when community is small the system is easy to abuse and poison...
At least Wikipedia and Stack Exchange is not spammed to useless levels so far...
PS: http://xkcd.com/810/