How on earth does Google Reader parse RSS? - parsing

I'm pulling hair out, i might pull a tooth out next, thats how frustrated i am.
I have deleted (for the purpose of proving a point) ALL my RSS files in my wordpress site
http://baked-beans.tv
No matter what i edit, Google Reader reads what it wants, ie: the posts, and all it's content!
So how on earth am I supposed to edit the content which most of my RSS subscribers will view (since Google Reader is very popular)
If you look here: http://baked-beans.tv/feed/
There is NO content!
And yet if I add this URL to Google reader, it generates full posts in the feed.
Furthermore!
If I edit say... wp-includes/feed-rss2.php I can see those changes within the RSS parser of Safari, Firefox, etc, but again, Google just shows the same thing, the entire post.
This really isnt on. If you go to Google Reader, and click on "Show Details" it says "Feed URL: http://baked-beans.tv/feed/" Which is just a total lie.
I really need to control how people see posts. The posts contain hefty video and a lot of images, and it parses the post in a really unattractive way.
Thanks in advance,
Marc

I'm pretty sure Google is using a cached result because your feed is completely empty (which is invalid RSS, which is probably interpreted as an error condition, like the feed being down).
Try showing a feed that is valid, but empty. That should get Google to pick up the change sooner or later.

If you'd like to edit the contents of posts that were already crawled by Reader, you'll need to republish them with the same GUID (if using RSS) or ID (if using Atom). Reader keeps copies of posts indefinitely (so that it can show historical data for feeds), and it keys things off of the ID. If it sees a post with the same ID as one it already has, it'll update the content of its copy with the new crawled content (more details here).

Related

Why does iTunes Store Reviews RSS feed sometimes return no results?

I'm trying to import reviews for certain apps on the iTunes App Store via the public reviews RSS feed. Most of the time the feed returns a list of 50 reviews per page, and gives me links for up to 10 pages. But in the case of some apps, some or all of those pages have 0 reviews, and I can't tell why.
At the time of this writing, the feed for Instagram (link below) returns no reviews, despite reporting that there's 10 pages of reviews available.
https://itunes.apple.com/us/rss/customerreviews/page=1/id=389801252/sortBy=mostrecent/xml
Even more confusing, I noticed last night that page 2 had 50 reviews but none of the other pages had any. This morning, page 2 is empty again.
If I remove the sortBy=mostrecent portion of the URL above, I actually do get 50 results back, but none of the other pages have any results.
Finally, it appears as if the JSON version of this page (link below) actually returns results better than the XML version. Unfortunately, the JSON version leaves off the date of the review in the data so I can't use it.
https://itunes.apple.com/us/rss/customerreviews/page=1/id=389801252/sortBy=mostrecent/json
Can anyone explain this? Is Apple's XML feed API just extremely unreliable? Am I forming a bad URL?
While this answer isn't very satisfying, it's the best I could work out after many trials. It appears as if the XML feed is really fallible and shouldn't be used for real-world usage. Furthermore, when using the public JSON feed, certain fields such as review date are missing. Neither feed reports developer response.
It's also clear that Apple doesn't use these feeds for iTunes (desktop) or App Store (iOS). I ultimately reverse-engineered the way iTunes requests review data and figured out that making a request the same way, making sure to match their User Agent and version, would return the data I needed. These requests seem to be rate-limited to a certain extent and the data comes as a mix of HTML and JSON that requires a lot of parsing. Furthermore, because they're private calls, Apple could easily shut the door at any moment.

Query the Facebook Graph API for objects containing a given string

I'm looking to get back photos from a specific album on Facebook that contain a given string in their title. Can this be done? The only alternative I know of would be to download ALL photos from the album (at the moment there's about 500 and this number will grow over time), parse them all and then eventually filter. That could become (and already kind of is) an extremely costly operation I'm looking to avoid.
I have looked all over and have yet to find anything somewhat related, so if there's an answer to this question already please link me to it, thanks!
You do have to go through the album, there is no Search API for that.
/album-id/photos
check for the string in the title
use paging to get the next batch of photos
repeat from number 2 until there is no next batch
You can also cache the ids/titles for later, so you don´t have to hit the API for a all the photos in a new search, but that´s completely up to you. There is no other way.

Asana: convert user #-tag to API object

I'm parsing the description of tasks for user links (#-tags) that we use to identify different roles on an item. I noticed something weird about the IDs though.
In the notes of a task returned from the API the #-tags are converted to links in the form https://app.asana.com/0/<int_id>/<int_id> which, when visited in the browser, show the user's tasks but when I use that ID to query the API as in https://app.asana.com/api/1.0/users/<int_id> I get a 403 with this response: {"errors":[{"message":"user: Not the correct type"}]} - further investigation showed that the IDs used in the #-tags are different from those used in the API for the same user, even though they both lead to the same page.
My question is are these IDs meant to be opaque or is there a way to convert them to the correct corresponding API IDs (short of browser scraping)?
Unfortunately, it's not possible at this time. In comments and notes (basically, anywhere in Asana that Rich Text is possible), we represent users as the URI to their "My Tasks" page, which is different from their User ID (as you noticed).
We are exploring ways to close this gap, but don't have anything to share at this time. I know that's not super helpful, but I hope it at least helps to have a definitive answer :-(

SEO: How to get rid of the webpages titles below the main link url on Google

Recently I changed my website which now is hosted on a different server (the previous server hosted by another company is not available anymore).
Everything is different on my new website including the content, the layout, the design and, most important here, the url's format.
The only thing I kept is my domain name which has been redirected toward the new server.
Keeping the same domain name is the issue:
The problem is that when I make a search about my website on Google, the main link displayed is ok but below this link, there are 4 titles corresponding to 4 sections of my previous website.
Clicking on them will lead to previous url's that don't exist anymore.
You get a kind of cached result with no css and the users are complaining a lot about that.
I opened an account on "Google webmasters" and I declared a brand new sitemap.xml and asked for a new Googlebots crawling (three times already).
It's been a week now and the titles below the general link on Google remain the same.
How can I get rid of these?
On "Google webmasters" I tried to "ban" the url's of these titles. It kinda works but not as I expected: The titles remain there but there is no more description below them (which doesn't solve my issue, it just makes it uglier).
Another difference is that one of these links finally disappeared … but another outdated section link has taken its place. It can go like this forever as there are too many possible links to ban with no certainty of result.
What I would like is just keep the main link on Google and get rid of these "sub" titles. At least the old ones.
PS: I never asked for these titles in the first place. they just appeared a long time ago.
I don't mind getting the new sections there but certainly not the old ones.
Thank you for your help.
First of all, I would block the old site's content with a robot.txt to prevent any crawls by cached sitemap.xmls.
If you have a site with a decent amount of SEO traffic, I'd create an htcaccess 301 redirect of the most important results.
It can take some time until your de-indexing will start. I've waited about 2 weeks.

Ruby Rss parser and event trigger

I'm using RSS library so i can parse Atom and RSS in Ruby and Rails and store it in a model.
I've looked at the standard RSS library, but is there one library that will auto-detect that there is a new rss feed so i can update my database ?
what are the best practice to trigger an instruction in order to store the new rss feed ?
should i use threads to handle that problem ?is it going to be slow?
thank you for your help
OK heres the deal.
If you want a real fast feed parser go for Feedzirra. Does not work on windows. http://github.com/pauldix/feedzirra
Autodiscovery?
-Theres truffle-hog if you don't want to do GET redirects. http://github.com/pauldix/truffle-hog
-Theres feedbag if you want to do GET redirects to find feeds from given urls. This is slower though. http://github.com/damog/feedbag
Feedzirra is the best bet if you want to poll for new entries for your feed. But if you want a more non-polling solution to your problem then i would suggest going through the pubsubhubbub spec. Make sure while parsing your feeds they are pubsubhubbub enabled. Check for the link tag. If it points to pubsubhubbub.appspot.com or any other pubsub enabled hub then just subscribe to the feed by sending a subscription request to the hub. You can then define a endpoint in your app which will in turn receive updated entry pings for your feed subscription from the hub. Just read the raw POST data and store it in your database. Stats are that 95% of the blogger blogs are pubsub enabled. That is a lot of data in your hands already. :)
If you are polling for changes then you should check the last-modified or etag from the header rather than parse the entire feed again. Saves you from wasting resources. Feedzirra takes care of this for you.
I am not sure what you mean by "auto-detect" a new feed?
Are you looking for code that can discover when someone creates a new feed on a site? Or, do you mean discover when an existing feed has a new article?
The first is tough because your code needs to know what site to look at so it needs some sort of auto-discovery of sites with new feeds. Searching the google for "new rss feeds" doesn't return anything that looks useful, at least not on the first page. If you, or your users, know of a new site then you can have an interface to add new sites to search. Then you grab the page at that URL, look for the RSS/Atom auto-discovery links, and go from there. Auto-discovery links can open a can of worms because of duplicate content being served using different protocols (RDF, RSS and Atom), so you have to determine which to use, or multiple feeds with alternate content listed.
If you mean you want to discover when an existing feed has new articles, then you have to keep track of the last time your code looked at the feed, and the last article that was seen, then retrieve the feed and see if any articles were not in your list of previously seen articles. Your code needs to be sensitive to the time-to-live information in a lot of feeds too. Hitting the feed every fifteen minutes when they update once a week is bad form. Most aggregation code can do those things already but you might need to configure a database and tell the code how to find it.
Generally, for this sort of task I set up a crontab entry on a production Linux or Unix system and fire off the job periodically, looking in the database for feeds whose last-run-time plus the stored time-to-live value is in the past.
Does that help any?
Very easy solution is to use Dynamic attribute-based finders
When you are filling your model with RSS feed data, instead of Model.create(...) use Model.find_or_create_by_column(value, :other_column => other_value).
You can specify a date as unique value or RSS message title ... (whatever you want)
I think this is pretty easy. You can make some cron task to fill your model once per hour for example. Only new feeds will be added.
There is no chance to get some "event" when RSS is updated without downloading whole RSS feed again.

Resources