Storing Media RSS and iTunes podcast RSS feeds in the database - ruby-on-rails

I want to be able to store media RSS and iTunes podcast RSS feeds into the database. The requirement here is that I don't want to miss out on ANY element or its attributes in the feed. It would make sense to find all most common elements in the feed and have them stored in database as separate columns. The catch here is that there can be feed specific elements that may not be standard. I want to capture them too. Since I don't know what they can be, I won't have a dedicated column for them.
Currently I have 2 tables called feeds and feed_entries. For RSS 2.0 tags like enclosures, categories, I have separate tables that have associations with feeds/feed_entries. I am using feedzirra for parsing the feeds. Feedzirra requires us to know the elements in the feed we want to parse and hence we would not know if feed contains elements beyond what feedzirra can understand.
What would be the best way to go about storing these feeds in the database and not miss single bit of information? (Dumping of the whole feed into the database as is won't work as we want to query most of the attributes). What parser would be the best fit? Feedzirra was chosen for performance, however, getting all data in the feed into the database is a priority.
Update
I'm using MySQL as the database.

I modeled my database on feeds and entries also, and cross-mapped the fields for RSS, RDF and Atom, so I could capture the required data fields as a starting point. Then I added a few others for tagging and my own internal-summarizations of the feed, plus some housekeeping and maintenance fields.
If you move from Feedzirra I'd recommend temporarily storing the actual feed XML in a staging table so you can post-process it using Nokogiri at your leisure. That way your HTTP process isn't bogged down processing the text, it's just retrieving content and filing it away, and updating the records for the processing time so you know when to check again. The post process can extract the feed information you want from the stored XML to store in the database, then delete the record. That means there's one process pulling in feeds periodically as quickly as it can, and another that basically runs in the background chugging away.
Also, both Typhoeus/Hydra and HTTPClient can handle multiple HTTP requests nicely and are easy to set up.

Store the XML as a CLOB, most databases have XML processing extensions that allow you to include XPath type queries as part of a SELECT statement.
Otherwise if your DBMS does not support XML querying, use your languages XPath implementation to query the CLOB. You will probably need to extract certain elements into table columns for speedy querying.

Related

(Swift iOS) Storing Article Content (image/paragraphs) locally

I'm planning out a sort of reference book application. For each topic there will be a page with an image and text stored. I don't want to create new views in xcode for each page since there are 100+ topics, I'd rather find the easiest way to store the items in a database and then call the content to display on a view template when the user selects the topic from a list. After searching around I see that this is potentially done with Core Data or SQLite, and maybe even json, but I have not encountered a clear answer.
What's the best way to handle this sort of data?
You should create a database in Core Data and where you'd like to store images, use the response from this tutorial (conversion to Swift is left as an exercise for the reader) and store the fileName as a string.
Don't use json to store 100+ items, it will be very slow. SQL is quite fast, even though it's a mobile device.

Rails: Cache multiple result sets during a session

I have a Rails app that searches a static set of documents, and I need to figure out the best way to cache result sets. I can't just use the native ActiveRecord caching.
My situation:
I'm using the will_paginate gem, and at the moment, the query is running every time the user changes pages. Some queries are too complex for this to be responsive, so I need to cache the results, at least during an individual session. Also, the user may have multiple tabs open, running separate queries simultaneously. The document set is static; the size of the set is on the order of tens of thousands of documents.
Why straight-forward ActiveRecord caching won't work:
The user can search the contents of the documents, or search based on metadata restrictions (like a date range), or both. The metadata for each document is stored in an ActiveRecord, so those criteria are applied with an ActiveRecord query.
But if they add a search term for the document content, I run that search using a separate FastCGI application, because I'm doing some specialized search logic. So, I pass the term & the winnowed-down document list to the FastCGI application, which responds with the final result list. Then I do another ActiveRecord query: where("id IN (?)',returnedIds)
By the way, it's these FastCGI searches that are sometimes complex enough to be unresponsive.
My thoughts:
There's the obvious-to-a-newbie approach: I can use the metadata restrictions plus the search term as a key; they're already stored in a hash. They'd be paired up with the returnedIds array. And this guide at RubyOnRails.org mentions the cache stores that are available. But it's not clear to me which store is best, and I'm also assuming there's a gem that's better for this.
I found the gem memcached, but it's not clear to me whether it would work for caching the results of my FastCGI request.

How should I store scraped HTML in my webapp?

I'm a newbie to web development (and development in general) and I'm building out a rails app which scrapes data from a third party website. I'm using Nokogiri to parse for specific html elements that I'm interested in and these elements are stored in a database.
However, I'd like to save the html of the whole page I'm scraping as a back-up in case I change my mind on what type of information I want and in case the website removes the site (or updates it).
What's the best practice for storing the archived html?
Should I extract it as a string and put it in a database, write it to a log or text file, or what?
Edit:
I should have clarified a bit. I am crawling on the order of 10K websites a week and anticipate only needing to access the back-ups on once-off basis if I redefine the type of data I want.
So as an example, if was crawling UN data on country population data and originally was looking at age distributions but later realized I wanted to get the gender distributions as well, I'd want to go back to all my HTML archives and pull the data out. I don't anticipate this happening much (maybe 1-3 times a month) but when it does I'll want to retrieve it across 10K-100K listings. The task should only take a few hours to do around 10K records so I guess each website fetch should take at most a second. I don't need any versioning capability. Hope this clarifies.
I'm not sure what the "best practice" for this case is (it will vary by the specifics of your project), but as a starting point I'd suggest creating a model with a string field for the URL and a text field for the HTML itself, and save the pages there. You might add a uniqueness validator for the URL, to make sure you don't store the same HTML twice.
You could then optionally add model methods to initiate a nokogiri document from the HTML text, thus using the HTML string as the "master" record (in the DB) and generating the nokogiri document on the fly when needed. But again, as #dave-newton points out, a lot of this will depend on what you're going to do with this HTML.
I would strongly suggest saving it into a table in the same DB as the data you are scraping. Why change what works? Keep it all as you normally would, or write it all to a separate database entirely just in case and keep some form or ref to link the scraped data to the backups just in case.

RoR: Use Feedzirra to pull different feeds and display as one

I can successfully pull different feeds using the Feedzirra gem and get feed updates. However, each feed that I'd like to pull has different content (ie: Github Public Feed, last.fm recently played, etc.).
What is the best way to go about combining all of these feeds into one? Right now I have different models for different types of feeds and some feeds use different timestamps than the others.
m,
You could add multiple extra fields to hold each of the unique attributes in an uber-feed object, only filling in the ones that come from each particular feed at time of processing. (It's kind of like the NoSQL model in that way, though not quite, since you have to define the fields ahead of time, but you can add any arbitrary field as a data-holder.)
This is how you add a new field to all instances of a feed...
Feedzirra::Feed.add_common_feed_entry(:my_custom_field)
You'll find a little more dialog about this here...
https://groups.google.com/forum/?fromgroups#!msg/feedzirra/_h4y8_vwDGc/N8sjym6NouEJ
You are creating an activity feed -- here are several gems that you can research on how to create activity feeds: https://www.ruby-toolbox.com/categories/Rails_Activity_Feeds

Cache and Paginate RSS Feed in Rails

I'm working on a life-streaming type app in Rails and am constantly parsing several different RSS feeds using Feedzirra. To support scaling of the app, I have to implement some kind of caching while also allowing to paginate the cached results in the view. The cache can expire as little as once a day.
Being a novice to caching in Rails, what types of caching would be recommended for this? Also, I'm doing most of the feed parsing operations in modules in my lib/ directory, not sure if this would have affect / not be ideal for caching. Ideally I'd like to cache the array of results the RSS feed returns and do some post-processing to them before I send it to the view.
Any help would be much appreciated!
I suggest you to use a gem for run a schedule task in the cron, collect all desired results from all your rss feeds and save it to an xml or even in a table.
For the next time, load the results from this xml or table and create an static cached pages (html files).
And everytime you run your schedule task, erase your previous saved files, preventing old results tyo be displayed.

Resources