How should I store scraped HTML in my webapp? - ruby-on-rails

I'm a newbie to web development (and development in general) and I'm building out a rails app which scrapes data from a third party website. I'm using Nokogiri to parse for specific html elements that I'm interested in and these elements are stored in a database.
However, I'd like to save the html of the whole page I'm scraping as a back-up in case I change my mind on what type of information I want and in case the website removes the site (or updates it).
What's the best practice for storing the archived html?
Should I extract it as a string and put it in a database, write it to a log or text file, or what?
Edit:
I should have clarified a bit. I am crawling on the order of 10K websites a week and anticipate only needing to access the back-ups on once-off basis if I redefine the type of data I want.
So as an example, if was crawling UN data on country population data and originally was looking at age distributions but later realized I wanted to get the gender distributions as well, I'd want to go back to all my HTML archives and pull the data out. I don't anticipate this happening much (maybe 1-3 times a month) but when it does I'll want to retrieve it across 10K-100K listings. The task should only take a few hours to do around 10K records so I guess each website fetch should take at most a second. I don't need any versioning capability. Hope this clarifies.

I'm not sure what the "best practice" for this case is (it will vary by the specifics of your project), but as a starting point I'd suggest creating a model with a string field for the URL and a text field for the HTML itself, and save the pages there. You might add a uniqueness validator for the URL, to make sure you don't store the same HTML twice.
You could then optionally add model methods to initiate a nokogiri document from the HTML text, thus using the HTML string as the "master" record (in the DB) and generating the nokogiri document on the fly when needed. But again, as #dave-newton points out, a lot of this will depend on what you're going to do with this HTML.

I would strongly suggest saving it into a table in the same DB as the data you are scraping. Why change what works? Keep it all as you normally would, or write it all to a separate database entirely just in case and keep some form or ref to link the scraped data to the backups just in case.

Related

ASP.NET MVC get list of cities by country without Database.

I need to do something like:
Show list of countries >> Select country -> show list of Cities.
So I have to get a list of cities by country but without using data in database.
Can anyone please suggest me a solution ?. I really appreciate your help.
You could use an API but the problem is that you will have a request everytime your page load. Not every API provider will allow this. For example, Here is an API that gets Country / Cities.
Another solution as you are using .NET technologies is to use a localDB. A localDB is in fact a database but within your app. Have a look to the definition on the MSDN :
It is very easy to install and requires no management, yet it offers the same T-SQL language, programming surface and client-side providers as the regular SQL Server Express. If the simplicity (and limitations) of LocalDB fit the needs of the target application environment, developers can continue using it in production, as LocalDB makes a pretty good embedded database too.
Finely, the last solution that comes in mind if you can't use XML or JSON files nor a LocalDB is to have your lists in classes but in my opinion you should avoid this solution, it will simply load everything in RAM until you application stops, as HDD cost less that RAM I really think the better option is to use XML or JSON files in your app.
You can store the info into a text file or even into a static class in your code (not exactly a great idea, but doable).
Then you just need to get the info from the container and build two SelectList items, one for countries and one for cities.
Use javascript to link change event of countries SelectList to a filtered reload of cities SelectList
Assuming you have a preset list of cities by country, and you really cannot use any sort of database, then perhaps just use text files? One text file for the list of countries and then one file per country with the list of cities. Read in the text file and display as needed.

Store ruby Mail (from gem) object in ActiveRecord

I'm currently implementing a very basic IMAP client into an application I'm building in Rails. I'm using the Mail gem which supplies lots of useful ways of parsing the imap data.
I'd like to store the Mail object that it's generating in the database. Is that possible?
i.e.
email = Email.new
email.uid = id
email.mail = Mail.new(imap.fetch(id, "RFC822")[0]["attr"]["RFC822"]
email.save
It's a convenience thing where I don't want to have to download the object again unless I have to since performance on the IMAP call is slow, but I'd like to be able to have it there to look back on (and do any breaking down I needed to later).
I could then call
email.find(x).mail.body
and various other useful things without having to build out that functionality in my own email model.
Q1: How would I set up the active record model?
Q1a: Would I be better off doing something that excluded the attachments to make it an easier object to store? (is that even possible?)
Appreciate your help,
Several database schemata have been developed to store mail. I've worked on one, and there are others. Believe me, it's hard work. The result can be very useful, but since your question doesn't focus on the result I suspect it's not worthwhile in your case.
You might find it easier to use a json library to write your object graph to a file with an automatically inferred structure, as most json libraries seem to support these days. That won't let you do as much, but it's very much easier and lets you store both completely and incompletely retrieved messages. If you haven't fetched a particular body part, the json library will just write a null for that field.
It depends on what you want to do with the stored mails. If you need only specific parts of the mail to be easily accessible trough the database you won't need a complex setup like archiveopteryx, which basically maps a complete representation of emails to relational database tables. In most cases though you won't need that much detail and it will be totally perfect to use a simple data model.
A1: rails g model Email from to subject date:datetime message_id body. this are just the basic parts, should get you started.
A1a: You don't need to store the attachments if you don't want to. If you need them, you'll probably be better off not storing them in the database itself. Attachments are just like uploads so there are plenty of gems that can help you do that (https://www.ruby-toolbox.com/categories/rails_file_uploads).
Using posgres jsonb columns, you can store the email as json, in my case I disregard the attachments (which I store the reference to and retrieve as and when required).
This works pretty well with the Mail gem.

Rails - Store unique data for each open tab/window

I have an application that has different data sets depending on which company the user has currently selected (dropdown box on sidebar currently used to set a session variable).
My client has expressed a desire to have the ability to work on multiple different data sets from a single browser simultaneously. Hence, sessions no longer cut it.
Googling seems to imply get or post data along with every request is the way, which was my first guess. Is there a better/easier/rails way to achieve this?
You have a few options here, but as you point out, the session system won't work for you since it is global across all instances of the same browser.
The standard approach is to add something to the URL that identifies the context in which to execute. This could be as simple as a prefix like /companyx/users instead of /users where you're fetching the company slug and using that as a scope. Generally you do this by having a controller base class that does this work for you, then inherit from that for all other controllers that will be affected the same way.
Another approach is to move the company identifying component from the URL to the host name. This is common amongst software-as-a-service providers because it makes sharding your application much easier. Instead of myapp.com/companyx/users you'd have companyx.myapp.com/users. This has the advantage of preserving the existing URL structure, and when you have large amounts of data, you can partition your app by customer into different databases without a lot of headache.
The answer you found with tagging all the URLs using a GET token or a POST field is not going to work very well. For one, it's messy, and secondly, a site with every link being a POST is very annoying to work with as it makes navigating with the back-button or forcing a reload troublesome. The reason it has seen use is because out of the box PHP and ASP do not have support routes, so people have had to make do.
You can create a temporary database table, or use a key-value database and store all data you need in it. The uniq key can be used as a window id. Furthermore, you have to add this window id to each link. So you can receive the corresponding data for each browser tab out of the database and store it in the session, object,...
If you have an object, lets say #data, you can store it in the database using Marshal.dump and get it back with Marshal.load.

Storing Media RSS and iTunes podcast RSS feeds in the database

I want to be able to store media RSS and iTunes podcast RSS feeds into the database. The requirement here is that I don't want to miss out on ANY element or its attributes in the feed. It would make sense to find all most common elements in the feed and have them stored in database as separate columns. The catch here is that there can be feed specific elements that may not be standard. I want to capture them too. Since I don't know what they can be, I won't have a dedicated column for them.
Currently I have 2 tables called feeds and feed_entries. For RSS 2.0 tags like enclosures, categories, I have separate tables that have associations with feeds/feed_entries. I am using feedzirra for parsing the feeds. Feedzirra requires us to know the elements in the feed we want to parse and hence we would not know if feed contains elements beyond what feedzirra can understand.
What would be the best way to go about storing these feeds in the database and not miss single bit of information? (Dumping of the whole feed into the database as is won't work as we want to query most of the attributes). What parser would be the best fit? Feedzirra was chosen for performance, however, getting all data in the feed into the database is a priority.
Update
I'm using MySQL as the database.
I modeled my database on feeds and entries also, and cross-mapped the fields for RSS, RDF and Atom, so I could capture the required data fields as a starting point. Then I added a few others for tagging and my own internal-summarizations of the feed, plus some housekeeping and maintenance fields.
If you move from Feedzirra I'd recommend temporarily storing the actual feed XML in a staging table so you can post-process it using Nokogiri at your leisure. That way your HTTP process isn't bogged down processing the text, it's just retrieving content and filing it away, and updating the records for the processing time so you know when to check again. The post process can extract the feed information you want from the stored XML to store in the database, then delete the record. That means there's one process pulling in feeds periodically as quickly as it can, and another that basically runs in the background chugging away.
Also, both Typhoeus/Hydra and HTTPClient can handle multiple HTTP requests nicely and are easy to set up.
Store the XML as a CLOB, most databases have XML processing extensions that allow you to include XPath type queries as part of a SELECT statement.
Otherwise if your DBMS does not support XML querying, use your languages XPath implementation to query the CLOB. You will probably need to extract certain elements into table columns for speedy querying.

Why would Google Search use client-side URL parameters?

Yesterday morning I noticed Google Search was using hash parameters:
http://www.google.com/#q=Client-side+URL+parameters
which seems to be the same as the more usual search (with search?q=Client-side+URL+parameters). (It seems they are no longer using it by default when doing a search using their form.)
Why would they do that?
More generally, I see hash parameters cropping up on a lot of web sites. Is it a good thing? Is it a hack? Is it a departure from REST principles? I'm wondering if I should use this technique in web applications, and when.
There's a discussion by the W3C of different use cases, but I don't see which one would apply to the example above. They also seem undecided about recommendations.
Google has many live experimental features that are turned on/off based on your preferences, location and other factors (probably random selection as well.) I'm pretty sure the one you mention is one of those as well.
What happens in the background when a hash is used instead of a query string parameter is that it queries the "real" URL (http://www.google.com/search?q=hello) using JavaScript, then it modifies the existing page with the content. This will appear much more responsive to the user since the page does not have to reload entirely. The reason for the hash is so that browser history and state is maintained. If you go to http://www.google.com/#q=hello you'll find that you actually get the search results for "hello" (even if your browser is really only requesting http://www.google.com/) With JavaScript turned off, it wouldn't work however, and you'd just get the Google front page.
Hashes are appearing more and more as dynamic web sites are becoming the norm. Hashes are maintained entirely on the client and therefore do not incur a server request when changed. This makes them excellent candidates for maintaining unique addresses to different states of the web application, while still being on the exact same page.
I have been using them myself more and more lately, and you can find one example here: http://blixt.org/js -- If you have a look at the "Hash" library on that page, you'll see my implementation of supporting hashes across browsers.
Here's a little guide for using hashes for storing state:
How?
Maintaining state in hashes implies that your application (I'll call it application since you generally only use hashes for state in more advanced web solutions) relies on JavaScript. Without JavaScript, the only function of hashes would be to tell the browser to find content somewhere on the page.
Once you have implemented some JavaScript to detect changes to the hash, the next step would be to parse the hash into meaningful data (just as you would with query string parameters.)
Why?
Once you've got the state in the hash, it can be modified by your code (or your user) to represent the current state in your application. There are many reasons for why you would want to do this.
One common case is when only a small part of a page changes based on a variable, and it would be inefficient to reload the entire page to reflect that change (Example: You've got a box with tabs. The active tab can be identified in the hash.)
Other cases are when you load content dynamically in JavaScript, and you want to tell the client what content to load (Example: http://beta.multifarce.com/#?state=7001, will take you to a specific point in the text adventure.)
When?
If you had a look at my "JavaScript realm" you'll see a border-line overkill case. I did it simply because I wanted to cram as much JavaScript dynamics into that page as possible. In a normal project I would be conservative about when to do this, and only do it when you will see positive changes in one or more of the following areas:
User interactivity
Usually the user won't see much difference, but the URLs can be confusing
Remember loading indicators! Loading content dynamically can be frustrating to the user if it takes time.
Responsiveness (time from one state to another)
Performance (bandwidth, server CPU)
No JavaScript?
Here comes a big deterrent. While you can safely rely on 99% of your users to have a browser capable of using your page with hashes for state, there are still many cases where you simply can't rely on this. Search engine crawlers, for example. While Google is constantly working to make their crawler work with the latest web technologies (did you know that they index Flash applications?), it still isn't a person and can't make sense of some things.
Basically, you're on a crossroads between compatability and user experience.
But you can always build a road inbetween, which of course requires more work. In less metaphorical terms: Implement both solutions so that there is a server-side URL for every client-side URL that outputs relevant content. For compatible clients it would redirect them to the hash URL. This way, Google can index "hard" URLs and when users click them, they get the dynamic state stuff!
Recently google also stopped serving direct links in search results offering instead redirects.
I believe both have to do with gathering usage statistics, what searches were performed by the same user, in what sequence, what of the search results the user has followed etc.
P.S. Now, that's interesting, direct links are back. I absolutely remember seeing there only redirects in the last couple of weeks. They are definitely experimenting with something.

Resources