I'm wondering if there's an easy way to parse an XML document in rails without loading it all into ram.
I've been using (depending on the XML) a combination of Nokogiri and the standard Hash.from_xml to pull get the contents of the XML.
That is all well and good when I'm dealing with (attempting to import) 100 or even 1000 products. When however the XML doc has 16,000 or 40,000 products in it.... well my Dino starts to really really feel it.
So I'm wondering if there's a way to walk the XML without pulling it all into memory.
Sorry I don't have code.... I'm attempting to avoid writing anything new. I mean who wants to write their own XML parser eh?

I came to this...
reader = Nokogiri::JSON::Reader('test.xml'))
reader.each do |node|
if( == 'Product')
hash = Hash.from_xml(node.outer_xml).values.first
I watched my memory load while I ran this across a 60 meg file. It accomplished my goal. I'd love to see other answers. Perhaps something even lighter.

Because XML is hierarchical the parser needs to know the whole structure to parse it correctly. You could feed well formed fragments to Nokogiri::HTML::Document.parse but you'd need to get those fragments out some other way.
Let's say you have a huge xml document:
... and so on
The actual products are enveloped within <products>, strip out the envelope part and then using string splitting to get an array of each <product> and its contents. Then parse each of these as XML fragments. Just a thought.
This might help, although I've never used it:


How to get http tag text by id using lua

There is a webpage parser, which takes a page contains several tags, in a certain structure, where divs are badly nested. I need to extract a certain div element, and copy it and all its content to a new html file.
Since I am new to lua, I may need basic clarification for things might seem simple.
The ease of extraction of data is going to largely depend on the page itself. If the page uses the exact same tag information throughout its entirety, it'll be much more difficult to extract than it would if it has named tags.
If you're able to find a version of the page that returns json format, then you're that much better off. Here's a snippet of code on something I wrote to grab definitions from a webpage that did not have json format:
local actualword, definition = string.match(wayup,"<html.-<td class='word'>%c(.-)%c</td>.-<div class=\"definition\">(.-)</div>")
Essentially, this code searched down the page until it found the class "word", and took the word after it (%c is the pattern for control characters). It continued on to "definition" and captured that, as well.
As you can see, it's a bit convoluted, but I had the luck of having specifically named tags for what I wanted.
This is edited to fit your comment. As a side note that I should have mentioned before, if you're familiar with regular expressions, you can use its model to capture what you need. In this case, it's capturing the string in its totality:
local data = string.match(page, "(<div id=\"aa\"><div>.-</div>.-</div>)")
It's rarely the fault of the language, but rather the webpage itself, that makes it hard to data mine anything. Since webpages could literally have hundreds of lines of code, it's hard to pinpoint exactly what you want without coming across garbage information. It's why I prefer a simplified result such as json, since Lua has a json module that can encode/decode and you can get your precise information.

Time Cost Comparison: Haml render_to_string vs. string addition

I'm working on something that will render a small bit of xml (10 lines) every second.
I like the ease of constructing Xml with Haml, but I was wondering if anyone knew any details about the server cost of using render_to_string with haml versus building a string with String addition.
Haml is meant to be used when
you want to generate pretty XML,
the structure of the XML is hand-edited.
If you are generating small XML documents for machine consumption use faster libraries like Nokogiri or Builder.
Please do not use string interpolation, most of the time you will end up creating malformed documents because the input data will be slightly different from the data you used to test your app. This is true whenever you handle user-generated data. String interpolation is also a nice way to introduce security bugs. Just don't do it.

Create csv from html pages

There is a website that displays a lot of data in html tables. They have paged the data so there are around 500 pages.
What is the most convenint (easy) way of getting the data in those tables and download it a CSV, on Windows?
Basically I need to write a script that does something like this but is overkilling to write in in C# and I am looking for other solutions that people with web experience use:
for(i=1 to 500)
load page from http://x/page_i.html;
parse the source and get the data in table with id='data'
save results in csv
I was doing a screen-scraping application once and found BeautifulSoup to be very useful. You could easily plop that into a Python script and parse across all the tags with the specific id you're looking for.
The easiest non-C# way I can think of is to use Wget to download the page, then run HTMLTidy to convert it to XML/XHTML and then transform the resulting XML to CSV with an XSLT (run with MSXSL.exe)
You will have to write some simple batch files and an XSLT with a basic XPath selector.
If you feel it would be easier to just do it in C#, you can use SgmlReader to read the HTML DOM and do an XPath query to extract the data. It should not take more than about 20 lines of code.

how to obtain URLs from Dmoz ODP

I want to use a database of URLs present in DMOZ ODP for my application. ( an array of URL strings OR a file containing the same ). Is there any way of obtaining it , ( other than the manual copy-paste ) ?
Is there any script / code to parse the rdf file..
Take a look at, you'll need to find a way to parse the RDF into your database.
I did this the other day using the odp2db scripts from Steve's Software. They're old, but the format hasn't changed significantly so they work fine.
I found I didn't need to do the iconv and steps suggested in the readme, just uncompressed the dumps and ran the and scripts. You'll need to create the database tables manually (see the SQL at top of script for that) and modify the connection details in the scripts before you start.
With the mid-January 2009 dump I used, there's 756,962 categories and 4,436,796 websites. It took a while to run through them all, but not excessively long, though I did dispense with the site descriptions as I didn't need them. Also, may be worth adding database indices after creating the tables to speed access up later. The raw structure and content files were 75MB and 300MB compressed respectively. 848MB and 2GB respectively.
I've actually done this in java. I just used the SAX API to read through the RDF files. It was pretty straight forward. In my case I wanted to pull out every URL that was in a topic with "Weblogs" in the topic name.
Basically what did was implement a org.xml.sax.helpers.DefaultHandler
Then to setup the code you do:
InputSource is = new InputSource(new FileInputStream("filename.rdf"));
XMLReader r = XMLReaderFactory.createXMLReader();
r.setContentHandler(new MyHandlerClass());
and that's pretty much it. In my handler class I had to implement:
startElement(String uri, String localName, String qName, Attributes attributes) then I had an if statement to see if it was an "ExternalPage" tag, in which case I went to another state to look for "topic","Title" and "Description". I had another
characters(char[] ch, int start, int length) where I read in the topic, title, and description text depending on which one had been most recently sent to startElement
endElement(String uri, String localName, String qName) where I checked to see which element was ending, and if it ExternalPage, that meant the end of the current element.
The whole thing was 80-90 lines of code for the basic parsing. So pretty easy to write. It was able to chew through the multi-gigabyte files in... I don't remember maybe a minute or two? If you just want to query out some specific data, it might be easier just to write the code to do that in your handler, rather then trying to load it into a DB.
If you find a tool that works well, that's obviously better then writing your own code. But writing your own code isn't hard! RDF is just an XML format, and it's not nested or anything. A simple SAX parser is easily doable in a day or so.
You could always pay one of the currupt editors there and they will help you out :)

How to parse a .xfa file

Hoping that someone has some info on how to parse a xfa file. I can parse csv or xml files just fine, but an xfa one has come along and I'm not familar with the format. Looks like tab delimited body with column metadata at the top.
Anyone dealt with these before or can give me a steer on how to parse them?
I use but the language of any solution isn't too relevant.
Much appreciated.
Mmm, looks like nobody has a clue. The problem is that .xfa doesn't look like a "standard" extension: after all, anybody can create its own extension names, from .xyz to .something...
I looked around a bit, found, unsurprisingly (the 'x') an XML format with this extension, not much more.
Indicating where this kind of file come from, what kind of data it holds, might help. Or not.
You describe the file as being a simple TSV (tab separated values) with a header. It is quite trivial to parse, with a tokenizer or some regex, so I am not sure where you are stuck.
I think you might be talking about this:
This seemed to be a page that was designed to deal with that template:
That information should be enough to get the ball rolling. If that fails then you can always analyse the file itself for patterns and go from there. I don't see it being too tricky.
Anyway, I hope that helps.
P.S. If you could provide a link to that .xfa we could probably give you more help.
The original post says the content looks like "tab delimited body with column metadata at the top". An XFA form doesn't look anything like that - XFA forms typically use a *.xdp extension and are XML.
Check out the Adobe page:
(Adobe XML Forms Architecture, currently 1400 pages)
Let LiveCycle/Acrobat parse it for you.
