Time Cost Comparison: Haml render_to_string vs. string addition - ruby-on-rails

I'm working on something that will render a small bit of xml (10 lines) every second.
I like the ease of constructing Xml with Haml, but I was wondering if anyone knew any details about the server cost of using render_to_string with haml versus building a string with String addition.

Haml is meant to be used when
you want to generate pretty XML,
the structure of the XML is hand-edited.
If you are generating small XML documents for machine consumption use faster libraries like Nokogiri or Builder.
Please do not use string interpolation, most of the time you will end up creating malformed documents because the input data will be slightly different from the data you used to test your app. This is true whenever you handle user-generated data. String interpolation is also a nice way to introduce security bugs. Just don't do it.

Related

Are There Any Rails Modules or Classes Which Provide Frozen HTML Content Type Strings?

Ive been searching through source for a while, and it appears to me that there are no given Rails tools for retrieving the String representation of various HTML content types. Ive also found this to be a very difficult concept to search for in general.
What I want is something like this:
Mime::Mimes::CONTENT_TYPE_JSON = 'application/json'.freeze
or, Mime::Mimes::CONTENT_TYPES[:json] etc.
...because I want to do a lot of things like some_value == 'application/json' or some_value = 'application/json' etc.
I want to use the expression "application/json" often, and I dont want to create new String instances for something that is pretty well within the domain of web application development. Ive thought of creating my own app consts or vars so I dont have to allocate HTML Content Type strings more than once, but also feel this should just be available for me in any web application framework (at least, those written in languages where every string is a new memory allocation).
Is there a better tool or resource within the Rails 5 source that I am missing that allows easy retrieval of content type strings? Do I have to get a gem / create my own for this?
Note: Im away of how heavy of an "optimization" this may appear to be. Let's then entertain this query from a position of being pragmatic about organizational style, for a project that requires elimination of any duplication of domain-specific string literals, and to keep them symbolized or as some frozen const. Let's pretend its a personal project for the sheer joy of experimenting with such style!
There is a shorthand for it:
Mime[:json]
Mime#[] -
https://github.com/rails/rails/blob/e2efc667dea886e71c33e3837048e34b7a1fe470/actionpack/lib/action_dispatch/http/mime_type.rb#L41
which uses
Mime::Type#lookup_by_extension -
https://github.com/rails/rails/blob/e2efc667dea886e71c33e3837048e34b7a1fe470/actionpack/lib/action_dispatch/http/mime_type.rb#L149
If you want to get the actual content type you might need to call a #to_s on it:
Mime[:json].to_s
Creating a new module to facilitate simple storage and retrieval using the ActionPack Mime::Type system would work as follows:
# Build hash of type name to value, e.g., { xml: "application/xml" }
CONTENT_TYPES = {}.tap do |simple_content_types_hash|
# Get each registered Mime Type
Mime::EXTENSION_LOOKUP.each do |mime|
simple_content_type_hash[mime.first.to_sym] = mime.last.instance_variable_get("#string").freeze
end
end.freeze
Note: the above is untested, its just a generalization of what I am looking for. Thanks to #Fire-Dragon-DoL for the tip.
This could be added via an initializer, patched into an existing module, or into a new helper module.

How to get http tag text by id using lua

There is a webpage parser, which takes a page contains several tags, in a certain structure, where divs are badly nested. I need to extract a certain div element, and copy it and all its content to a new html file.
Since I am new to lua, I may need basic clarification for things might seem simple.
Thanks,
The ease of extraction of data is going to largely depend on the page itself. If the page uses the exact same tag information throughout its entirety, it'll be much more difficult to extract than it would if it has named tags.
If you're able to find a version of the page that returns json format, then you're that much better off. Here's a snippet of code on something I wrote to grab definitions from a webpage that did not have json format:
local actualword, definition = string.match(wayup,"<html.-<td class='word'>%c(.-)%c</td>.-<div class=\"definition\">(.-)</div>")
Essentially, this code searched down the page until it found the class "word", and took the word after it (%c is the pattern for control characters). It continued on to "definition" and captured that, as well.
As you can see, it's a bit convoluted, but I had the luck of having specifically named tags for what I wanted.
This is edited to fit your comment. As a side note that I should have mentioned before, if you're familiar with regular expressions, you can use its model to capture what you need. In this case, it's capturing the string in its totality:
local data = string.match(page, "(<div id=\"aa\"><div>.-</div>.-</div>)")
It's rarely the fault of the language, but rather the webpage itself, that makes it hard to data mine anything. Since webpages could literally have hundreds of lines of code, it's hard to pinpoint exactly what you want without coming across garbage information. It's why I prefer a simplified result such as json, since Lua has a json module that can encode/decode and you can get your precise information.

Rails XML generation like Active Model Serializer

Is there a way to generate XML from the configuration/programming used by the Rails AciveModelSerializer gem? AMS seems to only generate customized JSON. XML comes out in a default format.
I've seen references to AciveModelSerialization and that it supports JSON and XML, but the configuration, while similar, is different. What is the story with the difference between them? Is one going away? How do they compare in real use (other than format capability)?
As you can see here, there (and at other spots), XML is slowly disappearing from the web. There are a couple of reasons for that. 1 - JSON objects are smaller. 2 - JSON is the de-facto format for most client-side javascript libraries. 3 - Fashion, people like it.
You can still use ActiveModel to serialize Xml if you wish so:
http://api.rubyonrails.org/classes/ActiveModel/Serializers/Xml.html
Hope it helps.

Ruby/Rails parse XML without loading it all into memory

I'm wondering if there's an easy way to parse an XML document in rails without loading it all into ram.
I've been using (depending on the XML) a combination of Nokogiri and the standard Hash.from_xml to pull get the contents of the XML.
That is all well and good when I'm dealing with (attempting to import) 100 or even 1000 products. When however the XML doc has 16,000 or 40,000 products in it.... well my Dino starts to really really feel it.
So I'm wondering if there's a way to walk the XML without pulling it all into memory.
Sorry I don't have code.... I'm attempting to avoid writing anything new. I mean who wants to write their own XML parser eh?
I came to this...
reader = Nokogiri::JSON::Reader(File.open('test.xml'))
reader.each do |node|
if(node.name == 'Product')
hash = Hash.from_xml(node.outer_xml).values.first
break;
end
end
I watched my memory load while I ran this across a 60 meg file. It accomplished my goal. I'd love to see other answers. Perhaps something even lighter.
Because XML is hierarchical the parser needs to know the whole structure to parse it correctly. You could feed well formed fragments to Nokogiri::HTML::Document.parse but you'd need to get those fragments out some other way.
Let's say you have a huge xml document:
<products>
<product>stuff</product>
<product>...</product>
... and so on
</products>
The actual products are enveloped within <products>, strip out the envelope part and then using string splitting to get an array of each <product> and its contents. Then parse each of these as XML fragments. Just a thought.
This might help, although I've never used it: https://github.com/soulcutter/saxerator

how to obtain URLs from Dmoz ODP

I want to use a database of URLs present in DMOZ ODP for my application. ( an array of URL strings OR a file containing the same ). Is there any way of obtaining it , ( other than the manual copy-paste ) ?
EDIT :
Is there any script / code to parse the rdf file..
Take a look at http://rdf.dmoz.org/, you'll need to find a way to parse the RDF into your database.
I did this the other day using the odp2db scripts from Steve's Software. They're old, but the format hasn't changed significantly so they work fine.
I found I didn't need to do the iconv and xmlclean.pl steps suggested in the readme, just uncompressed the dumps and ran the structure2db.pl and content2db.pl scripts. You'll need to create the database tables manually (see the SQL at top of script for that) and modify the connection details in the scripts before you start.
With the mid-January 2009 dump I used, there's 756,962 categories and 4,436,796 websites. It took a while to run through them all, but not excessively long, though I did dispense with the site descriptions as I didn't need them. Also, may be worth adding database indices after creating the tables to speed access up later. The raw structure and content files were 75MB and 300MB compressed respectively. 848MB and 2GB respectively.
I've actually done this in java. I just used the SAX API to read through the RDF files. It was pretty straight forward. In my case I wanted to pull out every URL that was in a topic with "Weblogs" in the topic name.
Basically what did was implement a org.xml.sax.helpers.DefaultHandler
Then to setup the code you do:
InputSource is = new InputSource(new FileInputStream("filename.rdf"));
XMLReader r = XMLReaderFactory.createXMLReader();
r.setContentHandler(new MyHandlerClass());
r.parse(is);
and that's pretty much it. In my handler class I had to implement:
startElement(String uri, String localName, String qName, Attributes attributes) then I had an if statement to see if it was an "ExternalPage" tag, in which case I went to another state to look for "topic","Title" and "Description". I had another
characters(char[] ch, int start, int length) where I read in the topic, title, and description text depending on which one had been most recently sent to startElement
endElement(String uri, String localName, String qName) where I checked to see which element was ending, and if it ExternalPage, that meant the end of the current element.
The whole thing was 80-90 lines of code for the basic parsing. So pretty easy to write. It was able to chew through the multi-gigabyte files in... I don't remember maybe a minute or two? If you just want to query out some specific data, it might be easier just to write the code to do that in your handler, rather then trying to load it into a DB.
If you find a tool that works well, that's obviously better then writing your own code. But writing your own code isn't hard! RDF is just an XML format, and it's not nested or anything. A simple SAX parser is easily doable in a day or so.
You could always pay one of the currupt editors there and they will help you out :)

Resources