Dealing with invalid characters from web scraping - ruby-on-rails

I've written a web scraper to extract a large amount of information from a website using Nokigiri and Mechanize, which outputs a database seed file. Unfortunately, I've discovered there's a lot of invalid characters in the text on the source website, things like keppnisæfind, Scémario and Klätiring, which is preventing the seed file from running. The seed file is too large to go through with search and replace, so how can I go about dealing with this issue?

I think those are html characters, all you need do is to write functions that will clean the characters. This depends on the programming platform

Those are almost certainly UTF-8 characters; the words should look like keppnisæfind, Scémario and Klätiring. The web sites in question might be sending UTF-8 but not declaring that as their encoding, in which case you will have to force Mechanize to use UTF-8 for sites with no declared encoding. However, that might complicate matters if you encounter other web sites without a declared encoding and they send something besides UTF-8.

Related

Rails: what security issues come with extracting text from user-submitted files?

If users of an app are able to submit flat text files, and these files have data pulled from them by a program using a regex (which is then returned to the user), how can this be abused?
I know there are concerns with executable files or unsanitized filenames when they're being saved, but I don't know what the risks are with just opening and parsing a file that lasts temporarily in memory.
Thanks.
It depends very much on the implementation of this theoretical system. The big two vulnerabilities are:
SQL Injection. If you are committing this data to a database and do so in an improper manner, you could expose your database to whatever maliciously-formatted data the user uploads.
Cross-Site Scripting. If you're rendering the results of the upload as HTML, you potentially allow an XSS vulnerability if the results aren't properly escaped.
Proper handling of user input can reduce these problems. Generally though much depends on the actual implementation details of your code. If you're evaling user input, obviously, that's also an enormous security flaw... but it's not something we can see at this level of detail.

Is there any way to avoid showing "xn--" for IDN domains?

If I use a domain such as www.äöü.com, is there any way to avoid it being displayed as www.xn--4ca0bs.com in users’ browsers?
Domains such as www.xn--4ca0bs.com cause a lot of confusion with average internet users, I guess.
This is entirely up to the browser. In fact, IDNs are pretty much a browser-only technology. Domain names cannot contain non-ASCII characters, so the actual domain name is always the Punycode encoded xn--... form. It's up to the browser to prettify this, but many choose to not do so to avoid domain name spoofing using lookalike Unicode characters.
From a security perspective, Unicode domains can be problematic because many Unicode characters are difficult to distinguish from common ASCII characters (or indeed other Unicode characters).
It is possible to register domains such as "xn–pple-43d.com", which is equivalent to "аpple.com". It may not be obvious at first glance, but "аpple.com" uses the Cyrillic "а" (U+0430) rather than the ASCII "a" (U+0061). This is known as a homograph attack.
Fortunately modern browsers have mechanisms in place to limit IDN homograph attacks. The page IDN Policy on chrome highlights the conditions under which an IDN is displayed in its native Unicode form. Generally speaking, the Unicode form will be hidden if a domain label contains characters from multiple different languages. The "аpple.com" domain as described above will appear in its Punycode form as "xn–pple-43d.com" to limit confusion with the real "apple.com".
For more information see this blog post by Xudong Zheng.
Internet Explorer 8.0 on Windows 7 displays your UTF-8 domain just fine.
Google Chrome 19 on the other hand doesn't.
Read more here: An Introduction to Multilingual Web Addresses #phishing.
Different browsers to things differently, possibly because some use the system codepage/locale/encoding/wtvr. And others use their own settings, or a list of allowed characters.
Read that article carefully, it explains how each browser works when making a decision.
If you are targeting a specific language, you can get away with it and make it work.

Best markup format for future-proofing large text chunks?

I have a number of records (=< 100) that contain sizeable chunks of text that require marking up (semantically: lists, headings, tables, links, quotations, etc...) before storing in a re-usable file format.
When stored, it is likely to remain more or less unchanged for as many years into the future as possible.
It contains some non-ascii, so UTF-8 is required. I started using HTML, then considered Markdown... but would like to know what people think is the most future-proof markup format for long-term storage? The content is initially for a (mostly static) website, but may be used as content for other outputs.
Finally, opinions on the choice of storage for long-term use - database, separate documents...? Changes to records will be infrequent and edited by only 1-3 people, and read access should increase over time.
Update:
I've finally chosen the common features (e.g. for tables) between MultiMarkdown, PHP Markdown Extra and Kramdown as the text format (Markdown omits too many HTML tags), and am converting the resulting files to html with Kramdown. Now I'm trying out iOS Markdown editors that can handle an extended Markdown and sync via Dropbox to my desk/laptop.
Any storage not designed for long-term archiving will break.
It is not so much a question of database vs. filesystem, but how to ensure that no (silent) data corruption happens, and how to migrate data. I can give you no definitive answers, because it depends on a lot of factors (incl. costs), but here are a few resources:
Building Better Long-Term Archival Storage System, Talk by Miller/Storer at the Library of Congress
The Digital Dilemma, Book aimed at movie archiving, but highlights some of the issues in long term archiving.
Project Honeycomb, a project by SUN for open source long-term archiving, but discontinued.
I have no real answer for the format question, but I think HTML + UTF-8 should be readable even in decades, but document it.

UTF-8 uses and alternatives

Under what circumstances would you recommend using UTF-8? Is there an alternative to it that will serve the same purpose?
UTF-8 is being used for i18n?
Since you tagged this with web design, I assume you need to optimize the code size to be as small as possible to transfer files quickly.
The alternatives to UTF-8 would be the other Unicode encodings, since there is no alternative to using Unicode (for regular computer systems at least).
If you look at how UTF-8 is specified, you'll see that all code points up to U+007F will require one octet, and code points up to U+07FF will require two octets, up to U+FFFF three and four octets for code points up to U+10FFFF.
For UTF-16, you will need two octets up to U+FFFF (mostly), and four octets for values up to U+10FFFF.
For UTF-32, you need four octets for all unicode points.
In other words, scripts that lie under U+07FF will have some size benefit from using UTF-8 compared to UTF-16, while scripts above that will have some size penalty.
However, since the domain is web design, it might be worth noting that all control characters lie in the one-octet range of UTF-8, which makes this less true for texts with lots of, say, HTML markup and Javascript, compared to the amount of actual "text".
Scripts under U+07FF include Latin (except some extensions such as tone marks), Greek, Cyrillic, Hebrew and probably some more. Wikipedia has pretty good coverage on Unicode issues, and on the Unicode Consortium you can get even more details.
Since you are asking for recommendations, I recommend you to use it at any circumstances. All the time, i.e. for HTML files and textual resources. For English-only application it doesn't change a thing, but when you need to actually localize it, having UTF-8 in the first place would be a benefit (you won't need to re-visit your code and change it; one source of defects less).
As for other Unicode family encodings (like especially UTF-16), I would not recommend to use them for web application. Although bandwidth consumption might be actually higher for i.e. Chinese characters (at least three bytes all the time), you'll avoid problems with transmission and browser interpretation (yeah, I know that in theory it should all work the same, unfortunately in practice it tends to break).
Use UTF-8 all the way. No excuses.
use utf-8 for latin languages. utf-16 for every other language.

Parsing web pages

I have a question about parsing HTML pages, specificaly forums,
i want to parse a forum or thread containing certain post criterias, i havent defined the
algorithm yet, since i have only parsed structure text formats before,
A use case may be copy and paste each thread into the program by hand, or insert a URL like
http://www.forums.com/forum/showthread.php?t=46875&page=3 and let the program parse the pages
Given all this i would like to know:
Is it possible to parse a forum thread on a HTML page?
what would be the best/Fastest/easiest language for doing this?
If i prefer Java what tools/libraries do i need for this?
Any other thing i should consider?
1 / yes
2 / Use some compact language like python or ruby for prototyping.
For python there is a neat library for HTML/XML parsing called beautifulsoup
For ruby, you could try: nokogiri or hpricot
3 / A Java tool to consider: htmlparser
4 / If you are interested only in some particular text or some special classes, a regular expression might be sufficient. But as soon as you want to dig deeper into the structure of the content, you'll need some kind of model to hold your data, and hence a parser, which, in the best case, can cope with the occuring incosistencies of real world html.
You might want to look into some sort of html parsing library, rather than using regular expressions to do this. There are some really good html parsers for ruby and python, but a quick google shows there to be a number of parsers for java as well. The benefit of these libraries is that you don't have to handle every edge case with regular expressions/they handle malformed html (both of which can be impossible with regexes, depending on what you want to do) and they also give you a much way of dealing with the data (for example, beautiful soup lets you grab all elements which belong to a specific class or to use some other css selector to limit which page elements you want to deal with).
Personally, I would, at least for the beginning, start in ruby or python, as the libraries are known and there is a lot of info about using them for this purpose. Also, I find it easier to quickly prototype these types of things in ruby or python than in the jvm. You could even later bring that code onto the jvm with jruby or jython, if it becomes necessary.
yes
regular expressions, any flavor.
probably the ones w/regex
there are tools out there that will do this for you.

Resources