Best markup format for future-proofing large text chunks? - storage

I have a number of records (=< 100) that contain sizeable chunks of text that require marking up (semantically: lists, headings, tables, links, quotations, etc...) before storing in a re-usable file format.
When stored, it is likely to remain more or less unchanged for as many years into the future as possible.
It contains some non-ascii, so UTF-8 is required. I started using HTML, then considered Markdown... but would like to know what people think is the most future-proof markup format for long-term storage? The content is initially for a (mostly static) website, but may be used as content for other outputs.
Finally, opinions on the choice of storage for long-term use - database, separate documents...? Changes to records will be infrequent and edited by only 1-3 people, and read access should increase over time.
Update:
I've finally chosen the common features (e.g. for tables) between MultiMarkdown, PHP Markdown Extra and Kramdown as the text format (Markdown omits too many HTML tags), and am converting the resulting files to html with Kramdown. Now I'm trying out iOS Markdown editors that can handle an extended Markdown and sync via Dropbox to my desk/laptop.

Any storage not designed for long-term archiving will break.
It is not so much a question of database vs. filesystem, but how to ensure that no (silent) data corruption happens, and how to migrate data. I can give you no definitive answers, because it depends on a lot of factors (incl. costs), but here are a few resources:
Building Better Long-Term Archival Storage System, Talk by Miller/Storer at the Library of Congress
The Digital Dilemma, Book aimed at movie archiving, but highlights some of the issues in long term archiving.
Project Honeycomb, a project by SUN for open source long-term archiving, but discontinued.
I have no real answer for the format question, but I think HTML + UTF-8 should be readable even in decades, but document it.

Related

Is there a known workaround for the max token limit on the input to GPT-3?

For a bit of context, I recently started working on a personal project that accepts the URL of some recipe web page, pulls the HTML, converts the HTML to simplified markdown (this is the GPT-3 part), then sends that markdown to a thermal receipt printer in my kitchen, which prints it out.
Recipe web pages have a wide variety of structures, and they are notorious for including long and often irrelevant articles before the recipe, for the sake of SEO.
My plan was to use the fine-tuning API for davinci2, and feed it a bunch of straight up recipe HTML as input and cleaned, recipe-only markdown as output. I notice though that the maximum input token count for both training and inference is 4096. The HTML for a web page can be much larger than that, like 20k tokens.
I am wondering if anyone has found a workaround for training and driving GPT-3 with more tokens than 4096.
I'm open to other suggestions as well. For instance, I've considered passing just the visible text on the page, rather than the full HTML tree, but there is much less context present in that form, and the models seems more easily confused by all of the links and other navigational elements present in the page. I have also considered only allowing this project to accept "printer-friendly" versions of recipes, which tend to be much smaller and would easily come in under the 4096 token limit, but not all sites offer a printer-friendly article, and I don't want this to be a limitation.
Do not know of any work arounds but have you thought of perhaps filtering the HTML elements out based on some basic rules. You can include only paragraph elements or elements that have certain characteristics, like having a list within them, which is something most recipes have.
this framework might be useful to you: https://github.com/Xpitfire/symbolicai
The basic idea is:
You could stream among your input data and build up a stack on the side.
Next, in your training procedure, you need to account for having loosely connected chunks of data. This you could overcome by indexing or clustering the chunks before designing your prompts.
This means, if you want to create a query for a question that is related to your long data stream, you could search through your indexes and retrieve the related information.
Now you need to parse together your few-shot learning prompt that accounts for a "section" in your prompt that relates to your query and another one for the facts you wanted to include.
Finally, you can then feed that into your model and provide examples of what you want your model to be tuned to.
I know this a bit high-level explained, but maybe if you follow the link I provided, things might get more clear.

How to extract entities from html using natural language processing or other technique

I am trying to parse entities from web pages that contain a time, a place, and a name. I read a little about natural language processing, and entity extraction, but I am not sure if I am heading down the wrong path, so I am asking here.
I haven't started implementing anything yet, so if certain open source libraries are only suitable for a specific language, that is ok.
A lot of times the data would not be found in sentences, but instead in html structures like lists (e.g. 2013-02-01 - Name of Event - Arena Name).
The structure of the webpages will be vastly different (some might use lists, some might put them in a table, etc.).
What topics can I research to learn more about how to achieve this?
Are there any open source libraries that take into account the structure of html when doing entity extraction?
Would extracting these (name, time, place) entities from html be better (or even possible) with machine vision where the CSS styling might make it easier to differentiate important parts (name, time, location) of the unstructured text?
Any guidance on topics/open source projects that I can research would help I think.
Many programming languages have external libraries that generate canonical date-stamps from various formats (e.g. in Java, using the SimpleDateFormat). As you say, the structure of the web-pages will be vastly different, but date can be expressed using a small number of variations only, so writing down the regular expressiongs for a few (let's say, half-a-dozen) formats will enable extraction of dates from most, if not all, HTML pages.
Extraction of places and names is harder, however. This is where natural language processing will have to come in. What you are looking for is a Named Entity Recognition system. One of the best open source NER systems is the Standford NER. Before using, you should check out their online demo. The demo has three classifiers (for English) that you can choose from. For most of my tasks, I find their english.all.3class.distsim classifier to be quite accurate.
Note that an NER performs well when the places and names you extract are occurring in sentences. If they are going to occur in HTML labels, this approach is probably not going to be very helpful.

Flat file in delphi

In my application I want to use files for storing data. I don't want to use database or clear text file, the goal is to save double and integer values along with string just to identify the name of the record ; I simple need to save data on disk for generating reports. File can grow even to gigabyte. What format you suggest to use? Binary? If so what vcl component/library you know which is good to use? My goal is to create an application which creates and updates the files while another tool will "eat" those file
producing nice pdf reports for user on demand. What do you think? Any idea or suggestion?
Thanks in advance.
If you don't want to reinvent the wheel, you may find all needed Open Source tools for your task from our side:
Synopse Big Table to store huge amount of data - see in particular the TSynBigTableRecord class to store an unlimited number of records with fields, including indexes if needed - it will definitively be faster and use less disk size than any other regular SQL DB
Synopse SQLite3 Framework if you would rather use a standard SQLite engine for the storage - it comes with a full Client/Server ORM
Reporting from code, including pdf file generation
With full Source code, working from Delphi 6 up to XE.
I've just updated the documentation of the framework. More than 600 pages, with details of every class method, and new enhanced general introduction. See the SAD document.
Update: If you plan to use SQLite, you should first guess how the data will be stored, which indexes are to be created, and how a SQL query may speed up your requests. It's a bad idea to read all file content for every request: you should better structure your data so that a single SQL query would be able to return the expended results. Sometimes, using additional values (like temporary sums or means) to the data is a good idea. Also consider using the RTree virtual table of SQLite3, which is dedicated to speed up access to double min/max multi-dimensional data: it may speed up a lot your requests.
You don't want to use a full SQL database, and you think that a plain text file is too simple.
Points in between those include:
Something that isn't a full SQL database, but more of a key-value store, would technically not be a flat file, but it does provide a single "key+value" list, that is quickly searchable on a single primary key. Such as BSDDB. It has the letter D and B in the name. Does that make it a database, in your view? Because it's not a relational database, and doesn't do SQL. It's just a binary key-value (hashtable) blob storage mechanism, using a well-understood binary file format. Personally, I wouldn't start a new project and use anything in this category.
Recommended: Something that uses SQL but isn't as large as standalone SQL database servers. For example, you could use SQLite and a delphi wrapper. It is well tested, and used in lots of C/C++ and Delphi applications, and can be trusted more than anything you could roll yourself. It is a very light embedded database, and is trusted by many.
Roll your own ISAM, or VLIR, which will eventually morph over time into your own in-house DBMS. There are multiple files involved, and there are indexes, so you can look up data fast without loading everything into memory. Not recommended.
The most flat of flat binary fixed-record-length files. You mentioned originally in your question, power basic which has something called Random Access files, and then you deleted that from your question. Probably what you are looking for, especially for append-only write as the primary operation. Roll your own TurboPascal era "file of record". If you use the "FILE OF RECORD" type, you hit the 2gb limit, and there are problems with Unicode. So use TStream instead, like this. Binary file formats have a lot of strikes against them, especially since it is difficult to grow and expand your binary file format over time, without breaking your ability to read old files. This is a key reason why I would recommend you start out with what might at first seem like overkill (SQLite) instead of rolling your own binary solution.
(Update 2: After updating the question to mention PDFs and what sounds like a reporting-system requirement, I think you really should be using a real database but perhaps a small and easy to use one, like firebird, or interbase.)
I would suggest using TClientDataSet, and use it's SaveToFile() / SaveToStream() methods by the generating program, and LoadFromFile() / LoadFromStream() methods for the program that will "consume" the data. That way, you can still make indexed records without connecting to any external database, all while keeping the interchange data in a single file.
Define API to work with your flat file, so that the API can be implemented by a separate data layer in many ways.
Implement the API using standard embedded SQL database (ex SQLite or Firebird).
Only if there is something wrong with the standard solution think of your own.
I use KBMMemtable - see http://www.components4developers.com/ - fast, reliable, been around a long time - supports binary and CSV streaming in and out of files, as well indexing, filters, and lots of other goodies - TClientDataSet will not do well with large datasets.

What is the best way to store multiple language versions of a website?

My web site (on Linux servers) needs to support multiple languages.
What is the best practice to have/store multiple languages versions of the same site?
Some I can think of:
store in DB
different view file for each language
gettex
hard coded words in PHP files (like in phpBB)
With web sites, you really have several categories of content to consider for localization:
The article-type content elements that you would in many cases create, edit and publish in a CMS.
The smaller content blocks that are common to every page (or a sub-group of pages), such as tagline, blurb, text around a contact form, but also imported content such as a news ticker or ads and affiliate links. Some of these may only appear for one language (for example, if you don't offer some services in some regions, or don't have, say, language-appropriate imported content for a particular language: it can be better to remove an element rather than offering English to people who may not speak it).
The purely functional elements, like "Click here to comment", "More...", high-level navigation, etc., which are sometimes part of your template. Some of these may be inside images.
For 1. the main decision is using a CMS or not. If yes, you absolutely need to choose one that supports multiple languages. I'm not up-to-date with recent developments in PHP CMS's, but several of the Django CMS apps (Django-CMS-2, FeinCMS) support multi-language content. Don't forget that date stamps, for example, need to be localized, too (or you can get around this by choosing ISO dates, though that may not always be possible). If you don't use a CMS, and everything is in your HTML files, then gettext is the way to go, and keep the .mo files (and your offline .po files) in folders by language.
For 2. if you have a CMS with good multi-lingual support, get as much as possible inside the CMS. The reason is that these bits do change, and you want to edit your template as little as possible. If you write code yourself, think of ways of exporting all in-CMS strings per language, to hand them to translators. Otherwise, again, gettext. The main issue is that these elements may require hard-coding language-selection code (if $language = X display content1 ...)
For 3., if it's in your template, use gettext. For images, the per-language folders will come in handy, and for heaven's sake make choose images the generation of which can be automated, or you (or your graphic artist) will go mad with editing 100s of custom images with strings in languages you don't understand.
For both 2. and 3., abstracting from the language selection may help selecting the appropriate blocks or content directory (where localized images or .mo files are kept).
What you definitely want to avoid is keeping a pile of HTML files with extensive text content in them that would be a nightmare to maintain.
EDIT: Everything about gettext, .po and .mo files is in the GNU gettext manual (more than you ever wanted to know) or a slightly dated but friendlier tutorial. For PHP, there's are the PHP gettext functions, and also the Zend Locale documentation
I recommend using Zend_Translate's Gettext adapter which parses mo files. Very efficient + caching. Your calls would be like
echo $translation->_("Hello World");
Which would find the locale specific key for that specified string.
Check out i18n support for php: http://php-flp.sourceforge.net/getting_started_english.htm

Parsing text file (100+ MB) and sending data over network

I have a requirement to parse a huge text file and send parts of this file to be added as seperate rows in Content Manager. what is the best way of parsing and then update the DB?
I also would need identify certain tokens within this text file.
Please suggest what language should I use to code this requirement.
Thanks
All widely used programming languages can do that, though scripting languages (especially Perl) may be better suited to the task than others. However, your personal experience is a bigger factor: using the language you're most familiar with would probably be best, unless you have specific reasons not to use it, or to use a different language.
A classic problem when working with large files is just reading them in the first place. A lot of standard libraries tend to want to read the entire file into memory / array. However for really large files this is usually not practical.
For what ever language you end up choosing, look over the file I/O libraries carefully and select a method that will allow you to read in the file in chunks. Then run your parsing logic over the chunks and when getting to the end of a chunk, read in the next. Be careful with the parsing logic, it can sometimes be tricky to handle a chunk when it ends in a place that your parsing is not expecting.
Additionally a double buffer system sometimes works well. Process one chunk and when you get near the end, you fill the other buffer with the next chunk. If your parsing is CPU intensive, you might even look at filling a buffer on another thread to overlap the file I/O with the parsing. However, I wouldn't do this first. Start with just getting the logic working before any performance optimizations.
Without more detailed requirements it's difficult to suggest a particular language. Certainly no language is going to magically solve the problem of parsing such a big file. Depending on the format of the file there might be parsing library particularly suited to the job which might guide your choice of language.
If by "Content Manager" you mean Microsoft Content Manager Server I guess one of the Microsoft languages such as C# or VB.Net might be a better choice.
So my answer would pick one of the languages you already know, probably the one you know best.

Resources