Database design for book structure (table of contents) and content - ios

I have a list of entries, which can be thought of as paragraphs from a book, stored as separate objects of the same class. These objects have a ‘num’ property, along with the actual text, so that I know their order and can later display them in as a list in the correct order (1,2,3, …).
Now I want to bring this one step further and be able to ‘record’ the structure of the book, like the table of contents. In other words, say the book is divided into chapters, and each chapter is further divided into sections. The first few paragraphs are found under Ch.1 Sec.1, then Ch.1 Sec. 2, and so on all the way to Ch. n, S. m. What I’m not sure of is what’s a good way to record this information? I've been told that I should use a database with SQL but I'm not sure where to begin.
The implementation must allow me to ‘quickly’ determine the following two things at any point: (1) Given a chapter and section #, what paragraphs are contained within this section? (2) Given a paragraph #, which chapter and section is it under? It must also be flexible enough that I could use the same platform in the future with few edits if the structure (depth-wise) of the book changes (e.g. sections are divided into subsections, etc.). Finally, should be able to handle optional divisions (i.e. some sections have subsections while others do not).
This is for an iOS app and my code is written in Objective-C so far.

SQL would certainly be one possibility. If you follow this route, there is a certain trade-off between flexibility and easy of coding which impacts maintainability. For example, if you build a fixed structure, say with some additional levels attempting to cater for the future, such as:
Book
Chapter
Section
Sub-section
Paragraph
you will have code with unambiguous references, such as section.fk_chapter, paragraph.fk_subSection, etc. This will make it easier to troubleshoot and build queries. However you have the problem of having to refactor your code a fair amount if you wanted to add, say, sub-paragaphs, or sub-sub-sections. Your UI will be simpler to code in this approach as you always know which "level" you are working at. Alternatively, you can go for a hierarchical approach:
Book
Chapter
Content Item
Content Item
Content Item
....
where the contentItem table has a self-reference foreign key. This has the quite big advantage of allowing you any number of levels. Some attribute on the Content Item could tell you the name and "type" of level you are at if needed. It is definitely much more flexible, but will come with some complexity in implementation and UI presentation. columns called contentItem.fk_contentItem to refer to the parent level do not tell the coder where they are in the hierarchy. Queries will be a bit more difficult to write. The UI will have to cater for "any" number of levels. But on the other hand, these problems are not insurmountable and many have gone before you on this route.
Your question is quite broad, so opinions will vary on the approach and the above is admittedly very general.

Related

Strategy for Requirements Traceability Matrix in DOORS

I want to create a traceability matrix in a new module that shows the object ID & text at the top level, then in columns moving to the right the object ID & text for the source of the first in-link, and then it's in-links to the right, etc. If there is more than one in-link, the next source object will be shown on the next line (new object), and the higher level object ID & text just repeated in the columns to the left. Basically, it is the recursive trace analysis layout dxl, but I want to spread out the information over separate columns.
My question is related to the best practices for approach. Is it best to create a new module and write several dxl layout scripts for each column, pulling info from all of the various modules, and then later converting it to text (so it isn't too heavy)? Or is it necessary (or easier) to actually create dxl attributes within each requirements module, and then pull information from there into my RTM module?
I'm likely over-complicating it, but any tips would be appreciated!
Well, one of our assets contains something that looks like your approach:
script creates new modules that only contains trace information, a "report" module, which does not contain any link to the original "data" modules
there are some 2 or three columns for each Req Level (high level reqs at the left, low level reqs at the right)
the advantage of this approach is that one can easily use standard DOORS filtering mechanisms to find "holes" in the matrix (requirements which have not been implemented, design elements without requirement etc.). plus, as every report run creates a new report module with the date/time in its name, project progress can be made visible over time, reports to excels might be made.
On the other hand the implementation took several weeks. So, I don't know if this approach would be feasible for you.

database design for dictionary of words

(my reason for asking this question is based on having read this answer, which made me rethink my current setup)
I currently am developing a ruby on rails application in which there are many languages, each of which has a dictionary of base words attached to it, as well as a list of the words that map to each base word. The way I currently have it set up, there is a base_words table that contains the base_word as a string, along with the language_id as a foreign key. There is also a words table, each row of which contains a word string, along with the base_word_id as a foreign key. There is also a language_id indexed on each column, although I'm almost positive that this is superfluous due to the language_id on base_word, so I'm planning to take it off (although this could be a bad assumption on my part).
In sum, on the contrary to the answer I mentioned in the beginning, the tables are not separated by language, because I've reasoned that I can simply pull out the language words programmatically when the time comes. However, my application will also have translation(s) associated with each base word (as did the answer I referenced), and so I'm doubting my structure due to the realization that each translation will actually be a base_word in the same table as itself, which would mean that the translation would actually be just an id of another base word in said table. This may be completely fine, or it may not be - I have no clue (this is my first ever programming project).
Is this ok? Do I need to separate my base_words into separate tables for each language, or can I leave it all in one table?
Another example: I also need to store many phrases for each language, along with their translations. Should I have one table where each row has the appropriate translation of the phrase, or one table where each row contains simply one phrase and a language_id, or multiple tables (one for each language)?
Un saludo,
Michael
As in the other scenario, you'll have a translations table. There is no technical reason it couldn't have multiple foreign keys to base_words (a source_word_id and target_word_id, perhaps). So yes, you can absolutely store all your words in one table. There are some minor side effects involved with translations being directional relationships: it becomes possible to have translations which only work one way, and there will be many pairs of entries with opposite source and target. Neither of these is much of a worry: the first is even potentially desirable in order to represent words with double meanings in one language but not the other, and as for the second, space is cheap and indexing is easy.
You are correct that you do not need words.language_id, so long as you always join base_words when you're querying words and the language matters. This obviously changes if you have a use case where it makes sense to leave base_words out, but that scenario sounds unlikely based on what you describe.
As for phrases: why should they be handled any differently than base_words?

Detecting HTML table orientation based only on table data

Given an HTML table with none of it's cells identified as "< th >" or "header" cells, I want to automatically detect whether the table is a "Vertical" table or "Horizontal" table.
For example:
This is a Horizontal table:
and this is a vertical table:
of course keep in mind that the "Bold" property along with the shading and any styling properties will not be available at the classification time.
I was thinking of approaching this by a statistical means, I can hand write couple of features like "if the first row has numbers, but the first column doesn't. That's probably a Vertical table" and give score for each feature and combine to decide the Class of the table orientation.
Is that how you approach such a problem? I haven't used any statistical-based algorithm before and I am not sure what would be optimal for such a problem
This is a bit confusing question. You are asking about ML method, but it seems you have not created training/crossvalidation/test sets yet. Without data preprocessing step any discussion about ML method is useless.
If I'm right and you didn't created datasets yet - give us more info on data (if you take a look on one example how do you know the table is vertical or horizontal?, how many data do you have, are you always sure whether s table is vertical/horizontal,...)
If you already created training/crossval/test sets - give us more details how the training set looks like (what are the features, number of examples, do you need white-box solution (you can see why a ML model give you this result),...)
How general is the domain for the tables? I know some Web table schema identification algorithms use types, properties, and instance data from a general knowledge schema such as Freebase to attempt to identify the property associated with a column. You might try leveraging that knowledge in an classifier.
If you want to do this without any external information, you'll need a bunch of hand labelled horizontal and vertical examples.
You say "of course" the font information isn't available, but I wouldn't be so quick to dismiss this since it's potentially a source of very useful information. Are you sure you can't get your data from a little bit further back in the pipeline so that you can get access to this info?

Performance implications of a table with many fields

I have a table that is currently at 40 fields. A significant expansion of its capability now has it looking like something more like 100 fields.
What are the database and Rails performance implications of having a table with more fields? My understanding of relations is that they don't load the data until absolutely necessary, but would having so much more information slow down, say, a filtered index of these records (showing only the main 8-10 fields)?
The fields I'm specifically talking about adding are not relevant to any of my reports or most of my queries - they simply store data that is used on the back end.
Normalization is not a problem here (there are no fields like field1, field2, ..., for example). I know it's hard to answer these questions when posed in a qualitative manner, but is it likely better to build these 60 fields in this table, or should I create a separate 1-1 table for them?
Having a single table is not a big deal and make things easier when it comes to queries. So if it's relevant, no need to split.
Still, you should only query what you need in your views so use the ActiveRecord's select: doc here.
Yes, having a lot of fields will slow down access to the table, however, in general not significantly enough that it matters for average data sizes. Most SQL databases arrange tables row by row, so on the disk, first all 40 fields of row 1, then all 40 fields of row 2, and so on, are stored. This means, that if you are only interested in retrieving the first 2 fields, you will still read all other 38 fields and then jump to the next row that matches. This is not a big issue if you have only a few matching rows, but might be, if you would have many matches that are also consecutive.
That said, I would still heavily advice against a table with 40 fields, except when there is a very good reason to do so (which you might have, but you give to little details to answer this). In general, having that many fields indicates the use of some alternative design. Definitly, if what I wrote above starts becoming an issue, you should order the fields according to the access patterns (so if normally fields 1-10 and 20,24,25,30 are accessed together, put those groups into separate tables).

Would like to Understand 6NF with an Example

I have just read #PerformanceDBA's arguments re: 6NF and E-A-V. I am intrigued. I had previously been skeptical of 6NF as it was presented as "merely" sticking some timestamp columns on tables.
I have always worked with a data dictionary and do not need to be convinced to use one, or to generate SQL code. So I expect an answer that would require a dictionary (or catalog) that is used to generate code.
So I would like to know how 6NF would deal with an extremely simple example. A table of items, descriptions and prices. The prices change over time.
So anyway, what does the Items table look like when converted to 6NF? What is the "explosion of tables?" that happens here?
If the example does not work with a table this simple, feel free to add what is necessary to get the point across.
I actually started putting an answer together, but I ran into complications, because you (quite understandably) want a simple example. The problem is manifold.
First I don't have a good idea of your level of actual expertise re Relational Databases and 5NF; I don't have a starting point to take up and then discuss the specifics of 6NF,
Second, just like any of the other NFs, it is variegated. You can just barely step into it; you can implement 6NF for certan tables; you can go the full hog on every table, etc. Sure there is an explosion of tables, but then you Normalise that, and kill the explosion; that's an advanced or mature implementation of 6NF. No use providing the full or partial levels of 6NF, when you are asking for the simplest, most straight-forward example.
I trust you understand that some tables can be "in 5NF" while others are "in 6NF".
So I put one together for you. But even that needs explanation.
Now SQL barely supports 5NF, it does not support 6NF at all (I think dportas says the same thing in different words). Now I implement 6NF at a deep level, for performance reasons, simplified pivoting (of entire tables; any and all columns, not the silly PIVOT function in MS), columnar access, etc. For that you need a full catalogue, which is an extension to the SQL catalogue, to support the 6NF that SQL does not support, and maintain data Integrity and business Rules. So, you really do not want to implement 6NF for fun, you only do that if you have a need, because you have to implement a catalogue. (This is what the EAV crowd do not do, and this is why most EAV systems have data integrity problems. Most of them do not use the declarative Referential & Data Integrity that SQL does have.)
But most people who implement 6NF don't implement the deeper level, with a full catalogue. They have simpler needs, and thus implement a shallower level of 6NF. So, let's take that, to provide a simple example for you. Let's start with an ordinary Product table that is declared to be in 5NF (and let's not argue about what 5NF is). The company sells various different kinds of Products, half the columns are mandatory, and the other half are optional, meaning that, depending on the Product Type, certain columns may be Null. While they may have done a good job with the database, the Nulls are now a big problem: columns that should be Not Null for certain ProductTypes are Null, because the declaration states NULL, and their app code is only as good as the next guy's.
So they decide to go with 6NF to fix that problem, because the subtitle of 6NF states that it eliminates The Null Problem. Sixth Normal Form is the irreducible Normal Form, there will be no further NFs after this, because the data cannot be Normalised further. The rows have been Normalised to the utmost degree. The definition of 6NF is:
a table is in 6NF when the row contains the Primary Key, and at most one, attribute.
Notice that by that definition, millions of tables across the planet are already in 6NF, without having had that intent. Eg. typical Reference or Look-up tables, with just a PK and Description.
Right. Well, our friends look at their Product table, which has eight non-key attributes, so if they make the Product table 6NF, they will have eight sub-Product tables. Then there is the issue that some columns are Foreign Keys to other tables, and that leads to more complications. And they note the fact that SQL does not support what they are doing, and they have to build a small catalogue. Eight tables are correct, but not sensible. Their purpose was to get rid of Nulls, not to write a little subsytem around each table.
Simple 6NF Example
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find IDEF1X Notation useful in order to interpret the symbols in the example.
So typically, the Product Table retains all the Mandatory columns, especially the FKs, and each Optional column, each Nullable column, is placed in a separate sub-Product table. That is the simplest form I have seen. Five tables instead of eight. In the Model, the four sub-Product tables are "in 6NF"; the main Product table is "in 5NF".
Now we really do not need every code segment that SELECTs from Product to have to figure out what columns it should construct, based on the ProductType, etc, so we supply a View, which essentially provides the 5NF "view" of the Product table cluster.
The next thing we need is the basic rudiments of an extension to the SQL catalog, so that we can ensure that the rules (data integrity) for the various ProductTypes are maintained in one place, in the database, and not dependent on app code. The simplest catalogue you can get away with. That is driven off ProductType, so ProductType now forms part of that Metadata. You can implement that simple structure without a catalogue, but I would not recommend it.
Update
It is important to note that I implement all Business Rules in the database. Otherwise it is not a database (the notion of implementing rules "in application code" is hilarious in the extreme, especially nowadays, when we have florists working as "developers"). Therefore all rules, etc are first and foremost implemented as SQL declarations, CHECK constraints, functions, etc. That preserves all Declarative Referential Integrity, and declarative Data Integrity. The extension to the SQL catalog covers the area that SQL does not have declarations for, and they are then implemented as SQL. Being a good data dictionary, it does much more. Eg. I do not write Views every time I change the tables or add or change columns or their characteristics, they are created directly from the catalog+extension using a simple code generator.
One more very important note. You cannot implement 6NF (or EAV properly, for that matter), without completing a full and faithful Normalisation exercise, to 5NF. The problem I see at every site is, they don't have a genuine 5NF state, they have a mish-mash of partial normalisation or no normalisation at all, but they are very attached to that. Creating either 6NF or EAV from that is a disaster. Creating EAV or 6NF from that without all business rules implemented in declarative SQL is a nuclear disaster, burning for years. You get what you pay for.
End update.
Finally, yes, there are at least four further levels of Normalisation (Normalisation is a Principle, not a mere reference to a Normal Form), that can be applied to that simple 6NF Product cluster, providing more control, less tables, etc. The deeper we go, the more extensive the catalogue. And higher levels of performance. When you are ready, just ask, I have already erected the models and posted details in other answers.
In a nutshell, 6NF means that every relation consists of a candidate key plus no more than one other (key or non-key) attribute. To take up your example, if an "item" is identified by a ProductCode and the other attributes are Description and Price then a 6NF schema would consist of two relations (* denotes the key in each):
ItemDesc {ProductCode*, Description}
ItemPrice {ProductCode*, Price}
This is potentially a very flexible approach because it minimises the dependencies. That's also its main disadvantage however, especially in a SQL database. SQL makes it hard or impossible to enforce many multi-table constraints. Using the above schema, in most cases it will not be possible to enforce a business rule that every product must always have a description AND a price. Similarly, you may not be able to enforce some compound keys that ought to apply (because their attributes could be split over multiple tables).
So in considering 6NF you have to weigh up which dependencies and integrity rules are important to you. In many cases you may find it more practical and useful to stick to 5NF and normalize no further than that.
I had previously been skeptical of 6NF
as it was presented as "merely"
sticking some timestamp columns on
tables.
I'm not quite sure where this apparent misconception comes from. Perhaps the fact that 6NF was introduced for the book "Temporal Data and The Relational Mode" by Date, Darwen and Lorentzos? Anyhow, I hope the other answers here have clarified that 6NF is not limited to temporal databases.
The point I wanted to make is, although 6NF is "academically respectable" and always achievable, it may not necessarily lead to the optimal design in every case (and not just when considering implementation using SQL either). Even the aforementioned discoverers and proponents of 6NF seem to agree e.g.
Chris Date: "For practical purposes, stick to 5NF (and 6NF)."
Hugh Darwen: "the 6NF decomposition around Date [not the person!] would be overkill... an optimal design for the soccer club is... 5-and-a-bit-NF!"
Hugh Darwen: "we are in 5NF but not in 6NF, and again 5NF is sufficient" (several similar examples).
Then again, I can also find evidence to the contrary:
Chris Date: "Darwen and I have both felt for some time that all base relvars should be in 6NF".
On a practical note, I recently extended the SQL schema of one of our products to add a minor feature. I adopted a 6NF to avoid nullable columns and ended up with six new tables where most (all?) of my colleagues would have used one table (or perhaps extended an existing table) with nullable columns. Despite me proving several 'helper' stored procs and a 'denormalized' VIEW with a INSTEAD OF triggers, every coder that has had to work with this feature at the SQL level has gone out of their way to curse me :)
These guys have it down: Anchor Modeling. Great academic papers on the subject, combined with practical examples. Their writings have finally pushed me over the edge to consider building a DW in 6nf on an upcoming project. The POC work I have done has validated (for me, at least) that the enormous benefits of 6nf don't outweigh the costs.

Resources