are supercolumns of a row all from same node - storage

I have a super columnfamily, each row will have thousands of supercolumns. My question is, if I do a query against a row, will all supercolumns of the row be returned from the same node. So in general, it's more of question of whether the whole data of any given row of a cassandra column family be stored as whole. I do understand that different rows of the column family might from different nodes.

"All data for a single row must fit (on disk) on a single machine in the cluster. Because row keys alone are used to determine the nodes responsible for replicating their data, the amount of data associated with a single key has this upper bound."
http://wiki.apache.org/cassandra/CassandraLimitations

Super column families work the same as column families. If you have a replication factor of N, then eventually N machines will store an entire copy of that row, which means all the columns and super columns.

Related

Alasql Best Practices

I'm new to alasql (which is amazing). While the documentation shows you how, it doesn't provide a lot information on best practices.
To date I have simply been running queries against an array of arrays (of js objects). i haven't created a database object or table objects.
Are there performance (speed, memory, other) benefits of using database and table objects over an array of arrays?
Here is a real world example. I have 2 sets of data that I am loading: Employees (10 columns) and Employee Sales (5 columns), that are joined on an EmployeeID column. Employees will be relatively small (say, 100 rows), whereas Employee Sales will have 10,000 records. My current approach is to simply run a query where I join those 2 set of data together and end up with one big result set: 10,000 rows of data with 14 columns per row (repeating every column in the Employee data set), which I then pull data from using dynamic filters, interactivity, etc.
This big data set is stored in memory the whole time, but this has the advantage that I don't need to rerun that query over and over. Alternatively, I could simply run the join against the 2 data sets each time I need it, then remove it from memory after.
Also, if I am joining together multiple tables, can I create indexes on the join columns to speed up performance? I see in examples where indexes are created, but there is nothing else in the documentation. (Nothing on this page: https://github.com/agershun/alasql/wiki/Sql). What is the memory impact of indexes? What are the performance impacts of insertions?
Primary keys are supported, but there is no documentation. Does this create an index?
Are there performance (speed, memory, other) benefits of using database and table objects over an array of arrays?
If you put indexes on your tables then - Yes - you get performance benefits. How much depends on your data.
if I am joining together multiple tables, can I create indexes on the join columns to speed up performance?
Yes. And all other column your put into a "where" condition.

Fact table design guidance for 100s of facts

I'm trying to create a datamart for the healthcare application. The facts in the datamart are basically going to be measurements and findings related to heart, and we have 100s of them. Starting from 1000 and can go to as big as 20000 per exam type.
I'm wondering what my design choices for the fact tables are:
Grain: 1 row per patient per exam type.
Some of the choices that I can think of -
1) A big wide fact table with 1000 or more columns.
2) EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
4) any other ideas?
Any inputs would be appreciated.
1. A big wide fact table with 1000 or more columns.
One very wide fact table gives end-user maximum flexibility if queries are executed directly in the data warehouse. However some considerations should be taken into account, as you might hit some limits depending on a platform.
SQL Server 2014 limits are as per below:
Bytes per row 8,060. A row-overflow storage might be a solution, however it supports only few column types typically not related to fact nature, i.e. varchar, nvarchar, varbinary, sql_variant. Also not supported in In-Memory OLTP. https://technet.microsoft.com/en-us/library/ms186981(v=sql.105).aspx
Columns per non-wide table 1024. Wide-tables and sparse columns are solution as columns per wide table limit is 30,000. However the same Bytes per row limit applies. https://technet.microsoft.com/en-us/library/cc280604(v=sql.120).aspx
Columns per SELECT/INSERT/UPDATE statement 4,096
Non-clustered indexes per table 999
https://technet.microsoft.com/en-us/library/ms143432(v=sql.120).aspx
2. EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
According to Kimball, EAV design is called Fact Normalization. It may make sense when a number of measurements is extremely lengthy, but sparsely populated for a given fact and no computations are made between facts.
Because facts are normalized therefore:
Extensibility is very easy, i.e. it's easy to add new measurements without the need to amend the data structure.
It's good to extract all measurements for one exam and present measurements as rows on the screen.
It's hard to extract/aggregate/make computation between several measurements (e.g. average HDL to CHOL ration) and present measurements/aggregates/computations as columns, i.e. requires complex WHERE/PIVOTING or multi-joins. SQL makes it difficult to make computations between facts in different rows.
If primary end-user platform is an OLAP cube then Fact Normalization makes sense. The cubes allows to make computation across any dimension.
Data importing could be an issue if data format is in a flat style CSV.
This questions is also discussed here Should I use EAV model?.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
In some scenarios multiple smaller fact tables perfectly makes sense. One of the reason is if you hit some physical limits set by platform, e.g. Bytes per row.
The facts could be grouped either by subject area, e.g. measurement group/subgroup, or by frequency of usage. Each table could be placed on a separate file group and drive to maximize I/O.
Further, you could duplicate measurements across different fact tables to reduce the need of fact tables join, i.e. put one measurement in a specific measurement subgroup fact table and in frequently used measurement fact table.
However some considerations should be taken into account if there are some specific requirements for data loading. For example, if a record errors out in your ETL to one fact table, you might want to make sure that the corresponding records in the other fact tables are deleted and staged to your error table so you don't end up with any bogus information. This is especially true if end users have their own calculations in the front end tool.
If you use OLAP cubes then multiple fact tables actually becomes a source of a measure group to a specific fact table.
In terms of fact-to-fact join, you (BI application) should never issue SQL that joins two fact tables together across the fact table’s foreign keys. Instead, the technique of Drilling Across two fact tables should be used, where the answer sets from two or more fact tables are separately created, and the results sort-merged on the common row header attribute values to produce the correct result.
More on this topic: http://www.kimballgroup.com/2003/04/the-soul-of-the-data-warehouse-part-two-drilling-across/
4) any other ideas?
SQL XML or some kind NoSQL could be an option, but the same querying / aggregation / computation / presentation issues exist.

Table Normalization with no Domain values

There is a debate between our ETL team and a Data Modeler on whether a table should be normalized or not, and I was hoping to get some perspective from the online community.
Currently the tables are set up as such
MainTable LookupTable
PrimaryKey (PK) Code (PK)
Code (FK) Name
OtherColumns
Both tables are only being populated by a periodic file (from a 3rd party)
through an ETL job
A single record in the file contains all attributes in both tables for a single row)
The file populating these tables is a delta (only rows with some change in them are in the file)
One change to one attribute for one record (again only by the 3rd party) will result in all the data for that record in the file
The Domain Values for Code and Name are
not known.
Question:Should the LookupTable be denormalized into MainTable.
ETL team: Yes. With this setup, every row from the file will first have to check the 2nd table to see if their FK is in there (insert if it is not), then add the MainTable row. More Code, Worse Performance, and yes slightly more space. However ,regardless of a change to a LookupTable.Name from a 3rd party, the periodic file will reflect every row affected, and we will still have to parse through each row. If lumped into MainTable, all it is, is a simple update or insert.
Data Modeler: This is standard good database design.
Any thoughts?
Build prototypes. Make measurements.
You started with this, which your data modeler says is a standard good database design.
MainTable LookupTable
PrimaryKey (PK) Code (PK)
Code (FK) Name
OtherColumns
He's right. But this, too, is a good database design.
MainTable
PrimaryKey (PK)
Name
OtherColumns
If all updates to these tables come only from the ETL job, you don't need to be terribly concerned about enforcing data integrity through foreign keys. The ETL job would add new names to the lookup table anyway, regardless of what their values happen to be. Data integrity depends mainly on the system the data is extracted from. (And the quality of the ETL job.)
With this setup, every row from the file will first have to check the
2nd table to see if their FK is in there (insert if it is not), then
add the MainTable row.
If they're doing row-by-row processing, hire new ETL guys. Seriously.
More Code, Worse Performance, and yes slightly more space.
They'll need a little more code to update two tables instead of one. How long does it take to write the SQL statements? How long to run them? (How long each way?)
Worse performance? Maybe. Maybe not. If you use a fixed-width code, like an integer or char(3), updates to the codes won't affect the width of the row. And since the codes are shorter than the names, more rows might fit in a page. (It doesn't make any sense to use a code that longer than the name.) More rows per page usually means less I/O.
Less space, surely. Because you're storing a short code instead of a long name in every row of "MainTable".
For example, the average length of a country name is about 11.4 characters. If you used 3-character ISO country codes, you'd save an average of 8.4 bytes per row in "MainTable". For 100 million rows, you save about 840 million bytes. The size of that lookup table is negligible, about 6k.
And you don't usually need a join to get the full name; country codes are intended to be human-readable without expansion.

Is there way to union two Google Fusion tables into one view, or insert rows from one table to another?

I have multiple tables that are quite large, and are updated in bulk. It would be extremely useful if I could work in smaller chunks at my end, then combine them at the Google end.
A view with a union would solve this, as would the ability to insert from another table into a common table. Do such functions exist?
Unfortunately, no. Here's the feature request for it: https://code.google.com/p/fusion-tables/issues/detail?id=930
I guess it's worth trying a hack like:
Create a master table "MasterTable" with column "Group" and values A, B, C. That's 3
rows in total.
Have table MySubsetA containing a subset of 20 rows of your data. Add a column "MasterGroup" to it and set the value in all the rows to be 'A'.
Have table MySubsetB containing a different subset of 20 rows of your data. Add a column "MasterGroup" to it and set the value in all the rows to be 'B'.
Same for Table MySubsetC with 20 rows of 'C'.
Then merge all the tables, starting with MasterTable.
I'd be interested to know if the behaviour would be to end up with 60 rows due to the duplicate matches on 20 As, 20 Bs and 20 Cs.
Even if that worked, it only allows for a very simplistic union of the tables. Without being able to manipulate the column names from the source table to the target view there's very few use cases that will be covered.
In any case, you're almost certainly best off maintaining your data in a single column and building your own tools (in Javascript, most probably), to help you manipulate the data as if it were a group of subsets.
If you're using Fusion Tables just to get entries on a Google Map, you could consider unioning at the map layer and having multiple Fusion Table layers. This is the wrong conceptual layer to union at, of course, but if it fits your use case nicely it's an option.
Subject to columns in the data sources being similar, you can achieve this by importing more rows into your fusion table (File->Import More Rows). This works, although, unlike SQL, may not offer the ability to limit duplicates etc.
In Fusion Tables UI there is a menu item named "Merge" which allows you to merge two fusion tables into one.

Mnesia table replication/sharing

Assume that we have N erlang nodes, running same application. I want
to share an mnesia table T1 with all N nodes, which I see no problem.
However, I want to share another mnesia table T2 with pairs of nodes.
I mean the contents of T2 will be identical and replicated to/with
only sharing pair. In another words, I want N/2 different contents for
T2 table. Is this possible with mnesia, not with renaming T2 for each
distinct pair of nodes?
It's possible to do this with mnesia's table fragmentation, if one makes use of the mnesia_frag_hash callback behaviour. This allows you to control the distribution of keys, and it would be possible to construct the keys such that the callback is able to determine which node pair (and thus, which fragment) should be used.
Whether or not this works in your particular case depends on your access patterns and data set. Chances are that it's a pretty convoluted approach, and that you'd be better served by simply using different table names instead.
One table is always one table, no matter how many nodes you share it with. If you want pairs of nodes sharing a table, you would have to create a unique table for each pair of nodes.
You can use the same settings (records etc) for all those tables though, so there shouldn't be so much more work to get it done.

Resources