I am looking to implement pub/sub&Dataflow connection from my AppEngine to Bigquery
as I am trying to understand how exactly to define it
my problem is that I have a daily table a new table on Bigquery that opens once a day,
When I try to set the dataflow it only gives me an option to choose one table
Based on what you said in the comment section, you could use Daily Sharded tables or Time/day Partitioned tables.
According to the documentation, you can stream in both types. However, I must point out some differences, you would have to consider.
Time/Day Partitioned Tables:
These tables are internally divided into into segments/partitions, which are easier to manage and improve query performance. You can have more information about it here.
There are quotas, such as maximum number of partitions per table, which you have to check if they attend your needs.
When querying against Day/Time partitioned tables you can use the pseudo columns _PARTITIONTIME or _PARTITIONDATE, each one has its own format, which you can read more here.
You can stream individual rows using insertAll requests.
According to the documentation, Partitioned tables perform better than sharded tables because you don't need a copy of the metadata and verify permissions for each table.
Daily Sharded table:
There is not a pseudo column you can use to manage/query your database.
There is not a limit for the amount of tables you can create, you can read more about the quotas here.
You create daily tables using templates, such as <targeted_table_name> + <templateSuffix>, all with the same schema.
If you choose Partition table, you could create a date partition table and stream into it. Although, if you prefer Sharded table, you can use a template to create tables.
In addition, I would encourage you to read more about the differences and characteristics of each of one here.
Related
I have to implement a system where a tenant can store multiple key-value stores. one key-value store can have a million records, and there will be multiple columns in one store
[Edited] I have to store tabular data (list with multiple columns) like Excel where column headers will be unique and have no defined schema.
This will be a kind of static data (eventually updated).
We will provide a UI to handle those updates.
Every tenant would like to store multiple table structured data which they have to refer it in different applications and the contract will be JSON only.
For Example, an Organization/Tenant wants to store their Employees List/ Country-State List, and there are some custom lists that are customized for the product and this data is in millions.
A simple solution is to use SQL but here schema is not defined, this is a user-defined schema, and though I have handled this in SQL, there are some performance issues, so I want to choose a NoSQL DB that suits better for this requirement.
Design Constraints:
Get API latency should be minimum.
We can simply assume the Pareto rule, 80:20 80% read calls and 20% write so it is a read-heavy application
Users can update one of the records/one columns
Users can do queries based on some column value, we need to implement indexes on multiple columns.
It's schema-less so we can simply assume it is NoSql, SQL also supports JSON but it is very hard to update a single row, and we can not define indexes on dynamic columns.
I want to segregate key-values stores per tenant, no list will be shared between tenants.
One Key Value Store :
Another key value store example: https://datahub.io/core/country-list
I am thinking of Cassandra or any wide-column database, we can also think of a document database (Mongo DB), every collection can be a key-value store or Amazon Dynamo database
Cassandra: allows you to partition data by partition key and in my use case I may want to get data by different columns in Cassandra we have to query all partitions which will be expensive.
Your example data shows duplicate items, which is not something NoSQL datbases can store.
DynamoDB can handle this scenario quite efficiently, its well suited for high read activity and delivers consistent single digit ms low latency at any scale. One caveat of DynamoDB compared to the others you mention is the 400KB item size limit.
In order to get top performance from DynamoDB, you have to utilize the Partition key as much as possible, because it provides you with hash-based access (super fast).
Its obvious that unique identifier for the user should be present (username?) in the PK, but if there is another field that you always have during request time, like the country for example, you should include it in the PK.
Like so
PK SK
Username#S2#Country#US#State#Georgia Address#A1
It might be worth storing a mapping for the countries alone so you can retrieve them before executing the heavy query. Global Indexes can't be more than 20, keep that in mind and reuse/overload indexes and keys as much as possible.
Stick to single table design to utilize this better.
As mentioned by Lee Hannigan, duplicated elements are not supported, all keys (including those of the indexes) must be unique pairs
I'm trying to create a datamart for the healthcare application. The facts in the datamart are basically going to be measurements and findings related to heart, and we have 100s of them. Starting from 1000 and can go to as big as 20000 per exam type.
I'm wondering what my design choices for the fact tables are:
Grain: 1 row per patient per exam type.
Some of the choices that I can think of -
1) A big wide fact table with 1000 or more columns.
2) EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
4) any other ideas?
Any inputs would be appreciated.
1. A big wide fact table with 1000 or more columns.
One very wide fact table gives end-user maximum flexibility if queries are executed directly in the data warehouse. However some considerations should be taken into account, as you might hit some limits depending on a platform.
SQL Server 2014 limits are as per below:
Bytes per row 8,060. A row-overflow storage might be a solution, however it supports only few column types typically not related to fact nature, i.e. varchar, nvarchar, varbinary, sql_variant. Also not supported in In-Memory OLTP. https://technet.microsoft.com/en-us/library/ms186981(v=sql.105).aspx
Columns per non-wide table 1024. Wide-tables and sparse columns are solution as columns per wide table limit is 30,000. However the same Bytes per row limit applies. https://technet.microsoft.com/en-us/library/cc280604(v=sql.120).aspx
Columns per SELECT/INSERT/UPDATE statement 4,096
Non-clustered indexes per table 999
https://technet.microsoft.com/en-us/library/ms143432(v=sql.120).aspx
2. EAV based design - A separate Measure dimension table. This foreign key will go into the fact table and the measure value will be in fact table. So the grain of the fact table will be changed to 1 row per patient per exam type per measurement.
According to Kimball, EAV design is called Fact Normalization. It may make sense when a number of measurements is extremely lengthy, but sparsely populated for a given fact and no computations are made between facts.
Because facts are normalized therefore:
Extensibility is very easy, i.e. it's easy to add new measurements without the need to amend the data structure.
It's good to extract all measurements for one exam and present measurements as rows on the screen.
It's hard to extract/aggregate/make computation between several measurements (e.g. average HDL to CHOL ration) and present measurements/aggregates/computations as columns, i.e. requires complex WHERE/PIVOTING or multi-joins. SQL makes it difficult to make computations between facts in different rows.
If primary end-user platform is an OLAP cube then Fact Normalization makes sense. The cubes allows to make computation across any dimension.
Data importing could be an issue if data format is in a flat style CSV.
This questions is also discussed here Should I use EAV model?.
3) Create smaller multiple fact tables per exam type per some other criteria like subgroup. But the end user is going to query across subgroups for that exam type and fact-fact join is not recommended.
In some scenarios multiple smaller fact tables perfectly makes sense. One of the reason is if you hit some physical limits set by platform, e.g. Bytes per row.
The facts could be grouped either by subject area, e.g. measurement group/subgroup, or by frequency of usage. Each table could be placed on a separate file group and drive to maximize I/O.
Further, you could duplicate measurements across different fact tables to reduce the need of fact tables join, i.e. put one measurement in a specific measurement subgroup fact table and in frequently used measurement fact table.
However some considerations should be taken into account if there are some specific requirements for data loading. For example, if a record errors out in your ETL to one fact table, you might want to make sure that the corresponding records in the other fact tables are deleted and staged to your error table so you don't end up with any bogus information. This is especially true if end users have their own calculations in the front end tool.
If you use OLAP cubes then multiple fact tables actually becomes a source of a measure group to a specific fact table.
In terms of fact-to-fact join, you (BI application) should never issue SQL that joins two fact tables together across the fact table’s foreign keys. Instead, the technique of Drilling Across two fact tables should be used, where the answer sets from two or more fact tables are separately created, and the results sort-merged on the common row header attribute values to produce the correct result.
More on this topic: http://www.kimballgroup.com/2003/04/the-soul-of-the-data-warehouse-part-two-drilling-across/
4) any other ideas?
SQL XML or some kind NoSQL could be an option, but the same querying / aggregation / computation / presentation issues exist.
First things first, I am an amateur, self-taught ruby programmer who came of age as a novice engineer in the age of super-fast computers where program efficiency was not an issue in the early stages of my primary GIS software development project. This technical debt is starting to tax my project and I want to speed up access to this lumbering GIS database.
Its a postgresql database with a postgis extension, controlled inside of rails, which immediately creates efficiency issues via the object-ification of database columns when accessing and/or manipulating database records with one or many columns containing text or spatial data easily in excess of 1 megabyte per column.
Its extremely slow now, and it didn't used to be like this.
One strategy: I'm considering building child tables of my large spatial data tables (state, county, census tract, etc) so that when I access the tables I don't have to load the massive spatial columns every time I access the objects. But then doing spatial queries might be difficult on a parent table's children. Not sure exactly how I would do that but I think its possible.
Maybe I have too many indexes. I have a lot of spatial indexes. Do additional spatial indexes from tables I'm not currently using slow down my queries? How about having too many for one table?
These tables have a massive amount of columns. Maybe I should remove some columns, or create parent tables for the columns with massive serialized hashes?
There are A LOT of tables I don't use anymore. Is there a reason other than tidiness to remove these unused tables? Are they slowing down my queries? Simply doing a #count method on some of these tables takes TIME.
PS:
- Looking back at this 8 hours later, I think what I'm equally trying to understand is how many of the above techniques are completely USELESS when it comes to optimizing (rails) database performance?
You don't have to read all of the columns of the table. Just read the ones you need.
You can:
MyObject.select(:id, :col1, :col2).where(...)
... and the omitted columns are not read.
If you try to use a method that needs one of the columns you've omitted then you'll get an ActiveModel::MissingAttributeError (Rails 4), but you presumably know when you're going to need them or not.
The inclusion of large data sets in the table is going to be a noticeable problem from the database side if you have full table scans, and then you might consider moving these data to other tables.
If you only use Rails to read and write the large data columns, and don't use PostgreSQL functions on them, you might be able to compress the data on write and decompress on read. Override the getter and setter methods by using write_attribute and read_attribute, compressing and decompressing (respectively of course) the data.
Indexing. If you are using postgres to store such large chucks of data in single fields consider storing it as Array, JSON or Hstore fields. If you index it using the gin index types so you can search effectively within a given field.
Everywhere I read, people say you shouldn't use Riak's MapReduce over an entire bucket and that there are other ways of achieving your goals. I'm not sure how, though. I'm also not clear on why using an entire bucket is slow, if you only have one bucket in the entire system, so either way, you need to go over all the entries.
I have a list of 500K+ documents that represent sales data. I need to view this data in different ways: for example, how much revenue was made in each month the business was operating? How much revenue did each product raise? How many of each product were sold in a given month? I always thought MapReduce was supposed to be good at solving these types of aggregate problems, so I'm confused what use MapReduce is if you already have all the keys (you have to have searched for them, somehow, right?).
My documents are all in a bucket named 'sales' and they are records with the following fields: {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": "1.00", "created_at": 1365931758}.
Let's take the example where I need to report the total revenue for each product in each month over the past 4 years (that's basically the entire bucket), how does one use Riak's MapReduce to do that efficiently? Even just trying to use an identity map operation on the data I get a timeout after ~30 seconds, which MySQL handles in milliseconds.
I'm doing this in Erlang (using the protocol buffers client), but any language is fine for an explanation.
The equivalent SQL (MySQL) would be:
SELECT SUM(price) AS revenue,
FROM_UNIXTIME(created_at, '%Y-%m') AS month,
product_key
FROM sales
GROUP BY month, product_key
ORDER BY month ASC;
(Ordering not important right now).
You are correct, MapReduce in any KV store will not make it behave like a SQL database. There are several things that may help your use case. Use more than one bucket. Instead of just a Sales bucket you could break them down by product, region, or month so the data is already split by one of your common reporting criteria. Consider adding a secondary index to each document for each field. Your month query could then be a range query of the created_at index. If your id field is sequentially increasing and you need to pull monthly data, store the beginning and ending id for each month in a separate key (not easy to do once the data is written, I know). You may also consider breaking each document a series of keys. Instead of just storing an id key with a json document for a value, store a key for each field like id-productid, id-createdat, id-price. This will minimize the amount of data that must be read from the disk and stored in RAM in order to process your MapReduce.
To put this in perspective, consider the following (very sarcastic) hypothetical: I have 500K documents in a MySQL database, each document consists of a json string. My database consists of a single table named Sales, with a single column named Data which stores my documents as binary blobs. How can I write a fast, efficient SQL statement that will select only the documents that contain a date and group them by month?
The point I am making is that you must design the structure of your data objects according to the strengths of the data store you choose to use. Riak is not particularly efficient at handling JSON unless you are using their solr-like search, but there are probably ways to restructure your data that it might be able to handle. Or perhaps this means that another data store would better fit your needs.
Currently, I create secondary indexes for document attributes that I need to search frequently, and use this much smaller subset of keys as the input to a MapReduce job.
http://docs.basho.com/riak/latest/tutorials/Secondary-Indexes---Examples/
I do agree that it seems very expensive to run a big MapReduce job like this, compared to other systems I've used.
I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings.
In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form).
There are currently 10,390 records but this figure is expected to grow.
I want to use Type 2 ETL whereby if a record has changed in the OLTP database, a new record is added to the DW.
What is the best way to scan through 10,000 records in the DW and then compare the results with the results in several tables contained in the OLTP?
I'm thinking of creating a "snapshot" using a temporary table of the OLTP data and then comparing the results row by row with the data in the Dimension table in the DW.
I'm using SQL Server 2005. This doesn't seem like the most efficient way. Are there alternatives?
Introduce LastUpdated into source system (OLTP) tables. This way you have less to extract using:
WHERE LastUpdated >= some_time_here
You seem to be using SQL server, so you may also try rowversion type (8 byte db-scope-unique counter)
When importing your data into the DW, use ETL tool (SSIS, Pentaho, Talend). They all have a componenet (block, transformation) to handle SCD2 (slowly changing dimension type 2). For SSIS example see here. The transformation does exactly what you are trying to do -- all that you have to do is specify which columns to monitor and what to do when it detects the change.
It sounds like you are approaching this sort of backwards. The typical way for performing ETL (Extract, Test, Load) is:
"Extract" data from your OLTP database
Compare ("Test") your extracted data against the dimensional data to determine if there are changes or whatever other validation needs to be performed
Insert the data ("Load") in to your dimension table.
Effectively, in step #1, you'll create a physical record via a query against the multiple tables in your OLTP database, then compare that resulting record against your dimensional data to determine if a modification was made. This is the standard way of doing things. In addition, 10000 rows is pretty insignificant as far as volume goes. Any RDBMS and ETL process should be able to process through that in a matter of no more than few seconds at most. I know SQL Server has DTS, although I'm not sure if the name has changed in more recent versions. That is the perfect tool for doing something like this.
Does you OLTP database have an audit trail?
If so, then you can query the audit trail for just the records that have been touched since the last ETL.