Does anyone know when Snowflake Serverless Tasks might come out of preview and be generally available?
I've seen a couple of articles about it being in a preview with some Snowflake customers, but can't find anymore info further to that.
As of August 2021 this feature is still in preview.
I've seen a couple of articles about it being in a preview with some Snowflake customers, but can't find anymore info further to that.
The documentation is available at
Introduction to Tasks
Compute Resources
Tasks require compute resources to execute SQL code. Either of the following compute models can be chosen for individual tasks:
Snowflake-managed
Also referred to as the serverless compute model, this option relies on compute resources managed by Snowflake. These resources are automatically resized and scaled up or down by Snowflake as required for each workload.
Billing for Task Runs
Snowflake-managed resources (i.e. serverless compute model)
Snowflake bills your account based on the actual compute resource usage; in contrast with customer-managed virtual warehouses, which consume credits when active, and may sit idle or be overutilized. Charges are calculated based on total usage of the resources (including cloud service usage) measured in compute-hours credit usage.
The CREATE TASK was also updated:
CREATE [ OR REPLACE ] TASK [ IF NOT EXISTS ] <name>
[ WAREHOUSE = <string> ]
Warehouse name becomes an optional parameter.
Related
I am new to Dataflow and pub-sub tools in GCP.
Need to migrate current on prem process to GCP.
Current Process is as follows:
We have two types of data feeds
Full Feed – its adhoc job – Size of full XML is ~100GB (Single XML – very complex one – Complete data – ETL Job process this xml and load it into ~60 tables)
Separate ETL jobs are there to process full feed. ETL job process
full feed and create load ready files and all tables will be truncate
and re-load.
Delta Feed - Every 30 min need to process delta files(XML files – it will have only changes with in last 30 min)
Source system push XML files in every 30 mins(More than one, file has timestamp), scheduled ETL process will pick all the files which are produced by source system and process all the xml files and create 3 load ready files insert, delete and update for each table
Schedule – ETL Jobs are scheduled to run every 5 min, if current process is running more than 5 min, next run will not trigger until current process completes
Order of the file processing is very important(ETL Job will take care of this). Need to process all the files in sequence.
At the end of ETL process load the load ready files into tables (Mainframe)
I was asked to propose the design to Migrate this to GCP. Need to have two process in GCP as well full and delta. My proposed solution should be handle/suitable for both the feeds.
Initially I thought below design.
Pub/sub -> DataFlow -> mySQL/BigQuery
Then came to know that pub/sub will not give the guarantee to process the files in sequence/order. After doing some research learn that recently google introduced ordering key concept for pub/sub, which will make sure to process the messages in order. In google cloud docs it was mentioned that, this feature is in Beta.
I have two questions:
Whether any one used ordering key concept in pub/sub in production environment. If yes, did you face any challenges while implementing this
Is this design is suitable for the above requirement or is there any better solution in GCP
is there any alternative for DataFlow?
Came to know that pub/sub can handle maximum 10MB size of messages, for us each XML size is more than ~5G.
As was mentioned by #guillaume blaquiere, Beta product launching phase brings some restrictions but they are mostly related to the product support:
At beta, products or features are ready for broader customer testing
and use. Betas are often publicly announced. There are no SLAs or
technical support obligations in a beta release unless otherwise
specified in product terms or the terms of a particular beta program.
The average beta phase lasts about six months.
Commonly, Cloud Pub/Sub message ordering feature works as intended, once you have something for developers attention it is highly appreciated to send a report via Google Issue tracker.
I have a use case wherein records will be published from an on-premise system to a PubSub topic. Now, I want to make sure that all records published are read by the Apache Beam job and they are all correctly written to BigQuery.
I have two questions regarding this:
1) How do I make sure that there is no data loss in the entire process?
2) I need to maintain an Audit table somewhere to make sure that if 'n' records were published I have dumped each one of them successfully. How to keep track of the records?
Thank You.
Google Cloud Dataflow guarantees exactly-once data processing, with transactional logic built into its sources and sinks. You can read more about exactly-once guarantees in the blog article: After Lambda: Exactly-once processing in Cloud Dataflow, Part 3 (sources and sinks).
For your question about an audit table: can you describe more about what you'd like to accomplish? Dataflow has built-in Elements Added counters available in the UI and API which will show exactly how many elements have been processed. You could match this up with the number of published Pub/Sub messages.
When submitting many jobs, I get an error similar to
Project <my project> has insufficient quota(s) to execute this workflow
Since this is a batch job, why is my job not held until resources are available?
Holding it until the resources are available isn't always the best solution -- that may never happen depending on your total quota, behavior of other workloads, etc.
But having an option to do so seems like it could be a useful feature -- will note your request.
I am working on testing space on Data Warehousing. In the scope I got newly created and dimensions and facts which should be validated. As per my knowledge and information got via browsing I would decide to cover for following
Schema validation of Facts and Dimension tables as per spec
Data duplicate check for Facts and Dimension table
Look-up validation for dimension table
Is there anything else that I can verify here?
In addition just curious how can I check whether data correctly populated to Fact table and row count, correct surrogate keys etc. In developers point of view are they using DML scripts to load the data?
Testing the Database
The database is tested in the following three ways:
Testing the database manager and monitoring tools - To test the
database manager and the monitoring tools, they should be used in the
creation, running, and management of test database.
Testing database features - Here is the list of features that we have
to test:
-Querying in parallel
-Create index in parallel
-Data load in parallel
3.Testing database performance - Query execution plays a very important role in data warehouse performance measures. There are sets of fixed queries that need to be run regularly and they should be tested. To test ad hoc queries, one should go through the user requirement document and understand the business completely. Take time to test the most awkward queries that the business is likely to ask against different index and aggregation strategies.
http://www.tutorialspoint.com/dwh/dwh_testing.htm
Also you can use ETL testing (Extract, Transform, and Load).
ETL Testing Techniques:
1) Verify that data is transformed correctly according to various business requirements and rules.
2) Make sure that all projected data is loaded into the data warehouse without any data loss and truncation.
3) Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data.
4) Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability.
Apart from these 4 main ETL testing methods other testing methods like integration testing and user acceptance testing is also carried out to make sure everything is smooth and reliable.
Also you can test Schedule, Backup Recovery, Operational Environment, the Application and Logistic of the Test
For more information about ETL Testing / Data Warehouse Testing please visit http://www.softwaretestinghelp.com/etl-testing-data-warehouse-testing/
UPD:
Creating Indexes in Parallel
Indexes on the fact table can be partitioned or non-partitioned. Local partitioned indexes provide the simplest administration. The only disadvantage is that a search of a local non-prefixed index requires searching all index partitions.
The considerations for creating index tablespaces are similar to those for creating other tablespaces. Operating system striping with a small stripe width is often a good choice, but to simplify administration it is best to use a separate tablespace for each index. If it is a local index you may want to place it into the same tablespace as the partition to which it corresponds. If each partition is striped over a number of disks, the individual index partitions can be rebuilt in parallel for recovery. Alternatively, operating system mirroring can be used. For these reasons the NOLOGGING option of the index creation statement may be attractive for a data warehouse.
Tablespaces for partitioned indexes should be created in parallel in the same manner as tablespaces for partitioned tables.
Partitioned indexes are created in parallel using partition granules, so the maximum DOP possible is the number of granules. Local index creation has less inherent parallelism than global index creation, and so may run faster if a higher DOP is used. The following statement could be used to create a local index on the fact table:
CREATE INDEX I on fact(dim_1,dim_2,dim_3) LOCAL
PARTITION jan95 TABLESPACE Tsidx1,
PARTITION feb95 TABLESPACE Tsidx2,
...
PARALLEL(DEGREE 12) NOLOGGING;
To backup or restore January data, you only need to manage tablespace Tsidx1.
Parallel Query Tuning
The parallel query feature is useful for queries that access a large amount of data by way of large table scans, large joins, the creation of large indexes, bulk loads, aggregation, or copying. It benefits systems with all of the following characteristics:
symmetric multiprocessors (SMP), clusters, or massively parallel
systems
high I/O bandwidth (that is, many datafiles on many different disk
drives)
underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
sufficient memory to support additional memory-intensive processes
such as sorts, hashing, and I/O buffers
If any one of these conditions is not true for your system, the parallel query feature may not significantly help performance. In fact, on over-utilized systems or systems with small I/O bandwidth, the parallel query feature can impede system performance.
Here you can read about this in more detail: http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqo.htm#1559
This resources I hope will be helpful for you:
https://wiki.postgresql.org/wiki/Parallel_Query_Execution
https://technet.microsoft.com/en-us/library/ms178065%28v=sql.105%29.aspx
http://www.csee.umbc.edu/portal/help/oracle8/server.815/a67775/ch24_pex.htm#1978
I am an ETL tester.
for data validation and data quality testing in data warehouse follow below checks
1) metadata testing - testing the structure of underlying tables and their structure (as per design document).
2) data validation - in data validation you test the mapping transformations using SQL and PL/SQL.
We generally test it using Source and target table count, Source minus Target, Source Intersect Target and Target minus Source.
3) Duplicate check : To ensure no redundancy in data warehouse.
4) loading strategy check : to check if your target table is SCD or delete on reload (depends on requirements.)
I'm working in DevOps space and currently supporting overly complicated CI system. It purpose is test & certify multiple Java artifacts against tests for every single artifacts and artifacts against each other. We have multiple Jenkins instances and complicated custom workflows, but they share the same limitation: lack of resources control. We ended up with a bunch of purely technical Jenkins jobs to deal with those limitations but they aren't perfect and initial workflow became too bloated.
Here I'm asking your expertise about applicability of Activiti BPM engine to CI process.
We have following issues with current process:
Cloud nodes can be handed-over from one Jenkins job to another. If workflow became terminated in the middle (let's say functional tests failed on newly built artifact) then we have to free those nodes.
Jobs can consume multiple resources by themselves - databases, environments of multiple nodes, etc. Those resources must be freed up when workflow will be finished
Ideally, we should be able to define workflow steps in some DSL and bind resources to those steps. Later on, during workflow execution, it will be possible for workflow engine to determine when the resources will first be required and request them just before that step (according to the resource type) from appropriate pool / provider.
After each step will be finished, workflow engine will what I call "garbage collection" over resources. It could calculate (based on provided DSL) list of steps which still are reachable from current state and a list of resources which are binded to those steps. After that it could be possible to construct a list (currently allocated resources MINUS future required resources). That list will go to garbage collection.
With a such "garbage collection" I'm trying to avoid overly complicated logic of manual resource lifecycle control which will be embedded into workflow definition and will bloat it. I want to have clear and well understandable (and easily supported) workflows.
Do you think that it can be done easily with Activiti or any other BPM engine?
Andrev,
this can be implemented with limited effort. We have created similar workflows for QA environments using the open source BPMS Eclipse Stardust http://www.eclipse.org/stardust/
Best regards
Rob