I'm wondering about how does Cloud Functions cost scales on data write/delete. On the pricing page, under Cloud Functions, I can only understand the amount of invocations. Let's say I want to periodically (twice a day) delete certain nodes under "users" nodes, and there are many users. Would this cost extra compare with deleting a single node?
Related
This question is really about the data schema. I have a program which has a bunch of discrete events, and I want to get beautiful graphs out.
From my knowledge, I understand that I should really keep a counter of the number of events that have occurred, and on a regular interval, transfer that cumulative counter to the TSDB (as part of a cron job or similar).
What I currently have though is a system where the monitor, on a regular interval, tells the TSB how many events occurred during that interval (a fixed hard coded value!).
Which of these two design patterns is better? What are the factors that affect that decision? Do I have a counter value here or is it just a measurement?
I have various concerns, including but not limited to the efficiency of the monitoring tool.
You tagged the question with InfluxDB but it seems like what you are really asking about is the collection agent. For that I would look at Telegraf.
StatsD is also a really great lightweight API that is available for most major languages now, from which you can efficiently emit different types of stats (counters, timings, etc); either for every event or at a sample rate you define.
I implemented a solution that gather metrics emitted from my app using StatsD, metrics that were pulled (JMX queries), and basic host level stats you get for free with Telegraf. Every host (30+) runs a single telegraf instance which delivers its stats to a centralized InfluxDB server on some interval (i.e. 30 seconds).
So with an approach like that you get a good balance of performance and data precision.
I am using google dataflow on my work.
While I am using dataflow, I need to set number of workers dynamically while dataflow batch job is running.
That's mainly because of cloud bigtable QPS.
We are using 3 bigtable cluster nodes and they can't afford to receiving all traffics from 500 number of workers instantly.
So, I gotta change number of workers(from 500 to 25) just before trying to insert all the processed data into the bigtable.
Is there any way to achieve this goal?
Dataflow does not provide the ability to manually change the resource allocation of a batch job while it is running, however:
1) We plan to incorporate throttling into our autoscaling algorithms, so Dataflow would detect that it needs to downsize while writing to your bigtable. I don't have a concrete ETA, but this is definitely on our roadmap.
2) Meanwhile, you try to can artificially limit the parallelism of your pipeline by a trick like this:
Take your PCollection<Something> (Something being the data type you're writing to bigtable)
Pipe it through a sequence of transforms: ParDo(pair with a random key in 0..25), GroupByKey, ParDo(ungroup and remove random key). You get, again, a PCollection<Something>
Write this collection to Bigtable.
The trick here is that there is no parallelization within a single key after a GroupByKey, so the result of GroupByKey is a collection of 25 key-value pairs (where the value is an Iterable<Something>) that can't be processed by more than 25 workers in parallel. The ParDo's following it will likely get fused together with the writing to Bigtable, and will thus have a parallelism of 25.
The caveat is that Dataflow is within its right to materialize any intermediate collections if it predicts that this will improve performance of the pipeline. It may even do this just for the sake of increasing the degree of parallelism (which goes explicitly against your goal in this example). But if you have an urgent job to run, I believe right now this will probably do what you want.
Meanwhile the only long-term solution I can suggest, until we have throttling, is to use a smaller limit on number of workers, or use a larger Bigtable cluster, or both.
There's a lot of relevant information in the DATA & ANALYTICS: Analyzing 25 billion stock market events in an hour with NoOps on GCP talk from GCP/Next.
FWIW, you can increase the number of nodes of Bigtable before your batch job, give Bigtable a few minutes to adjust, and then start your job. You can turn down the Bigtable cluster when you're done with the batch job.
I am working on testing space on Data Warehousing. In the scope I got newly created and dimensions and facts which should be validated. As per my knowledge and information got via browsing I would decide to cover for following
Schema validation of Facts and Dimension tables as per spec
Data duplicate check for Facts and Dimension table
Look-up validation for dimension table
Is there anything else that I can verify here?
In addition just curious how can I check whether data correctly populated to Fact table and row count, correct surrogate keys etc. In developers point of view are they using DML scripts to load the data?
Testing the Database
The database is tested in the following three ways:
Testing the database manager and monitoring tools - To test the
database manager and the monitoring tools, they should be used in the
creation, running, and management of test database.
Testing database features - Here is the list of features that we have
to test:
-Querying in parallel
-Create index in parallel
-Data load in parallel
3.Testing database performance - Query execution plays a very important role in data warehouse performance measures. There are sets of fixed queries that need to be run regularly and they should be tested. To test ad hoc queries, one should go through the user requirement document and understand the business completely. Take time to test the most awkward queries that the business is likely to ask against different index and aggregation strategies.
http://www.tutorialspoint.com/dwh/dwh_testing.htm
Also you can use ETL testing (Extract, Transform, and Load).
ETL Testing Techniques:
1) Verify that data is transformed correctly according to various business requirements and rules.
2) Make sure that all projected data is loaded into the data warehouse without any data loss and truncation.
3) Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data.
4) Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability.
Apart from these 4 main ETL testing methods other testing methods like integration testing and user acceptance testing is also carried out to make sure everything is smooth and reliable.
Also you can test Schedule, Backup Recovery, Operational Environment, the Application and Logistic of the Test
For more information about ETL Testing / Data Warehouse Testing please visit http://www.softwaretestinghelp.com/etl-testing-data-warehouse-testing/
UPD:
Creating Indexes in Parallel
Indexes on the fact table can be partitioned or non-partitioned. Local partitioned indexes provide the simplest administration. The only disadvantage is that a search of a local non-prefixed index requires searching all index partitions.
The considerations for creating index tablespaces are similar to those for creating other tablespaces. Operating system striping with a small stripe width is often a good choice, but to simplify administration it is best to use a separate tablespace for each index. If it is a local index you may want to place it into the same tablespace as the partition to which it corresponds. If each partition is striped over a number of disks, the individual index partitions can be rebuilt in parallel for recovery. Alternatively, operating system mirroring can be used. For these reasons the NOLOGGING option of the index creation statement may be attractive for a data warehouse.
Tablespaces for partitioned indexes should be created in parallel in the same manner as tablespaces for partitioned tables.
Partitioned indexes are created in parallel using partition granules, so the maximum DOP possible is the number of granules. Local index creation has less inherent parallelism than global index creation, and so may run faster if a higher DOP is used. The following statement could be used to create a local index on the fact table:
CREATE INDEX I on fact(dim_1,dim_2,dim_3) LOCAL
PARTITION jan95 TABLESPACE Tsidx1,
PARTITION feb95 TABLESPACE Tsidx2,
...
PARALLEL(DEGREE 12) NOLOGGING;
To backup or restore January data, you only need to manage tablespace Tsidx1.
Parallel Query Tuning
The parallel query feature is useful for queries that access a large amount of data by way of large table scans, large joins, the creation of large indexes, bulk loads, aggregation, or copying. It benefits systems with all of the following characteristics:
symmetric multiprocessors (SMP), clusters, or massively parallel
systems
high I/O bandwidth (that is, many datafiles on many different disk
drives)
underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
sufficient memory to support additional memory-intensive processes
such as sorts, hashing, and I/O buffers
If any one of these conditions is not true for your system, the parallel query feature may not significantly help performance. In fact, on over-utilized systems or systems with small I/O bandwidth, the parallel query feature can impede system performance.
Here you can read about this in more detail: http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqo.htm#1559
This resources I hope will be helpful for you:
https://wiki.postgresql.org/wiki/Parallel_Query_Execution
https://technet.microsoft.com/en-us/library/ms178065%28v=sql.105%29.aspx
http://www.csee.umbc.edu/portal/help/oracle8/server.815/a67775/ch24_pex.htm#1978
I am an ETL tester.
for data validation and data quality testing in data warehouse follow below checks
1) metadata testing - testing the structure of underlying tables and their structure (as per design document).
2) data validation - in data validation you test the mapping transformations using SQL and PL/SQL.
We generally test it using Source and target table count, Source minus Target, Source Intersect Target and Target minus Source.
3) Duplicate check : To ensure no redundancy in data warehouse.
4) loading strategy check : to check if your target table is SCD or delete on reload (depends on requirements.)
I am making my first App. I am new to both SQL and GAE. Google Cloud SQL has tier "D0", which has "included I/O per day" of 200k. I have an example, could you please explain how many I/O's is this example?
Suppose I have a table in my Cloud SQL of 10 rows and 3 headers. the headers are "article name", "author", "date of publishing". so there are 30 fields in total. When a user starts my App and requests latest information, I want to send the user all 30 fields. I can send this to the user with a single SQL code.
Is the execution of that query counted as thirty I/O because 30 fields were transferred or one I/O because one SQL query was run?
Appreciate your help.
The pricing guide has this to say;
The number of I/O requests to storage made by your database instance depends on your queries, workload and data set. Cloud SQL will cache data in memory to serve your queries efficiently and to minimise the number of I/O requests.
In other words, neither of the two options, some queries may be served entirely from memory, generating no I/O, while some may generate many I/O requests. Optimising the database well with indexes will make your queries cheaper, generating table scans over large tables will cost more.
In short, same good practice rules apply as keeping a fast database as on a local machine, but not doing the optimisation won't just make your queries slower, but make them cost more.
The # of I/Os refers to disk operations. So that really depends on the query and the cached data.
I have been learning RavenDB recently and would like to put it to use.
I was wondering what advice or suggestions people had around building the system in a way that is ready to scale, specifically sharding the data across servers, but that can start on a single server and only grow as needed.
Is it advisable, or even possible, to create multiple databases on a single instance and implement sharding across them. Then to scale it would simply be a matter of spreading these databases across the machines?
My first impression is that this approach would work, but I would be interested to hear the opinions and experiences of others.
Update 1:
I have been thinking more on this topic. I think my problem with the "sort it out later" approach is that it seems to me difficult to spread data evenly across servers in that situation. I will not have a string key which I can range on (A-E,F-M..) it will be done with numbers.
This leaves two options I can see. Either break it at boundaries, so 1-50000 is on shard 1, 50001-100000 is on shard 2, but then with a site that ages, say like this one, your original shards will be doing a lot less work. Alternatively a strategy that round robins the shards and put the shard id into the key will suffer if you need to move a document to a new shard, it would change the key and break urls that have used the key.
So my new idea, and again I am putting it out there for comment, would be to create from day one a bucketting system. Which works like stuffing the shard id into the key, but you start with a large number, say 1000 which you distribute evenly between. Then when it comes time to split the load into a shard, you can say move buckets 501-1000 to the new server and write your shard logic that 1-500 goes to shard 1 and 501-1000 goes to shard 2. Then when a third server comes online you pick another range of buckets and adjust.
To my eye this gives you the ability to split into as many shards as you originally created buckets, spreading the load evenly both in terms of quantity and age. Without having to change keys.
Thoughts?
It is possible, but really unnecessary. You can start using one instance, and then scale when necessary by setting up sharding later.
Also see:
http://ravendb.net/documentation/docs-sharding
http://ayende.com/blog/4830/ravendb-auto-sharding-bundle-design-early-thoughts
http://ravendb.net/documentation/replication/sharding
I think a good solution is to use virtual shards. You can start with one server and point all virtual shard to a single server. Use module on the incremental id to evenly distribute the rows across the virtual shards. With Amazon RDS you have the option to turn a slave into a master, so before you change the sharding configuration (point more virtual shards to the new server), you should make a slave a master, then update your configuration file, and then delete all the records on the new master using modulu that doesn't comply with the shard range that you use for the new instance.
You also need to delete rows from the original server, but by now all the new data with IDs that are modulu based on the new virtual shard ranges will point to the new server. So you actually don't need to move the data, but take advantage of Amazon RDS server promotion feature.
You can then make replica off the original server. You create a unique ID as: Shard ID + Table Type ID + Incremental number. So when you query the database, you know to which shard to go and fetch the data from.
I don't know how it's possible to do it with RavenDB, but it can work pretty well with Amazon RDS, because Amazon already provide you with replication and server promotion feature.
I agree that their should be a solution that right from the start offer seamless sociability and not telling the developer to sort the problems out when those occur. Furthermore, I've find out that many NoSQL solution that evenly distribute data across shards need to work within a cluster with low latency. So you have to take that into consideration. I've tried using Couchbase with two different EC2 machines (not in a dedicated Amazon cluster) and data balancing was very very slow. That adds to the overall cost too.
I also want to add that what pinterest had done to solve their scalability issues, using 4096 virtual shards.
You should also need to look into paging issues with many NoSQL databases. With that approach you can page data quite easily, but maybe not in the most efficient way, because you might need to query several databases. Another problem is changing schema. Pinterest solved this by putting all the data in a JSON Blob in MySQL. When you want to add a new column, you create a new table with the new column data + key, and can use Index on that column. If you need to query the data, for example, by email, you can create another table with the emails + ID and put an index on the email column. Counters are another problem , I mean atomic counters. So it's better taking those counters out from the JSON and put them in a column so you can increment the counter value.
There are great solutions out there, but at the end of the day you find out that they can be very expensive. I preferred spending time on building my own sharding solution and prevent myself the headache later on. If you choose the other path, there are plenty of companies waiting for you to get into trouble and ask for quite a lot of money to solve your problems. Because at the moment that you need them, they know that you will pay everything to make your project work again. That's from my own experience, that's why I am breaking my head to build my own sharding solution using your approach, which also be much cheaper.
Another option is to use middleware solutions for MySQL like ScaleBase or DBshards. So you can continue working with MySQL, but at the time you need to scale, they have well proven solution. And the costs might be much lower then the alternative.
Another tip: when you create your config for shards, put a write_lock attribute that accepts false or true. So when it false, data won't be written to that shard, so when you fetch the list of shards for specific table type (ie. users), it will be written only to the other shards for that same type. This is also good for backup, so you can show a friendly error for visitors when you want to lock all the shard when backing up all the data to get a point-in-time snapshots of all the shards. Although I think you can send a global request for snapshoting all the databases with Amazon RDS and using point-in-time backup.
The thing is that most companies won't spend time working with a DIY sharding solution , they will prefer paying for ScaleBase. Those solution comes from single developers that can afford paying for a scalable solution from the start, but want to rest assured that when they reach to the point they need it, they have a solution. Just look at the prices out there and you can figure out that it will cost you A LOT. I will gladly share my code with you once I'm done. You are going with the best path in my opinion, it's all depends on your application logic. I model my database to be simple, no joins, not complicated aggregation queries - this solves many of my problems. In the future you can use Map Reduce to solve those big data queries needs.