Cost of each pipeline job - google-cloud-dataflow

My team at Moloco runs a lot of Dataflow pipelines (hourly and daily, mostly batch jobs), and from time to time we wish to calculate each pipeline's total cost to identify what improvements we can make to save costs.
In the past few weeks, one of our engineers usually goes to the job monitoring UI webpage (via https://console.cloud.google.com/dataflow?project=$project-name), and manually calculates the cost by looking up the number of workers, worker machine type, total PD and memory used, etc.
Recently, we noticed that now the page shows the "resource metrics" which will help us save our time when it comes to calculating the costs (along with the new pricing model that was announced a while ago).
On the other hand, because we run about 60-80 dataflow jobs every day, it is time consuming for us to calculate the cost per job.
Is there a way to obtain total vCPU, Memory, and PD/SSD usage metrics via API given a job id, perhaps via ''PipelineResult'' or from the log of the master node? If it is not supported now, do you guys plan to in near future?
We are wondering if we should consider writing our own script or something that would extract the metrics per job id, and calculate the costs, but we'd prefer we don't have to do that.
Thanks!

I'm one of the engineers on the dataflow team.
I’d recommend using the command line tool to list these metrics and writing a script to parse the metrics from the output string and calculate your cost based on those. If you want to do this for many jobs, you may want to also list your jobs as well using gcloud beta dataflow jobs list. We are working on solutions to make this easier to obtain in the future.
Make sure you are using gcloud 135.0.0+:
gcloud version
If not you can update it using:
gcloud components update
Login with an account that has access to the project running your job:
cloud auth login
Set your project
gcloud config set project <my_project_name>
Run this command to list the metrics and grep the resource metrics:
gcloud beta dataflow metrics list <job_id> --project=<my_project_name> | grep Service -B 1 -A 3
Your results should be structured like so:
name:
name: Service-mem_mb_seconds
origin: dataflow/v1b3
scalar: 192001
updateTime: '2016-11-07T21:23:46.452Z'
--
name:
name: Service-pd_ssd_gb_seconds
origin: dataflow/v1b3
scalar: 0
updateTime: '2016-11-07T21:23:46.452Z'
--
name:
name: Service-cpu_num
origin: dataflow/v1b3
scalar: 0
updateTime: '2016-11-07T21:23:46.452Z'
--
name:
name: Service-pd_gb
origin: dataflow/v1b3
scalar: 0
updateTime: '2016-11-07T21:23:46.452Z'
--
name:
name: Service-pd_gb_seconds
origin: dataflow/v1b3
scalar: 12500
updateTime: '2016-11-07T21:23:46.452Z'
--
name:
name: Service-cpu_num_seconds
origin: dataflow/v1b3
scalar: 50
updateTime: '2016-11-07T21:23:46.452Z'
--
name:
name: Service-pd_ssd_gb
origin: dataflow/v1b3
scalar: 0
updateTime: '2016-11-07T21:23:46.452Z'
--
name:
name: Service-mem_mb
origin: dataflow/v1b3
scalar: 0
updateTime: '2016-11-07T21:23:46.452Z'
The relevant ones for you are:
Service-cpu_num_seconds
Service-mem_mb_seconds
Service-pd_gb_seconds
Service-pd_ssd_gb_seconds
Note: These metric names will change in the future soon, to:
TotalVCPUUsage
TotalMemoryUsage
TotalHDDPersistentDiskUsage
TotalSSDPersistentDiskUsage

Related

Neo4j GraphSage training does not log anything

I am working on extracting graph embeddings with training GraphSage algorihm. I am working on a large graph consisting of (82,339,589) nodes and (219,521,164) edges. When I checked with ":queries" command the query is listed as running. Algorithm started in 6 days ago. When I look the logs with "docker logs xxx" the last logs listed as
2021-12-01 12:03:16.267+0000 INFO Relationship Store Scan (RelationshipScanCursorBasedScanner): Imported 352,492,468 records and 0 properties from 16247 MiB (17,036,668,320 bytes); took 59.057 s, 5,968,663.57 Relationships/s, 275 MiB/s (288,477,487 bytes/s) (per thread: 1,492,165.89 Relationships/s, 68 MiB/s (72,119,371 bytes/s))
2021-12-01 12:03:16.269+0000 INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:56143] ] LOADING
INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:56143] ] LOADING Actual
memory usage of the loaded graph: 8602 MiB
INFO [neo4j.BoltWorker-3 [bolt] [/10.0.0.6:64076] ] GraphSageTrain ::
Start
There is a way to see detailed logs about training process. Is it normal for taking 6 days for graphs with shared sizes ?
It is normal for GraphSAGE to take a long time compared to FastRP or Node2Vec. Starting in GDS 1.7, you can use
CALL gds.beta.listProgress(jobId: String)
YIELD
jobId,
taskName,
progress,
progressBar,
status,
timeStarted,
elapsedTime
If you call without passing in a jobId, it will return a list of all running jobs. If you call with a jobId, it will give you details about a running job.
This query will summarize the details for job 03d90ed8-feba-4959-8cd2-cbd691d1da6c.
CALL gds.beta.listProgress("03d90ed8-feba-4959-8cd2-cbd691d1da6c")
YIELD taskName, status
RETURN taskName, status, count(*)
Here's the documentation for progress logging. The system monitoring procedures might also be helpful to you.

log4j2 - keep last 7days of log file

To keep the last 3days of log file with each file size up to 10MB, how to configure in log4j2.yml file?
I tried,
filePattern: ${log}/${app}-archive/${app}-%d{MM-dd-yyyy}-%i.log"
...
Policies:
TimeBasedTriggeringPolicy:
interval: 1
modulate: true
SizeBasedTriggeringPolicy:
size: 10 MB
DefaultRolloverStrategy:
delete:
basePath: "${log}/${app}-archive"
maxDepth: 1
IfFileName:
glob: "*.log"
IfLastModified:
age: 3d
and it creates only up to 7 archives on the same day and delete old files even though it was today's log. Is there a way to keep as much as files if its lastModified < 3d?
like app-04-09-2021-8.log, app-04-09-2021-9.log,....app-04-09-2021-39.log and so on.
Please, guide me.
By default DefaultRolloverStrategy will keep at most the value of the max configuration attribute, 7 by default, per time based rollover interval, daily in your use case as indicated in your file pattern, ${app}-%d{MM-dd-yyyy}-%i.log - the max attribute only applies only to the %i counter.
Provide a bigger value for that attribute, the value you consider appropriate depending on your log patterns. For example:
DefaultRollOverStrategy:
max: 100
delete:
basePath: "${log}/${app}-archive"
maxDepth: 1
IfFileName:
glob: "*.log"
IfLastModified:
age: 3d
Please, see the relevant documentation.

Measure the duration of x amount of requests while using K6

I would like to use K6 in order to measure the time it takes to proces 1.000.000 requests (in total) by an API.
Scenario
Execute 1.000.000 (1 million in total) get requests by 50 concurrent users/theads, so every user/thread executes 20.000 requests.
I've managed to create such a scenario with Artillery.io, but I'm not sure how to create the same one while using K6. Could you point me in the right direction in order to create the scenario? (Most examples are using a pre-defined duration, but in this case I don't know the duration -> this is exactly what I want to measure).
Artillery yml
config:
target: 'https://localhost:44000'
phases:
- duration: 1
arrivalRate: 50
scenarios:
- flow:
- loop:
- get:
url: "/api/Test"
count: 20000
K6 js
import http from 'k6/http';
import {check, sleep} from 'k6';
export let options = {
iterations: 1000000,
vus: 50
};
export default function() {
let res = http.get('https://localhost:44000/api/Test');
check(res, { 'success': (r) => r.status === 200 });
}
The iterations + vus you've specified in your k6 script options would result in a shared-iterations executor, where VUs will "steal" iterations from the common pile of 1m iterations. So, the faster VUs will complete slightly more than 20k requests, while the slower ones will complete slightly less, but overall you'd still get 1 million requests. And if you want to see how quickly you can complete 1m requests, that's arguably the better way to go about it...
However, if having exactly 20k requests per VU is a strict requirement, you can easily do that with the aptly named per-vu-iterations executor:
export let options = {
discardResponseBodies: true,
scenarios: {
'million_hits': {
executor: 'per-vu-iterations',
vus: 50,
iterations: 20000,
maxDuration: '2h',
},
},
};
In any case, I strongly suggest setting maxDuration to a high value, since the default value is only 10 minutes for either executor. And discardResponseBodies will probably help with the performance, if you don't care about the response body contents.
btw you can also do in k6 what you've done in Artillery, have 50 VUs start a single iteration each and then just loop the http.get() call 20000 times inside of that one single iteration... You won't get a very nice UX that way, the k6 progressbars will be frozen until the very end, since k6 will have no idea of your actual progress inside of each iteration, but it will also work.

How to model decimal type in RAML

I am modelling a REST API using RAML. The response body of an endpoint (JSON format) is a financial transactions list. Each transactions contains an amount of money: currency and numeric value. The following is the snippet of my RAML file, please note the property amount in the Transaction type:
# %RAML 1.0
title: Trading details API
version: v1
mediaType: application/json
baseUri: http://my.host.io/api/trading/v1/
types:
Transactions:
type: Transaction[]
minItems: 1
Transaction:
type: object
properties:
refNum:
type: string
amount:
type: ????
currency:
type: string
minLength: 2
maxLength: 3
/trades
get:
description: Get details for a given trade
queryParameters:
userId:
type: integer
required: true
responses:
200:
body:
application/json:
type: Transactions
Unfortunately RAML has no Built-in decimal type, and the other numeric types (integer, float or double) are not suitable for this scope, mainly because I need to specify the number of digits after the . .
So question is: in RAML how do I correctly model the type amount?
I need to provide an exact definition of the type for each response body values, because this file will be the contract between the backend and frontend (developed by 2 different teams).
Any helps is welcomed.
Please note that I made some research on SO, and the closest question to mine is: How to define money amounts in an API
. But it is not related to RAML modelling, and the answers are not helping to me.
RAML has a similar construct to the one in JSON Schema. You'll want to use type: number in combination with multipleOf to describe decimal precision.
#%RAML 1.0 DataType
type: number
multipleOf: 0.01
After months I come back to share my experience.
The way I worked around it was by using the type string and a pattern. I am aware of the many concerns around changing the data type from number to string, but this approach is elegant, robust, flexible and still simple to test and understand.
The API consumers are forced to format the amount in the correct way and the messages coming in and out of the API are consistent, consistency cannot be guaranteed by using multiplyOf 0.0001 (where 25 and 25.0000 are both accepted).
I reused over and over this solution with great results. Therefore I am sharing this with the community.
Solution:
[...]
amount:
type: string
pattern: "^(([1-9][0-9]*)|[0])[.]([0-9]{4})$"
currency:
type: string
...
The pattern accepts 4 digits on the decimal part, forces to use a . and the amount cannot starts with 0, with the exception of 0.xxxx family of numbers.
The following is an examples list of accepted numbers:
1.0000
54.0000
23456.1234
1.9800
0.0000
0.0001
Instead the following is an example list of rejected:
0123.3453
12.12
1.1
000
01.0000
1.0
1.00
4.000
Moreover, you can specify the max number of digits on the left side (in this example 10):
pattern: "^(([1-9][0-9]{0,9})|[0])[.]([0-9]{4})$"
Example of accepted numbers:
1234567890.1234
3.5555
0.1234
Example of rejected numbers:
12345678901.1234
123456789012.1234

R - how to exclude pennystocks from environment before calculating adjusted stock returns

Within my current research I'm trying to find out, how big the impact of ad-hoc sentiment on daily stock returns is.
Calculations functioned quite well and results also are plausible.
The calculations until now with quantmod package and yahoo financial data look like below:
getSymbols(c("^CDAXX",Symbols) , env = myenviron, src = "yahoo",
from = as.Date("2007-01-02"), to = as.Date("2016-12-30")
Returns <- eapply(myenviron, function(s) ROC(Ad(s), type="discrete"))
ReturnsDF <- as.data.table(do.call(merge.xts, Returns))
# adjust column names
colnames(ReturnsDF) <- gsub(".Adjusted","",colnames(ReturnsDF))
ReturnsDF <- as.data.table(ReturnsDF)
However, to make it more robust towards noisy influence of pennystock data I wonder, how its possible to exclude stocks that once in the time period go below a certain value x, let's say 1€.
I guess, the best thing would be to exclude them before calculating the returns and merge the xts object results or even better, before downloading them with the getSymbols command.
Has anybody an idea how this could work best? Thanks in advance.
Try this:
build a price frame of the Adj. closing prices of your symbols
(I use the PF function of the quantmod add-on package qmao which has lots of other useful functions for this type of analysis. (install.packages("qmao", repos="http://R-Forge.R-project.org”))
check by column if any price is below your minimum trigger price
select only columns which have no closings below the trigger price
To stay more flexible I would suggest to take a sub period - let’s say no price below 5 during the last 21 trading days.The toy example below may illustrate my point.
I use AAPL, FB and MSFT as the symbol universe.
> symbols <- c('AAPL','MSFT','FB')
> getSymbols(symbols, from='2018-02-01')
[1] "AAPL" "MSFT" "FB"
> prices <- PF(symbols, silent = TRUE)
> prices
AAPL MSFT FB
2018-02-01 167.0987 93.81929 193.09
2018-02-02 159.8483 91.35088 190.28
2018-02-05 155.8546 87.58855 181.26
2018-02-06 162.3680 90.90299 185.31
2018-02-07 158.8922 89.19102 180.18
2018-02-08 154.5200 84.61253 171.58
2018-02-09 156.4100 87.76771 176.11
2018-02-12 162.7100 88.71327 176.41
2018-02-13 164.3400 89.41000 173.15
2018-02-14 167.3700 90.81000 179.52
2018-02-15 172.9900 92.66000 179.96
2018-02-16 172.4300 92.00000 177.36
2018-02-20 171.8500 92.72000 176.01
2018-02-21 171.0700 91.49000 177.91
2018-02-22 172.5000 91.73000 178.99
2018-02-23 175.5000 94.06000 183.29
2018-02-26 178.9700 95.42000 184.93
2018-02-27 178.3900 94.20000 181.46
2018-02-28 178.1200 93.77000 178.32
2018-03-01 175.0000 92.85000 175.94
2018-03-02 176.2100 93.05000 176.62
Let’s assume you would like any instrument which traded below 175.40 during the last 6 trading days to be excluded from your analysis :-) .
As you can see that shall exclude AAPL and FB.
apply and the base function any applied(!) to a 6-day subset of prices will give us exactly what we want. Showing the last 3 days of prices excluding the instruments which did not meet our condition:
> tail(prices[,apply(tail(prices),2, function(x) any(x < 175.4)) == FALSE],3)
FB
2018-02-28 178.32
2018-03-01 175.94
2018-03-02 176.62

Resources