How do I get args like epochs to show up in the UI configuration panel under hyperparameters? I want to be able to change number of epochs and learning rate from within the UI.
You can use argparse - ClearML will auto-magically log all parameters in the task's configuration section (under hyper-parameters section) - see this example. You can also just connect any dictionary (see this example)
Related
I'm trying to find optimal hyperparams with ray:
tuner = tune.Tuner(
train,
param_space=hyperparams1,
tune_config=tune.TuneConfig(
num_samples=200,
metric="score",
mode="max",
scheduler=ASHAScheduler(grace_period=6),
),
run_config=RunConfig(stop={"score": 290},
checkpoint_config=CheckpointConfig(checkpoint_score_attribute="score"))
)
Sometimes my model overfits and it's results become worse over time, i.e., I get something like 100, 200, 220, 140, 90, 80. Ray shows me the current "best result" in its status, but it selects the best value only from the last iterations (i.e., the best value for the mentioned result is 80).
I'm sure that results with higher values are better, so it would be nice to select best results based on the whole history, not on the last value.
Is there a way to get force it to use the whole train history for selecting the best result? Or should I interrupt training manually when I see that model is no longer improving? Or it's already saving all the results and all I need is to filter them after it finishes?
I've seen this Checkpoint best model for a trial in ray tune and have added CheckpointConfig to my code, but it seems like it doesn't help: I still see the last result
The entire history of reported metrics is being saved within each Result, and you can access it via result.metrics_dataframe. See this section of the user guide for an example of what you can access within each result.
It's possible to filter out the entire history to only show the the maximum accuracy (or another metric you define) using the ResultGrid output of tuner.fit(). The ResultGrid.get_dataframe(filter_metric=<your-metric>, filter_mode=<min/max>) API will return a DataFrame that filters out the history of reported results of each trial. See the bottom of this section for an example of doing this.
I'm running node2vec in neo4j but when the algorithm runs again with the same parameters, the result changes. So, I read the configuration and I see that there is a seedvalue.
I tried to set the seedvalue in a specific number, but nothing changes..
https://github.com/aditya-grover/node2vec/issues/83
If you check the documentation you can see that the randomSeed parameters is available. However, the results are still non-deterministic as per the docs:
Seed value used to generate the random walks, which are used as the
training set of the neural network. Note, that the generated
embeddings are still nondeterministic.
I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.
I'm trying to get a deeper view on my dataflow jobs by measuring parts of it using Metrics.counter & Metrics.gauge but I cannot find them on Stackdriver.
I have a premium Stackdriver account and I can see those counters under the Custom Counters section on the Dataflow UI.
I can see droppedDueToLateness 'custom' counter though on Stackdriver that seems to be created via Metrics.counter as well...
Aside from that, there's something that could be helpful that is that when I navigate https://app.google.stackdriver.com/services/dataflow the message I get is this:
"You do not have any resources of this type being monitored by Stackdriver." and that's weird as well. As if our Cloud Dataflow wasn't properly connected to Stackdriver, but, on the other hand. Some metrics are displayed and can be monitored such as System Lag, Watermark age, Elapsed time, Element count, etc...
What am I missing?
Regards
Custom metric naming conventions
When defining custom metrics in Dataflow, you have to adhere to the custom metric naming conventions, or they won't show up in Stackdriver.
Relevant snippet:
You must adhere to the following spelling rules for metric label
names:
You can use upper and lower-case letters, digits, underscores (_) in
the names.
You can start names with a letter or digit.
The maximum length of a metric label name is 100 characters.
If you create a metric with
Metrics.counter('namespace', 'name')
The metric shows up in stackdriver as custom.googleapis.com/dataflow/name, so 'name' should adhere to the rules mentioned above. The namespace does not seem to be used by Stackdriver.
Additional: labels
It doesn't seem possible to add labels to the metrics when defined this way. However, the full description of each time series of a metric is a string with the format
'name' job_name job_id transform
So you can aggregate by these 4 properties (+ region and project).
I have built an Microsoft Azure ML Studio workspace predictive web service, and have a scernario where I need to be able to run the service with different training datasets.
I know I can setup multiple web services via Azure ML, each with a different training set attached, but I am trying to find a way to do it all within the same workspace and passing a Web Input Parameter as the input value to choose which training set to use.
I have found this article, which describes almost my scenario. However, this article relies on the training dataset that is being pulled from the Load Trained Data module, as having a static endpoint (or blob storage location). I don't see any way to dynamically (or conditionally) change this location based on a Web Input Parameter.
Basically, does Azure ML support a "conditional training data" loading?
Or, might there be a way to combine training datasets, then filter based on the passed Web Input Parameter?
This probably isn't exactly what you need, but hopefully, it helps you out.
To combine data sets, you can use the Join Data module.
To filter, that may be accomplished by executing a Python script. Here's an example.
Using the Adult Census Income Binary Classification dataset, on the age column, there's a minimum age of 17.
If I wanted to filter the data set by age, connect it to an Execute Python Script module and here's the filtering code with the pandas query method.
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
# Return value must be of a sequence of pandas.DataFrame
return dataframe1.query("age >= 25")
And looking at that output it filters out the data set where the minimum age is now 25.
Sure, you can do that. What you would want is to use an Execute R Script or SQL Transformation module to determine, based on your input data, what model to use. Something like this:
Notice, your input data is cleaned/updated/feature engineered, then it's passed to two different SQL transforms which will tell it to go to one of two paths.
Each path has it's own training data.
Note: I am not exactly sure what your use case is, but if it were me, I would instead train two different models using the two different training data, then try to just use the models in my web service, not actually train on the web service as that would likely be quite slow.