Determine attached managed disks of a VM - terraform-provider-azure

I want to determine the managed disk IDs of all data disks that are attached to a particular VM. The data sources "azurerm_managed_disk" and "azurerm_virtual_machine" are of no great help, they do not provide information about the relationship between the VM and its disks.
In PowerShell, you have .StorageProfile subspace in the VM object which gives you that information, but how do I determine this relation in Terraform?

For your issue, there exactly is no data resource you can directly get the managed disk that attached to the special VM. But you can get the disks information in PowerShell script as you think. And then execute the script in the Terraform through External Data Source. So that you can get the disks information in Terraform indirectly. The Terraform code like below:
data "external" "powershell_test" {
program = ["Powershell.exe", "./vmDisk.ps1"]
}
output "value" {
value = "${data.external.powershell_test.result}"
}

Related

kubeflow OutputPath/InputPath question when writing/reading multiple files

I've a data-fetch stage where I get multiple DFs and serialize those. I'm currently treating OutputPath as directory - create it if it doesn't exist etc. and then serialize all the DFs in that path with different names for each DF.
In a subsequent pipeline stage (say, predict) I need to retrieve all those through InputPath.
Now, from the documentation it seems InputPath/OutputPath as file. Does kubeflow as any limitation if I use it as directory?
The ComponentSpec's {inputPath: input_name} and {outputPath: output_name} placeholders and their Python analogs (input_name: InputPath()/output_name: OutputPath()) are designed to support both files/blobs and directories.
They are expected to provide the path for the input/output data. No matter whether the data is a blob/file or a directory.
The only limitation is that UX might not be able to preview such artifacts.
But the pipeline itself would work.
I have experimented with a trivial pipeline - no issue is observed if InputPath/OutputPath is used as directory.

AWS Step Function, Map State and Batch Input/Output

Let's suppose I have a Step Function with a Map State. The Map State is a Batch Job, associated with a Docker container. I want pass input parameters to containers, and receive output for other SF's states.
I believe it could be a Lambda Function, iterating thru the input as array, and pass each element as environment variables set to containers. But how could the lambda working with foreach + environment variables look like? How can I catch Docker container output (I believe it could be S3 file/directory)?
Also is there any alternative to a Lambda Function at all?
Handling the iterator:
If you have a predefined input array that you want to iterate on with the map state, then you can just pass that as the Map InputPath and ItemsPath, but in some cases you may need to setup a lambda that that will go and create that list for you.
Your ItemsPath might looks something like:
"list": [
{
"input": "<my_cool_input parameters>"
},
{
"input": "<my_cool_input parameters>"
}...
]
Handling the output:
As far as I know currently there is no way to get an output from batch compute back to the state machine directly. So you will need to take an indirect approach.
One way could be to write the output from your docker container to some temporary location such as dynamodb or s3. Then you would need a step in your step function to read the output from dynamodb (you can do that directly, no lambda needed, if you write to s3 then you will need a lambda to read the output).
It would seem that this approach would also be needed to capture raised exceptions from a docker container - I'm all ears if anyone has a better approach.

How to use custom docker storage in Prefect flows?

I have setup a Dask cluster and i'm happily sending basic Prefect flows to it.
Now i want to do something more interesting and take a custom docker image with my python library on it and execute flows/tasks on the dask cluster.
My assumption was I could leave the dask cluster (scheduler and workers) as they are with their own python environment (after checking all the various message passing libraries have the matching versions everywhere). That is to say, i do not expect to need to add my library to those machines if the Flow is executed within my custom storage.
However either I have not set up storage correctly or it is not safe to assume the above. In other words, perhaps when pickling objects in my custom library, the Dask cluster does need to know about my python library. Suppose i have some generic python library called data...
import prefect
from prefect.engine.executors import DaskExecutor
#see https://docs.prefect.io/api/latest/environments/storage.html#docker
from prefect.environments.storage import Docker
#option 1
storage = Docker(registry_url="gcr.io/my-project/",
python_dependencies=["some-extra-public-package"],
dockerfile="/path/to/Dockerfile")
#this is the docker build and register workflow!
#storage.build()
#or option 2, specify image directly
storage = Docker(
registry_url="gcr.io/my-project/", image_name="my-image", image_tag="latest"
)
#storage.build()
def get_tasks():
return [
"gs://path/to/task.yaml"
]
#prefect.task
def run_task(uri):
#fails because this data needs to be pickled ??
from data.tasks import TaskBase
task = TaskBase.from_task_uri(uri)
#task.run()
return "done"
with prefect.Flow("dask-example",
storage = storage) as flow:
#chain stuff...
result = run_task.map(uri=get_tasks())
executor = DaskExecutor(address="tcp://127.0.01:8080")
flow.run(executor=executor)
Can anyone explain how/if this type of docker-based workflow should work?
Your dask workers will need access to the same python libraries that your tasks rely on to run. The simplest way to achieve this is to run your dask workers using the same image as your Flow. You could do this manually, or using something like the DaskCloudProviderEnvironment that will create short-lived Dask clusters per-flow run using the same image automatically.

How to parallel compute csv file store on each worker without use hdfs?

A concept same as data localy on hadoop but I don't want to use hdfs.
I have 3 dask-worker .
I want to compute a big csv filename for example mydata.csv.
I split mydata.csv to small file (mydata_part_001.csv ... mydata_part_100.csv) and store in local folder /data on each worker
e.g.
worker-01 store mydata_part_001.csv - mydata_part_030.csv in local folder /data
worker-02 store mydata_part_031.csv - mydata_part_060.csv in local folder /data
worker-03 store mydata_part_061.csv - mydata_part_100.csv in local folder /data
how to use dask compute to mydata ??
Thank.
It is more common to use some sort of globally accessible file system. HDFS is one example of this, but several other Network File Systems (NFSs) exist. I recommend looking into these instead of managing your data yourself in this way.
However, if you want to do things this way then you are probably looking for Dask's worker resources, which allow you to target specific tasks to specific machines.

Cloud Composer - Get google user

There is a way to get the google account name running the DAGs from the DAG definition?
This will be very helpful to track which users was running the DAGs.
I'm only see :
unixname --> always airflow
owner --> fixed in the dag definition
Regards
Eduardo
Possible as DAGs in Composer are essentially GCS objects. The GCS object GET API does tell you who uploads that object. So here's one possible way of getting owner info:
Define a function user_lookup() in your DAG definition.
The implementation of user_look() consists of the following steps: a) gets the current file path (e.g., os.path.basename(__file__)); b) based on how Composer GCS objects are mapped locally, determines the corresponding GCS object path (e.g., gs://{your-bucket}/object); c) reads GCS object details and returns object.owner.
In your DAG definition, set owner=user_lookup().
Let us know whether the above works for you.

Resources