I'm trying to discern if it is possible to query InfluxDB to have an exact match on tags without having to first query InfluxDB for the TAG KEYS to generate my query.
Here is an example to showcase what I'm trying to accomplish,
Connected to http://localhost:8086 version 1.10.0
InfluxDB shell version: 1.10.0
> create database so
> use so
Using database so
> INSERT cpu,cpu=cpu0,env=intg,host=is-nflxts101t time_system=13592
> INSERT cpu,env=intg,host=is-nflxts101t time_system=134
SELECT * FROM cpu
name: cpu
time cpu env host time_system
---- --- --- ---- -----------
1668608108642977000 cpu0 intg is-nflxts101t 13592
1668608113752018000 intg is-nflxts101t 134 # <--- We want to get this line ONLY
The goal is to get the final line where cpu isn't specified.
The naive query gives us the complete set of rows, including the one we do not want,
> SELECT * FROM cpu WHERE "env"='intg' AND "host"='is-nflxts101t'
name: cpu
time cpu env host time_system
---- --- --- ---- -----------
1668608108642977000 cpu0 intg is-nflxts101t 13592 # <--- undesired
1668608113752018000 intg is-nflxts101t 134
and while I could first fetch all TAG KEYS to programatically generate the query,
> SELECT * FROM cpu WHERE "env"='intg' AND "host"='is-nflxts101t' AND "cpu" = ''
name: cpu
time cpu env host time_system
---- --- --- ---- -----------
1668608113752018000 intg is-nflxts101t 134
in my application code I was hoping there was/is a way to express this psuedo-query,
> SELECT * FROM cpu WHERE EXACT "env"='intg' AND "host"='is-nflxts101t'
name: cpu
time cpu env host time_system
---- --- --- ---- -----------
1668608113752018000 intg is-nflxts101t 134
Is there any way to accomplish this? Or am I stuck with fetching the keys first?
Your sample should be good enough to filter out null-value tag with other conditions:
SELECT * FROM cpu WHERE "env"='intg' AND "host"='is-nflxts101t' AND "cpu" = ''
as for tags, doing WHERE some_tag = '' will match rows for which the tag has no value. (The tag value will still be returned as null in the response, though, not as the empty string, because internally-consistent type systems are for pansies.)
Related
Hi, I'm having issues with dockerized TF Serving seeing but not using my GPU.
It adds the GPU as device 0, allocates memory on it, but then loads the ML model into CPU device memory and runs inference using only the CPUs. GPU-util on nvidia-smi never leaves 0%.
Could anyone help me figure out why this is happening, and what should be changed?
The setup:
OS: Amazon/Deep Learning AMI (Ubuntu 18.04) on EC2 g4dn.xlarge
GPU: Tesla T4
Model: pretrained gpt2-xl tensorflow from huggingface, which I froze into a SavedModel and uploaded to S3.
Docker: came stock with Deep Learning AMI. I've already checked and confirmed that nvidia-smi runs containerized, so it's not a nvidia+docker issue.
TF Serving: I use the below Dockerfile to pull the latest-gpu image and download the model directly into it at buildtime:
FROM tensorflow/serving:latest-gpu
RUN apt-get update
ENV TZ=America
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update
RUN apt-get install -y awscli
ENV AWS_ACCESS_KEY_ID=...
ENV AWS_SECRET_ACCESS_KEY=...
ARG model_name
ENV MODEL_NAME=$model_name
# Use AWS CLI to download the SavedModel into the docker container from S3 bucket
RUN aws s3 cp s3://v3-models/models/pretrained_tf_serving/${MODEL_NAME} /models/${MODEL_NAME} --recursive
EXPOSE 8500
I build and run the above Dockerfile with these commands:
#!/bin/bash
# first build the image with the model_name arg, and tag it as xl-serving
docker build -t xl-serving --build-arg model_name=gpt2-xl ../../model_server
# then run it with gpus, exposing gRPC port
docker run -it --rm --gpus all --runtime nvidia -p 8500:8500 xl-serving
Running the serving container prints this output. Notice that the GPU is added.
2020-11-06 05:25:34.671071: I tensorflow_serving/model_servers/server.cc:87] Building single TensorFlow model file config: model_name: gpt2-xl model_base_path: /models/gpt2-xl
2020-11-06 05:25:34.671274: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2020-11-06 05:25:34.671295: I tensorflow_serving/model_servers/server_core.cc:575] (Re-)adding model: gpt2-xl
2020-11-06 05:25:34.771644: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771673: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771687: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771724: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/gpt2-xl/1
2020-11-06 05:25:35.222512: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-11-06 05:25:35.222545: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Reading SavedModel debug info (if present) from: /models/gpt2-xl/1
2020-11-06 05:25:35.222672: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-06 05:25:35.223994: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-06 05:25:35.262238: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.263132: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-11-06 05:25:35.263149: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-11-06 05:25:35.263236: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.264122: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.264948: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-06 05:25:36.185140: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-06 05:25:36.185165: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-11-06 05:25:36.185171: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-11-06 05:25:36.185334: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.186222: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.187046: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.187852: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13896 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-11-06 05:25:37.279837: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:199] Restoring SavedModel bundle.
2020-11-06 05:25:56.154008: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /models/gpt2-xl/1
2020-11-06 05:25:57.551535: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:303] SavedModel load for tags { serve }; Status: success: OK. Took 22777844 microseconds.
2020-11-06 05:25:57.832736: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/gpt2-xl/1/assets.extra/tf_serving_warmup_requests
2020-11-06 05:25:57.835030: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:57.838329: I tensorflow_serving/model_servers/server.cc:367] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2020-11-06 05:25:57.840415: I tensorflow_serving/model_servers/server.cc:387] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
I then hit this server with a single, non-batched gRPC call. It will successfully run and return a correct GPT2 output. However, it takes as long as the same setup takes on a CPU. htop shows that 8gb of ram (gpt2-xl model size) is loaded into the CPU machine. It then shows the TF Serving process running, and maxing out one or two CPU cores. It appears to only run on CPU.
This is what nvidia-smi looks like while the call is running. Notice the allocated memory, and 0% GPU-Util:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 36C P0 26W / 70W | 14240MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 13357 C tensorflow_model_server 14221MiB |
+-----------------------------------------------------------------------------+
I've scoured the web and can't find any advice for this. Closest I found was this github issue: GPU utilization with TF serving #1440, for which the fixes did not work for me. They were dealing with low GPU-util, I'm dealing with 0%.
Any advice on what the issue is?
Thank you very much. I've been banging my head against the wall for days on this, so I very much appreciate your help :)
Update #1 :
I've written a python script (below) to use tensorflow==2.3.0 to load the model and run it. It's running in a conda env with CUDA=11.0. It successfully runs inference on the GPU, and it's a good 15x faster than what I'm getting on TF-serving.
import tensorflow as tf
import numpy as np
model = tf.saved_model.load('/home/ubuntu/models/gpt2-xl/1/')
servable = model.signatures["forward"]
# Create input tensor
tensor_in = tf.constant([[198, 15667, 6530, 25, 29437, 1706, 1610, 977, 948, 33611]])
# Run a loop of 10 inferences on the model, to predict the next 10 tokens.
for i in range(10):
pred = servable(tensor_in)
logits = pred['output_0']
logits = logits[:, -1, :] / 0.8
next_id = tf.random.categorical(tf.nn.log_softmax(logits, axis=-1), num_samples=1)
next_id = tf.dtypes.cast(next_id, tf.int32).numpy()
tensor_in = np.concatenate((tensor_in, next_id), axis=1)
Up next: will be trying running tf-serving outside of container. Update to come...
How did you save your model? Add clear_devices=True when saving model and have anather try.
I am trying to use the following InfluxDB query with conditions on both time and field value, but it returns no results:
> select * from something where (time > 1 and time < 20000) or (def > 999)
However when I remove the last condition, my measurement is returned:
> select * from something where (time > 1 and time < 20000)
name: something
time abc def id
---- --- --- --
10000 444 555 123
Is this a bug in InfluxDB, or am I doing something wrong? I can't find anything in the documentation indicating that time and field conditions can't be combined... I've tried upgrading from 1.7 to 1.8.
To try this yourself:
$ influx
Connected to http://localhost:8086 version 1.8.0
InfluxDB shell version: 1.8.0
> drop database testdb
> create database testdb
> use testdb
Using database testdb
> insert something,id=123 abc=444i,def=555i 10000
> select * from something where (time > 1 and time < 20000) or (def > 999)
I'm using open source version of InfluxDB on Windows with default settings. I tried 1.6.4 and 1.7.1.
When I specify any retention policy then default, the data is not stored.
For test purposes I've created two identical retention policies - default and non_default:
show retention policies
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
default 168h0m0s 24h0m0s 1 true
non_default 168h0m0s 24h0m0s 1 false
When I'm posting data with the non_default retention policy nothing happens. The server returns a success but there is no data in metrics.
$ curl -i -XPOST " http://influx1:8086/write?db=test&rp=non_default" --data-binary 'TestViaHttp,mytag=a myfield=90'
When I'm posting data with default retention policy it is inserted successfully.
$ curl -i -XPOST " http://influx1:8086/write?db=test&rp=default" --data-binary 'TestViaHttp,mytag=a myfield=90'
Does anyone have an idea how to fix this?
Figured out, that you should specify retention policy in select statement.
SELECT * FROM "non_default"."TestViaHttp"
Looks like retention policies are similar to schemas in MS SQL.
I am using centreon (nagios) to monitor the CPUs of some VMs using NSClient. In my case it makes only sense to set the critical state of the cpu probe if the average cpu load is > 95 over the 5m period. Is this achievable ?
I cannot find documentation on how to specify that in the critical param
Default command
check_cpu
Returns
CPU Load ok
'total 5m load'=0%;80;90 'total 1m load'=0%;80;90 'total 5s load'=7%;80;90
Command with specific threshold (but all time period can match)
check_cpu "critical=load > 90"
It is not exactly what I wanted to do but what I did is the following
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "crit=load > 95" "warn=load > 90" time=5m
Which limits the output to the 5m time period.
Note that to execute this from centreon you have to set the following variables inside the nsclient.ini file (waisted a lot of time on that one)
[/settings/NRPE/server]
allow nasty characters=true
[/settings/external scripts]
allow nasty characters=true
Check this script,
define service{
use generic-service
host_name xxx
service_description CPU Load
check_command check_nrpe!check_load
contact_groups sysadmin
}
---
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
You can try something like that
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "warning=time = '5m' and load > 80" "critical=time = '5m' and load > 90" show-all
You can also check the documentation for more info.
I have created a series of functions that basically collect all the IIS configurations about a site, when run on a server locally it executes without issue (albeit slowly) however when I run them remotely using an invoke-command in PowerShell 2 it runs through and mysteriously stops approximately 15-20 seconds into the process. It generally stalls on the same request but not always. The same commands executed locally work without any issues. No exception is raised, it just hangs indefinitely.
I can post the code if necessary however it is several hundred lines so I'm more looking for guidance on how to investigate a problem like this or if anyone has encountered something similar.
Comparing IISConfig between [targetserver] and localhost.
Checking Installed IIS version on [targetserver]:
IIS major version : 7
IIS minor version : 5
IIS7+ detected, using WebAdmin module and IIS metabase
Name Value
---- -----
name Default Web Site
id 1
serverAutoStart True
state 1
Site Configuration:
Name Path PSPath Handlers_Ac Access_sslF Asp_AppAllo Asp_AppAllo Asp_limits_ Asp_EnableP Asp_limits_
cessFlags lags wClientDebu wDebugging bufferingLi arentPaths queueTimeou
g mit t
---- ---- ------ ----------- ----------- ----------- ----------- ----------- ----------- -----------
Default ... IIS:Site... WebAdmin... Read,Script False False 25000000 True 00:00:00
WebApp VDir: /MyApp, App Pool: MyApp
App pool Configuration:
AppPoolID Enable32Bit managedPipe managedRunt AppPoolName AppPoolAuto processMode processMode processMode recycling_l
AppOnWin64 lineMode imeVersion Start l_idleTimeo l_identityT l_UserName ogEventOnRe
ut ype cycle
--------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
False Classic v2.0 MyApp True 00:20:00 LocalSer... Time,Req...
Analyzing web directories for /MyApp, this could take a while....
Initial Collection Completed, found 141... took 0.9516122 seconds
0 C:\inetpub\wwwroot\MyApp\Core
1 C:\inetpub\wwwroot\MyApp\Core\AdminTools
2 C:\inetpub\wwwroot\MyApp\Core\AdminTools\Cache
3 C:\inetpub\wwwroot\MyApp\Core\AdminTools\Extra
4 C:\inetpub\wwwroot\MyApp\Core\AdminTools\HTTPPostTest
5 C:\inetpub\wwwroot\MyApp\Core\AdminTools\IISAdmin
6 C:\inetpub\wwwroot\MyApp\Core\AdminTools\Profiling
7 C:\inetpub\wwwroot\MyApp\Core\AdminTools\RecordTestData
8 C:\inetpub\wwwroot\MyApp\Core\AdminTools\ScrambleTest
9 C:\inetpub\wwwroot\MyApp\Core\AdminTools\Sessions
Analyzed 10 so far... took 6.7236862 seconds, remaining time 88.08028922 seconds
Current Folder: C:\inetpub\wwwroot\MyApp\Core\AdminTools\Sessions
10 C:\inetpub\wwwroot\MyApp\Core\AdminTools\SoapTest
11 C:\inetpub\wwwroot\MyApp\Core\AdminTools\StaticContent
Sometimes it makes it to 15 or so. I tried from my laptop and from one server to another and the behavior is the same.
Here is the loop which is hanging:
$start = [System.DateTime]::Now
$numanalyzed = 0
if ($true) #skip to test
{
# loop through all physical folders as it is much faster
foreach ($folder in $folders)
{
write-host $numanalyzed $folder.fullname
#figure out the virtual path to the folder
$iis7vwebfolderpath = $folder.FullName.Replace($iis7webapp.PhysicalPath, $iis7VDirWebApppath)
#Get-item $iis7vwebfolderpath | gm
$iis7VWebDirConfigItem = Get-LNOSIIS7ConfigForPSPath -PSPath $iis7vwebfolderpath
# add new item to list
$iis7VWebDirConfig += $iis7VWebDirConfigItem
# increment counter and report out progress every 10
$numAnalyzed++
if ($numanalyzed % 10 -eq 0)
{
$end = [System.DateTime]::Now
$timeSoFar = (NEW-TIMESPAN –Start $Start –End $End).TotalSeconds
$timeremaining = ($folders.Count - $numAnalyzed) * ($timeSoFar / $numanalyzed)
"Analyzed {0} so far... took {1} seconds, remaining time {2} seconds" -f $numanalyzed,$timeSoFar,$timeremaining | write-host
"Current Folder: {0}" -f $folder.FullName | Write-Host
}
}
}
$end = [System.DateTime]::Now
"Processed web dirs: {0} took {1} seconds" -f $iis7VWebDirConfig.Count,(NEW-TIMESPAN –Start $Start –End $End).TotalSeconds | write-host | Write-Host
The function I'm having performance problems with and I've got a separate question about but this post has the source code for the function:
web-administration vs WMI to query web directory properties performance problems
In my case, it seemed my PowerShell call froze due to the Idle-Timeout expiration (the call runs for a very long time).
Setting IdleTimeout value to a sufficiently long duration fixed my issue.
Once again, query the current configuration using
winrm get winrm/config/winrs
And set the timeout using
winrm set winrm/config/winrs '#{IdleTimeout="18000000"}'
I think i may have discovered the problem, i started getting some odd failures in other parts of the script:
[SEVERNAME] Processing data from remote server SERVERNAME failed with the following error message: The WSMan provider host process did not return a proper response. A provider in the host process may have behaved improperly. For more information, see the about_Remote_Troubleshooting Help topic.
+ CategoryInfo : OpenError: (SERVERNAME:String) [], PSRemotingTransportException
+ FullyQualifiedErrorId : 1726,PSSessionStateBroken
and
Processing data for a remote command failed with the following error message: Not enough storage is available to complete this operation. For more information, see the about_Remote_Troubleshooting Help topic.
+ CategoryInfo : OperationStopped (System.Manageme...pressionSyncJob:PSInvokeExpressionSyncJob) [], PSRemotingTransportException
+ FullyQualifiedErrorId : JobFailure
This lead me to the following site: http://www.gsx.com/blog/bid/83018/Troubleshooting-unknown-PowerShell-error-messages
The following recommendations seems to have cleared up most of the problems although i still have some testing to do.
Excerpt from site below:
As the first error message specifies, an overflow of memory in the remote session has occurred. Open a PowerShell prompt on the remote server and display the configuration of winrs using:
winrm get winrm/config/winrs
Check the "MaxMemoryPerShellMB" value. It is set by default to 150 MB on Windows Server 2008 R2 and Windows 7. This is something that Microsoft changed in Windows Server 2012 and Windows 8 to 1024 MB.
In order to resolve this issue, you need to increase the value to at least 512 MB with the following command:
winrm set winrm/config/winrs `#`{MaxMemoryPerShellMB=`"512`"`}
As an FYI if Invoke-Command always hangs:
Try a simple command to system :
Invoke-Command -ComputerName XXXXX -ScriptBlock { Get-ItemProperty -Path HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion }
Start the Windows Remote Management Service (on that system)
Check for the listening port:
netstat -aon | findstr "5985"
TCP 0.0.0.0:5985 0.0.0.0:0 LISTENING 4
TCP [::]:5985 [::]:0 LISTENING 4