Telegraf, trigger at start and not interval? - influxdb

Hello i am currently trying to parse a folder of many csv Files(ca. 3GB) into influxdb.
On the influxdata blog it was suggested, that this would be the fastest way since telegraf is written in go.So:
I can get everything to work and i can parse all csvĀ“s and write them to influxdb.
The Problem is that parsing and writing the files takes a lot of time (old macbook..more than an hour i think) and when the agent interval is smaller than the time it takes to write the data, telegraf-agent will start again to read and write all files at the next interval. So it never finishes and my ram gets packed with all the same parsed data over and over. When i set the interval really high i have to wait one interval before the agent starts. So not an option too.
The question is:
Can telegraf be triggert like a script? So that i just run it one time and not have to wait for one interval to start?

The functionality you need has been added since this question was asked. You can now run Telegraf with a --once flag.
I can't find it documented anywhere, but the commit is here.
It's available in v1.15.0-rc1

Related

Google Cloud Run is very slow vs. local machine

We have a small script that scrapes a webpage (~17 entries), and writes them to Firestore collection. For this, we deployed a service on Google Cloud Run.
The execution of this code takes ~5 seconds when tested locally using Docker Container image.
The same image when deployed to Cloud Run takes over 1 minute.
Even simple command as "Delete all Documents in a Collection", which takes 2-3 seconds locally, takes over 10 seconds when deployed on Cloud Run.
We are aware of Cold Start, and so we tested the performance of Cloud Run on the third, fourth and fifth subsequent runs, but it's still quite slow.
We also experimented with the number of CPUs, instances, concurrency, memory, using both default values as well as extreme values at both ends, but Cloud Run's performance is slow.
Is this expected? Are individual instances of Cloud Run really this weak? Can we do something to make it faster?
The problem with this slowness is that if we run our code for large number of entries, Cloud Run would eventually time out (not to mention the cost of Cloud Run per second)
Posting answer to my own question as we experimented a lot with this, and found issues in our own implementation.
In our case, the reason for super slow performance was async calls without Promises or callbacks.
What we initially missed was this section: Avoiding background activities
Our code didn't wait for the async operation to end, and responding to the request right away. The async operation then moved to background activity and took forever to finish.
Responding to comments posted, or similar questions that may arise:
1. We didn't try experiment with local by setting up a VM with same config because we figured out the cause sooner.
We are not writing anything on filesystem (yet), and operations are simple calls. But this is a good question, and we'll keep it in mind when we store/write data

Redis replication without lua

Some information that's important to the question before describe the problems and issues.
Redis lua scripting replicates the script itself instead of
replicating the single commands, both to slaves and to the AOF file.
This is needed as often scripts are one or two order of magnitudes
faster than executing commands in a normal way, so for a slave to be
able to cope with the master replication link speed and number of
commands per second this is the only solution available.
More information about this decision in Lua scripting: determinism,
replication, AOF (github issue)).
Question
Is here is any way or workaround to replicates single commands instead of executing LUA script itself?
Why?
We use Redis as Natural language processing (Multinomial Naive Bayes) application server. Each time you want to learn on new text you should update big list of word weights. The word list with approximately 1,000,000 words in it. Processing time using LUA ~350 ms per run. Processing using separate applicaton server (hiredis based) is 37 seconds per run.
I think about workaround like this:
After computation are done transfer key to other (read only server) with MIGRATE
From time to time save and move RDB to other server and load it my hands.
Is here is any other workaround to solve this?
Yes, in the near future we're gonna have just that: https://www.reddit.com/r/redis/comments/3qtvoz/new_feature_single_commands_replication_for_lua/

Dataflow job takes too long to start

I'm running a job which reads about ~70GB of (compressed data).
In order to speed up processing, I tried to start a job with a large number of instances (500), but after 20 minutes of waiting, it doesn't seem to start processing the data (I have a counter for the number of records read). The reason for having a large number of instances is that as one of the steps, I need to produce an output similar to an inner join, which results in much bigger intermediate dataset for later steps.
What should be an average delay before the job is submitted and when it starts executing? Does it depend on the number of machines?
While I might have a bug that causes that behavior, I still wonder what that number/logic is.
Thanks,
G
The time necessary to start VMs on GCE grows with the number of VMs you start, and in general VM startup/shutdown performance can have high variance. 20 minutes would definitely be much higher than normal, but it is somewhere in the tail of the distribution we have been observing for similar sizes. This is a known pain point :(
To verify whether VM startup is actually at fault this time, you can look at Cloud Logs for your job ID, and see if there's any logging going on: if there is, then some VMs definitely started up. Additionally you can enable finer-grained logging by adding an argument to your main program:
--workerLogLevelOverrides=com.google.cloud.dataflow#DEBUG
This will cause workers to log detailed information, such as receiving and processing work items.
Meanwhile I suggest to enable autoscaling instead of specifying a large number of instances manually - it should gradually scale to the appropriate number of VMs at the appropriate moment in the job's lifetime.
Another possible (and probably more likely) explanation is that you are reading a compressed file that needs to be decompressed before it is processed. It is impossible to seek in the compressed file (since gzip doesn't support it directly), so even though you specify a large number of instances, only one instance is being used to read from the file.
The best way to approach the solution of this problem would be to split a single compressed file into many files that are compressed separately.
The best way to debug this problem would be to try it with a smaller compressed input and take a look at the logs.

What is the Cloud Dataflow equivalent of BigQuery's table decorators?

We have a large table in BigQuery where the data is streaming in. Each night, we want to run Cloud Dataflow pipeline which processes the last 24 hours of data.
In BigQuery, it's possible to do this using a 'Table Decorator', and specifying the range we want i.e. 24 hours.
Is the same functionality somehow possible in Dataflow when reading from a BQ table?
We've had a look at the 'Windows' documentation for Dataflow, but we can't quite figure if that's what we need. We came up with up with this so far (we want the last 24 hours of data using FixedWindows), but it still tries to read the whole table:
pipeline.apply(BigQueryIO.Read
.named("events-read-from-BQ")
.from("projectid:datasetid.events"))
.apply(Window.<TableRow>into(FixedWindows.of(Duration.standardHours(24))))
.apply(ParDo.of(denormalizationParDo)
.named("events-denormalize")
.withSideInputs(getSideInputs()))
.apply(BigQueryIO.Write
.named("events-write-to-BQ")
.to("projectid:datasetid.events")
.withSchema(getBigQueryTableSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE) .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Are we on the right track?
Thank you for your question.
At this time, BigQueryIO.Read expects table information in "project:dataset:table" format, so specifying decorators would not work.
Until support for this is in place, you can try the following approaches:
Run a batch stage which extracts the whole bigquery and filters out unnecessary data and process that data. If the table is really big, you may want to fork the data into a separate table if the amount of data read is significantly smaller than the total amount of data.
Use streaming dataflow. For example, you may publish the data onto Pubsub, and create a streaming pipeline with a 24hr window. The streaming pipeline runs continuously, but provides sliding windows vs. daily windows.
Hope this helps

Is it possible to write a daemon to do cleanup on Parse?

I am working on a snapchat clone to get familiar with parse. I was wondering if there was a way to write a script that runs at predefined intervals and deletes messages that are over 24hrs old.
You can write a background job (https://www.parse.com/docs/cloud_code_guide#jobs).
Next you schedule the task every 24 hours.

Resources