What causes flume with GCS sink to throw a OutOfMemoryException - docker

I am using flume to write to Google Cloud Storage. Flume listens on HTTP:9000. It took me some time to make it work (add gcs libaries, use a credentials file...) but now it seems to communicate over the network.
I am sending very small HTTP request for my tests, and I have plenty of RAM available:
curl -X POST -d '[{ "headers" : { timestamp=1417444588182, env=dev, tenant=myTenant, type=myType }, "body" : "some body ONE" }]' localhost:9000
I encounter this memory exception on first request (then of course, it stops working):
2014-11-28 16:59:47,748 (hdfs-hdfs_sink-call-runner-0) [INFO - com.google.cloud.hadoop.util.LogUtil.info(LogUtil.java:142)] GHFS version: 1.3.0-hadoop2
2014-11-28 16:59:50,014 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:467)] process failed
java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.<init>(BufferedOutputStream.java:76)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:79)
at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:820)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
(see complete stack trace as a gist for full details)
The strange part is that folders and files are created the way I want, but files are empty.
gs://my_bucket/dev/myTenant/myType/2014-12-01/14-36-28.1417445234193.json.tmp
Is it something wrong with the way I configured flume + GCS or is it a bug in GCS.jar ?
Where should I check to gather more data ?
ps : I am running flume-ng inside docker.
My flume.conf file:
# Name the components on this agent
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem
# Describe/configure the source
a1.sources.http.type = org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000
# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = gs://my_bucket/%{env}/%{tenant}/%{type}/%Y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 10000
a1.channels.mem.transactionCapacity = 1000
# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem
related question in my flume/gcs journey: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

When uploading files, the GCS Hadoop FileSystem implementation sets aside a fairly large (64MB) write buffer per FSDataOutputStream (file open for write). This can be changed by setting "fs.gs.io.buffersize.write" to a smaller value, in bytes, in core-site.xml. I imagine 1MB would suffice for low-volume log collection.
In addition, check what the maximum heap size is set to when launching the JVM for flume. The flume-ng script sets a default JAVA_OPTS value of -Xmx20m to limit the heap to 20MB. This can be set to a larger value in flume-env.sh (see conf/flume-env.sh.template in the flume tarball distribution for details).

Related

inconsistency between docker stats command and docker rest api memory stats

when looking at a running container with docker stats command, I can see that the memory usage of a container is 202.3MiB.
However, when looking at the same container through the REST API with
GET /containers/container_name/stats -> memory_stats-> usage , the usage there shows 242.10 MiB.
there is a big difference between those values.
What might be the reason for the difference? I know that the docker client uses the REST API to get its stats, but what am I missing here?
Solved my problem. Initially, I did not take into account cache memory when calculating memory usage.
Say "stats" is the returned json from
GET /containers/container_name/stats,
the correct formula is:
memory_usage = stats["memory_stats"]["usage"] - stats["memory_stats"]["stats"]["cache"]
limit = stats["memory_stats"]["limit"]
memory_utilization = memory_usage/limit * 100
Use the rss value i.e (rss = usage - cache)
"memory_stats": {
"stats": {
"cache": 477356032,
"rss": 345579520,
},
"usage": 822935552
}
On Linux, the Docker CLI reports memory usage by subtracting page cache usage from the total memory usage.
The API does not perform such a calculation but rather provides the total memory usage and the amount from the page cache so that clients can use the data as needed. (https://docs.docker.com/engine/reference/commandline/stats/)
The accepted answer is incorrect for recent Docker versions (version greater than 19.03).
The correct way that gets the same number docker stats reports is:
memory = stats['memory_stats']['usage'] - stats["memory_stats"]["stats"]["total_inactive_file"]
memory_limit = stats['memory_stats']['limit']
memory_perc = (memory / memory_limit) * 100
JavaScript code according to the Docker cli source code:
const memStats = stats.memory_stats
const memoryUsage = memStats.stats.total_inactive_file && memStats.stats.total_inactive_file < memStats.usage
? memStats.usage - memStats.stats.total_inactive_file
: memStats.usage - memStats.stats.inactive_file

How select prefered file transport method?

I have a problem, as I think, with my prosody configuration. When I am sending files (for example photos) more the ~2 or 3 megabytes (as I established experimentally) using Converstions 2.* version (android IM app) it transfers this files using peer to peer connection instead of uploading this file to server and sending a link to my interlocutor. Small files transfers well using http upload. And I couldn't find a reason for such behavior.
Here are some lines for http_upload module from my config, that I took from official documentation (where I hadn't found a setup for turning off peer to peer files transfer):
http_upload_file_size_limit = 536870912 -- 512 MB in bytes
http_upload_expire_after = 604800 -- 60 * 60 * 24 * 7
http_upload_quota = 10737418240 -- 10 GB
http_upload_path = "/var/lib/prosody"
And this is my full config: https://pastebin.com/V6DNYrhe
Small files are transferred well using http upload. And I couldn't
find a reason for such behavior.
TL;DR: You put options in the wrong place. The default 1MB limit
applies. This is advertised to clients so they know about it and can use
more efficient p2p transfer methods for very large files.
http_upload_path = "/var/lib/prosody"
This line makes Prosodys data directory public, allowing anyone easy
access to all user data. You really don't want to do that. You are
lucky you did not put that in the correct section.
And this is my full config: https://pastebin.com/V6DNYrhe
"http_upload" is in the global modules_enabled list which will load
it onto all VirtualHost(s).
You have added options to the end of the config file, putting them under
a Component section. That makes those options only apply to that
Component.
Thus, the VirtualHost where mod_http_upload is loaded sees no options
set and will use the defaults.
http_upload_file_size_limit = 536870912 -- 512 MB in bytes
Don't do this. Prosodys built-in HTTP server is not optimized for very
large uploads. There is a safety limit on HTTP request size that will
cap HTTP upload size limit to 10M to prevent DoS attacks.
While that limit can be changed, I would strongly suggest you look at
https://modules.prosody.im/mod_http_upload_external.html instead.

Dask script fails on large csv file on a localhost environment

We are trying to use Dask to clean up some data as part of an ETL process.
The original file is over 3GB csv .
When we run the code on a subset (1GB) the code runs successfully (with a few user warning regarding our cleaning procedures such as:
ddf[id1] = ddf[id1].str.extract(´(\d+)´)
repeater = re.compile(r´((\d)\2{5,}´)
mask_repeater = ddf[id1].str.contrains(repeater, regex=True)
ddf = ddf[~mask_repeater]
On the 3GB file the process nearly completes (there is only one task left - drop-duplicates-agg) and then restarts from the middle (that is what I can see from the bokeh status website). we also see the warning which is the same as when the script starts to run.
RuntimeWarning: Couldn't detect a suitable IP address for reaching '8.8.8.8', defaulting to '127.0.0.1'...
I´m running on a offline single windows64bit workstation with 24 cores .
Any suggestions?

Error creating vocabulary from big text file on disk

I try to perform example from https://cran.r-project.org/web/packages/text2vec/vignettes/files-multicore.html but with my file "text" - 3.7Gb plain text, build from Wikipedia XML dump with Perl script from here - http://mattmahoney.net/dc/textdata.html
setwd("c:/rtest")
library(text2vec)
library(doParallel)
N_WORKERS = 2
registerDoParallel(N_WORKERS)
it_files_par = ifiles_parallel(file_paths = "text")
it_token_par = itoken_parallel(it_files_par, preprocessor = tolower, tokenizer = word_tokenizer)
vocab = create_vocabulary(it_token_par)
This causes error:
Error in unserialize(socklist[[n]]) : error reading from connection
I have 8Gb RAM, word2vec model from this file is created without any errors.
First of all it doesn't make sense to use parallel iterators on a single file - each file processed in a separate R worker process. So here it will be worse than just itoken. Also it involves sending result from each worker to the master process. Here we see that result it too big to be send through socket.
Long story short - just use itoken or split your file into several smaller files.

How to log uWSGI vassal metrics in emperor mode?

We're using uWSGI in Emperor mode. We want to be able to track the default (non-custom) metrics like worker.0.requests, and we're trying to use the metrics-dir configuration parameter in the vassals' ini files. For example:
enable-metrics = true
metrics-dir = /tmp/pametrics
Files are being written to the directory we specify, and their timestamps are being updated each time we hit the app being served by the vassal, but they are all 4096 bytes long and full of zero bytes; they are not text files as the documentation says.
What are we missing?
They are memory mapped files so their size is the same of a memory page.
Being 0 terminated, you can use the classic unix utilities to manage them

Resources