After a job is finished, how can I know the maximum resident size it required at any given point while running?
(tried /usr/bin/time, but not installed on the server)
Thank you!
PBS MOM reports some statistics back and it gets recorded in the PBS server log.
A handy utility called tracejob parses the logs to extract all entries related to a specific job given a job ID.
For example after the job completion on PBS Pro 12.1 tracejob would return several lines including the following
07/11/2014 16:37:27 S Exit_status=0 resources_used.cpupercent=98
resources_used.cput=01:49:14 resources_used.mem=5368kb
resources_used.ncpus=1 resources_used.vmem=38276kb
resources_used.walltime=01:49:22
Here 5368 kb would correspond to the maximum rss.
Similarly on Torque 3.0.5
07/15/2014 03:45:12 S Exit_status=0 resources_used.cput=20:44:10
resources_used.mem=704692kb
resources_used.vmem=1110224kb
resources_used.walltime=20:44:30
Here the maximum rss was 704692 kb
Related
We have Spring Boot 2.0.4 application. We use distributed Hazelcast 3.11 cache. In our application we configured HazelcastClient which connects to a Hazelcast server in Docker container.
In cache we store different "persons" in one map and the same "persons" but as a list in another (~900 persons in one list by one key; these persons in both maps are not the same for 100%, they both describe the person in real life but the last one in the list have less properties.). All the maps are of BINARY type.
When we made stress tests to get person by random id from the cache (1st map), everything went excellent. 5000 concurrent requests didn't influence our application HEAP at all, 10000 - slightly. In JSON format one person details has the size of 10kB.
When we made stress tests to get the list of persons from the cache (2nd map) we faced problems with the HEAP of our application where the client is configured. We made just 500 concurrent requests and the HEAP grew to 4Gb size! In JSON format the list has the size of 800kB. It is stored in the 2nd map and was requested by the same key 500 times.
Does anybody know what is going on?
DTO
Controller
Method of a Facade which is retrieved from the Controller, and where caching takes place via #Cacheable annotation
HazelcastInstance configuration
hazelcast.xml configuration for the server side
500 concurrent requests (3 times in a row)
Heap, Classes
UPDATED:
I made 500 concurrent requests sequentially 23 times. Below we can see the final minutes of the test.
Telemetries Overview
#Nicolay, correct me if I'm wrong:
the second map contains lists of people, ~900 people, as an entry. You mentioned each person is ~10KB, so each entry in the second map is ~9MB, even though you're saying it's 800KB in Json format. Can you please check the size of entries in the second map through Hazelcast. like: client.getMap(map_name).getEntryView(key).getCost(). This will give you entry memory cost in bytes.
500 concurrent req, if each entry is ~9MB, will require 4.5GB additional heap, which matches what you observed.
By looking numbers, everything seems fine, other that Json size being 800KB.
Can you check those numbers?
We have a 1 GB List that was created using View.asList() method on beam sdk 2.0. We are trying to iterate through every member of the list and do, for now, nothing significant with it (we just sum up a value). Just reading this 1 GB list is taking about 8 minutes to do so (and that was after we set the workerCacheMb=8000, which we think means that the worker cache is 8 GB). (If we don't set the workerCacheMB to 8000, it takes over 50 minutes before we just kill the job.). We're using a n1-standard-32 instance, which should have more than enough RAM. There is ONLY a single thread reading this 8GB list. We know this because we create a dummy PCollection of one integer and we use it to then read this 8GB ViewList side-input.
It should not take 6 minutes to read in a 1 GB list, especially if there's enough RAM. EVEN if the list were materialized to disk (which it shouldn't be), a normal single NON-ssd disk can read data at 100 MB/s, so it should take ~10 seconds to read in this absolute worst case scenario....
What are we doing wrong? Did we discover a dataflow bug? Or maybe the workerCachMB is really in KB instead of MB? We're tearing our hair out here....
Try to use setWorkervacheMb(1000). 1000 MB = Around 1GB. It will pick the side input from cache of each worker node and that will be fast.
DataflowWorkerHarnessOptions options = PipelineOptionsFactory.create().cloneAs(DataflowWorkerHarnessOptions.class);
options.setWorkerCacheMb(1000);
Is it really required to iterate 1 GB of side input data every time or you are need some specific data to get during iteration?
In case you need specific data then you should get it by passing specific index in the list. Getting data specific to index is much faster operation then iterating whole 1GB data.
After checking with the Dataflow team, the rate of 1GB in 8 minutes sounds about right.
Side inputs in Dataflow are always serialized. This is because to have a side input, a view of a PCollection must be generated. Dataflow does this by serializing it to a special indexed file.
If you give more information about your use case, we can help you think of ways of doing it in a faster manner.
I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.
We are using sidekiq pro 1.7.3 and sidekiq 3.1.4, Ruby 2.0, Rails 4.0.5 on heroku with the redis green addon with 1.75G of memory.
We run a lot of sidekiq batch jobs, probably around 2 million jobs a day. What we've noticed is that the redis memory steadily increases over the course of a week. I would have expected that when the queues are empty and no workers are busy that redis would have low memory usage, but it appears to stay high. I'm forced to do a flushdb pretty much every week or so because we approach our redis memory limit.
I've had a series of correspondence with Redisgreen and they suggested I reach out to the sidekiq community. Here are some stats from redisgreen:
Here's a quick summary of RAM use across your database:
The vast majority of keys in your database are simple values taking up 2 bytes each.
200MB is being consumed by "queue:low", the contents of your low-priority sidekiq queue.
The next largest key is "dead", which occupies about 14MB.
And:
We just ran an analysis of your database - here is a summary of what we found in 23129 keys:
18448 strings with 1048468 bytes (79.76% of keys, avg size 56.83)
6 lists with 41642 items (00.03% of keys, avg size 6940.33)
4660 sets with 3325721 members (20.15% of keys, avg size 713.67)
8 hashs with 58 fields (00.03% of keys, avg size 7.25)
7 zsets with 1459 members (00.03% of keys, avg size 208.43)
It appears that you have quite a lot of memory occupied by sets. For example - each of these sets have more than 10,000 members and occupies nearly 300KB:
b-3819647d4385b54b-jids
b-3b68a011a2bc55bf-jids
b-5eaa0cd3a4e13d99-jids
b-78604305f73e44ba-jids
b-e823c15161b02bde-jids
These look like Sidekiq Pro "batches". It seems like some of your batches are getting filled up with very large numbers of jobs, which is causing the additional memory usage that we've been seeing.
Let me know if that sounds like it might be the issue.
Don't be afraid to open a Sidekiq issue or email prosupport # sidekiq.org directly.
Sidekiq Pro Batches have a default expiration of 3 days. If you set the Batch's expires_in setting longer, the data will sit in Redis longer. Unlike jobs, batches do not disappear from Redis once they are complete. They need to expire over time. This means you need enough memory in Redis to hold N days of Batches, usually not a problem for most people, but if you have a busy Sidekiq installation and are creating lots of batches, you might notice elevated memory usage.
Simple question:
What is the memory limitation of the Pig LOAD statement?
More detailed question:
Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle?
Scenario:
A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code.
-- load data from I_WATS_DIR
Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header- Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as
(src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray);
Details:
CLUSTER
1 front end node, 16 cores, 64GB RAM, 128GB swap, NameNode
3 compute nodes, 16 cores, 128GB RAM, 128GB swap, DataNode
TEST JOB 1
Same script referenced above, loading a directory with 1 file
Resident memory reported 1.2GB
Input: 138MB
Output: 207MB
Reduce input records: 1,630,477
Duration: 4m 11s
TEST JOB 2
Same script, 17 files
Resident memory: 16.4GB
Input: 3.5GB
Output: 1.3GB
Reduce input records: 10,648,807
Duration: 6m 48s
TEST JOB 3
Same script, 51 files
Resident memory: 41.4GB
Input: 10.9GB
Output: not recorded
Reduce input records: 31,968,331
Duration: 6m 18s
Final thoughts:
This is a 4 node cluster with nothing else running on it, fully dedicated to Cloudera Hadoop CDH4, running this 1 job only. Hoping this is all the info people need to answer my original question! I strongly suspect that some sort of file parsing loop that loads 1 file at a time is the solution, but I know even less about Pig than I do about Hadoop. I do have a programming/development background, but in this case I am the sys admin, not the researcher or programmer.
Based on your description of your cluster and the amount of data your pushing through it, it sounds like you are running out of space during the map/shuffle phase of the job. The temporary data is sent over the network, uncompressed, and then written to disk on the reducer before being processed in the reduce phase. One thing you can try is to compress the output of the mappers by setting mapred.map.compress.output to true (and specifying the desired codec).
But with only four nodes, I suspect you're just trying to do too much at once. If you can, try splitting up your job into multiple steps. For example, if you are doing the standard word count example, do the word count over small portions of your data, and then run a second MR program that sums those counts.