Slow Write performance in Glusterfs - ios

I am a newbie to Glusterfs. I have currently setup Glusterfs in two servers with following options:
performance.cache-size 2GB
performance.io-thread-count 16
performance.client-io-threads on
performance.io-cache on
performance.readdir-ahead on
When I run my binary in following manner:
./binary > shared_location/log
It takes roughly 1.5 mins
log size is roughly 100M
Whereas running in this manner:
./binary > local_location/log
Takes roughly 10 secs.
This is a huge difference in time. No of cores in glusterfs server: 2, current machine: 2
Is there any way I can reduce time?
Also, is there any standard configuration to start off with, so I can avoid basic issues like above?

Related

How to pick proper number of threads, workers, processes for Dask when running in an ephemeral environment as single machine and cluster

Our company is currently leveraging prefect.io for data workflows (ELT, report generation, ML, etc). We have just started adding the ability to do parallel task execution, which is powered by Dask. Our flows are executed using ephemeral AWS Fargate containers, which will use Dask LocalCluster with a certain number of workers, threads, processes passed into the LocalCluster object.
Our journey on Dask will look very much like this:
Continue using single machine LocalCluster until we out grow max cpu/memory allowed
When we out grow a single container, spawn additional worker containers on the initial container (a la dask-kubernetes) and join them to the LocalCluster.
We're currently starting with containers that have 256 cpu(.25 vCPU) and 512 memory and pinning the LocalCluster to 1 n_workers and 3 threads_per_worker to get a reasonable amount of parallelism. However, this really is guess work. 1 n_workers since its a machine with less than 1 vcpu and 3 threads because that doesn't sound crazy to me based on my previous experience running other python based applications in Fargate. This seems to work fine in a very simply example that just maps a function against a list of items.
RENEWAL_TABLES = [
'Activity',
'CurrentPolicyTermStaus',
'PolicyRenewalStatus',
'PolicyTerm',
'PolicyTermStatus',
'EndorsementPolicyTerm',
'PolicyLifeState'
]
RENEWAL_TABLES_PAIRS = [
(i, 1433 + idx) for idx, i in enumerate(RENEWAL_TABLES)
]
#task(state_handlers=[HANDLER])
def dummy_step():
LOGGER.info('Dummy Step...')
sleep(15)
#task(state_handlers=[HANDLER])
def test_map(table):
LOGGER.info('table: {}...'.format(table))
sleep(15)
with Flow(Path(__file__).stem, SCHEDULE, state_handlers=[HANDLER]) as flow:
first_step = dummy_step()
test_map.map(RENEWAL_TABLES_PAIRS).set_dependencies(upstream_tasks=[first_step])
I see no more than 3 tasks executed at once.
I would really like to understand how to best configure n_workers(single machinne), threads, processes as we expand the size of the single machine out to adding remote workers. I know it depends on my workload, but you could see a combination of things in a single flow where one task does an extract from a database to a csv and another task run a pandas computation. I have seen things online where it seems like it should be threads = number of cpus requested for the documentation, but it seems like you can still achieve parallelism with less than one cpu in Fargate.
Any feedback would be appreciated and could help others looking to leverage Dask in a more ephemeral nature.
Given that Fargate increments from .25 -> .50 -> 1 -> 2 -> 4 for vCPU, I think it’s safe to go with a 1 worker to 1 vcpu setup. However, would be helpful to understand how to choose a good upper limit for number of threads per worker given how Fargate vcpu allotment works.

Lua and Torch issues with GPu

I am trying to run the Lua based program from the OpenNMT. I have followed the procedure from here : http://forum.opennmt.net/t/text-summarization-on-gigaword-and-rouge-scoring/85
I have used the command:
th train.lua -data textsum-train.t7 -save_model textsum1 -gpuid 0 1 2 3 4 5 6 7
I am using 8 GPUs but still the process is damn slow as if the process is working on the CPU. kindly, let me know what might be the solution for the optimizing the GPU usage.
Here is the stats of the GP usage:
Kindly, let me know how I can make the process run faster using the complete GPUs. I am available with 11GBs, but the process only consumes 2 GB or less. Hence the process is damn slow.
As per OpenNMT documentation, you need to remove 0 from right after the gpuid option since 0 stands for the CPU, and you are effectively reduce the training speed to that of a CPU-powered one.
To use data parallelism, assign a list of GPU identifiers to the -gpuid option. For example:
th train.lua -data data/demo-train.t7 -save_model demo -gpuid 1 2 4
will use the first, the second and the fourth GPU of the machine as returned by the CUDA API.

Spark JobServer, memory settings for release

I've set up a spark-jobserver to enable complex queries on a reduced dataset.
The jobserver executes two operations:
Sync with the main remote database, it makes a dump of some of the server's tables, reduce and aggregates the data, save the result as a parquet file and cache it as a sql table in memory. This operation will be done every day;
Queries, when the sync operation is finished, users can perform SQL complex queries on the aggregated dataset, (eventually) exporting the result as csv file. Every user can do only one query at time, and wait for its completion.
The biggest table (before and after the reduction, which include also some joins) has almost 30M of rows, with at least 30 fields.
Actually I'm working on a dev machine with 32GB of ram dedicated to the job server, and everything runs smoothly. Problem is that in the production one we have the same amount of ram shared with a PredictionIO server.
I'm asking how determine the memory configuration to avoid memory leaks or crashes for spark.
I'm new to this, so every reference or suggestion is accepted.
Thank you
Take an example,
if you have a server with 32g ram.
set the following parameters :
spark.executor.memory = 32g
Take a note:
The likely first impulse would be to use --num-executors 6
--executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity
of the NodeManagers. The application master will take up a core on one
of the nodes, meaning that there won’t be room for a 15-core executor
on that node. 15 cores per executor can lead to bad HDFS I/O
throughput.
A better option would be to use --num-executors 17 --executor-cores 5
--executor-memory 19G. Why?
This config results in three executors on all nodes except for the one
with the AM, which will have two executors. --executor-memory was
derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47
~ 19.
This is explained here if you want to know more :
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

First MongoDB query ultra slow on Linode

When I start my Rails application and open a page which needs to query my MongoDB database, then there is the following problem:
on my local machine it takes about 1600ms to perform the queries and render all
on my linode it takes about 4min to perform the first query and render all
After that everything is faster, caching, pages are loaded instantly, etc.
But really, 4min? Why is that? Is that the loading from disk to memory for MongoDB? Why does it take so much longer than on my local machine?
Is this due to the hard drive being shared on Linode? I noticed a lot of activity when running iostat
$ iostat -d 2
Linux 3.12.6-x86_64-linode36 (linode) 01/31/2014 _x86_64_ (8 CPU)
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 1129.69 43026.47 17.62 1940251345 794504
xvdb 248.43 2572.50 698.08 116005452 31479356
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvda 4491.50 179012.00 0.00 358024 0
xvdb 0.00 0.00 0.00 0 0
It's my understanding that Mongo loads all the data from disk into memory, so I guess it's likely that you're experiencing slow performance during that phase. Perhaps it makes sense to hit the db with several queries to warm it up before you enable your application.

Specific memory limitations of Pig LOAD statement?

Simple question:
What is the memory limitation of the Pig LOAD statement?
More detailed question:
Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle?
Scenario:
A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code.
-- load data from I_WATS_DIR
Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header- Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as
(src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray);
Details:
CLUSTER
1 front end node, 16 cores, 64GB RAM, 128GB swap, NameNode
3 compute nodes, 16 cores, 128GB RAM, 128GB swap, DataNode
TEST JOB 1
Same script referenced above, loading a directory with 1 file
Resident memory reported 1.2GB
Input: 138MB
Output: 207MB
Reduce input records: 1,630,477
Duration: 4m 11s
TEST JOB 2
Same script, 17 files
Resident memory: 16.4GB
Input: 3.5GB
Output: 1.3GB
Reduce input records: 10,648,807
Duration: 6m 48s
TEST JOB 3
Same script, 51 files
Resident memory: 41.4GB
Input: 10.9GB
Output: not recorded
Reduce input records: 31,968,331
Duration: 6m 18s
Final thoughts:
This is a 4 node cluster with nothing else running on it, fully dedicated to Cloudera Hadoop CDH4, running this 1 job only. Hoping this is all the info people need to answer my original question! I strongly suspect that some sort of file parsing loop that loads 1 file at a time is the solution, but I know even less about Pig than I do about Hadoop. I do have a programming/development background, but in this case I am the sys admin, not the researcher or programmer.
Based on your description of your cluster and the amount of data your pushing through it, it sounds like you are running out of space during the map/shuffle phase of the job. The temporary data is sent over the network, uncompressed, and then written to disk on the reducer before being processed in the reduce phase. One thing you can try is to compress the output of the mappers by setting mapred.map.compress.output to true (and specifying the desired codec).
But with only four nodes, I suspect you're just trying to do too much at once. If you can, try splitting up your job into multiple steps. For example, if you are doing the standard word count example, do the word count over small portions of your data, and then run a second MR program that sums those counts.

Resources