Why is proc upload so slow? - upload

I've also posted this question on runsubmit.com, a site outside the SE network for SAS-related questions.
At work there are 2 sas servers I use. When I transfer a sas dataset from one to the other via proc upload, it goes at about 2.5MB/s. However, if I map the drive on one server as a network drive and copy and paste the file across, it runs much faster, around 80MB/s (over the same gigabit connection).
Could anyone suggest what might be causing this and what I can do either to fix it or as a workaround?
There is also a third server I use that cannot map network drives on the other two- SAS is the only available means of transferring files from that one, so I need a SAS-based solution. Although individual transfers from this one run at 2.5MB/s, I've found that it's possible to have several transfers all going in parallel, each at 2.5MB/s.
Would SAS FTP via filenames and a data step be any faster than using proc upload? I might try that next, but I would prefer not to use this- we only have SAS 9.1.3, so SFTP isn't available.
Update - Further details:
I'm connecting to a spawner, and I think it uses 'SAS proprietary encryption' (based on what I recall seeing in the logs).
The uploads are Windows client -> Windows remote in the first case and Unix client -> Windows remote in the second case.
The SAS datasets in question are compressed (i.e. by SAS, not some external compression utility).
The transfer rate is similar when using proc upload to transfer external files (.bz2) in binary mode.
All the servers have very fast disk arrays handled by enterprise-grade controllers (minimum 8 drives in RAID 10)
Potential solutions
Parallel PROC UPLOAD - potentially fast enough, but extremely CPU-heavy
PROC COPY - much faster than PROC UPLOAD, much less CPU overhead
SAS FTP - not secure, unknown speed, unknown CPU overhead
Update - test results
Parallel PROC UPLOAD: involves quite a lot of setup* and a lot of CPU, but works reasonably well.
PROC COPY: exactly the same transfer rate per session as proc upload, and far more CPU time used.
FTP: About 20x faster, minimal CPU (100MB/s vs. 2.5MB/s per parallel proc upload).
*I initially tried the following:
local session -> remote session on source server -> n remote sessions
on destination server -> Recombine n pieces on destination server
Although this resulted in n simultaneous transfers, they each ran at 1/n of the original rate, probably due to a CPU bottleneck on the source server. To get it to work with n times the bandwidth of a single transfer, I had to set it up as:
local session -> n remote sessions on source server -> 1 remote
session each on destination server -> Recombine n pieces on destination server
SAS FTP code
filename source ftp '\dir1\dir2'
host='servername'
binary dir
user="&username" pass="&password";
let work = %sysfunc(pathname(work));
filename target "&work";
data _null_;
infile source('dataset.sas7bdat') truncover;
input;
file target('dataset.sas7bdat');
put _infile_;
run;

My understanding of PROC UPLOAD is that it is performing a record-by-record upload of the file along with some conversions and checks, which is helpful in some ways, but not particularly fast. PROC COPY, on the other hand, will happily copy the file without being quite as careful to maintain things like indexes and constraints; but it will be much faster. You just have to define a libref for your server's files.
For example, I sign on to my server and assign it the 'unix' nickname. Then I define a library on it:
libname uwork server=unix slibref=work;
Then I execute the following PROC COPY code, using a randomly generated 1e7 row datafile. Following that, I also RSUBMIT a PROC UPLOAD for comparison purposes.
48 proc copy in=work out=uwork;
NOTE: Writing HTML Body file: sashtml.htm
49 select test;
50 run;
NOTE: Copying WORK.TEST to UWORK.TEST (memtype=DATA).
NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: The data set UWORK.TEST has 10000000 observations and 1 variables.
NOTE: PROCEDURE COPY used (Total process time):
real time 13.07 seconds
cpu time 1.93 seconds
51 rsubmit;
NOTE: Remote submit to UNIX commencing.
3 proc upload data=test;
4 run;
NOTE: Upload in progress from data=WORK.TEST to out=WORK.TEST
NOTE: 80000000 bytes were transferred at 1445217 bytes/second.
NOTE: The data set WORK.TEST has 10000000 observations and 1 variables.
NOTE: Uploaded 10000000 observations of 1 variables.
NOTE: The data set WORK.TEST has 10000000 observations and 1 variables.
NOTE: PROCEDURE UPLOAD used:
real time 55.46 seconds
cpu time 42.09 seconds
NOTE: Remote submit to UNIX complete.
PROC COPY is still not quite as fast as OS copying, but it's much closer in speed. PROC UPLOAD is actually quite a bit slower than even a regular data step, because it's doing some checking; in fact, here the data step is comparable to PROC COPY due to the simplicity of the dataset (and probably the fact that I have a 64k block size, meaning that a data step is using the server's 16k block size while PROC COPY presumably does not).
52 data uwork.test;
53 set test;
54 run;
NOTE: There were 10000000 observations read from the data set WORK.TEST.
NOTE: The data set UWORK.TEST has 10000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 12.60 seconds
cpu time 1.66 seconds
In general in 'real world' situations, PROC COPY is faster than a data step, but both are faster than PROC UPLOAD - unless you need to use proc upload because of complexities in your situation (I have never seen a reason to, but I know it is possible). I think that PROC UPLOAD was more necessary in older versions of SAS but is largely unneeded now, but given my experience is fairly limited in terms of hardware setups this may not apply to your situation.

FTP, if available from the source server, is much faster than proc upload or proc copy. These both operate on a record-by-record basis and can be CPU-bound over fast network connections, especially for very wide datasets. A single FTP transfer will attempt to use all available bandwidth, with negligible CPU cost.
This assumes that the destination server can use the unmodified transferred file - if not, the time required to make it usable might negate the increased transfer speed of FTP.

Related

Control the tie-breaking choice for loading chunks in the scheduler?

I have some large files in a local binary format, which contains many 3D (or 4D) arrays as a series of 2D chunks. The order of the chunks in the files is random (could have chunk 17 of variable A, followed by chunk 6 of variable B, etc.). I don't have control over the file generation, I'm just using the results. Fortunately the files contain a table of contents, so I know where all the chunks are without having to read the entire file.
I have a simple interface to lazily load this data into dask, and re-construct the chunks as Array objects. This works fine - I can slice and dice the array, do calculations on them, and when I finally compute() the final result the chunks get loaded from file appropriately.
However, the order that the chunks are loaded is not optimal for these files. If I understand correctly, for tasks where there is no difference of cost (in terms of # of dependencies?), the local threaded scheduler will use the task keynames as a tie-breaker. This seems to cause the chunks to be loaded in their logical order within the Array. Unfortunately my files do not follow the logical order, so this results in many seeks through the data (e.g. seek halfway through the file to get chunk (0,0,0) of variable A, then go back near the beginning to get chunk (0,0,1) of variable A, etc.). What I would like to do is somehow control the order that these chunks get read, so they follow the order in the file.
I found a kludge that works for simple cases, by creating a callback function on the start_state. It scans through the tasks in the 'ready' state, looking for any references to these data chunks, then re-orders those tasks based on the order of the data on disk. Using this kludge, I was able to speed up my processing by a factor of 3. I'm guessing the OS is doing some kind of read-ahead when the file is being read sequentially, and the chunks are small enough that several get picked up in a single disk read. This kludge is sufficient for my current usage, however, it's ugly and brittle. It will probably work against dask's optimization algorithm for complex calculations. Is there a better way in dask to control which tasks win in a tie-breaker, in particular for loading chunks from disk? I.e., is there a way to tell dask, "all things being equal, here's the relative order I'd like you to process this group of chunks?"
Your assessment is correct. As of 2018-06-16 there is not currently any way to add in a final tie breaker. In the distributed scheduler (which works fine on a single machine) you can provide explicit priorities with the priority= keyword, but these take precedence over all other considerations.

Talend- Memory issues. Working with big files

Before admins start to eating me alive, I would like to say to my defense that I cannot comment in the original publications, because I do not have the power, therefore, I have to ask about this again.
I have issues running a job in talend (Open Studio for BIG DATA!). I have an archive of 3 gb. I do not consider that this is too much since I have a computer that has 32 GB in RAM.
While trying to run my job, first I got an error related to heap memory issue, then it changed for a garbage collector error, and now It doesn't even give me an error. (just do nothing and then stops)
I found this SOLUTIONS and:
a) Talend performance
#Kailash commented that parallel is only on the condition that I have to be subscribed to one of the Talend Platform solutions. My comment/question: So there is no other similar option to parallelize a job with a 3Gb archive size?
b) Talend 10 GB input and lookup out of memory error
#54l3d mentioned that its an option to split the lookup file into manageable chunks (may be 500M), then perform the join in many stages for each chunk. My comment/cry for help/question: how can I do that, I do not know how to split the look up, can someone explain this to me a little bit more graphical
c) How to push a big file data in talend?
just to mention that I also went through the "c" but I don't have any comment about it.
The job I am performing (thanks to #iMezouar) looks like this:
1) I have an inputFile MySQLInput coming from a DB in MySQL (3GB)
2) I used the tFirstRows to make it easier for the process (not working)
3) I used the tSplitRow to transform the data form many simmilar columns to only one column.
4) MySQLOutput
enter image description here
Thanks again for reading me and double thanks for answering.
From what I understand, your query returns a lot of data (3GB), and that is causing an error in your job. I suggest the following :
1. Filter data on the database side : replace tSampleRow by a WHERE clause in your tMysqlInput component in order to retrieve fewer rows in Talend.
2. MySQL jdbc driver by default retrieves all data into memory, so you need to use the stream option in tMysqlInput's advanced settings in order to stream rows.

Flash Memory Management

I'm collecting data on an ARM Cortex M4 based evaluation kit in a remote location and would like to log the data to persistent memory for access later.
I would be logging roughly 300 bytes once every hour, and would want to come collect all the data with a PC after roughly 1 week of running.
I understand that I should attempt to minimize the number of writes to flash, but I don't have a great understanding of the best way to do this. I'm looking for a resource that would explain memory management techniques for this kind of situation.
I'm using the ADUCM350 which looks like it has 3 separate flash sections (128kB, 256kB, and a 16kB eeprom).
For logging applications the simplest and most effective wear leveling tactic is to treat the entire flash array as a giant ring buffer.
define an entry size to be some integer fraction of the smallest erasable flash unit. Say a sector is 4K(4096 bytes); let the entry size be 256.
This is to make all log entries be sector aligned and will allow you to erase any sector without cuting a log entry in half.
At boot, walk the memory and find the first empty entry. this is the 'write_pointer'
when a log entry is written, simply write it to write_pointer and increment write_pointer.
If write_pointer is on a sector boundary erase the sector at write_pointer to make room for the next write. essentially this guarantees that there is at least one empty log entry for you to find at boot and allows you to restore the write_pointer.
if you dedicate 128KBytes to the log entries and have an endurance of 20000 write/erase cycles. this should give you a total of 10240000 entries written before failure. or 1168 years of continuous logging...

Reading large gzip JSON files from Google Cloud Storage via Dataflow into BigQuery

I am trying to read about 90 gzipped JSON logfiles from Google Cloud Storage (GCS), each about 2GB large (10 GB uncompressed), parse them, and write them into a date-partitioned table to BigQuery (BQ) via Google Cloud Dataflow (GCDF).
Each file holds 7 days of data, the whole date range is about 2 years (730 days and counting). My current pipeline looks like this:
p.apply("Read logfile", TextIO.Read.from(bucket))
.apply("Repartition", Repartition.of())
.apply("Parse JSON", ParDo.of(new JacksonDeserializer()))
.apply("Extract and attach timestamp", ParDo.of(new ExtractTimestamps()))
.apply("Format output to TableRow", ParDo.of(new TableRowConverter()))
.apply("Window into partitions", Window.into(new TablePartWindowFun()))
.apply("Write to BigQuery", BigQueryIO.Write
.to(new DayPartitionFunc("someproject:somedataset", tableName))
.withSchema(TableRowConverter.getSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
The Repartition is something I've built in while trying to make the pipeline reshuffle after decompressing, I have tried running the pipeline with and without it. Parsing JSON works via a Jackon ObjectMapper and corresponding classes as suggested here. The TablePartWindowFun is taken from here, it is used to assign a partition to each entry in the PCollection.
The pipeline works for smaller files and not too many, but breaks for my real data set. I've selected large enough machine types and tried setting a maximum number of workers, as well as using autoscaling up to 100 of n1-highmem-16 machines. I've tried streaming and batch mode and disSizeGb values from 250 up to 1200 GB per worker.
The possible solutions I can think of at the moment are:
Uncompress all files on GCS, and so enabling the dynamic work splitting between workers, as it is not possible to leverage GCS's gzip transcoding
Building "many" parallel pipelines in a loop, with each pipeline processsing only a subset of the 90 files.
Option 2 seems to me like programming "around" a framework, is there another solution?
Addendum:
With Repartition after Reading the gzip JSON files in batch mode with 100 workers max (of type n1-highmem-4), the pipeline runs for about an hour with 12 workers and finishes the Reading as well as the first stage of Repartition. Then it scales up to 100 workers and processes the repartitioned PCollection. After it is done the graph looks like this:
Interestingly, when reaching this stage, first it's processing up to 1.5 million element/s, then the progress goes down to 0. The size of OutputCollection of the GroupByKey step in the picture first goes up and then down from about 300 million to 0 (there are about 1.8 billion elements in total). Like it is discarding something. Also, the ExpandIterable and ParDo(Streaming Write) run-time in the end is 0. The picture shows it slightly before running "backwards".
In the logs of the workers I see some exception thrown while executing request messages that are coming from the com.google.api.client.http.HttpTransport logger, but I can't find more info in Stackdriver.
Without Repartition after Reading the pipeline fails using n1-highmem-2 instances with out of memory errors at exactly the same step (everything after GroupByKey) - using bigger instance types leads to exceptions like
java.util.concurrent.ExecutionException: java.io.IOException:
CANCELLED: Received RST_STREAM with error code 8 dataflow-...-harness-5l3s
talking to frontendpipeline-..-harness-pc98:12346
Thanks to Dan from the Google Cloud Dataflow Team and the example he provided here, I was able to solve the issue. The only changes I made:
Looping over the days in 175 = (25 weeks) large chunks, running one pipeline after the other, to not overwhelm the system. In the loop make sure the last files of the previous iteration are re-processed and the startDate is moved forward at the same speed as the underlying data (175 days). As WriteDisposition.WRITE_TRUNCATE is used, incomplete days at the end of the chunks are overwritten with correct complete data this way.
Using the Repartition/Reshuffle transform mentioned above, after reading the gzipped files, to speed up the process and allow smoother autoscaling
Using DateTime instead of Instant types, as my data is not in UTC
UPDATE (Apache Beam 2.0):
With the release of Apache Beam 2.0 the solution became much easier. Sharding BigQuery output tables is now supported out of the box.
It may be worthwhile trying to allocate more resources to your pipeline by setting --numWorkers with a higher value when you run your pipeline. This is one of the possible solutions discussed in the “Troubleshooting Your Pipeline” online document, at the "Common Errors and Courses of Action" sub-chapter.

Compressing/encoding SQL commands for faster network transmission

I am developing an iOS app which stores its data in an sqlite3 database. Every insert, update or delete operation is logged locally and then pushed up to iCloud, and other devices running the app can download these transaction logs and execute the SQL commands within them to keep all devices running the app synchronised. This is working extremely well.
I am now looking into optimising the process and it occurs to me that logging the whole SQL command results in a lot of redundant data being pushed to and pulled from the cloud, which will ultimately result in longer sync times and increased data usage.
The SQL queries are very predictable (there is only one format each of insert, update and delete used in the app) so I am considering using an encoding/decoding routine which will compress the SQL command for storage in the transaction log, and then decompress it from the log for execution.
The string compression methods I have found don't seem to do too well with SQL queries, so I've devised my own:
Single byte to identify the SQL command type
Table and column names indexed in arrays in the app, and the names are encoded using their index position in the array
String of tab separated digits to represent groups of columns, and tab separated values (e.g. in a VALUES() clause)
Encoded check column and value (for the WHERE clause in an update or delete command)
Using this format I have compressed one example query of 186 bytes down to just 78 bytes. This has clear advantages for speed of data transmission and amount of data usage.
The disadvantage I foresee is that it will require more processing on the client end to encode and decode the commands. I am wondering whether anybody has done anything similar and has any advice to offer.
To make is clearer what I am asking: in general is it better to minimise the amount of data being synced and increase the burden on the client to interprete those data, or is it preferable to just sync the data as-is and leave the client to use it as-is?
I'm posting an answer to my own question because I have some information which may provide advice to others who are looking in to the same thing.
I spent some time yesterday writing SQL query compression and decompression functions in Objective C. These functions use the method detailed in my original question and in so doing reduce all non-data parts of the queries (SQL command [insert/update/delete], table names and column names) to a single digit to represent each, and remove all of the remaining SQL syntax (key words like "FROM", spaces, commas, brackets...).
I have done some testing by creating five records both with and without the query compression enabled, and here are the results for the file sizes:
Full SQL queries (logged unmodified to the transactions file)
Zip compressed: 747 bytes
Uncompressed: 1,515 bytes
Compressed SQL queries (compressed using my custom format to the transactions file)
Zip compressed: 673 bytes
Uncompressed: 785 bytes
As you can see, the greatest benefit was to using both types of compression. Encoding the queries achieved ~50% compression. Encoding and then zipping the queries achieved about 10% compression compared to zipping the unencoded queries.
The question I really need to ask myself now is whether it's worth the additional overhead to encode and then zip the queries, and unzip and decode them after downloading the transaction logs from other devices. In this example I only saved 74 bytes overall by encoding and then compressing the transaction log. With five queries this averages at 14.8 bytes saved per query (compared to zipping alone). This is just 14.4kb per 1000 records, which does not seem a lot.

Resources