Creating a tar stream of arbitrary data and size - tar

I need to create an arbitrarily large tarfile for testing but don't want it to hit the disk.
What's the easiest way to do this?

You can easily use python to generate such a tarfile:
mktar.py:
#!/usr/bin/python
import datetime
import sys
import tarfile
tar = tarfile.open(fileobj=sys.stdout, mode="w|")
info = tarfile.TarInfo(name="fizzbuzz.data")
info.mode = 0644
info.size = 1048576 * 16
info.mtime = int(datetime.datetime.now().strftime('%s'))
rand = open('/dev/urandom', 'r')
tar.addfile(info,rand)
tar.close()
michael#challenger:~$ ./mktar.py | tar tvf -
-rw-r--r-- 0/0 16777216 2012-08-02 13:39 fizzbuzz.data

You can use tar with -O option tar -O, like this tar -xOzf foo.tgz bigfile | process
https://www.gnu.org/software/tar/manual/html_node/Writing-to-Standard-Output.html
PS: However, it could be, that you will not get the benefits you intend to gain as tar starts writing stdout only after it has read through the entire compressed file. You can demonstrate this behavior by starting a large file extraction and following the file size over time; it should be zero most of the processing time and start growing at very late stage. On the other hand I haven't researched this extensively, there might be some work around, or I might be just plain wrong with my first hand out-of-memory experience.

Related

Is it possible to add a file to a zip stream?

Let's suppose I download a zip archive, and I mean something like adding some file on the fly to the data stream, avoiding usage of a temp file:
wget http://example.com/archive.zip -O - | zipadder -f myfile.txt | pv
I read somewhere that bsdtar can manipulate such streams.
This will likely be hard on RAM, as it needs you to manipulate the zip structure entirely in memory. That said, it should be possible to write zipadder in python, using StringIO to manipulate a memory-backed file-like object read from stdin like so:
#!/usr/bin/env python
import zipfile
import sys
import StringIO
s = StringIO.StringIO(sys.stdin.read()) # read buffer from stdin
f = zipfile.ZipFile(s, 'a')
f.write('myfile.txt') # add file to buffer
f.close()
print s.getvalue() # write buffer to stdout

extract a file from xz file

I have a huge file file.tar.xz containing many smaller text files with a similar structure. I want to quickly examine a file out of the compressed file and have a glimpse of files content structure. I don't have information about names of the files within the compressed file. Is there anyway to extract a single file out given the above the above scenario?
Thank you.
EDIT: I don't want to tar -xvf file.tar.xz.
Based on the discussion in the comments, I tried the following which worked for me. It might not be the most optimal solution, the regex might need some improvement, but you'll get the idea.
I first created a demo archive:
cd /tmp
mkdir demo
for i in {1..100}; do echo $i > "demo/$i.txt"; done
cd demo && tar cfJ ../demo.tar.xz * && cd ..
demo.tar.xz now contains 100 txt files.
The following lists the contents of the archive, selects the first file and stores the path within the archive into the variable firstfile:
firstfile=`tar -tvf demo.tar.xz | grep -Po -m1 "(?<=:[0-9]{2} ).*$"`
echo $firstfile will output 1.txt.
You can now extract this single file from the archive:
tar xf demo.tar.xz $firstfile

tar pre-run to evaluate expected size or amount of files

The problem:
I have a back-end process that at some point he collect and build a big tar file.
This tar receive few directories and an exclude files.
the process can take up to few minutes and i want to report in my front-end process (GUI) about the progress of the taring process (This is a big issue for a user that press download button and it seems like nothing is happening...).
i know i can use -v -R in the tar command and count files and size progress but i am looking for some kind of tar pre-run mode / dry run to help me evaluate either the expected number of files or the expected tar size.
the command I am using: tar -jcf 'FILE.tgz' 'exclude_files' 'include_dirs_and_files'
10x for everyone who is willing to assist.
You can pipe the output to the wc tool instead of actually making a file.
With file listing (verbose):
[git#server]$ tar czvf - ./test-dir | wc -c
./test-dir/
./test-dir/test.pdf
./test-dir/test2.pdf
2734080
Without:
[git#server]$ tar czf - ./test-dir | wc -c
2734080
Why don't you run a
DIRS=("./test-dir" "./other-dir-to-test")
find ${DIRS[#]} -type f | wc -l
beforehand. This gets all the files (-type f) one per line and counts the number of files. DIRS is an array in bash, so you can store the folders in a variable
If you want to know the size of all the stored files, you can use du
DIRS=("./test-dir" "./other-dir-to-test")
du -c -d 0 ${DIRS[#]} | tail -1 | awk -F ' ' '{print $1}'
This prints the disk usage with du, calculates a grand total (-c flag), gets the last line (example 4378921 total), and uses just the first column with awk

HDF5 integration with ROOT framework

I've worked extensively with ROOT, which has it's own format for data files, but for various reasons we would like to switch to HDF5 files. Unfortunately we'd still require some way of translating files between formats. Does anyone know of any existing libraries which do this?
You might check out rootpy, which has a facility for converting ROOT files into HDF5 via PyTables: http://www.rootpy.org/commands/root2hdf5.html
If this issue is still of interest to you, recently there have been large improvements to rootpy's root2hdf5 script and the root_numpy package (which root2hdf5 uses to convert TTrees into NumPy structured arrays):
root2hdf5 -h
usage: root2hdf5 [-h] [-n ENTRIES] [-f] [--ext EXT] [-c {0,1,2,3,4,5,6,7,8,9}]
[-l {zlib,lzo,bzip2,blosc}] [--script SCRIPT] [-q]
files [files ...]
positional arguments:
files
optional arguments:
-h, --help show this help message and exit
-n ENTRIES, --entries ENTRIES
number of entries to read at once (default: 100000.0)
-f, --force overwrite existing output files (default: False)
--ext EXT output file extension (default: h5)
-c {0,1,2,3,4,5,6,7,8,9}, --complevel {0,1,2,3,4,5,6,7,8,9}
compression level (default: 5)
-l {zlib,lzo,bzip2,blosc}, --complib {zlib,lzo,bzip2,blosc}
compression algorithm (default: zlib)
--script SCRIPT Python script containing a function with the same name
that will be called on each tree and must return a tree or
list of trees that will be converted instead of the
original tree (default: None)
-q, --quiet suppress all warnings (default: False)
As of when I last checked (a few months ago) root2hdf5 had a limitation that it could not handle TBranches which were arrays. For this reason I wrote a bash script: root2hdf (sorry for non-creative name).
It takes a ROOT file and the path to the TTree in the file as input arguments and generates source code & compiles to an executable which can be run on ROOT files, converting them into HDF5 datasets.
It also has a limitation that it cannot handle compound TBranch types, but I don't know that root2hdf5 does either.

Can't read Mahout generated sequence files with hadoop streaming

I am trying to stream a sequence file generated by one of the Mahout examples to see its contents:
hadoop jar hadoop-streaming-0.20.2-cdh3u0.jar \
-input /tmp/mahout-work-me/20news-bydate/bayes-test-input-output/ \
-output /tmp/me/mm \
-mapper "cat" \
-reducer "wc -l" \
-inputformat SequenceFileAsTextInputFormat
The job starts successfully and eventually dies with:
11/11/30 21:08:39 INFO streaming.StreamJob: map 0% reduce 0%
11/11/30 21:09:17 INFO streaming.StreamJob: map 100% reduce 100%
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: org.apache.mahout.common.StringTuple
I wonder if something is wrong with my streaming jar file, if I I need to point explicitly to the Mahout jar that has this class (tried setting HADOOP_CLASSPATH to the location of mahout-core-0.5-cdh3u2.jar but did not work), or maybe even something else?
Any help is appreciated. Thanks.
Add this option:
-libjars mahout-core-0.5-cdh3u2.jar

Resources