Let's suppose I download a zip archive, and I mean something like adding some file on the fly to the data stream, avoiding usage of a temp file:
wget http://example.com/archive.zip -O - | zipadder -f myfile.txt | pv
I read somewhere that bsdtar can manipulate such streams.
This will likely be hard on RAM, as it needs you to manipulate the zip structure entirely in memory. That said, it should be possible to write zipadder in python, using StringIO to manipulate a memory-backed file-like object read from stdin like so:
#!/usr/bin/env python
import zipfile
import sys
import StringIO
s = StringIO.StringIO(sys.stdin.read()) # read buffer from stdin
f = zipfile.ZipFile(s, 'a')
f.write('myfile.txt') # add file to buffer
f.close()
print s.getvalue() # write buffer to stdout
Related
Is it possible to use a file containing filters as a filter itself? Instead of having to write each filter -f ...... -f ....... have a file that contains all the filters I wish to use to capture? What should the format of this file be? How do I create said file? "Filter1" udp "Filter2" ip6 ........ When using this file using CMD what would the expression be? dumpcap -i 5 -???????? -w capture.pcapng
I expect an expression of what to type in CMD in order to use a file as a capture filter instead of manually writing all filters as -f ........ -f .......
I am trying to check whether variable GROUP exist in SAS data set file or not from the UNIX command but unfortunately it's showing that GROUP variable does not exist in the data set,However GROUP variable is present in SAS data set.
In my command for case sensitive and whole word match I am using i and w options of grep command respectively. But still UNIX command is not giving the expected result.I s there any way to fix this issue?
Below is the command which I am using:
sasfile="sasdata"
rwords="GROUP"
cat $sasfile | grep -iqw "$rwords"
Thank you
As mentioned in earlier comment
SAS data sets are stored in disk files using a proprietary format.
There may be encodings and storage methodologies that do not yield the
information you seek in a plain text examination of said disk file.
Running SAS code in a SAS session is the definitive way to glean information about a data set.
What will that code look like ?
Proc CONTENTS
Data step or macro code that uses VARNAME function
... many other ways ...
In UNIX SAS can use stdio.
From "SAS(R) 9.2 Companion for UNIX Environments", STDIO System Option: UNIX
Details
This option tells SAS to take its input from standard input (stdin),
to write its log to standard error (stderr), and to write its output
to standard output (stdout).
This option is designed for running SAS
in batch mode or from a shell script. If you specify this option
interactively, SAS starts a line mode session.
The STDIO option
overrides the DMS, DMSEXP, and EXPLORER system options. The STDIO
option does not affect the assignment of the Stdio, Stdin, and Stderr
filerefs. See Filerefs Assigned by SAS in UNIX Environments for more
information.
For example, in the following SAS command, the file
myinput is used as the source program, and files myoutput and mylog
are used for the procedure output and log respectively.
sas -stdio < myinput > myoutput 2> mylog
If you are using the C shell, you should
use parentheses:
(sas -stdio < myinput > myoutput ) >& output_log
With -stdio you want a short SAS program that can indicate if a variable is present in a data set, or perhaps emit a list of variables in a data set for further shell processing. A Proc CONTENTS step is short and sweet.
So looking for your proverbial needle in a haystack
sasfile=<path to data set file>/<dataset>.sas7bdat
needle=GROUP
echo "Proc CONTENTS data=""$sasfile""" | sas -stdio | grep $needle
The default CONTENTS output might contain yield some false matches. So you could also try
echo "Proc CONTENTS noprint data=""$sasfile"" out=list;data _null_;set list;file print;put name;"
| sas -stdio
| grep -i "GROUP"
You could try:
sasfile="sasdata"
rwords="GROUP"
grep -iw "$rwords" "$sasfile"
The only difference between these and your original commands is that I omitted cat and grep's quiet flag -q.
Sample input in sasdata:
fasd group
fdsfds fdsfdsa
fdsfd as GROUP afdsfdsa
Output:
fasd group
fdsfd as GROUP afdsfdsa
The -q flag of grep will suppress standard output but echo $? can retrieve the return value of grep. Using the same input file as before:
grep -iqw "$rwords" "$sasfile" # No stout
echo $? # Prints 0, means grep succeeded
grep -iqw "word" "$sasfile" # No stout
echo $? # Prints 1, means grep failed
I have a huge file file.tar.xz containing many smaller text files with a similar structure. I want to quickly examine a file out of the compressed file and have a glimpse of files content structure. I don't have information about names of the files within the compressed file. Is there anyway to extract a single file out given the above the above scenario?
Thank you.
EDIT: I don't want to tar -xvf file.tar.xz.
Based on the discussion in the comments, I tried the following which worked for me. It might not be the most optimal solution, the regex might need some improvement, but you'll get the idea.
I first created a demo archive:
cd /tmp
mkdir demo
for i in {1..100}; do echo $i > "demo/$i.txt"; done
cd demo && tar cfJ ../demo.tar.xz * && cd ..
demo.tar.xz now contains 100 txt files.
The following lists the contents of the archive, selects the first file and stores the path within the archive into the variable firstfile:
firstfile=`tar -tvf demo.tar.xz | grep -Po -m1 "(?<=:[0-9]{2} ).*$"`
echo $firstfile will output 1.txt.
You can now extract this single file from the archive:
tar xf demo.tar.xz $firstfile
I need to create an arbitrarily large tarfile for testing but don't want it to hit the disk.
What's the easiest way to do this?
You can easily use python to generate such a tarfile:
mktar.py:
#!/usr/bin/python
import datetime
import sys
import tarfile
tar = tarfile.open(fileobj=sys.stdout, mode="w|")
info = tarfile.TarInfo(name="fizzbuzz.data")
info.mode = 0644
info.size = 1048576 * 16
info.mtime = int(datetime.datetime.now().strftime('%s'))
rand = open('/dev/urandom', 'r')
tar.addfile(info,rand)
tar.close()
michael#challenger:~$ ./mktar.py | tar tvf -
-rw-r--r-- 0/0 16777216 2012-08-02 13:39 fizzbuzz.data
You can use tar with -O option tar -O, like this tar -xOzf foo.tgz bigfile | process
https://www.gnu.org/software/tar/manual/html_node/Writing-to-Standard-Output.html
PS: However, it could be, that you will not get the benefits you intend to gain as tar starts writing stdout only after it has read through the entire compressed file. You can demonstrate this behavior by starting a large file extraction and following the file size over time; it should be zero most of the processing time and start growing at very late stage. On the other hand I haven't researched this extensively, there might be some work around, or I might be just plain wrong with my first hand out-of-memory experience.
I've worked extensively with ROOT, which has it's own format for data files, but for various reasons we would like to switch to HDF5 files. Unfortunately we'd still require some way of translating files between formats. Does anyone know of any existing libraries which do this?
You might check out rootpy, which has a facility for converting ROOT files into HDF5 via PyTables: http://www.rootpy.org/commands/root2hdf5.html
If this issue is still of interest to you, recently there have been large improvements to rootpy's root2hdf5 script and the root_numpy package (which root2hdf5 uses to convert TTrees into NumPy structured arrays):
root2hdf5 -h
usage: root2hdf5 [-h] [-n ENTRIES] [-f] [--ext EXT] [-c {0,1,2,3,4,5,6,7,8,9}]
[-l {zlib,lzo,bzip2,blosc}] [--script SCRIPT] [-q]
files [files ...]
positional arguments:
files
optional arguments:
-h, --help show this help message and exit
-n ENTRIES, --entries ENTRIES
number of entries to read at once (default: 100000.0)
-f, --force overwrite existing output files (default: False)
--ext EXT output file extension (default: h5)
-c {0,1,2,3,4,5,6,7,8,9}, --complevel {0,1,2,3,4,5,6,7,8,9}
compression level (default: 5)
-l {zlib,lzo,bzip2,blosc}, --complib {zlib,lzo,bzip2,blosc}
compression algorithm (default: zlib)
--script SCRIPT Python script containing a function with the same name
that will be called on each tree and must return a tree or
list of trees that will be converted instead of the
original tree (default: None)
-q, --quiet suppress all warnings (default: False)
As of when I last checked (a few months ago) root2hdf5 had a limitation that it could not handle TBranches which were arrays. For this reason I wrote a bash script: root2hdf (sorry for non-creative name).
It takes a ROOT file and the path to the TTree in the file as input arguments and generates source code & compiles to an executable which can be run on ROOT files, converting them into HDF5 datasets.
It also has a limitation that it cannot handle compound TBranch types, but I don't know that root2hdf5 does either.