Why does deleting a line in sed use memory? - memory

I have a 1.5GB GNU zipped text file from which I want to delete one (syntactically invalid) line. I'm using the following command for this:
zcat foo.gz | sed -n '36690930d' | gzip > bar.gz
The above results in Sed using all the memory on my machine, and getting killed by the OS.
Why does Sed use any noticeable amount of memory at all (TWISI it should just stream through the file and skip one line somewhere, no memory needed)?
And what would be a better way to achieve line deletion using no/little memory?

Related

Grepping list of phpass hashes against a file

I'm trying to grep multiple strings which look like this (there's a few hundred) against a file which contains data:string
Example strings: (no sensitive data is provided, they have been modified).
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1
I've been researching how to grep a file of patterns against another file, and came across the following commands
grep -f strings.txt datastring.txt > output.txt
grep -Ff strings.txt datastring.txt > output.txt
But unfortunately, these commands do NOT work successfully, and only print out a handful of results to my output file. I think it may be something to do with the symbols contained in strings.txt, but I'm unsure. Any help/advice would be great.
To further mention, I'm using Cygwin on Windows (if this is relevant).
Here's an updated example:
strings.txt contains the following:
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1
datastring.txt contains the following:
$H$9a...DcuCqC/rMVmfiFNm2rqhK5vFW1:53491
$H$9n...AHZAV.sTefg8ap8qI8U4A5fY91:03221
$H$9o...Bi6Z3E04x6ev1ZCz0hItSh2JJ/:20521
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1:30142
So technically, all lines should be included in the OUTPUT file, but only this line is outputted:
$H$9w...CFva1ddp8IRBkgwww3COVLf/K1:30142
I just don't understand.
You have showed the output of cat -A strings.txt elsewhere, which includes ^M representing a CR (carriage return) character at the end of each line:
This indicates your file has Windows line endings (CR LF) instead of the Unix line endings (only LF) that grep would expect.
You can convert files with dos2unix strings.txt and back with unix2dos strings.txt.
Alternatively, if you don't have dos2unix installed in your Cygwin environment, you can also do that with sed.
sed -i 's/\r$//' strings.txt # dos2unix
sed -i 's/$/\r/' strings.txt # unix2dos

Output file much larger than input files after cat + grep

I have 18 csv files, all between 1mb and 14mb. The sum of all files is 64mb. I want to create a new csv file that contains a subset of those files-- only the lines featuring the pattern "Hello" (or "HELLO", or "hello" ...). Here's what I'm doing
cat *.csv | head -n 1 > new.csv # I want to create a header first
cat *.csv | grep -i "hello" >> new.csv
I'm running Debian on WSL. The output file is much, much larger than the original 64mb (I stopped the process after 1+ hour, and the file was 300+ GB).
How can a subset of a text file be larger than the original files? Does it have anything to do with WSL?
This is not an OS issue. When you redirect your output to new.csv, shell creates that file first, before the glob expression *.csv is evaluated. That means the expansion of *.csv would include new.csv as well. That seems like the root cause of the recursive grep issue you are facing.
You are reading all the files twice, which is not necessary. You can make your operation a lot simpler and efficient with a single awk command:
awk 'NR==1 {print} tolower($0) ~ /hello/ {print}' *.csv > csv.new
mv csv.new new.csv
since the output file is named csv.new it won't interfere with the glob *.csv
NR==1 picks up the first line (header) from the very first file
The awk command can be written more succinctly as:
awk 'NR==1 || tolower($0) ~ /hello/' *.csv > csv.new
You are using *.csv and redirecting the output to new.csv which falls under *.csv which is causing recursion in grep result. perhaps you can try,
grep -i hello *.csv --exclude="new.csv" >> new.csv

Why is pcregrep faster than grep?

I have some large text file(3 GB rails log file) on a centos os with a corrupted byte in this text file. When trying to search some pattern using grep, it runs indefinitely and I have to close it, however with pcregrep it takes less than a minute, so any clue why this difference ?
My search using grep:
grep -Pzo "2016-04-20(.*?)SomeController#index" production.log | wc -l
using pcregrep:
pcregrep -M "2016-04-20(.*?)SomeController#index" production.log | wc -l

How do I extract partial path from pwd in tcsh?

I want to basically implement an alias (using cd) which takes me to the 5th directory in my pwd. i.e.
If my pwd is /hm/foo/bar/dir1/dir2/dir3/dir4/dir5, I want my alias, say cdf to take me to /hm/foo/bar/dir1/dir2 .
So basically I am trying to figure how I strip a given path to a given number of levels of directories in tcsh.
Any pointers?
Edit:
Okay, I came this far to print out the dir I want to cd into using awk:
alias cdf 'echo `pwd` | awk -F '\''/'\'' '\''BEGIN{OFS="/";} {print $1,$2,$3,$4,$5,$6,$7;}'\'''
I am finding it difficult to do a cd over this as it already turned into a mess of escaped characters.
This should do the trick:
alias cdf source ~/.tcsh/cdf.tcsh
And in ~/.tcsh/cdf.tcsh:
cd "`pwd | cut -d/ -f1-6`"
We use the pwd tool to get the current path, and pipe that to cut, where we split by the delimiter / (-d/) and show the first 5 fields (-f1-6).
You can see cut as a very light awk; in many cases it's enough, and hugely simplifies things.
The problem with your alias is tcsh's quircky quoting rules. I'm not even going to try and fix that. We use source to evade all of that;
tcsh lacks functions, but you can sort of emulate them with this. Never said it was pretty.
#carpetsmoker's solution using cut is nice and simple. But since his solution awkwardly uses another file and source, here's a demonstration of how to avoid that. Using single quotes prevents the premature evaluation.
% alias cdf 'cd "`pwd | cut -d/ -f1-6`"'
% alias cdf
cd "`pwd | cut -d/ -f1-6`"
Here's a simple demonstration of how single quotes can work with backticks:
% alias pwd2 'echo `pwd`'
% alias pwd2
echo `pwd`
% pwd2
/home/shx2

tar pre-run to evaluate expected size or amount of files

The problem:
I have a back-end process that at some point he collect and build a big tar file.
This tar receive few directories and an exclude files.
the process can take up to few minutes and i want to report in my front-end process (GUI) about the progress of the taring process (This is a big issue for a user that press download button and it seems like nothing is happening...).
i know i can use -v -R in the tar command and count files and size progress but i am looking for some kind of tar pre-run mode / dry run to help me evaluate either the expected number of files or the expected tar size.
the command I am using: tar -jcf 'FILE.tgz' 'exclude_files' 'include_dirs_and_files'
10x for everyone who is willing to assist.
You can pipe the output to the wc tool instead of actually making a file.
With file listing (verbose):
[git#server]$ tar czvf - ./test-dir | wc -c
./test-dir/
./test-dir/test.pdf
./test-dir/test2.pdf
2734080
Without:
[git#server]$ tar czf - ./test-dir | wc -c
2734080
Why don't you run a
DIRS=("./test-dir" "./other-dir-to-test")
find ${DIRS[#]} -type f | wc -l
beforehand. This gets all the files (-type f) one per line and counts the number of files. DIRS is an array in bash, so you can store the folders in a variable
If you want to know the size of all the stored files, you can use du
DIRS=("./test-dir" "./other-dir-to-test")
du -c -d 0 ${DIRS[#]} | tail -1 | awk -F ' ' '{print $1}'
This prints the disk usage with du, calculates a grand total (-c flag), gets the last line (example 4378921 total), and uses just the first column with awk

Resources