I have a file f1:
line1
line2
line3
line4
..
..
I want to delete all the lines which are in another file f2:
line2
line8
..
..
I tried something with cat and sed, which wasn't even close to what I intended. How can I do this?
grep -v -x -f f2 f1 should do the trick.
Explanation:
-v to select non-matching lines
-x to match whole lines only
-f f2 to get patterns from f2
One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).
Try comm instead (assuming f1 and f2 are "already sorted")
comm -2 -3 f1 f2
For exclude files that aren't too huge, you can use AWK's associative arrays.
awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt
The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.
The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)
Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):
awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt
Accessing r[$0] creates the entry for that line, no need to set a value.
Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.
if you have Ruby (1.9+)
#!/usr/bin/env ruby
b=File.read("file2").split
open("file1").each do |x|
x.chomp!
puts x if !b.include?(x)
end
Which has O(N^2) complexity. If you want to care about performance, here's another version
b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}
which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)
here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:
$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test
real 0m0.639s
user 0m0.554s
sys 0m0.021s
$time sort file1 file2|uniq -u > sort.test
real 0m2.311s
user 0m1.959s
sys 0m0.040s
$ diff <(sort -n ruby.test) <(sort -n sort.test)
$
diff was used to show there are no differences between the 2 files generated.
Some timing comparisons between various other answers:
$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null
real 0m0.019s
user 0m0.023s
sys 0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null
real 0m0.026s
user 0m0.018s
sys 0m0.007s
$ time grep -xvf f2 f1 > /dev/null
real 0m43.197s
user 0m43.155s
sys 0m0.040s
sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.
comm can also be used with stdin and here strings:
echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a
Seems to be a job suitable for the SQLite shell:
create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q
Did you try this with sed?
sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh
sed -i 's#$#%%g'"'"' f1#g' f2.sh
sed -i '1i#!/bin/bash' f2.sh
sh f2.sh
Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.
Obviously won't work for huge files but it did the trick for me. A few notes:
I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data
A Python way of filtering one list using another list.
Load files:
>>> f1 = open('f1').readlines()
>>> f2 = open('f2.txt').readlines()
Remove '\n' string at the end of each line:
>>> f1 = [i.replace('\n', '') for i in f1]
>>> f2 = [i.replace('\n', '') for i in f2]
Print only the f1 lines that are also in the f2 file:
>>> [a for a in f1 if all(b not in a for b in f2)]
$ cat values.txt
apple
banana
car
taxi
$ cat source.txt
fruits
mango
king
queen
number
23
43
sentence is long
so what
...
...
I made a small shell scrip to "weed" out the values in source file which are present in values.txt file.
$cat weed_out.sh
from=$1
cp -p $from $from.final
for x in `cat values.txt`;
do
grep -v $x $from.final > $from.final.tmp
mv $from.final.tmp $from.final
done
executing...
$ ./weed_out source.txt
and you get a nicely cleaned up file....
I have two text files containing one column each, for example -
File_A File_B
1 1
2 2
3 8
If I do grep -f File_A File_B > File_C, I get File_C containing 1 and 2. I want to know how to use grep -v on two files so that I can get the non-matching values, 3 and 8 in the above example.
Thanks.
You can also use comm if it allows empty output delimiter
$ # -3 means suppress lines common to both input files
$ # by default, tab character appears before lines from second file
$ comm -3 f1 f2
3
8
$ # change it to empty string
$ comm -3 --output-delimiter='' f1 f2
3
8
Note: comm requires sorted input, so use comm -3 --output-delimiter='' <(sort f1) <(sort f2) if they are not already sorted
You can also pass common lines got from grep as input to grep -v. Tested with GNU grep, some version might not support all these options
$ grep -Fxf f1 f2 | grep -hxvFf- f1 f2
3
8
-F option to match strings literally, not as regex
-x option to match whole lines only
-h to suppress file name prefix
f- to accept stdin instead of file input
awk 'NR==FNR{a[$0]=$0;next} !($0 in a) {print a[(FNR)], $0}' f1 f2
3 8
To Understand the meaning of NR and FNR check below output of their print.
awk '{print NR,FNR}' f1 f2
1 1
2 2
3 3
4 4
5 1
6 2
7 3
8 4
Condition NR==FNR is used to extract the data from first file as both NR and FNR would be same for first file only.
With GNU diff command (to compare files line by line):
diff --suppress-common-lines -y f1 f2 | column -t
The output (left column contain lines from f1, right column - from f2):
3 | 8
-y, --side-by-side - output in two columns
I am trying to build a csproj file with FAKE, following the "Getting started with FAKE - F# Make" tutorial from the FAKE website. I am trying to understand what this line is doing:
|> Log "AppBuild-Output: "
I understand that it is creating the log of the respective build. But where is this log file being created?
This is the code:
Target "BuildApp" (fun _ ->
!! "src/app/**/*.csproj"
|> MSBuildRelease buildDir "Build"
|> Log "AppBuild-Output: "
)
When I delete the last line, i.e. if i write:
Target "BuildApp" (fun _ ->
!! "src/app/**/*.csproj"
|> MSBuildRelease buildDir "Build"
)
I am getting an error, type 'unit' doesn't match the type 'string list'.
I have been given a task to transform this grammar to LL(1)
E → E+E | E-E | E*E | E/E | E^E | -E | (E)| id | num
So for first step I eliminated ambiguity and got this:
E → E+T | E-T | T
T → T*P | T/P | P
P → P ^ F | F
F → id | num | -E | (E)
And after that I have eliminated left-recursion and got this:
E → TE'
E' → +TE' | -TE' | ɛ
T → PT'
T' → *PT' | /PT' | ɛ
P → FP'
P' → ^FP' | ɛ
F → id | num | -E | (E)
When I put this into JFLAP and click 'Build LL(1) parse table' I get warning that grammar above is not LL(1).
Why this grammar is not LL(1), and what do I need to change to be LL(1).
Your grammar is still ambiguous, so it can't be LL(1).
This production F → -E makes it possible to mix an expression with lower precedence operators in a level (unary operator) where they shouldn't appear.
Note that id + - id + id has two derivation trees.
You shouldn't use E there, but a symbol that represents an atomic value. You could replace
F → id | num | -E | (E)
with
F → -A | A
A → id | num | (E)
Or F → -F | A if you want to allow multiple unary minuses.
When I execute my Fake build script I run into a problem where the parenthesis in my project file seem to be converted to %28 and %29 respectively. This results in a failed build.
Here is my FAKE script below:
// include Fake lib
#I #"../tools/FAKE/tools/"
#r #"FakeLib.dll"
open Fake
let buildDir = "./.build"
let dotNet40ProjectsIncludeStr = "**/*(NET40).*proj"
let dotNet40Projects =
!! dotNet40ProjectsIncludeStr
let dotNet45Projects =
!! "**/*.*proj"
-- dotNet40ProjectsIncludeStr
// Default target
Target "Default" (fun _ ->
trace "Executed Default target"
)
Target "Build NET45" (fun _ ->
dotNet45Projects |> Seq.iter (log << sprintf "%s%s" "Net45Projects:")
MSBuildRelease (buildDir ## "net45") "Build" dotNet45Projects
|> Log "BuildNet45: "
)
Target "Build NET40" (fun _ ->
//let projects = dotNet40Projects |> Seq.map (sprintf "%s")
//projects |> Seq.iter (log << sprintf "%s%s" "Net40Projects:")
dotNet40Projects |> Seq.iter (log << sprintf "%s%s" "Net40Projects:")
MSBuildRelease (buildDir ## "net40") "Build" dotNet40Projects
|> Log "BuildNet40: "
)
"Build NET40"
==> "Build NET45"
==> "Default"
// start build
RunTargetOrDefault "Default"
The output I get contains the following:
Starting Target: Build NET40 Net40Projects:C:\dev\work\My- Company\dev\MyProject\Data\MyProject\MyProject.Data(NET40).csproj
Net40Projects:C:\dev\work\My-Company\dev\MyProject\Data\MyProject.Data.Tests\MyProject.Data.Tests(NET40).cspr oj
Running build failed.
Error:
System.IO.FileNotFoundException: Could not find file 'C:\dev\work\MyCompany\dev\MyProject\Data\MyProject.Data\MyProject.Data%28NET40%29.csproj'.
I consider this as a bug. Please open an issue in fake's github issue tracker.