Compare 2 files line by line - comparison

Assume that I have 2 files, like that:
File 1:
Verrucomicrobiaceae
Porphyromonadaceae
Clostridium
Verrucomicrobiaceae
Clostridium
Bacteroidaceae
Clostridium
Verrucomicrobiaceae
Verrucomicrobiaceae
Verrucomicrobiaceae
Verrucomicrobiaceae
Clostridium
File 2:
Verrucomicrobiaceae
Porphyromonadaceae
Verrucomicrobiaceae
Porphyromonadaceae
Verrucomicrobiaceae
Verrucomicrobiaceae
Verrucomicrobiaceae
Verrucomicrobiaceae
I would like to count the occurrences of the following:
No. of incidences where lines in file 1 and 2 are identical
No. of incidences where lines in file 1 and 2 are different
No. of incidences where a line in file 1 has a string, while the same line in file 2 has nothing (blank)
No. of incidences where a line in file 2 has a string, while the same line in file 1 has nothing (blank)
I tried to use comm, cmp and diff, but they couldn't do that task.
Is there any linux command that can do this?

This is specialized enough that it probably isn't easy to do by standard tools. I'd write a program doing the comparing, like in Perl:
#!/usr/bin/perl
open(IN1, "< file1");
open(IN2, "< file2");
$count1 = $count2 = $count3 = $count4 = 0;
while(<IN1>) {
$line1 = $_;
$line2 = <IN2>;
$count1++ if($line1 eq $line2);
$count2++ if($line1 ne $line2);
$count3++ if($line1 ne "" && $line2 eq "");
$count4++ if($line1 eq "" && $line2 ne "");
}
print "$count1 $count2 $count3 $count4\n";
No error checking at all, assumes both files are the same length, ...

Related

Removing the file paths and using the file number to perform some calculations while plotting

I am trying to read .txt files from a directory which have the following order;
x-23.txt
x-43.txt
x-83.txt
:
:
x-243.txt
I am calling these files using filename = system("ls ../Data/*.txt"). The goal is to load these files and plot certain columns. At the same time, I am trying to parse the file names such that it would look like as below so that I can use them as title in the plot and add/subtract them from a certain column;
23
43
83
:
:
243
For that, I tried the following;
dirname = '../Data/'
str = system('echo "'.dirname. '" | perl -pe ''s/x[\d-](\d+).txt/\1.\2/'' ')
cv = word(str, 1)
The above lines doesn't seem to trim and produce numbers on the files. The code all together;
filelist1 = system("ls ../Data/*.txt")
print filelist1
dirname = '../Data/'
str = system('echo "'.dirname. '" | perl -pe ''s/x[\d-](\d+).txt/\1.\2/'' ')
cv = word(str, 1)
plot for [filename1 in filelist1] filename1 using (-cv/1000+ Tx($4)):(X($3)) with points pt 7 lc 6 title system('basename '.filename1),\
I am trying to use the file numbers "cv" after parsing the .txt files to subtract them from column Tx($4) while plotting.
directory = "../temp/"
filelist = system("cd ../temp/ ; ls *.txt")
files = words(filelist)
filename(i) = directory . word(filelist,i)
title(i) = word(filelist,i)[3 : strstrt(word(filelist,i),'.')-1]
plot for [i=1:files] filename(i) using ... title title(i)
Test case (edited to show pulling files from another directory):
gnuplot> print filelist
x-234.txt
x-23.txt
x-2.txt
x-34.txt
gnuplot> do for [i=1:files] { print i, ": ", filename(i) }
1: ../temp/x-234.txt
2: ../temp/x-23.txt
3: ../temp/x-2.txt
4: ../temp/x-34.txt
gnuplot> plot for [i=1:files] x*i title title(i)

Reading File Path to a Variable

I want to use the variable that I passed to a function which contains a file path. However, I don't get it working.
For example, I have a path like "/samba-test/log_gen/log_gen/log_generator" and when I read this path to a variable it doesn't work as expected. Please refer to my explaination in the code. My comments
are tagged with the string "VENK" . Any help would be appreciated.
/* caller */
config_path = "/samba-test/log_gen/log_gen/log_generator"
ReadWrite_Config(config_path)
/*definition*/
def ReadWrite_Timeline(lp_readpath, lp_filterlist):
current_parent_path = lp_readpath
current_search_list = lp_filterlist
print(current_parent_path) >>>>>> VENK - PATH prints fine here as expected <<<<<<<<.
strings_1 = ("2e88422c-4b61-41d7-9cf9-4650edaa4e56", "2017-11-27 16:1")
for index in range(0,3):
print (current_search_list[index])
files=None
filext=[".txt",".log"]
#outputfile = open(wrsReportFileName, "a")
for ext in filext:
print("Current_Parent_Path",current_parent_path ) <<<<<<VENK - Prints as Expected ""
#VENK - The above line prints as ('Current_Parent_Path', '/samba-test/log_gen/log_gen/log_generator') which is expected
#The actual files are inside the 'varlog' where the 'varlog' folder is inside '/samba-test/log_gen/log_gen/log_generator'
#possible problematic line below.
varlogpath = "(current_parent_path/varlog)/*"+ext >>>>>>>>>>> VENK- Unable to find the files if given in this format <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
print("varlogpath",varlogpath) >>>>>>>>>>>> VENK- varlogpath doesn't print as expected <<<<<<<<<<<<<<<<<<<<
#VENK - The above line prints as ('varlogpath', 'current_parent_path/varlog/*.txt') which I feel is problematic.
#VENK - If I give the absolute path as below it works fine
#varlogpath = "/samba-test/log_gen/log_gen/log_generator/varlog/*"+ext
files = glob.glob(varlogpath)
for file in files:
fname_varlog = open(file, 'r')
outputfile.write("\n")
outputfile.write(file)
outputfile.write("\n")
for line in fname_varlog:
#if any(s in line for s in strings):
"""
#s1 searches the mandatory arguments
#s2 searches the optional arguments
"""
if all(s1 in line for s1 in strings_1):
#if all(s1 in line for s1 in strings_1) or all(s2 in line for s2 in strings_2):
#print (file, end="")
outputfile.write(line)
fname_varlog.close()
outputfile.write("\n")
outputfile.write("10.[##### Summary of the Issue #####] -- Enter Data Manually \n")
outputfile.write("\n")
outputfile.close()
#print (ext)
A path join to the variable 'current_parent_path' helped to resolve the problem (like below).
varlogpath = os.path.join(current_parent_path, "*"+ext)

ProgressBar Overlay Or ProgressBar from Copy Functions

Function GetTotalBytesOfCopyDestination{
param($destinationPath);
$colItems = (Get-ChildItem $destinationPath | Measure-Object -property length -sum)
return $colItems.sum;
}
Function GetBytesOfFile{
param($sourcePath);
return (Get-Item $sourcePath).length;
}
Function GetPosition{
param([double]$currentOfBytesSended)
param([double]$countsOfBytesWillSend)
$position = ($currentOfBytesSended / $countsOfBytesWillSend) * 100;
#range 0 - 100
#(15800 bytes / 1975633689 bytes)*100
#
return $position;
}
Function Copy-File {
#.Synopsis
# Copies all files and folders in $source folder to $destination folder, but with .copy inserted before the extension if the file already exists
param($source,$Destination2)
# create destination if it's not there ...
mkdir $Destination2 -force -erroraction SilentlyContinue
[double]$currentOfBytesSended = 0;
[double]$countsOfBytesWillSend = 0;
[double]$countsOfBytesWillSend = GetTotalBytesOfCopyDestination($source);
$progressbar6.Maximum = 100;
$progressbar6.Step = 1;
foreach($original in ls $source -recurse) {
$result = $original.FullName.Replace($source,$Destination2)
while(test-path $result -type leaf){ $result = [IO.Path]::ChangeExtension($result,"copy$([IO.Path]::GetExtension($result))") }
[System.Windows.Forms.Application]::DoEvents()
if($original.PSIsContainer) {
mkdir $result -ErrorAction SilentlyContinue
} else {
copy $original.FullName -destination $result
[System.Windows.Forms.Application]::DoEvents()
$currentCopyingFileSizeInBytes = 0;
$currentCopyingFileSizeInBytes = GetBytesOfFile($original.FullName);
$currentOfBytesSended = [double]$currentOfBytesSended + [double]$currentCopyingFileSizeInBytes;
#$currentOfBytesSended += $currentCopyingFileSizeInBytes;
$progressbar6.Value=GetPosition([double]$currentOfBytesSended, [double]$countsOfBytesWillSend);
[System.Windows.Forms.Application]::DoEvents()
#$progressbar6.PerformStep();
$progressbar6.Refresh();
}
}
}
what I'm trying to get is Copy-File Function ,copy files & directory from remote machine to local machine while moving progress bar depending on total amount to copy,and which are already copied and it define position for the progress bar and i get this error
ERROR: GetPosition : Cannot process argument transformation on parameter 'currentOfBytesSended'. Cannot convert the "System.Object[]" value of type "System.Object[]"
ERROR: to type "System.Double".
Assit App.pff (337): ERROR: At Line: 337 char: 35
ERROR: + $progressbar6.Value=GetPosition([double]$currentOfBytesSended, [double]$coun ...
ERROR: + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
ERROR: + CategoryInfo : InvalidData: (:) [GetPosition], ParameterBindingArgumentTransformationException
ERROR: + FullyQualifiedErrorId : ParameterArgumentTransformationError,GetPosition
ERROR:
>> Script Ended
First of all, welcome to StackOverflow. Try to break up your questions into individual questions, rather than grouping many of them together. The format of this site is better suited to one question, one answer, unless maybe there is guidance that states otherwise.
Excluding Folders
You can ignore folders by using the Exclude parameter on the Get-ChildItem command.
Imagine that you have a folder structure similar to the following:
c:\gci
|
|\a
| \a.txt
|\b
| \b.txt
|\c
| \c.txt
If you only want to get the contents of c:\gci\a and c:\gci\c, you can exclude c:\gci\b using the command:
Get-ChildItem -Path c:\gci -Exclude b* -Recurse;
Keep in mind that this will also exclude other items starting with "b" such as c:\gci\bcd.txt.
Progress Bar
You can create a progress bar using the Write-Progress command. You will have to write your own, custom logic to determine what tasks to report progress on, and what the percentage of progress is. You can do this based on the number of bytes copied vs. the total number of bytes to be copied, the number of files copied vs. the total number of files, or some other sort of metric.
There's no simple answer to this one. You will have to perform the calculations yourself, and call Write-Progress at the appropriate times.

Why will Powershell write to the console, but not to a file?

I have a PowerShell script that sets flags based on various conditions of the file. I'll abbreviate for brevity.
$path = "c:\path"
$srcfiles = Get-ChildItem $path -filter *.htm*
ForEach ($doc in $srcfiles) {
$s = $doc.Fullname
Write-Host "Processing :" $doc.FullName
if (stuff) {flag 1 = 1} else {flag 1 = 0}
if (stuff) {flag 1 = 1} else {flag 1 = 0}
if (stuff) {flag 1 = 1} else {flag 1 = 0}
$t = "$s;$flag1;$flag2;$flag2"
Write-Host "Output: " $t
This all works great. My file processes, the flags are set properly, and a neat semicolon delimited line is generated as $t. However, if I slap these two lines at the end of the function,
$stream = [System.IO.StreamWriter] "flags.txt"
$stream.WriteLine $t
I get this error.
Unexpected token 't' in expression or statement.
At C:\CGC003a.ps1:53 char:25
+ $stream.WriteLine $t <<<<
+ CategoryInfo : ParserError: (t:String) [], ParseException
+ FullyQualifiedErrorId : UnexpectedToken
If I'm reading this write, it appears that write-host flushed my variable $t before it got to the WriteLine. Before I try out-file, though, I want to understand what's happening to $t after Write-Host that prevents Write Line from seeing it as valid. Or is there something else I'm missing?
try:
$stream.WriteLine($t)
writeline is a method of streamwriter .net object. To pass in value you need to enclose it in ( )
-if you need to append to a streamwriter you need to create it like this:
$a = new-object 'System.IO.StreamWriter' -ArgumentList "c:\path\to\flags.txt",$true
Where the boolean arguments can be true to append data to the file orfalse to overwrite the file.
I suggest to pass full path for:
$stream = [System.IO.StreamWriter] "c:\path\to\flags.txt"
otherwise you create the file in .net current folder ( probably c:\windows\system32 if run as administrator your current powershell console, to know it type [System.IO.Directory]::GetCurrentDirectory())
you could try
$t > c:\path\to\flags.txt

Build CSV from parse files python

I am building a small database (for personal use), from over a 1000 files. I am looking for specific word, but the issue that I have if the word is not contained in the file how can i write a NoData line, what I would like to have is:
Africa Botswana test 51.1922546 -113.9366341
Africa Kenya Skydive Kenya -13.788388 33.78498
Africa Malawi Skydive Malawi NoData NoData
Africa Mauritius SkyDive Austral 30.5000854 -8.824510574
Africa Morocco Beni Mellal NoData NoData
for i in os.listdir(Main_Path):
if "-" in i:
for filename in os.listdir(Main_Path+i):
if ".dat" in filename and os.path.isdir(Main_Path+i):
f_split = filename.split("-")
if len(f_split) == 4:
continent.append(f_split[0])
country.append(f_split[1])
state.append(f_split[2].split(".")[0])
else:
continent.append(f_split[0])
country.append("")
state.append(f_split[1].split(".")[0])
d = open(Main_Path+i+"/" + filename, "r")
files = d.readlines()
d.close()
for k, line in enumerate(files):
if "Dropzone.com :" in line:
dzname.append(line.split(":")[1].strip())
elif 'id="lat"' in line:
lat.append(line.split("=")[3].split('"')[1].strip())
myFile = open(Main_Path+"MYFILE.csv", "wb")
wtr= csv.writer( myFile )
for a,b,c,d,e in zip(continent,country,state,dzname,lat):
wtr.writerow([a,b,c,d,e])
myFile.close()
I am stack "elif 'id="lat"' in line:" because it adds to the list "lat" only the files which contains id = lat. I do understand why but I would like the parser to return and add to the list an NoData
sorry i wrote the question from another comp.
Do you mean something like this?
That is: if no line in files contains id="lat" it will append "No Data" to lat.
snip...
d = open(Main_Path+i+"/" + filename, "r")
files = d.readlines()
d.close()
found_latitude = False
for k, line in enumerate(files):
if "Dropzone.com :" in line:
dzname.append(line.split(":")[1].strip())
elif 'id="lat"' in line:
found_latitude = True
lat.append(line.split("=")[3].split('"')[1].strip())
if not found_latitude:
lat.append("No Data")
snip...

Resources