Error with BiGSCAPE even after installing - biopython

I am no coding person by any means but I try my best to work with issues. so I had installed BiGSCAPE to look at the secondary metabolites clusters. I am running it in conda and it seems to be installed fine. As it provides me the version number.
However I keep getting this error. I have test it with the example data as well. and it returns the same results/error
The version I have installed is
BiG-SCAPE 1.1.4 (2022-04-14)
(bigscape) Shaheens-MacBook-Pro:BIG-SCAPE shaheenbibi$ python bigscape.py -i /Downloads/gbks -o ResultsAndres
/Users/shaheenbibi/miniconda3/envs/bigscape/lib/python3.6/site-packages/Bio/SubsMat/init.py:131: BiopythonDeprecationWarning: Bio.SubsMat has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.substitution_matrices as a replacement, and contact the Biopython developers if you still need the Bio.SubsMat module.
BiopythonDeprecationWarning,
Processing input files - -
Output folder already exists
Logs folder already exists
Cache folder already exists
BGC fastas folder already exists
Domtable folder already exists
Domains folder already exists
pfs folder already exists
pfd folder already exists
Including files with one or more of the following strings in their filename: 'cluster', 'region'
Skipping files with one or more of the following strings in their filename: 'final'
Importing GenBank files
Starting with 0 files
Files that had its sequence extracted: 0
Creating output directories
SVG folder already exists
Networks folder already exists
Trying threading on 4 cores
Predicting domains using hmmscan
All fasta files had already been processed
Finished generating domtable files.
Parsing hmmscan domtable files
All domtable files had already been processed
Finished generating pfs and pfd files.
Processing domains sequence files
Adding sequences to corresponding domains file
Reading the ordered list of domains from the pfs files
Creating arrower-like figures for each BGC
Parsing hmm file for domain information
Done
All SVG from the input files seem to be in the SVG folder
Finished creating figures
Calculating distance matrix - -
Performing multiple alignment of domain sequences
No domain fasta files found to align
Trying to read domain alignments (*.algn files)
No aligned sequences found in the domain folder (run without the --skip_ma parameter or point to the correct output folder)

Starting with 0 files Files seems to indicate that something is wrong with your input directory. You may need to put some .gbk files in /Downloads/gbks.
Also note that BiGSCAPE puts a bunch of constraints on the names of the .gbk files: https://git.wageningenur.nl/medema-group/BiG-SCAPE/-/wikis/input. Perhaps your inputs .gbk files need to be renamed.

Related

Nasdaq ITCH order book build -> EOF error when expanding '.gz' file

I'm trying my first steps in ML using Jupter's IPython, I was advised to start with Nasdaq's order book ITCH dataset to create models. I'm following the same steps in this tutorial on github.
I can't seem to unzip/expand files from the ITCH dataset, when executing the function may_be_download(url) and the following code (code cell nr.5 in tutorial):
file_name = may_be_download(urljoin(FTP_URL, SOURCE_FILE))
date = file_name.name.split('.')[0]
I get the following error; EOFError: Compressed file ended before the end-of-stream marker was reached
Nor am I able to simply unzip the file by clicking on it in Finder or using gzip & gunzip methods in Terminal.
I took the following steps:
Executed all previous code cells (1-4)
Copied the file 03272019.NASDAQ_ITCH50.gz to a folder named data in the relative path
First I went clicked on the sample link in the notebook
Then logged in as a guest and navigated to the folder Nasdaq ITCH
Then located the file 03272019.NASDAQ_ITCH50.gz and copyed it a local folder.
Executed code cell nr.5 listed above.
I've search and tried numerous solutions to similar issues listed here on Stack and Github, but none seem to solve this particular problem. I would deeply appreciate any help and thoughts on what may be occurring and how I might go about solving this.
I'll leave you with a picture of the error logs, assuming it may be of some help
Thanks for reading.
I downloaded that file and one other from that site. They both appear to be corrupted, both failing with incomplete deflate data.
What's more, there are MD5 signatures for the files there, and what is downloaded has MD5 signatures that do not match.
This is not being caused by the ftp server doing end-of-line conversions, because the lengths of the file in bytes match exactly the lengths on the server. Also a histogram of the byte values shows no bias.

how to find and deploy the correct files with Bazel's pkg_tar() in Windows?

please take a look at the bin-win target in my repository here:
https://github.com/thinlizzy/bazelexample/blob/master/demo/BUILD#L28
it seems to be properly packing the executable inside a file named bin-win.tar.gz, but I still have some questions:
1- in my machine, the file is being generated at this directory:
C:\Users\John\AppData\Local\Temp_bazel_John\aS4O8v3V\execroot__main__\bazel-out\x64_windows-fastbuild\bin\demo
which makes finding the tar.gz file a cumbersome task.
The question is how can I make my bin-win target to move the file from there to a "better location"? (perhaps defined by an environment variable or a cmd line parameter/flag)
2- how can I include more files with my executable? My actual use case is I want to supply data files and some DLLs together with the executable. Should I use a filegroup() rule and refer its name in the "srcs" attribute as well?
2a- for the DLLs, is there a way to make a filegroup() rule to interpret environment variables? (e.g: the directories of the DLLs)
Thanks!
Look for the bazel-bin and bazel-genfiles directories in your workspace. These are actually junctions (directory symlinks) that Bazel updates after every build. If you bazel build //:demo, you can access its output as bazel-bin\demo.
(a) You can also set TMP and TEMP in your environment to point to e.g. c:\tmp. Bazel will pick those up instead of C:\Users\John\AppData\Local\Temp, so the full path for the output directory (that bazel-bin points to) will be c:\tmp\aS4O8v3V\execroot\__main__\bazel-out\x64_windows-fastbuild\bin.
(b) Or you can pass the --output_user_root startup flag, e.g. bazel--output_user_root=c:\tmp build //:demo. That will have the same effect as (a).
There's currently no way to get rid of the _bazel_John\aS4O8v3V\execroot part of the path.
Yes, I think you need to put those files in pkg_tar.srcs. Whether you use a filegroup() rule is irrelevant; filegroup just lets you group files together, so you can refer to the group by name, which is useful when you need to refer to the same files in multiple rules.
2.a. I don't think so.

Lua - My documents path and file creation date

I'm planning to do a program with Lua that will first of all read specific files
and get information from those files. So my first question is whats the "my documents" path name? I have searched a lot of places, but I'm unable to find anything. My second question is how can I use the first four letters of a file name to see which one is the newest made?
Finding the files in "my documents" then find the newest created file and read it.
The reading part shouldn't be a problem, but navigating to "my documents" and finding the newest created file in a folder.
For your first question, depends how robust you want your script to be. You could use Lua's builtin os.getenv() to get a variety of environment vars related to user, such as USERNAME, USERPROFILE, HOMEDRIVE, HOMEPATH. Example:
username = os.getenv('USERNAME')
dir = 'C:\\users\\' .. username .. '\\Documents'
For the second question, there is no builtin mechanism in Windows to have the file creation or modification timestamp as part of the filename. You could read the creation or modification timestamp, via a C extension you create or using an existing Lua library like lfs. Or you could read the contents of a folder and parse the filenames if they were named according to the pattern you mention. Again there is nothing built into Lua to do this, you would either use os.execute() or lfs or, again, your own C extension module, or combinations of these.

How to compress multiple folders into one archive?

I have some compression components (like KAZip, JVCL, zLib) and exactly know how to use them to compress files, but i want to compress multiple folders into one single archive and keep folders structure after extract, how can i do it?
in all those components i just can give a list of files to compress, i can not give struct of folders to extract, there is no way (or i couldn't find) to tell every file must be extracted where:
i have a file named myText.txt in folder FOLDER_A and have a file with same name myText.txt in folder FOLDER_B:
|
|__________ FOLDER_A
| |________ myText.txt
|
|__________ FOLDER_B
| |________ myText.txt
|
i can give a list of files to compress: myList(myText.txt, myText.txt) but i cant give the structure for uncompress files, what is best way to found which file belongs to which folder?
The zip format just does not have folders. Well, it kinda does, but they are kind of empty placeholders, only inserted if you need metadata storage like user access rights. But other than those rather rare advanced things - there is no need for folders at all. What is really done - and what you can observe opening zip file in the notepad and scrolling to the end - is that each file has its path in it, starting with "archive root". In your exanple the zip file should have two entries (two files):
FOLDER_A/myText.txt
FOLDER_B/myText.txt
Note, that the separators used are true slashes, common to UNIX world, not back-slashes used in DOS/Windows world. Some libraries would fix back-slashes it for you, some would not - just do your tests.
Now, let's assume that that tree is contained in D:\TEMP\Project - just for example.
D:\TEMP\Project\FOLDER_A\myText.txt
D:\TEMP\Project\FOLDER_B\myText.txt
There are two more questions (other than path separators): are there more folders within D:\TEMP\Project\ that should be ignored, rather than zipped (like maybe D:\TEMP\Project\FOLDER_C\*.* ? and does your zip-library have direct API to pack the folders wit hall its internal subfolder and files or should you do it file by file ?
Those three questions you should ask yourself and check while choosing the library. The code drafts would be somewhat different.
Now let's start drafting for the libraries themselves:
The default variant is just using Delphi itself.
Enumerate the files in the folder: http://docwiki.embarcadero.com/CodeExamples/XE3/en/DirectoriesAndFilesEnumeraion_(Delphi)
If that enumeration results in absolute paths then strip the common D:\TEMP\Project from the beginning: something like If AnsiStartsText('D:\TEMP\Project\', filename) then Delete(filename, 1, Length('D:\TEMP\Project\'));. You should get paths relative to chosen containing place. Especially if you do not compress the whole path and live some FOLDER_C out of archive.
Maybe you should also call StringReplace to change '\' into '/' on filenames
then you can zip them using http://docwiki.embarcadero.com/Libraries/XE2/en/System.Zip.TZipFile.Add - take care to specify correct relative ArchiveFileName like aforementioned FOLDER_A/myText.txt
You can use ZipMaster library. It is very VCL-bound and may cause troubles using threads or DLLs. But for simple applications it just works. http://www.delphizip.org/
Last version page have links to "setup" package which had both sources, help and demos. Among demos there is an full-featured archive browser, capable of storing folders. So, you just can read the code directly from it. http://www.delphizip.org/191/v191.html
You talked about JVCL, that means you already have Jedi CodeLib installed. And JCL comes with a proper class and function, that judging by name can directly do what you want it too: function TJclSevenzipCompressArchive.AddDirectory(const PackedName: WideString; const DirName: string = ''; RecurseIntoDir: Boolean = False; AddFilesInDir: Boolean = False): Integer;
Actually all those libraries are rather similar on basic level, when i made XLSX export i just made a uniform zipping API, that is used with no difference what an actual zipping engine is installed. But it works with in-memory TStream rather than on-disk files, so would not help you directly. But i just learned than apart of few quirks (like instant vs postponed zipping) on ground level all those libs works the same.

jenkins archive artifact excluding all subdirectory

I have a couple of job in Jenkins that archive artifact from the source tree for another job (some unit tests or alike). I have the current situation :
top_dir
\scripts_dir
\some_files
\dir1
\dir2
\dir3
\other_dir
I would like to archive all that is in "top_dir" including the files in "scripts_dir", but not the subdirectories "dir1, dir2,...", which I do not know the name, that are in "scripts_dir". These subdirs are actually Windows directory joints that point to other places on the disk, and I do not want them to be copied.
How do I achieve this with the inculde/excludes pattern of Jenkins ?
I already tried, having include=top_dir/ , exclude=
**/scripts_dir/*/
**/scripts_dir/*/**
**/scripts_dir/**/*
but it always exculdes the whole "scripts_dir" folder.
Finally, by using brute force, I found that the following expression does exclude all the files in the subdirectories of scripts_dir (whatever symlink or not), then removing these subdirs, while keeping the files directly in scripts_dir :
**/scripts_dir/**/*/*/
Thanks for the help anyway.
Reading the ANT manual, there an followsymlinks attribute that defaults to true. You said those things you want to exclude are symlinks (although i am not sure if this will work with Windows joints). Try adding followsymlinks=false
Another solution: if all your files under scripts_dir have a set number of characters in the extension, you can put that into your include statement. This will only pickup files with extensions of 3 characters:
**/scripts_dir/*.???
More on this here

Resources