File Organization in a tar.gz archive - tar

While I observed that usually the files inside a folder are listed sequentially in a tar.gz archive in one exceptional case I found that it is listed in a random manner. E.g., let's say there are three folders a, b, and c and each contains 1,2,3 file. In the usual case, the archive entries would be listed in a/1, a/2, a/3, b/1, b/2, b/3, c/1, c/2, c/3 but in this case it is something like b/2, a/1, b/4, ... Why this could happen? I'm using the first organization assumption to read a .tar.gz archive file and do some processing on the data inside at a folder level. Without traversing the whole archive each time and generating parent/child formation any idea if I could get the folder listings sorted inline for such cases. Sample code below:
TarArchiveInputStream tis = new TarArchiveInputStream("a.tar");
while(tis.getNextTarEntry()!=null)
System.out.println(tis.getCurrentEntry().getName() );
I could not find any API which would give me such a sorted list inline. It would be very helpful if somebody helps me here. I'm stuck with this case.

Related

Nasdaq ITCH order book build -> EOF error when expanding '.gz' file

I'm trying my first steps in ML using Jupter's IPython, I was advised to start with Nasdaq's order book ITCH dataset to create models. I'm following the same steps in this tutorial on github.
I can't seem to unzip/expand files from the ITCH dataset, when executing the function may_be_download(url) and the following code (code cell nr.5 in tutorial):
file_name = may_be_download(urljoin(FTP_URL, SOURCE_FILE))
date = file_name.name.split('.')[0]
I get the following error; EOFError: Compressed file ended before the end-of-stream marker was reached
Nor am I able to simply unzip the file by clicking on it in Finder or using gzip & gunzip methods in Terminal.
I took the following steps:
Executed all previous code cells (1-4)
Copied the file 03272019.NASDAQ_ITCH50.gz to a folder named data in the relative path
First I went clicked on the sample link in the notebook
Then logged in as a guest and navigated to the folder Nasdaq ITCH
Then located the file 03272019.NASDAQ_ITCH50.gz and copyed it a local folder.
Executed code cell nr.5 listed above.
I've search and tried numerous solutions to similar issues listed here on Stack and Github, but none seem to solve this particular problem. I would deeply appreciate any help and thoughts on what may be occurring and how I might go about solving this.
I'll leave you with a picture of the error logs, assuming it may be of some help
Thanks for reading.
I downloaded that file and one other from that site. They both appear to be corrupted, both failing with incomplete deflate data.
What's more, there are MD5 signatures for the files there, and what is downloaded has MD5 signatures that do not match.
This is not being caused by the ftp server doing end-of-line conversions, because the lengths of the file in bytes match exactly the lengths on the server. Also a histogram of the byte values shows no bias.

Determine if files are part of any package

Given I have a list of files, e.g foo/src/main.cpp, foo/src/bar.cpp, foo/README.md is it possible to determine which of those files are part of a bazel package?
In my example, the output would e.g. be foo/src/main.cpp, foo/src/bar.cpp since the README.md would not be part of the build.
One way to do this would be to call bazel query on each file and see if it results in an output, but that is quite inefficient and so I was wondering if there is an easier way.
Background: I am trying to determine if a changes in a set of files have an impact on a target, and I want to use bazel query somepath(//some/target, set($FILES)) for that, but this will fail if any of the files in $FILES is not part of a BUILD file.
How about flipping it around and querying for all the source files of the target with:
bazel query 'kind("source file", deps(//some:target))'
and then checking if the result has any of the files in the set

Bulk update links within the files on Google Drive?

We use Google Drive (GAFE) to prepare and present teaching/training materials. We'd like to maintain archived versions of past iterations, and then work on a new copy for each consecutive training session.
I've succeeded in making a copy of our training folder (using ericyd's gdrive-copy), and we're happily working away on that, BUT... the files are fairly heavily cross-linked. The Slides, for instance, will have links to the Docs handouts and PDF assignments associated with that lesson. When I made a copy of the whole folder structure, the files copied over, but the links are still all linked to the original files, when in fact what we want is for them to be linked to their respective copies.
This makes sense - obviously, when you make a copy of a file, you usually don't want to changes its contents at the same time. However, when you're making an archive of a whole folder, ideally you'd like the links within the files to update as well.
I can compile a spreadsheet with the file IDs for each "original and copy" pair. Is there any way to iterate through all Google Docs/Sheets/Slides in a folder, and substitute the original URLs from the spreadsheet file with their respective copy URLs?
I'm practically a beginner when it comes to Google Apps Scripts, so while I have found Get All Links in a Document and am guessing it would be part of the answer, I have no clue where to go beyond that.
(Btw, if there's a different way of going about all three, automating fixing the links in Slides would be the most helpful, as that's where the bulk of them are)
I know this is a rather old topic, but I recently ran into similar situation that I needed to solve. In my searching, this is the only reference I could find referring to cross-linking as a result of duplication. Unfortunately, I was not able to come up with a purely automated solution, but through a bit of ingenuity I was able to reduce the number of steps required to update my hyperlinks to reference the duplicated files rather than the originals.
First, I borrowed some script code I found online to generate a list of files within a Google Drive folder and their URL's. I'll post the code below. This generates a new Google Sheet named "URL LIST" (you can change the name if you wish in the script), that once generated you'll need to find on your recent list in your Google Drive and move to the folder containing the copied documents and sheets.
Next, in the Google Sheet that I have my hyperlinks to my documents, I created an additional Tab also called URL LIST, and in A1 added an IMPORTRANGE() to import the URL LIST contents. Once you're done with all of this, you will only have to update this one reference with each copy you make, thus dramatically reducing the number of updates you'll need to make, i.e. IMPORTRANGE() points at a specific URL, so each newly generated URL LIST will have a new URL that the copied document containing your hyperlinks and IMPORTRANGE() will need to point to. Hopefully, that makes sense.
Next, your hyperlinks will need a formula along the lines of =HYPERLINK(VLOOKUP(A1,'URL LIST'!$A$1:$B$10,2,FALSE) to grab the imported URL's. It's important to make sure you that you indicate that the look up range is not sorted, or FALSE, because the order that the script spits out the document list with URL's may change depending on how the folder is sorted at the time of running the script, and will ensure you don't need the list sorted. You can then copy the formula to each cell that you need a hyperlink.
Of equal importance is that your VLOOKUP() search key is exactly as it will be listed in your URL LIST.
This method allowed me to reduce the number of steps of updating hyperlinks from 9 steps down to the 1 step of updating the IMPORTRANGE() each time I make copies.
I hope this helps you or someone else!
Copy and past the following script into your script editor:
// replace your-folder below with the folder name for which you want a listing
function listFolderContents() {
var foldername = 'your-folder';
var folderlisting = 'URL LIST ';
var folders = DriveApp.getFoldersByName(foldername)
var folder = folders.next();
var contents = folder.getFiles();
var ss = SpreadsheetApp.create(folderlisting);
var sheet = ss.getActiveSheet();
sheet.appendRow( ['name', 'link'] );
var file;
var name;
var link;
var row;
while(contents.hasNext()) {
file = contents.next();
name = file.getName();
link = file.getUrl();
sheet.appendRow( [name, link] );
}
};

Lua - My documents path and file creation date

I'm planning to do a program with Lua that will first of all read specific files
and get information from those files. So my first question is whats the "my documents" path name? I have searched a lot of places, but I'm unable to find anything. My second question is how can I use the first four letters of a file name to see which one is the newest made?
Finding the files in "my documents" then find the newest created file and read it.
The reading part shouldn't be a problem, but navigating to "my documents" and finding the newest created file in a folder.
For your first question, depends how robust you want your script to be. You could use Lua's builtin os.getenv() to get a variety of environment vars related to user, such as USERNAME, USERPROFILE, HOMEDRIVE, HOMEPATH. Example:
username = os.getenv('USERNAME')
dir = 'C:\\users\\' .. username .. '\\Documents'
For the second question, there is no builtin mechanism in Windows to have the file creation or modification timestamp as part of the filename. You could read the creation or modification timestamp, via a C extension you create or using an existing Lua library like lfs. Or you could read the contents of a folder and parse the filenames if they were named according to the pattern you mention. Again there is nothing built into Lua to do this, you would either use os.execute() or lfs or, again, your own C extension module, or combinations of these.

How to compress multiple folders into one archive?

I have some compression components (like KAZip, JVCL, zLib) and exactly know how to use them to compress files, but i want to compress multiple folders into one single archive and keep folders structure after extract, how can i do it?
in all those components i just can give a list of files to compress, i can not give struct of folders to extract, there is no way (or i couldn't find) to tell every file must be extracted where:
i have a file named myText.txt in folder FOLDER_A and have a file with same name myText.txt in folder FOLDER_B:
|
|__________ FOLDER_A
| |________ myText.txt
|
|__________ FOLDER_B
| |________ myText.txt
|
i can give a list of files to compress: myList(myText.txt, myText.txt) but i cant give the structure for uncompress files, what is best way to found which file belongs to which folder?
The zip format just does not have folders. Well, it kinda does, but they are kind of empty placeholders, only inserted if you need metadata storage like user access rights. But other than those rather rare advanced things - there is no need for folders at all. What is really done - and what you can observe opening zip file in the notepad and scrolling to the end - is that each file has its path in it, starting with "archive root". In your exanple the zip file should have two entries (two files):
FOLDER_A/myText.txt
FOLDER_B/myText.txt
Note, that the separators used are true slashes, common to UNIX world, not back-slashes used in DOS/Windows world. Some libraries would fix back-slashes it for you, some would not - just do your tests.
Now, let's assume that that tree is contained in D:\TEMP\Project - just for example.
D:\TEMP\Project\FOLDER_A\myText.txt
D:\TEMP\Project\FOLDER_B\myText.txt
There are two more questions (other than path separators): are there more folders within D:\TEMP\Project\ that should be ignored, rather than zipped (like maybe D:\TEMP\Project\FOLDER_C\*.* ? and does your zip-library have direct API to pack the folders wit hall its internal subfolder and files or should you do it file by file ?
Those three questions you should ask yourself and check while choosing the library. The code drafts would be somewhat different.
Now let's start drafting for the libraries themselves:
The default variant is just using Delphi itself.
Enumerate the files in the folder: http://docwiki.embarcadero.com/CodeExamples/XE3/en/DirectoriesAndFilesEnumeraion_(Delphi)
If that enumeration results in absolute paths then strip the common D:\TEMP\Project from the beginning: something like If AnsiStartsText('D:\TEMP\Project\', filename) then Delete(filename, 1, Length('D:\TEMP\Project\'));. You should get paths relative to chosen containing place. Especially if you do not compress the whole path and live some FOLDER_C out of archive.
Maybe you should also call StringReplace to change '\' into '/' on filenames
then you can zip them using http://docwiki.embarcadero.com/Libraries/XE2/en/System.Zip.TZipFile.Add - take care to specify correct relative ArchiveFileName like aforementioned FOLDER_A/myText.txt
You can use ZipMaster library. It is very VCL-bound and may cause troubles using threads or DLLs. But for simple applications it just works. http://www.delphizip.org/
Last version page have links to "setup" package which had both sources, help and demos. Among demos there is an full-featured archive browser, capable of storing folders. So, you just can read the code directly from it. http://www.delphizip.org/191/v191.html
You talked about JVCL, that means you already have Jedi CodeLib installed. And JCL comes with a proper class and function, that judging by name can directly do what you want it too: function TJclSevenzipCompressArchive.AddDirectory(const PackedName: WideString; const DirName: string = ''; RecurseIntoDir: Boolean = False; AddFilesInDir: Boolean = False): Integer;
Actually all those libraries are rather similar on basic level, when i made XLSX export i just made a uniform zipping API, that is used with no difference what an actual zipping engine is installed. But it works with in-memory TStream rather than on-disk files, so would not help you directly. But i just learned than apart of few quirks (like instant vs postponed zipping) on ground level all those libs works the same.

Resources