Usage of URL_HASH in FetchContent_Declare - url

I am a newbie in CMake and trying to understand the following CMake command
FetchContent_Declare(curl
URL https://github.com/curl/curl/releases/download/curl-7_75_0/curl-7.75.0.tar.xz
URL_HASH SHA256=fe0c49d8468249000bda75bcfdf9e30ff7e9a86d35f1a21f428d79c389d55675
USES_TERMINAL_DOWNLOAD TRUE)
When I open a browser and put in https://github.com/curl/curl/releases/download/curl-7_75_0/curl-7.75.0.tar.xz, the file curl-7.75.0.tar.xz will start downloading without the need for the URL_HASH. I am sure it is not redundant. I wanted to know what the purpose of the URL_HASH is?
Also how can SHA256 be found? Because when I visit https://github.com/curl/curl/releases/download/curl-7_75_0 to find out more, the link is broken.

I am sure it is not redundant. I wanted to know what the purpose of the URL_HASH is?
Secure hash functions like SHA256 are designed to be one-way; it is (in practice) impossible to craft a malicious version of a file with the same SHA256 hash as the original. It is even impossible to find two arbitrary files that have the same hash. Such a pair is called a "collision" and finding even one would constitute a major breakthrough in cryptanalysis.
The purpose of this hash in a CMakeLists.txt, then, is as an integrity check. If a bad actor has intercepted your connection somehow, then checking the hash of the file you actually downloaded against this hard-coded expected hash will detect whether or not the file changed in transit. This will even catch less nefarious data corruptions, like those caused by a faulty hard drive.
Including such a hash (a "checksum") is absolutely necessary when downloading code or other binary artifacts.
Also how can SHA256 be found?
Often, these will be published alongside the binaries. Use a published value if available.
If you have to compute it yourself, you have a few options. On the Linux command line, you can use the sha256sum command. As a hack, you can write a deliberately wrong SHA256=0 value or something and fish the observed value from the error message.
Note that if you compute the hash yourself, you should either (a) download the file from an absolutely trusted connection and device or (b) download it from multiple independent devices (free CI systems like GitHub Actions are useful for this) and ensure the hash is the same across all of them.

Related

Uniquely identify files with same name and size but with different contents

We have a scenario in our project where there are files coming from the client with the same file name, sometimes with the same file size too. Currently when we upload a file, we are checking the new file name with the existing files in the database and if there is a reference we are marking it as duplicate and would not allow to upload at all. But now we have a requirement to check the content of the file when they have the same file name. So we need to find out a solution to differentiate such files based on contents. So, how do we efficiently do that - meaning how to do it avoiding even a minute chance of error?
Rails 3.1, Ruby 1.9.3
Below is one option I have read from a web reference.
require 'digest'
digest_value = Digest::MD5.base64digest(File.read( file_path ))
And the above line will read all the contents of the incoming file and based on which it will generate a unique hash, right? Then we can use it for unique file identification. But we have more than 500 users simultaneously working in 24/7 mode and most of them will be doing this operation. So, if the incoming file has a huge size (> 25MB) then the Digest will take more time to read the whole contents and there by suffer performance issues. So, what could be a better solution considering all these facts?
I have read the question and the comments and I have to say you have the problem stated not 100% correct. It seems that what you need is to identify identical content. Period. Despite whether name and size are equal or not. Correct me if I am wrong, but you likely don’t want to allow users to update 100 duplicates of the same file just because the user has 100 copies of it in local, having different names.
So far, so good. I would use the following approach. The file name is not involved anyhow. The file size might help in terms of fast-check the uniqueness (sizes differ hence files are definitely different.)
Then one might allow the upload with an instant “OK” response. Afterwards, the server in the background should run Digest::MD5, comparing the file against all already uploaded. If there is a duplicate, the new copy of the file should be removed, but the name should stay on the filesystem, being a symbolic link to the original.
That way you’ll not frustrate users, giving them an ability to have as many copies of the file as they want under different names, while preserving the HDD volume at the lowest possible level.

swapbuffers minifilter problems

I implemented a minifilter driver using the swapbuffers example. I made two changes:
attach only to \Device\HarddiskVolume3
encryption XORing with 0xFF
Encryption works, but the volume3 (which in my system is E:) not working. E: is not recognized file system. chkdsk E: results all boot sectors corrupted message.
After investigations (using procmon.exe): the chkdsk.exe creates a shadow copy of volume. If the driver attaches the shadow copy too the chkdsk E: is OK, the filesystem is perfect. But E: remains unrecognized.
Any idea what I should change?
Assuming no simple mistake was made, that is, the volume was unmounted, you added the filter, and remounted, obviously the mount/filesystem is not using your filter.
I noticed a comment in the example code about "not for kernel mode drivers".
What you want to research is "whole disk encryption". A google search AND search on: windows whole disk encryption will help.
In particular, TrueCrypt does what you want. Since it is open source, and is available on sourceforge.net, you could download the source and figure out how to hook your stuff in by learning how TrueCrypt does it.
Just one problem: TrueCrypt has security gaps, so the sourceforge.net page is now just migration info to BitLocker. But, it still exists and other pages have been created where you can get it. Notably, a fork of TrueCrypt is VeraCrypt
Just one of the pages in the search is: http://www.howtogeek.com/203708/3-alternatives-to-the-now-defunct-truecrypt-for-your-encryption-needs/
UPDATE
Note: After I wrote this update, I realized that there may be hope ... So, keep reading.
Minifilter appears to be for filesystems but not underlying storage. It may work, you just need to find a lower level hook. What about filter stack altitute? Here's a link: https://msdn.microsoft.com/en-us/library/windows/hardware/ff540402%28v=vs.85%29.aspx It also has documentation on fltmc and the !fltkd debugger extension
In this [short] blog: http://blogs.msdn.com/b/erick/archive/2006/03/27/562257.aspx it says:
The Filter Manager was meant to create a simple mechanism for drivers to filter file system operations: file system minifilter drivers. File system minifilter driver are located between the I/O manager and the base filesystem, not between the filesystem and the storage driver(s) like legacy file system filter drivers.
Figuring out what that means will help. Is the hook point between FS and I/O manager [which I don't know about] sufficient? Or, do you need to hook between filesystem and storage drivers [implying legacy filter]?
My suspicion is that a "legacy" driver filter may be what you need, if the minifilter does not have something that can do the same.
Since your hooks need to work on unmounted storage so that chkdsk will work, this may imply the legacy filter. On the other hand, you mentioned that you were able to hook the shadow copy and it worked for chkdsk. That implies minifilter has the right stuff.
Here's a link that I think is a bit more informative: http://blogs.msdn.com/b/ntdebugging/archive/2013/03/25/understanding-file-system-minifilter-and-legacy-filter-load-order.aspx It has a direct example about the altitute of an encryption filter. You just may need more hook points and to lower the altitude of your minifilter
UPDATE #2
Swapbuffers just hooks a few things: IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL. These are file I/O related, not device I/O related. The example is fine, just not necessarily for your purposes.
The link I gave you to fltmc is one page in MS's entire reference for filters. If you meander around that, you'll find more interesting things like IoGetDeviceAttachmentBaseRef, IoGetDiskDeviceObject. You need to find the object for the device and filter its I/O operations.
I think that you'll have to read the reference material in addition to examples. As I've said previously, your filter needs to hook more or different things.
In the VeraCrypt source, the Driver subdirectory is an example of the types of things you may need to do. In DriveFilter.c, it uses IRP_MJ_READ but also uses IRP_MN_START_DEVICE [A hook when the device is started].
Seriously, this may be more work than you imagine. Is this just for fun, or is this just a test case for a much larger project?

TextIO.Write - does it append to or replace the output files (Google Cloud Dataflow)

I cannot find any documentation on it, so I wonder what is the behavior if the output files already exist (in a gs:// bucket)?
Thanks,
G
The files will be overwritten. There are several motivations for this:
The "report-like" use case (compute a summary of the input data and put the results on GCS) seems to be a lot more frequent than the use case where you are producing data incrementally and putting more of it onto GCS with each execution of the pipeline.
It is good if rerunning a pipeline is idempotent(-ish?). E.g. if you find a bug in your pipeline, you can just fix it and rerun it, and enjoy the overwritten correct results. A pipeline that appends to files would be very difficult to work with in this matter.
It is not required to specify the number of output shards for TextIO.Write; it can slightly differ between different executions, even for exactly the same pipeline and the same input data. The semantics of appending in that case would be very confusing.
Appending is, as far as I know, impossible to implement efficiently using any filesystem I'm aware of, while preserving the atomicity and fault tolerance guarantees (e.g. that you produce all output or none of it, even in the face of bundle re-executions due to failures).
This behavior will be documented in the next version of SDK that appears on github.

Ignore/skip GCS input files that don't exist

Our requirement is to process the last 24 hours of adserving logs that Google DFP writes directly to our GCS bucket.
We currently achieve this by using a Flatten, and passing in all the file names for the last 24 hours. The file names are in yyyyMMdd_hh format.
But, we've identified that sometimes DFP fails to write a file(s) for some of the hours. We've raised that issue to the DFP guys.
However, is there a way to configure our Dataflow job to ignore any missing GCS files, and not fail in that case? It currently fails if one or more files don't exist.
Using Dataflow APIs like TextIO.Read or AvroIO.Read to read from a non-existent file will, of course, thrown an error and cause the pipeline to fail. This is working as intended and I cannot think of a workaround.
Now, reading from a filepattern like yyyyMMdd_* may solve your problem, at least partially. Dataflow will expand the filepattern into a set of files and process them. As long as at least one file exists that matches the pattern provided, the pipeline should proceed.
The approach of having one source per file is often an anti-pattern -- it is less efficient and less elegant, but functionally the same. Nevertheless, you can still fix it by using the Google Cloud Storage API before constructing your Dataflow pipeline to confirm presence of each file. If an input file is not present, you can simply skip generating one of the sources.
Either way, please keep in mind the eventual consistency guarantee provided by the GCS list API. This means that expanding a file pattern may not immediately generate all files that would otherwise be readable. The anti-pattern may be a good workaround for this case, however.
Maybe not the best answer, but you can always use
GcsUtilFactory.create(options).expand(...)
to grab all files which exist. Then you can create Flatten accordingly.
Waiting for more professional answers.

PHP fails to parse large post variable

I'm trying to pass a rather large post request to php, and when I var_dump $_POST array, one, the most large, variable is missing. (Actually that's base64 encoded binary upload as part of a post request)
Funny thing, that on my development PC exactly same request is parsed correctly, without any missing variables.
I checked out contents of php://input on server and development PC and they are exactly the same, md5 matches. Yet development PC recognizes all variables, and server misses one.
I tried changing many different options in php.ini, and got zero effect.
Maybe someone will point me to the right one.
Here is my php://input (~5 megabytes) http://www.mediafire.com/?lp0uox53vhr35df
It's possible the server is blocking it because of Suhosin extension.
http://www.hardened-php.net/suhosin/configuration.html#suhosin.post.max_value_length
suhosin.post.max_value_length
Type: Integer Default: 65000 Defines the maximum length of a variable
that is registered through a POST request.
This will have to be changed in the php.ini.
Keep in mind that this is different than the Suhosin patch which is common on alot of shared hosts. I don't know if the patch would cause this problem.

Resources