Hard drive reads and writes without file creation - hard-drive

There are programs out there that recover deleted files from the hard drive and also ones that overwrite free space in order to prevent deleted files from being recovered.
The act of overwriting free space seems understandable. The program creates files and writes arbitrary bytes to them.
However, when it comes to reading deleted files, I'm stumped. I understand that deleting a file only gets rid of the reference in the file system and that recovery programs search for common file headers in order to determine which part of the 'free space' could be a recoverable file.
But how can a program read data from the hard disk that is not part of the file system? Any language that I've used or read some documentation about, allows reading from the hard disk only by opening a file - which is not free space.
I would also be grateful for a small example of a read from hard disk maybe in C++, Java or Python.
Also, I am a Windows user.
EDIT: This is what the Java guys came up with : How to access specific raw data on disk from java

Every OS out there has the notion of a block device - with a hard disk being the canonical example. Now the beauty is, that in most implementations (this includes Windows), these can be opened just as if they were files on a file system by referring to special file names, that would be invalid inside the file system (appropriate user privileges are assumed).
On Windows, e.g. opening \\?\Device\Harddisk0\Partition1 will give you access to the first partition of the first harddrive. With read access to this special "file", you can now read the drive's content without going through the file system, giving you the possibility to discover and salvage objects, that are no longer part of the file system, but have not yet been overwritten or trimmed.

Related

Flash memory raw data changes depending on the reading tool. Why?

I've been playing around with the raw data inside an 8GB Memory Stick, reading and writing directly into specific sectors, but for some reason changes don't remain consistent.
I've used Active # Disk Editor to write a string at a specific sector and it seems consistent when I read it through Active (it survives unmounting, rebooting...), but if I try to read it through terminal using dd and hexdump the outcome is different.
Some time ago I was researching ways to fully and effectively erase a disk and I read somewhere that solid state drives such as flash drives or SSDs have more memory than it's stated so its internals keep replacing parts of the memory in order to increase lifespan or something like that.
I don't know if it is because of that or if it's even correct. Could you tell me if I'm wrong or where to find good documentation about the subject?
Okay I just figured it out.
Apparently when you open a disk inside a Hex editor there's two ways you can go, you can open it as a physical disk (the whole disk) or as a logical disk, aka a volume or a partition.
Active # Disk Editor was opening it as a physical disk, while using dd and hexdump was dumping it as a logical disk. In other words it was dumping the content of the only partition inside the physical disk. This means there was an offset between the real physical sectors where I was writing data using Active and the ones that I was reading (quite a big offset, 2048 sectors of 512 bytes each).
So changes were being made, I was just looking at the wrong positions. Hope this saves someone a few minutes.

Reed-Solomon in file recovery

A piece of software I'm working on outputs quite a lot of files which are the stored on a server. During its runtime I've had one file go corrupt on me. These files are critical to the operation, so this cannot happen. I'm therefore trying to come up with a way of adding error correction to the files to prevent this from ever happening again.
I've read up on Reed-Solomon, which encodes k blocks of data plus m blocks of parity, and can then reconstruct up to m missing blocks. So what I'm thinking is taking the data stream, split it into these blocks, and then store them in sequence on disk, first the data blocks, then the parity blocks. Repeat until entire file is stored. k, m, and block sizes are of course variables I'll have to investigate and play with.
However, it's my understanding that Reed-Solomon requires you to know which blocks are corrupt. How could I possibly know that? My thinking is I'd have to add some extra, simpler, error detection code to each of the blocks as I write them, otherwise I can't know if they're corrupted. Like CRC32 or something.
Have I understood this correctly, or is there a better way to accomplish this?
This is a bit of an older question, but (in my mind) is always something that is useful and in some cases necessary. Bit rot will never be completely cured (hush ZFS community; ZFS only has control of what's on it's filesystem while it's there), so we always have to come up with proactive prevention and recovery plans.
While it was designed to facilitate piracy (specifically storing and extracting multi-GB files in chunks on newsgroups where any chunk could go missing or be corrupted), "Parchives" are actually exactly what you're looking for (see the white paper, though don't implement that scheme directly as it has a bug and newer schemes are available), and they work in practice as follows:
The complete file is input in to the encoder
Blocks are processed and Reed-Solomon blocks are generated
.par files containing those blocks are output along side the original file
When integrity is checked (typically on the other end of a file transfer), the blocks are rechecked and any blocks that need to be used to reconstruct missing data are pulled from the .par files.
Things eventually settled in to "PAR2" (essentially a rewrite with additional features) with the following scheme:
Large file compressed with RAR and split in to chunks (typically around 100MB each as that was a "usually safe" max of usenet)
An "index" file is placed along side the file (for example bigfile.PAR2). This has no recovery chunks.
A series of par files totaling 10% of the original data size are along side in increasingly larger filesizes (bigfile.vol029+25.PAR2, bigfile.vol104+88.PAR2, etc)
The person on the other end can then gets all .rar files
An integrity check is run, and returns a MB count of out how much data needs recovery
.PAR2 files are downloaded in an amount equal to or greater than the need
Recovery is done and integrity verified
RAR is extracted, and the original file is successfully transferred
Now without a filesystem layer this system is still fairly trivial to implement using the Parchive tools, but it has two requirements:
That the files do not change (as any change to the file on-disk will invalidate the parity data (of course you could do this and add complexity with a copy-on-change writing scheme))
That you run both the file generation and integrity check/recovery when appropriate.
Since all the math and methods are both known and battle-tested, you can also roll your own to meet whatever needs to have (as a hook in to file read/write, spanning arbitrary path depths, storing recovery data on a separate drive, etc). For initial tips, refer to the pros: https://www.backblaze.com/blog/reed-solomon/
Edit: The same research that led me to this question led me to a whole subset of already-done work that I was previously unaware of
https://crates.io/crates/solana-reed-solomon-erasure (as well as a bunch of other implementations in the Rust crate registry)
https://github.com/klauspost/reedsolomon (based on the BackBlaze code, and processes 1Gbps per core)
Etc. Look for "Reed-Solomon file recovery "

Reading/Writing to/from iPhone's Documents folder performance

I have got an iPhone application where I archive permanent data in the documents folder of the application (arrays, dictionaries). I read/write from and to the documents folder quite frequently and I would like to know whether this is considered a bad habit. Wouldn't it be better if I had a singleton class, read and write to arrays there and then, only when the application quits, write this data to the documents folder ? I do not see/feel any performance issues right now on my iPhone 5, but I wanted to know whether this is a bad practise.
FLASH memory has limited write capability - a long time ago it was rated in some increment of thousands. Not sure where it is today.
That said, if your app is using the standard file system APIs, then the system is using the file cache, and you might open a file, read it then change it many times without the file system ever writing to flash. The system may sync to flash occasionally, but that process is opaque - no way to really know when or why iOS does it.
The UNIX APIs allow for syncing the file system cache to the storage system (iOS this is FLASH), but if you are not using that then you are probably not doing much I/O at all given what you say above.
Given the lack of Apple discouraging developers from writing to the file system, I for sure would not worry about this.
But, if you said you were writing gigabytes of image data every few minutes - well - that might be a problem.

What is the fastest way for reading huge files in Delphi?

My program needs to read chunks from a huge binary file with random access. I have got a list of offsets and lengths which may have several thousand entries. The user selects an entry and the program seeks to the offset and reads length bytes.
The program internally uses a TMemoryStream to store and process the chunks read from the file. Reading the data is done via a TFileStream like this:
FileStream.Position := Offset;
MemoryStream.CopyFrom(FileStream, Size);
This works fine but unfortunately it becomes increasingly slower as the files get larger. The file size starts at a few megabytes but frequently reaches several tens of gigabytes. The chunks read are around 100 kbytes in size.
The file's content is only read by my program. It is the only program accessing the file at the time. Also the files are stored locally so this is not a network issue.
I am using Delphi 2007 on a Windows XP box.
What can I do to speed up this file access?
edit:
The file access is slow for large files, regardless of which part of the file is being read.
The program usually does not read the file sequentially. The order of the chunks is user driven and cannot be predicted.
It is always slower to read a chunk from a large file than to read an equally large chunk from a small file.
I am talking about the performance for reading a chunk from the file, not about the overall time it takes to process a whole file. The latter would obviously take longer for larger files, but that's not the issue here.
I need to apologize to everybody: After I implemented file access using a memory mapped file as suggested it turned out that it did not make much of a difference. But it also turned out after I added some more timing code that it is not the file access that slows down the program. The file access takes actually nearly constant time regardless of the file size. Some part of the user interface (which I have yet to identify) seems to have a performance problem with large amounts of data and somehow I failed to see the difference when I first timed the processes.
I am sorry for being sloppy in identifying the bottleneck.
If you open help topic for CreateFile() WinAPI function, you will find interesting flags there such as FILE_FLAG_NO_BUFFERING and FILE_FLAG_RANDOM_ACCESS . You can play with them to gain some performance.
Next, copying the file data, even 100Kb in size, is an extra step which slows down operations. It is a good idea to use CreateFileMapping and MapViewOfFile functions to get the ready for use pointer to the data. This way you avoid copying and also possibly get certain performance benefits (but you need to measure speed carefully).
Maybe you can take this approach:
Sort the entries on max fileposition and then to the following:
Take the entries that only need the first X MB of the file (till a certain fileposition)
Read X MB from the file into a buffer (TMemorystream
Now read the entries from the buffer (maybe multithreaded)
Repeat this for all the entries.
In short: cache a part of the file and read all entries that fit into it (multhithreaded), then cache the next part etc.
Maybe you can gain speed if you just take your original approach, but sort the entries on position.
The stock TMemoryStream in Delphi is slow due to the way it allocates memory. The NexusDB company has TnxMemoryStream which is much more efficient. There might be some free ones out there that work better.
The stock Delphi TFileStream is also not the most efficient component. Wayback in history Julian Bucknall published a component named BufferedFileStream in a magazine or somewhere that worked with file streams very efficiently.
Good luck.

FlushFileBuffers as good as CloseHandle then CreateFile at saving data to disk?

For a file on disk, is the Win32 function FlushFileBuffers as reliable and certain as closing the file using CloseHandle then re-opening the file using CreateFile?
Are there circumstances where CloseHandle then CreateFile are better because they save the data correctly to disk when FlushFileBuffers does not?
It is better, CloseHandle() doesn't flush the file system cache write buffers. Beware of the cost, it can take a long time to get the data to the disk. The FILE_FLAG_NO_BUFFERING option for CreateFile allows you to avoid flushing. But it is very expensive and difficult to get right due to the limitations on the written data.
According to the documentation, FlushFileBuffers does write everything to the disk. However, it probably doesn't hurt to test it yourself. I've done BRS testing (big red switch ... well PCs used to have big red switches) in the past, and I found that it did cause everything to be written. After the call to FlushFileBuffers, turn the PC off without a clean shutdown. Turn it back on and see if the data is all there. The behavior may change by OS (in theory it shouldn't ... but you never know). It was quite some time ago that I did tests like that (it was on XP or possibly even Windows 2000).
And I suppose it goes without saying, but you probably don't want to do this testing on a workstation that you really care about.
Although this information is not related to Delphi, but the most deployed SQL-database on earth, sqlite (used for example in Firefox) takes care on such things and you can read a lot ot atomic operations here: http://www.sqlite.org/atomiccommit.html
Below is a quote from the article about FlushFileBuffers
9.2 Incomplete Disk Flushes
SQLite uses the fsync() system call on
Unix and the FlushFileBuffers() system
call on w32 in order to sync the file
system buffers onto disk oxide as
shown in step 3.7 and step 3.10.
Unfortunately, we have received
reports that neither of these
interfaces works as advertised on many
systems. We hear that
FlushFileBuffers() can be completely
disabled using registry settings on
some Windows versions. ...

Resources