what is the difference between file and random access file?
A random access file is a file where you can "jump" to anywhere within it without having to read sequentially until the position you are interested in.
For example, say you have a 1MB file, and you are interested in 5 bytes that start after 100k of data. A random access file will allow you to "jump" to the 100k-th position in one operation. A non-random access file will require you to read 100k bytes first, and only then read the data you're interested in.
Hope that helps.
Clarification: this description is language-agnostic and does not relate to any specific file wrapper in any specific language/framework.
Almost nothing these days. There used to be a time in certain operating systems where there were different types of files - some of which could be accessed randomly (at any point in the file) and others which could only be accessed sequentially. This made more sense when you were using a sequential medium such as tape. Any file system worth its salt these days only supports random access.
Related
We have a scenario in our project where there are files coming from the client with the same file name, sometimes with the same file size too. Currently when we upload a file, we are checking the new file name with the existing files in the database and if there is a reference we are marking it as duplicate and would not allow to upload at all. But now we have a requirement to check the content of the file when they have the same file name. So we need to find out a solution to differentiate such files based on contents. So, how do we efficiently do that - meaning how to do it avoiding even a minute chance of error?
Rails 3.1, Ruby 1.9.3
Below is one option I have read from a web reference.
require 'digest'
digest_value = Digest::MD5.base64digest(File.read( file_path ))
And the above line will read all the contents of the incoming file and based on which it will generate a unique hash, right? Then we can use it for unique file identification. But we have more than 500 users simultaneously working in 24/7 mode and most of them will be doing this operation. So, if the incoming file has a huge size (> 25MB) then the Digest will take more time to read the whole contents and there by suffer performance issues. So, what could be a better solution considering all these facts?
I have read the question and the comments and I have to say you have the problem stated not 100% correct. It seems that what you need is to identify identical content. Period. Despite whether name and size are equal or not. Correct me if I am wrong, but you likely don’t want to allow users to update 100 duplicates of the same file just because the user has 100 copies of it in local, having different names.
So far, so good. I would use the following approach. The file name is not involved anyhow. The file size might help in terms of fast-check the uniqueness (sizes differ hence files are definitely different.)
Then one might allow the upload with an instant “OK” response. Afterwards, the server in the background should run Digest::MD5, comparing the file against all already uploaded. If there is a duplicate, the new copy of the file should be removed, but the name should stay on the filesystem, being a symbolic link to the original.
That way you’ll not frustrate users, giving them an ability to have as many copies of the file as they want under different names, while preserving the HDD volume at the lowest possible level.
I've coded a random access file to be created, and am wondering;
Is the random access file stored on the RAM, or the hard drive?
If it's the hard drive, why is it called a "Random Access" file?
Thanks
Random access has nothing to do with storage location (which for a file is the disk). It has to do with how you can access (read/write) that file content.
Random access means you can access any location in the file between the start and end, in any order, at any time. It's the opposite of sequential access, which means you can only access the file from start to end.
In other words, with random access you can read the last byte (or block of bytes) from the file, the first byte (or block), and then a byte or block from the middle somewhere. With sequential access, you have to read the first byte/block, then the second byte/block, then the third, and so on.
Random access is opposite to sequential access. Random file read / write is somewhat similar to accessing RAM. A file is a file, can be read into RAM, but stored in hard drive.
If I have large number of files (n x 100K individual files) what would be most efficient way to store them in iOS file system (from speed of access to the file by path point of view)? Should I dump them all in single folder or break them in multilevel folder hierarchy.
Basically this breaks in three questions:
does file access time depend on number of "sibling" files (I think
answer is yes. If I am correct file names are organized into b-tree
so it should be O(log n))?
how expensive is traversing from one folder to another along the
path (is it something like m * O( log nm ) - where m is number of
components in the path and nm is number of "siblings" at each path
component )?
What gets cached at file system level to make above assumptions incorrect?
It would be great if some one had direct experience with this kind of problem and can share some real life results.
You comments will be highly appreciated
This seems like it might provide relevant, hard data:
File System vs Core Data: the image cache test
http://biasedbit.com/blog/filesystem-vs-coredata-image-cache
Conclusion:
File system cache is, as expected, faster. Core Data falls shortly behind when storing (marginally slower) but load times are way higher when performing single random accesses.
For such a simple case Core Data functionality really doesn't pay up, so stick to the file system version.
I think you should store everything is a one folder and create a hash table which include key (file name) and value (source path) pare.By creating hash table complexity with be constant log(1) and this will speed up your process as well.
The file system is not an optimal database. With that many thousands of files, you should consider using Core Data, or other database instead to store the name and contents of each file.
37 signals suggests id partitioning to accomplish this thing..
http://37signals.com/svn/archives2/id_partitioning.php
Any suggestions would be more than welcome.
Thanks.
We use Paperclip for storing our files. It can do what you want pretty easily.
We use partitioning by date so an image uploaded today would end up in 2009/12/10/image_12345.jpg. The path is stored in the db for reference and the path to the image folder (the parent of 2009) is placed in some config file. If we need to change things later it makes it very easy.
You can map by virtually everything. We use mapping by user on our designs, but it's a HR system so it makes sense (there's no way the user will have 32k file entries) and the files are clearly connected with user. On Media Library parts of the system dividing by date or ID will be more useful.
The catch is, you should store some part of file path in database table (as suggested before). Will it be date, or user hash/name (often also divided, eg u/user j/john j/jo/john etc). Then you don't have to worry about changing division system, as this will only require database update.
We take text/csv like data over long periods (~days) from costly experiments and so file corruption is to be avoided at all costs.
Recently, a file was copied from the Explorer in XP whilst the experiment was in progress and the data was partially lost, presumably due to multiple access conflict.
What are some good techniques to avoid such loss? - We are using Delphi on Windows XP systems.
Some ideas we came up with are listed below - we'd welcome comments as well as your own input.
Use a database as a secondary data storage mechanism and take advantage of the atomic transaction mechanisms
How about splitting the large file into separate files, one for each day.
If these machines are on a network: send a HTTP post with the logging data to a webserver.
(sending UDP packets would be even simpler).
Make sure you only copy old data. If you have a timestamp on the filename with a 1 hour resolution, you can safely copy the data older than 1 hour.
If a write fails, cache the result for a later write - so if a file is opened externally the data is still stored internally, or could even be stored to a disk
I think what you're looking for is the Win32 CreateFile API, with these flags:
FILE_FLAG_WRITE_THROUGH : Write operations will not go through any intermediate cache, they will go directly to disk.
FILE_FLAG_NO_BUFFERING : The file or device is being opened with no system caching for data reads and writes. This flag does not affect hard disk caching or memory mapped files.
There are strict requirements for successfully working with files opened with CreateFile using the FILE_FLAG_NO_BUFFERING flag, for details see File Buffering.
Each experiment much use a 'work' file and a 'done' file. Work file is opened exclusively and done file copied to a place on the network. A application on the receiving machine would feed that files into a database. If explorer try to move or copy the work file, it will receive a 'Access denied' error.
'Work' file would become 'done' after a certain period (say, 6/12/24 hours or what ever period). So it create another work file (the name must contain the timestamp) and send the 'done' through the network ( or a human can do that, what is you are doing actually if I understand your text correctly).
Copying a file while in use is asking for it being corrupted.
Write data to a buffer file in an obscure directory and copy the data to the 'public' data file periodically (every 10 points for instance), thereby reducing writes and also providing a backup
Write data points discretely, i.e. open and close the filehandle for every data point write - this reduces the amount of time the file is being accessed provided the time between data points is low