I planning a hdfs system that will host image files (few Mb to 200mb) for a digital repository (Fedora Commons). I found from another stackoverflow post that CombineFileInputFormat can be used to create input splits consisting of multiple input files. Can this approach be used for images or pdf? Inside the map task, I want process individual files in their entirety i.e. process each image in the input split separately.
I'm aware of the small files problem, and it will not be an issue for my case.
I want to use CombineFileInputFormat for the benefits of avoiding Mapper task setup/cleanup overhead, and data-locality preservation.
If you want to process images in Hadoop, I can only recommend using HIPI, which should allow you to do what you need.
Otherwise, when you say you want to process individual files in their entirety, I don't think you can do this with conventional input formats, because even with CombineFileInputFormat, you would have no guarantee that what's in your split is exactly 1 image.
An approach you could also consider is to have in input a file containing URLs/locations of your images (for example you could put them in Amazon S3), and make sure you have as many mappers as images, and then each map task would be able to process an individual image. I've done something similar not so long ago and it worked ok.
Related
We have a scenario in our project where there are files coming from the client with the same file name, sometimes with the same file size too. Currently when we upload a file, we are checking the new file name with the existing files in the database and if there is a reference we are marking it as duplicate and would not allow to upload at all. But now we have a requirement to check the content of the file when they have the same file name. So we need to find out a solution to differentiate such files based on contents. So, how do we efficiently do that - meaning how to do it avoiding even a minute chance of error?
Rails 3.1, Ruby 1.9.3
Below is one option I have read from a web reference.
require 'digest'
digest_value = Digest::MD5.base64digest(File.read( file_path ))
And the above line will read all the contents of the incoming file and based on which it will generate a unique hash, right? Then we can use it for unique file identification. But we have more than 500 users simultaneously working in 24/7 mode and most of them will be doing this operation. So, if the incoming file has a huge size (> 25MB) then the Digest will take more time to read the whole contents and there by suffer performance issues. So, what could be a better solution considering all these facts?
I have read the question and the comments and I have to say you have the problem stated not 100% correct. It seems that what you need is to identify identical content. Period. Despite whether name and size are equal or not. Correct me if I am wrong, but you likely don’t want to allow users to update 100 duplicates of the same file just because the user has 100 copies of it in local, having different names.
So far, so good. I would use the following approach. The file name is not involved anyhow. The file size might help in terms of fast-check the uniqueness (sizes differ hence files are definitely different.)
Then one might allow the upload with an instant “OK” response. Afterwards, the server in the background should run Digest::MD5, comparing the file against all already uploaded. If there is a duplicate, the new copy of the file should be removed, but the name should stay on the filesystem, being a symbolic link to the original.
That way you’ll not frustrate users, giving them an ability to have as many copies of the file as they want under different names, while preserving the HDD volume at the lowest possible level.
The new .zif single file format provided by Zoomify Pro seems to have some performance issues. Comparing it to the old file structure it loads the page 3 to 4 times slower and the requests that it sends exceed 50% more (Tested with the same initial image in multiple file formats).
Using the old format is not feasible for out product and we are stuck with over a minute of load time.
Has anyone encountered this issue, and are there some workarounds? The results in the internet and the official site doesn't seem to be of any help.
NOTE: Contacting the vendor hasn't led to anything yet.
Although the official site claims the zif format could handle very large image, I'm skeptical about it because the viewer tries to do everything in Javascript. The performance is entire based on the client's machine. Try opening it on a faster machine and see if it improves.
Alternative solution: You could create Deep Zoom Image tiles by using VIPS library.
More information here:
https://libvips.github.io/libvips/API/current/Making-image-pyramids.md.html
Scroll further down in the article and you'll see this snippet:
With 7.40 and later, you can use --container to set the container
type. Normally dzsave will write a tree of directories, but with
--container zip you'll get a zip file instead. Use .zip as the directory suffix to turn on zip format automatically:
$ vips dzsave wtc.tif mypyr.zip
to write a zipfile containing the tiles.
Also, checkout this tutorial:
Serve deepzoom images from a zip archive with openseadragon
https://web.archive.org/web/20170310042401/https://literarymachin.es/deepzoom-osd-server/
The community (openseadragon and vips) is much stronger over there so you'll get help when you hit a wall.
If you want to take a break from all of this and just want the images zoomable, you could use 3rd party service such as zoomable.ca or zoomo.ca. It’s free and user friendly (upload your image and embed the viewer to your site like Google Map).
ZIF format designer here... ZIF can easily handle monstrous images, up to hundreds of terabytes in size.
Without a server, of course the viewer tries to do everything, it's the only option. As a result, serving ZIF directly from a webserver will not be as performant as using an image server. But... you can DO it. Using Zoomify tile folders, speed will be faster, but you may have hundreds of thousands or millions of tiles to deal with at the server side, and transfers will be horrendously slow and error-prone.
There are always trade-offs.See zif.photo for specification.
Is it possible to move a file in GCS after the dataflow pipeline has finished running? If so, how? Should be the last .apply? I can't imagine that being the case.
The case here is that we are importing a lot of .csv's from a client. We need to keep those CSV's indefinitely, so we either need to "mark the CSV as being already handled", or alternatively, move them out of the initial folder that TextIO uses to find the csv's. The only thing I can currently think of is storing the file name (I'm not sure how I'd get this even, I'm a DF newbie) in BigQuery perhaps, and then excluding files that have already been stored from the execution pipeline somehow? But there has to be a better approach.
Is this possible? What should I check out?
Thanks for any help!
You can try using BlockingDataflowPipelineRunner and run arbitrary logic in your main program after p.run() (it will wait for the pipeline to finish).
See Specifying Execution Parameters, specifically the section "Blocking execution".
However, in general, it seems that you really want a continuously running pipeline that watches the directory with CSV files and imports new files as they appear, never importing the same file twice. This would be a great case for a streaming pipeline: you could write a custom UnboundedSource (see also Custom Sources and Sinks) that would watch a directory and return filenames in it (i.e. the T would probably be String or GcsPath):
p.apply(Read.from(new DirectoryWatcherSource(directory)))
.apply(ParDo.of(new ReadCSVFileByName()))
.apply(the rest of your pipeline)
where DirectoryWatcherSource is your UnboundedSource, and ReadCSVFileByName is also a transform you'll need to write that takes a file path and reads it as a CSV file, returning the records in it (unfortunately right now you cannot use transforms like TextIO.Read in the middle of a pipeline, only at the beginning - we're working on fixing this).
It may be somewhat tricky, and as I said we have some features in the works to make it a lot simpler and we're considering creating a built-in source like that, but it's possible that for now this would still be easier than "pinball jobs". Please give it a try and let us know at dataflow-feedback#google.com if anything is unclear!
Meanwhile, you can also store information about which files you have or haven't processed in Cloud Bigtable - it'd be a better fit for that than BigQuery, because it's more suited for random writes and lookups, while BigQuery is more suited for large bulk writes and queries over the full dataset.
I am building a ASP.Net website and the website accepts a PDF as input and processes them. I am generating an intermediate file with a particular name. But I want to know if multiple users are using the same site at the same time, then how will the server handle this.
How can I handle this. Will Multi-Threading do the job? What about the file names of the intermediate files I am generating? How can I make sure they won't override each other. How to achieve performance?
Sorry if the question is too basic for you.
I'm not into .NET but it sounds like a generic problem anyways, so here are my two cents.
Like you said, multithreading (as usually different requests run in different threads) takes care for most of that kind of problems, as every method invocation involves new objects run in a separate context.
There are exceptions, though:
- Singleton (global) objects whose any of their operations have side effects
- Other resources (files, etc. ), this is exactly your case.
So in the case of files, I'd ponder these (mutually exclusive) alternatives:
(1) Never write the uploaded file to disk, instead hold it into memory and process it in there (like in byte array). In this case you're leveraging the thread-per-request protection. This one cannot be applied if your files are really big.
(2) Choose very randomized names (like UUID) to write them into a temporary location so their names won't clash if two users upload at the same time.
I'd go with (1) whenever possible.
Best
I'm writing an MVC application which serves transformed versions of user-uploaded images (rotated, cropped, and watermarked.) Once the transformations are specified, they're not likely to change, so I'd like to aggressively cache the generated output, and serve it as efficiently as possible.
The transformation options are stored in the database and used by the image servers to create the images on demand; only the original uploads are stored permanently. Caching the generated images in a local directory allows IIS 7 to pick them up without touching the ASP.NET process, i.e. by matching the route. Static images in images/ take precedence over the dynamic MVC route /images/{id}.jpg.
My concern at this point is when the user actually changes the transformation options -- the images need to be re-generated, but I'd like to avoid manually deleting them. I'm storing a last-modified field in the database, so I could append that timestamp to the URL, e.g. http://images.example.com/images/153453543.jpg?m=123542345453. This would work if the caching was handled by the ASP.NET process, which could vary the output cache by the parameter m, but seeing as I need to serve large quantities of images I'd rather avoid that.
Is there an intelligent way to get the IIS to discard static files if some condition is met?
If you don't want your ASP.NET code to be invoke every time someone requests an image then I would recommend that you delete the images when updating the transformations. It is a relatively "free" operation since it just a cache and they will be regenerated when needed.
You might be concerned about tracking if the transformation is actually changed when the user updates image properties but how often will the user make changes at all. Does it matter if you need to regenerate an image a bit to often?
You could include the timestamp in the filename itself, e.g.
http://images.example.com/images/153453543_20091124120059.jpg.
That way you could avoid to delete the images when updating. However, you would leave a trail of old outdated files...
Why not run the process to generate the physical image whenever those settings are changed, rather than on each request?