I'm putting together a script to find remove duplicates in a large library of images. At the moment I'm doing a two pass filter of first finding files of the same size and then doing a sha256 on a 10240 byte piece of the file to get a fingerprint of the files with the same size (code here).
It works well, but I'm guessing there are probably checksums built in to the jpeg format that I could use instead of doing the sha256.
Does anyone know if there are checksums or other components that could act as checksums / fingerprints? If so, is there an efficient way to access them?
I don't think the JPEG specification includes any kind of checksum in the way you're describing.
A JPEG can contain a thumbnail as part of its EXIF metadata, though. It's not a perfect indicator, since it's possible for two different images to have the same thumbnail. There's at least one documented case of a thumbnail not being replaced after the image had undergone substantial modifications, said thumbnail revealing much more than the publisher had intended.
Its been awhile since I've dug into the IJG library, but I don't think there's an easy class member or function call you can use there to check for some type of fingerprint. You could use the built in EXIF tags if you can control the encoding of the images...
I'm just built a very similar script. I don't want to checksum metadata I want to see if the actual images are duplicates even if tags have been modified. Best for that is not to sort by size, but do sort by the checksum istelf. I use jhead to remove metadata and then checksum the whole file (but I also thought about just doing part of it, but actually I don't think it saves much time). jhead doesn't use shared memory (pipes) and does overwrite so I just copy the file to shared memory first. I place the checksum in the ImageDescription field for later faster retrieval. Obviously this also allows to check image integrity later and is part of why I checksum the whole thing. Tip: exiv2 is MUCH faster for reading and writing the metadata than exiftool for one at a time decision based manipulation.
In JPEG standard(ITU-T.81) i believe there isn't any field/syntax element which has a checksum or such, for the whole compressed jpeg image file. Unless a customised application puts such filed in the Application segment, or as meta data for which segments are provided in the standard.
So to serve your purpose, what you are doing is one soln.
Other could be some kind a application wrapper which will call some binary file compare utlitiy (like beyond compare, or even a windows command fc /b) and check the result of that compare utility and take the decision u want to.
-AD
One way you could perform is reduce all images to a fixed size and store that as a thumbnail. Then the image comparison would compare similar sized images and give you a chance of being a duplicate - useful if you have cropped (unless cropped heavily) or resized images and want to find those 'duplicates'.
In the XMP specification there are document ID and version ID which should uniquely identify the version of the image.
The problem with these (and with any other metadata-based identification method) is that it might not be respected by some applications that can change the content of the jpeg updating the metadata accordingly.
Related
Is there a command line tool to remove all spot color channels from a vector input image (type can be ai, eps) and keep only the CMYK or RGB color channels .
What I ve been able to come up with so far is using ghostscript tiffsep device and then recombine the color channel images to one image using imagemagicks -combine option. The drawback of this method is that it is quite compicated and I end up with a tiff image, instead of the original (vector) format.
'Image' has a defined meaning in PostScript, it means a bitmap, a raster. I think, from the context, that you mean something more general.
The simple answer is no, in general you can't do this, and I don't know of any tool which will.
The reason is that to do so would lose information; the marks defined in Separation or DeviceN space would be lost entirely, and its generally regarded as a Bad Idea to discard random parts of the document.
Perhaps you could explain what you are trying to achieve with this (ie why are you doing this), and it might be possible to suggest an alternative method.
If you are a competent C programmer you could produce a Ghostscript subclass device using the existing FILTER device (in gdevflt.c) as a template. That device looks at the type of operation, and either passes it on to the output device, or throws it away. It would be reasonably simple to look at the current colour space and discard Separation or DeviceN space. If you then uses the pdfwrite/ps2write/eps2write outptu device you'd get an EPS, PostScript program or PDF file as the output.
Whether you go down this route, continue with what you have, or find an alternative approach, there are a couple of things you need to think about; how do you plan to tackle Separation inks with process colour names ? Eg /Separation /Black. What about DeviceN spaces where some of the inks are process colours ? Eg a duotone Black and Pantone ink. Should these be preserved or dicarded ?
Your current approach will use the parts of the object which mark process plates, but not those which mark spot colorus, which could give some very peculiar results.
[EDIT]
PDF, PostScript and EPS don't have 'layers' (PDF has a feature, Optional Content, which uses the term 'layers' as a description in the specification but that's all).
An application such as Photoshop and Illustrator can have layers, but in general what they export to has to have those 'layers' converted into something else. That 'something else' depends on what you are saving it as.
Part of the problem is that you are apparently trying to deal with 3 different kinds of input, you say Illustrator (PDF, more or less), Photoshop (raster image) and EPS (PostScript). There is little common ground between the 3, is there a reason to support all of them ?
If you are content to stick with just Illustrator you might be able to do something with Optional Content. I'm not terribly familiar with modern versions of Illustrator, but wouldn't it be simpler to save two versions of the file, one with the answer layer and one without ?
Anyway, Ghostscript can honour Optional Content, so if you can save a PDF file (not PostScript or EPS) from Illustrator, it may be that the layers will persist into the PDF as Optional Content. I suspect they will going by a quick Google. In that case you might be able to run the file through Ghostscript, telling it not to honour the Optional Content portion, and get a PDF file without it present.
Another solution (again limited to PDF) would be to open the PDF file with an editing application such as Acrobat Pro, and simply delete the bits you don't want. Deletion of that kind is relatively reliable.
It still feels like rather a long-winded way to get a PDF file with some of the content removed though. I can't help feeling that just saving two versions from the creating application would be easier.
I have a JPEG image stored in memory as a blob and am looking to apply some basic transformations to it (e.g. resize, convert to greyscale, rotate etc.)
I am currently using Google Scripts which doesn't have a native image library as far as I can tell.
Are there standard algorithms or similar which would allow me to work with the raw binary array, knowing it represents a JPEG image, to achieve such a transformation?
Not the answer you are looking for I guess, but...
To be able to do image processing using JPEG files as input, you need to decode the images. Well, actually, 90/180/270 degree rotation, flipping and cropping is possible as lossless operations, and thus without decoding the image data. But for anything more advanced, like resizing, you need to work with a decoded image.
Both the file structure (JIF/JFIF) and algorithms used to compress the image data in standard JPEG format are well defined and properly documented. But at the same time, the specification is quite complex. It's certainly doable if you have the time and know what you are doing. And if you are lucky, and your JPEG blobs are all written the same way, you might get away with implementing only some of the spec. But even then, you will need to (re-)implement large parts of the spec, and it might just not be worth it.
Using a 3rd party service to convert it for you, or create your own using a known library, like libjpeg or Java's ImageIO, etc. might be your best bets, if you need a quick solution, and don't have too strict requirements for performance.
There are no straightfoward image processing capabilities available in Apps Script. You'll have either expose your Python as a web service and call it from Apps Script or use the Drive REST API to access the files from your Python app or use any api webservices.
GAE Python has Image processing capabilities check the below url:
https://developers.google.com/appengine/docs/python/images/
Available image transformations
The Images service can resize, rotate, flip, and crop images, and enhance photographs. It can also composite multiple images into a single image.
I've asked similar questions before, but have not received a definitive answer. Seems that there must be a way to simply add/modify metadata to an image without loading the image into memory, without having to deal with directly reading bits.
Seems like ways exist when using CMSampleBufferRefs, but I need to be able to do this with a regular image already saved to disk.
For instance, given a very large png at /Documents/photo.png, I want to modify its exif metadata without having to load that image.
You can use libexif - I've had success with compiling it for iOS before. With libexif, you can modify any image's EXIF metadata.
If you know how to modify the EXIF, you can modify the binary data directly from the file. Just replace in the image the binary portion with the new one.
I don't know if objective-c permit this, but in ansi c should be simple. The complicate part is to identify the exact part to change.
How can I get the summary information for file images in Delphi?
You will have to use parsers for each file type. The simple solution would be to use something like GraphicEx and to load each supported image into a temporary object, extract the information you want then dispose of it.
For EXIF information (the information attached by modern day cameras) you might want to use a different component. I know there are a few components floating around that will give you access to this special format, however this data is not supported by all image types and is normally seen in JPEG files.
You don't need anything.
It's included in Windows (Win32 COM)
See on Win32 api group (classic question...)
I want to distribute a few images and not allow others to see them unless they are using my program. My intention will be to use JPG files in which I will alter the header so other image viewers cannot read them anymore. For example I can delete the bytes 7-10 which are the magic signature for JPG. Later, my program will reconstruct the header and show the JPG file.
Question: how do I do this on the fly, without reading the “broken” JPG file, restoring the header, saving the good file to disk and then re-loading it as a “good” JPG file?
Load the "broken" file into a TMemoryStream, patch the bytes in-memory, and use TGraphic.LoadFromStream() to load the fixed JPG file.
Encrypt them. Load the encrypted image, decrypt in memory and then do the loadfromstream like mghie suggested.
Why not just encrypt the images with a private key and distribute your public key to the people you want to view the images? much easier to distribute a public key than writing some custom software and distributing that. Don't forget; anything displayed on screen can be screen captured. The fact a custom-mangled JPEG can only be displayed with your app is no protection. Also don't forget; people can simply distribute your software with the mangled image.
Mghie's answer is about as good as you'll find, but it's not likely to be too effective. If someone wants to look at your images and they know anything about image formats, they'll open it in a hex editor and most likely recognize what they see as a JPEG with the magic header removed.
If you really want to keep someone from viewing your images, construct your own image format, (it's not as hard as it sounds, really,) and put as little metadata in as possible, and then hope that works. Or encrypt them, or put them into an archive, (construct your own archive format for best results,) and hope that works.
Thing is, ultimately, anything that's encoded has to be decoded before it can be shown, and any sufficiently-talented hacker can trace their way through your decoding routine and figure out how it works. Why are you trying to hide things from your users anyway?
You could make it more difficult for them by byte packing the images as encrypted resources. But like anything else if they have access to the files that could get the images out. It just depends on how much effort the are willing to use.
Depending on how secure you need it to be you could do something as simple as obfuscate the file extension to an extension that is only opened with your application. This will only work if its not super secret images that you are changing.