Jackrabbit how are indexes created and updated ? how are the index files updated? - sling

We are using jackrabbit oak with mongodb as document store and s3 as blob store. The content we are currently storing is mostly text and very few images. But the size of s3 is around 80 GB whereas the size of mongoDb is only ~10 GB . I've lucene indexes enabled. I assume oak stores the indexes, revision files in s3. Can someone explain how this works? what is the 80Gb content that oak is creating in s3?

Related

Is Dataflow making use of Google Cloud Storage's gzip transcoding?

I am trying to process JSON files (10 GB uncompressed/2 GB compressed) and I want to optimize my pipeline.
According to the official docs Google Cloud Storage (GCS) has the option to transcode gzip files, which means the application gets them uncompressed, when they are tagged correctly.
Google Cloud Dataflow (GCDF) has better parallelism when dealing with uncompressed files, so I was wondering if setting the meta tag on GCS has a positive effect on performance?
Since my input files are relatively large, does it make sense to unzip them so that Dataflow splits them in smaller chunks?
You should not use this meta tag. It's dangerous, as GCS would report the size of your file incorrectly (e.g. report the compressed size, but dataflow/beam would read the uncompressed data).
In any case, the splitting of uncompressed files relies on reading in parallel from different segments of a file, and this is not possible if the file is originally compressed.

Where to store created PDF files on ASP .NET MVC site [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
storing uploaded photos and documents - filesystem vs database blob
I am starting to develop a web app, the primary purpose of which is to display photos. The users will be able to upload photos as well.
The first question that came up was where to store the photos: on the file system or the database.
I will be using a Windows box to host the site. The database is MySQL and the backend code is in C# utilizing ASP.NET MVC.
Filesystem, of course, unless you're aiming for a story on thedailywtf. The easiest way is to have the photos organized by a property you can derive from the file itself, such as its SHA-1 hash. Then just store the hash in the database, attached to the photo's primary key and other attributes (who uploaded it, upload date, etc).
It's also a good idea to divvy up the photos on the filesystem, so you don't end up with millions of files in a single directory. So you'll have something like this:
storage/00/e4/f56c0de1c61fdb926e79e8a0a65bd12930c9.jpg
storage/25/9a/ec1c55bfb660548a6770238668c4b117d92f.jpg
storage/5d/d5/4b01d98f17a9ad9dd1526b49ba39b5aa37a1.jpg
storage/63/49/6f740b6c284ce6685dc17d473a7360ace249.jpg
storage/b1/75/066d178188dde110149a8422ab651b0ee615.jpg
storage/b1/20/a2b7d02b7b0c43530677ab06235382a37e20.jpg
storage/da/39/a3ee5e6b4b0d3255bfef95601890afd80709.jpg
This is also easy to port if you ever move to sharded storage.
If you are using SQL Server 2008 there's a Filestream datatype that handles most of the problems mentioned about the DB getting larger. It handles all the annoying details of synchronizing between the filesystem and the table.
Look here for a blog post about the topic: Store any data in SQL Server 2008 (Katmai)
If you're building a website around photos then forget the database. If it will become popular your database is going to be hit hard and the majority of its time will be spent delivering photos. Also databases don't scale very well. There are so much more advantages in keeping them on the file system. And you can scale very well, having static content servers, using services for content delivery.
Also, Amazon S3 or other cloud providers do have their advantages. For instance S3 + Amazon CloudFront will provide good performance. CloudFront caches your files on servers around the world so they'll be very easily/fast accessible from anywhere. BUT if we're talking pictures and the site becomes popular your bills might be quite high.
For S3 Amazon charges per storage and per transfer in/out of the cloud.
For CloudFront per transfer.
Generally, people store binary data such as images on the filesystem, not the database. They reference the filesystem path from the database. Retrieving BLOBs (binary large objects) from the database is slower than allowing the web server to serve static files from the file system.
I would use something like Amazon S3.
But, if the choice is between filesystem and database I would choose filesystem because it's faster to server images from a filesystem than a database.
The only reason I would put photos as BLOBs in a database would be if I had a cluster of servers, and I was using database replication to automatically copy the photos to every machine in the cluster.
Life is much simpler if you just store the photos as files, and store the filenames of the photos in the database. If you need to create unique filenames for the photos, you can use a primary key integer from the database as part of the filename. But you could also just use a hash of the photo itself, as suggested by John Milliken. That's simple, and simple is better.
Some people point out that it's easier to manage if everything's in the database: including making backups, and preserving referential integrity.
If you store it in db, the db will grow quickly and will be much, much larger. It is just a touch more complicated to get an image out of db for display then to it is to get it from a file system. On the other hand, you better make sure that the file names and paths do not get out of sync with what is stored in db. In the past i have chosen to store on disk instead of db. It made it easier for me do move the database to different boxes. Worked out well.
We had a similar decision to make for a project I am on. The compelling thing about jamming stuff (images and other BLOBy things) into the DB is that it is is less likely that someone might delete/alter something (either intentionally or unintentionally). But, that isn't the choice we made. Instead we have the path info stored in the DB and use that to reference the data via UNC path. Data paths are stored in two parts - a part that references the location of the data relative to which machine it resides on and a part that points to which machine that group of data is on. When we need to move data around we can update the appropriate path info.
It is certainly quick to get the data without pulling out of the DB. Ultimately that was a major deciding factor.
It makes life so easy when you have a blob database. You should forget about the nightmare that is file system management.
EDIT
ID
VARBINARY
From experience this is an efficient way to manage binary files. You have one database that has only binary files. How can this be any harder to backup?

Using EC2 to resize images stored on S3 on demand

We need to serve the same image in a number of possible sizes in our app. The library consists of 10's of thousands of images which will be stored on S3, so storing the same image in all it's possible sizes does not seem ideal. I have seen a few mentions on Google that EC2 could be used to resize S3 images on the fly, but I am struggling to find more information. Could anyone please point me in the direction of some more info or ideally, some code samples?
Tip
It was not obvious to us at first, but never serve images to an app or website directly from S3, it is highly recommended to use CloudFront. There are 3 reasons:
Cost - CloudFront is cheaper
Performance - CloudFront is faster
Reliability - S3 will occasionally not serve a resource when queried frequently i.e. more than 10-20 times a second. This took us ages to debug as resources would randomly not be available.
The above are not necessarily failings of S3 as it's meant to be a storage and not a content delivery service.
Why not store all image sizes, assuming you aren't talking about hundreds of different possible sizes? Storage cost is minimal. You would also then be able to serve your images up through Cloudfront (or directly from S3) such that you don't have to use your application server to resize images on the fly. If you serve a lot of these images, the amount of processing cost you save (i.e. CPU cycles, memory requirements, etc.) by not having to dynamically resize images and process image requests in your web server would likely easily offset the storage cost.
What you need is an image server. Yes, it can be hosted on EC2. These links should help starting off: https://github.com/adamdbradley/foresight.js/wiki/Server-Resizing-Images
http://en.wikipedia.org/wiki/Image_server

Cropping and resizing images on the fly with node.js

I run a node.js server on Amazon EC2. I am getting a huge csv file with data containing links to product images on a remote host. I want to crop and store the images in different sizes on Amazon S3.
How could this be done, preferably just with streams, without saving anything to disc?
I don't think you can get around saving the full-size image to disk temporarily, since resizing/cropping/etc would normally require having the full image file. So, I say ImageMagick.

Cannot upload files bigger than 8GB to Amazon S3 by multi-part upload Java API due to broken pipe

I implemented S3 multi-part upload in Java, both high level and low level version, based on the sample code from
http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?HLuploadFileJava.html and http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?llJavaUploadFile.html
When I uploaded files of size less than 4 GB, the upload processes completed without any problem. When I uploaded a file of size 13 GB, the code started to show IO exception, broken pipes. After multiple retries, it still failed.
Here is the way to repeat the scenario. Take 1.1.7.1 release,
create a new bucket in US standard region
create a large EC2 instance as the client to upload file
create a file of 13GB in size on the EC2 instance.
run the sample code on either one of the high-level or low-level API S3 documentation pages from the EC2 instance
test either one of the three part size: default part size (5 MB) or set the part size to 100,000,000 or 200,000,000 bytes.
So far the problem shows up consistently. I did a tcpdump. It appeared the HTTP server (S3 side) kept resetting the TCP stream, which caused the client side to throw IO exception - broken pipe, after the uploaded byte counts exceeding 8GB . Anyone has similar experiences when upload large files to S3 using multi-part upload?

Resources