restrictions for key in Couchbase - membase

I found some info about limitations for document in Couchbase: /thread/key-length - that the maximum length of the key is 250 bytes.
But couldn't find any official one.
Can someone confirm the maximum length of a key for Couchbase document?
What are other limitations for keys, and what are good practices for them?
What about indexes (keys for map functions)?
My use case is that I want to store documents identified by url. The straight forward solution is to indicate the documents by url. Assuming that there are urls bigger than 250 bytes, I need to choose other key - eg md5(url) and put url as an element of the document. Is this good model for Couchbase?

Yes there is a a 250 byte restriction to keynames in Couchbase Server. Your idea to hash the URLs should work great.
The bytes that make up a key also need to be legal utf-8 (you can store and retrieve non-string keys, but they wont' participate in the full set of Couchbase features, like views and cross datacenter replication).
Key in map reduce views must be utf-8, and are limited to 65kb in length.

This isn't a good model for Couchbase. The reason is that Couchbase is meant to be accessed by application servers and not the end user. If you set up Couchbase with an open port then there is nothing stopping someone from modifying or deleting all of the data in your database.

Related

What is the best way to save images? [duplicate]

The common method to store images in a database is to convert the image to base64 data before storing the data. This process will increase the size by 33%. Alternatively it is possible to directly store the image as a BLOB; for example:
$image = new Imagick("image.jpg");
$data = $image->getImageBlob();
$data = $mysqli->real_escape_string($data);
$mysqli->query("INSERT INTO images (data) VALUES ('$data')");
and then display the image with
<img src="data:image/jpeg;base64,' . base64_encode($data) . '" />
With the latter method, we save 1/3 storage space. Why is it more common to store images as base64 in MySQL databases?
UPDATE: There are many debates about advantages and disadvantages of storing images in databases, and most people believe it is not a practical approach. Anyway, here I assume we store image in database, and discussing the best method to do so.
I contend that images (files) are NOT usually stored in a database base64 encoded. Instead, they are stored in their raw binary form in a binary column, blob column, or file.
Base64 is only used as a transport mechanism, not for storage. For example, you can embed a base64 encoded image into an XML document or an email message.
Base64 is also stream friendly. You can encode and decode on the fly (without knowing the total size of the data).
While base64 is fine for transport, do not store your images base64 encoded.
Base64 provides no checksum or anything of any value for storage.
Base64 encoding increases the storage requirement by 33% over a raw binary format. It also increases the amount of data that must be read from persistent storage, which is still generally the largest bottleneck in computing. It's generally faster to read less bytes and encode them on the fly. Only if your system is CPU bound instead of IO bound, and you're regularly outputting the image in base64, then consider storing in base64.
Inline images (base64 encoded images embedded in HTML) are a bottleneck themselves--you're sending 33% more data over the wire, and doing it serially (the web browser has to wait on the inline images before it can finish downloading the page HTML).
On MySQL, and perhaps similar databases, for performance reasons, you might wish to store very small images in binary format in BINARY or VARBINARY columns so that they are on the same page as the primary key, as opposed to BLOB columns, which are always stored on a separate page and sometimes force the use of temporary tables.
If you still wish to store images base64 encoded, please, whatever you do, make sure you don't store base64 encoded data in a UTF8 column then index it.
Pro base64: the encoded representation you handle is a pretty safe string. It contains neither control chars nor quotes. The latter point helps against SQL injection attempts. I wouldn't expect any problem to just add the value to a "hand coded" SQL query string.
Pro BLOB: the database manager software knows what type of data it has to expect. It can optimize for that. If you'd store base64 in a TEXT field it might try to build some index or other data structure for it, which would be really nice and useful for "real" text data but pointless and a waste of time and space for image data. And it is the smaller, as in number of bytes, representation.
Just want to give one example why we decided to store image in DB not files or CDN, it is storing images of signatures.
We have tried to do so via CDN, cloud storage, files, and finally decided to store in DB and happy about the decision as it was proven us right in our subsequent events when we moved, upgraded our scripts and migrated the sites serveral times.
For my case, we wanted the signatures to be with the records that belong to the author of documents.
Storing in files format risks missing them or deleted by accident.
We store it as a blob binary format in MySQL, and later as based64 encoded image in a text field. The decision to change to based64 was due to smaller size as result for some reason, and faster loading. Blob was slowing down the page load for some reason.
In our case, this solution to store signature images in DB, (whether as blob or based64), was driven by:
Most signature images are very small.
We don't need to index the signature images stored in DB.
Index is done on the primary key.
We may have to move or switch servers, moving physical images files to different servers, may cause the images not found due to links change.
it is embarrassed to ask the author to re-sign their signatures.
it is more secured saving in the DB as compared to exposing it as files which can be downloaded if security is compromised. Storing in DB allows us better control over its access.
any future migrations, change of web design, hosting, servers, we have zero worries about reconcilating the signature file names against the physical files, it is all in the DB!
AC
I recommend looking at modern databases like NoSQL and also I agree with user1252434's post. For instance I am storing a few < 500kb PNGs as base64 on my Mongo db with binary set to true with no performance hit at all. Mongo can be used to store large files like 10MB videos and that can offer huge time saving advantages in metadata searches for those videos, see storing large objects and files in mongodb.

How much data a column of mnesia table can store

How much data can a column of mnesia can store.Is there any limit on it or we can store as much as we want.Any pointer?(If table is disc_only_copy)
As with any potentially large data set (in terms of total entries, not total volume of bytes) the real question isn't how much you can cram into a single table, but how you want to partition the data and how unified or distinct those partitions should appear to the system.
In the context of a chat system, for example, you may want to be able to save the chat history forever, which is a reasonable goal. But you may not want all chat entries to be in the same table forever and ever (10 years? how long? who knows!) right next to chat entries made yesterday. You may also discover as time moves on that storing every chat message in a single table to be a painfully naive decision to overcome later on down the road.
So this brings up the issue of partitioning. How do you want to do it? (Staying within the context of a chat system, but easily transferrable to another problem...) By time? By channel? By user? By time and channel?
How do you want to locate the data later? This brings up obvious answers that are the same as above: By time? By channel? By user? By time and channel?
This issue exists whether you're dealing with Mnesia or with Postgres -- or any database -- when you're contemplating the storage of lots of entries. So think about your problem in the context of how you want to partition the data.
The second issue is the volume of the data in bytes, and the most natural representation of that data. Considering basic chat data, its not that hard to imagine simply plugging everything into the database. But if its a chat system that can have large files attached within a message, I would probably want to have those files stored as what they are (files) somewhere in a system made for that (like a file system!) and store only a reference to it in the database. If I were creating a movie archive I would certainly feel comfortable using Mnesia to store titles, actors, years, and a pointer (URL or file system path) to the movie, but I wouldn't dream of storing movie file data in my database, even if I was using Postgres (which can actually stand up to that sort of abuse... but think about new awkwardness of database dumps, backups and massive bottleneck introduced in the form of everyone's download/upload speed being whatever the core service's bandwidth to the database backend is!).
In addition to these issues, you want to think about how the data backend will interface with the rest of the system. What is the API you wish you could use? Write it now and think it through to see if its silly. Once it seems perfect, go back through critically and toss out any elements you don't have an immediate need to actually use right now.
So, that gives us:
Partition scheme
Context of future queries
Volume of data in bytes
Natural state of the different elements of data you want to store
Interface to the overall system you wish you could use
When you start wondering how much data you can put into a database these are the questions you have to start asking yourself.
Now that all that's been written, here is a question that discusses Mnesia in terms of entries, bytes, and how many bytes different types of entries might represent: What is the storage capacity of a Mnesia database?
Mnesia started as an in-memory database. It means that it is not designed to store large amount of data. When you ask yourself this question, it means you should look at another ejabberd backend.

DynamoDB auto incremented ID & server time (iOS SDK)

Is there an option in DynammoDB to store auto incremented ID as primary key in tables? I also need to store the server time in tables as the "created at" fields (eg., user create at). But I don't find any way to get server time from DynamoDB or any other AWS services.
Can you guys help me with,
Working with auto incremented IDs in DyanmoDB tables
Storing server time in tables for "created at" like fields.
Thanks.
Actually, there are very few features in DynamoDB and this is precisely its main strength. Simplicity.
There are no way automatically generate IDs nor UUIDs.
There are no way to auto-generate a date
For the "date" problem, it should be easy to generate it on the client side. May I suggest you to use the ISO 8601 date format ? It's both programmer and computer friendly.
Most of the time, there is a better way than using automatic IDs for Items. This is often a bad habit taken from the SQL or MongoDB world. For instance, an e-mail or a login will make a perfect ID for a user. But I know there are specific cases where IDs might be useful.
In these cases, you need to build your own system. In this SO answer and this article from DynamoDB-Mapper documentation, I explain how to do it. I hope it helps
Rather than working with auto-incremented IDs, consider working with GUIDs. You get higher theoretical throughput and better failure handling, and the only thing you lose is the natural time-order, which is better handled by dates.
Higher throughput because you don't need to ask Dynamo to generate the next available IDs (which would require some resource somewhere obtaining a lock, getting some numbers, and making sure nothing else gets those numbers). Better failure handling comes when you lose your connection to Dynamo (Dynamo goes down, or you are bursty and your application is doing more work than currently provisioned throughput). A write-only application can continue "working" and generating data complete with IDs, queueing it up to be written to dynamo, and never worry about ID collisions.
I've created a small web service just for this purpose. See this blog post, that explains how I'm using stateful.co with DynamoDB in order to simulate auto-increment functionality: http://www.yegor256.com/2014/05/18/cloud-autoincrement-counters.html
Basically, you register an atomic counter at stateful.co and increment it every time you need a new value, through RESTful API.

Methods of reducing URL size?

So, we have a very large and complex website that requires a lot of state information to be placed in the URL. Most of the time, this is just peachy and the app works well. However, there are (an increasing number of) instances where the URL length gets reaaaaallllly long. This causes huge problems in IE because of the URL length restriction.
I'm wondering, what strategies/methods have people used to reduce the length of their URLs? Specifically, I'd just need to reduce certain parameters in the URL, maybe not the entire thing.
In the past, we've pushed some of this state data into session... however this decreases addressability in our application (which is really important). So, any strategy which can maintain addressability would be favored.
Thanks!
Edit: To answer some questions and clarify a little, most of our parameters aren't an issue... however some of them are dynamically generated with the possibility of being very long. These parameters can contain anything legal in a URL (meaning they aren't just numbers or just letters, could be anything). Case sensitivity may or may not matter.
Also, ideally we could convert these to POST, however due to the immense architectural changes required for that, I don't think that is really possible.
If you don't want to store that data in the session scope, you can:
Send the data as a POST parameter (in a hidden field), so data will be sent in the HTTP request body instead of the URL
Store the data in a database and pass a key (that gives you access to the corresponding database record) back and forth, which opens a lot of scalability and maybe security issues. I suppose this approach is similar to use the session scope.
most of our parameters aren't an issue... however some of them are dynamically generated with the possibility of being very long
I don't see a way to get around this if you want to keep full state info in the URL without resorting to storing data in the session, or permanently on server side.
You might save a few bytes using some compression algorithm, but it will make the URLs unreadable, most algorithms are not very efficient on small strings, and compressing does not produce predictable results.
The only other ideas that come to mind are
Shortening parameter names (query => q, page=> p...) might save a few bytes
If the parameter order is very static, using mod_rewritten directory structures /url/param1/param2/param3 may save a few bytes because you don't need to use parameter names
Whatever data is repetitive and can be "shortened" back into numeric IDs or shorter identifiers (like place names of company branches, product names, ...) keep in an internal, global, permanent lookup table (London => 1, Paris => 2...)
Other than that, I think storing data on server side, identified by a random key as #Guido already suggests, is the only real way. The up side is that you have no size limit at all: An URL like
example.com/?key=A23H7230sJFC
can "contain" as much information on server side as you want.
The down side, of course, is that in order for these URLs to work reliably, you'll have to keep the data on your server indefinitely. It's like having your own little URL shortening service... Whether that is an attractive option, will depend on the overall situation.
I think that's pretty much it!
One option which is good when they really are navigatable parameters is to work these parameters into the first section of the URL e.g.
http://example.site.com/ViewPerson.xx?PersonID=123
=>
http://example.site.com/View/Person/123/
If the data in the URL is automatically generated can't you just generate it again when needed?
With little information it is hard to think of a solution but I'd start by researching what RESTful architectures do in terms of using hypermedia (i.e. links) to keep state. REST in Practice (http://tinyurl.com/287r6wk) is a very good book on this very topic.
Not sure what application you are using. I have had the same problem and I use a couple of solutions (ASP.NET):
Use Server.Transfer and HttpContext (PreviousPage in .Net 2+) to get access to a public property of the source page which holds the data.
Use Server.Transfer along with a hidden field in the source page.
Using compression on querystring.

What is the difference between file and random access file?

what is the difference between file and random access file?
A random access file is a file where you can "jump" to anywhere within it without having to read sequentially until the position you are interested in.
For example, say you have a 1MB file, and you are interested in 5 bytes that start after 100k of data. A random access file will allow you to "jump" to the 100k-th position in one operation. A non-random access file will require you to read 100k bytes first, and only then read the data you're interested in.
Hope that helps.
Clarification: this description is language-agnostic and does not relate to any specific file wrapper in any specific language/framework.
Almost nothing these days. There used to be a time in certain operating systems where there were different types of files - some of which could be accessed randomly (at any point in the file) and others which could only be accessed sequentially. This made more sense when you were using a sequential medium such as tape. Any file system worth its salt these days only supports random access.

Resources