Best compression algorithm for Url query string

Best compression algorithm for Url query string - url

I have to pass a large url query string, so when this string size exceeds a certain number of characters, it creates problem when passed in the url.
Currently I have tried deflation + base64 encoding, which is giving me around 30-35% compression.
So if my query string becomes too large, say 4400 characters, it will be compressed to approximately 2650 chars, which wont fit to my url.
I need a solution that gives better results than this one.
I searched a lot, but not able to find a better solution.
Any suggestions on what else could be done will be appreciated. Thanks.
Example of my query string:
3d7821d1-e324-4cea-9bd7-763c0b62cdc2|94db7bdb-5e16-4700-a1f9-408ba7f7bee1|63360a17-0807-45a0-a798-31eb2614b0f7|9b37f302-2757-40e5-b9b4-390e5b786010|46ef6bce-c7e9-47d6-90d8-bc7c2b5784c0|e5f450a5-724b-42a0-aff9-34be2d50f59b|33db4e6b-bc53-4774-8267-759167a8dba9|30a8c7a9-0a3b-4df3-ab01-5e9b262d1902|d31086bb-98e8-41d0-a6cf-0bd48986bce7|30f27de5-1536-483a-85aa-6eb5000ba67b|41498746-3f45-4c16-9152-a6ca8355d502|6b5c643b-03f6-4390-9d54-79bf978f8e15|4537e3ba-09ed-465a-aad8-1c842084c3af|ad1161ab-0393-4a66-a538-6dda0c7b892a.....

Currently the solution- deflation + base64, doesnot completely solve my issue but improves the situation, so I integrated it with my code.
And for Future work, thinking about:
Converting the request to POST
OR
Taking sequential ids (1,2,3...), instead of UUID
(the example of query string shows that it is a concatenation of UUIDs)
and concatenating, and passing in GET request.

Related

How unique are the first 8-12 characters of SHA256 hashes?

Take this hash for example:
ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad
It's too long for my purposes so I intend to use a small chunk from it, such as:
ba7816bf8f01
ba7816bf
Or similar. My intended use case:
Video gallery on a website, represented by thumbnails. They are in random order.
They play in the lightbox. They don't have a unique ID, only their URL is unique.
While the lightbox is open I add something to the end of the page URL with JS History API.
//example.com/video-gallery/lightbox/ba7816bf8f01
The suffix needs to be short and simple, definitely not a URL.
People share the URL.
The server can make sense of the lightbox/ba7816bf8f01 in relation to /video-gallery.
Visiting the URL, the lightbox needs to find which video the suffix belongs to and play it.
I thought I'd SHA256 the URL of the video, use the first few characters as an ad-hoc ID. How many characters should I use from the generated hash, to considerably reduce the chance of collision?
I got the idea from URLs and Hashing by Google.

The Wikipedia page on birthday attacks has a table with the number of entries you need to produce a certain chance of collision with a certain number of bits as a random identifier. If you want to have a one in a million chance of a collision and expect to store a million documents, for example, you’ll need fewer than 64 bits (16 hex characters).
Base64 is a good way to fit more bits into the same length of string compared to hex, too, taking 1⅓ characters per byte instead of 2.

trying to figure out the charset

I'm downloading a CSV from Google Docs and in it characters like “ are saved as \xE2\x80\x9C and ” are saved as \xE2\x80\x9D.
My question is... what charset are those being saved in? How might I go about figuring that out?

It is in UTF-8.. You can tell by decoding it as UTF-8 and it shows the correct characters.
UTF-8 also has a unique and very distinctive pattern, just 3 bytes with highest bit set forming a valid UTF-8 sequence are enough to tell if something is UTF-8 with 99% confidence. Even with 2 bytes with highest bit set forming a valid UTF-8 sequence, you can already get to 90%.
In a case it wasn't UTF-8, and was some 8-bit code page instead, it would be impossible to tell just by looking at the bytes alone. Without any other information, you would basically have to brute force by decoding it in various 8-bit encodings and then seeing if it looks correct. The other possibility is using an algorithm that would go through the encodings automatically, and see if it the result makes sense in any language.
With more information like what operating system and locale the file was saved in, you could reduce the amount of possible encodings to try by a huge deal though.

Base64 encode: Three different outputs from different tools?

I am trying to verify an OAuth signature generated in code against a "known reputable source". All my steps are verified correct except the last, wherein a 'base signature string' is HMAC-SHA1 hashed against a secret key and then base64 encoded.
I have confirmed that my hash value is the same as expected by the algorithm. I then disconfirmed that my base64 encode was the same. Attempting to determine why my encode failed, I wanted to check the encoder I was using.
Here is the (hash) string being base64 encoded:
203ebb13a65cccaae5cb1b9d5af51fe41f534357
Here is the base64 encode that results in my code:
MjAzZWJiMTNhNjVjY2NhYWU1Y2IxYjlkNWFmNTFmZTQxZjUzNDM1Nw==
According to http://www.motobit.com/util/base64-decoder-encoder.asp, that is the correct result:
But, according to http://www.online-convert.com/result/096d7b00138f3726daee5f6ddb107a62 (provided with the secret and base string, not the hash), a different base64 should have been output. Note that the hash output is my correct hash despite the difference in base64:
Finally, the "official" tester (http://hueniverse.com/oauth/guide/authentication/) outputs a third different base64 from the same hash:
I have no idea what I'm doing wrong, and the fact that these tools are outputting different results makes me wonder if there is in fact such a thing as base64 encoding or if they are actually using different algorithms? Perhaps the fact that it's for OAuth would help you help me identify the answer.
Thanks for any leads from the wise.

OK, in this case the first website was making the same "mistake" I was (in my case it was a mistake, the first website may just be making an unstated assumption).
That mistake is whether the hash is interpreted as a string (which gets base64encoded) or as a series of hexadecimal values which get base64encoded. In the former case, the resultant encode is longer than the original string, while in the latter the resultant encode is shorter than the original string. This is not only empirically true but the interwebs show that it is one of the concepts behind the standard in the first place.
The second website, working from (as stated) "hex" data, got the correct answer.

Try to check via https://base64-encode.org
On this website you can convert all types of images to Base64 string.

Extra text after the URL in a QR code

I've seen a number of QR codes that contain a URL but also have extra some text after it. Something like:
http://www.example.com Thanks for scanning this QR code!
I've experimented with using a number of different delimiting methods (several spaces, a question mark, two dashes, one or two returns) and all work to varying degrees on various scanning programs.
Some respect the space character, others respect the return. Some think the URL isn't a URL at all when I use a return. Long story short, it's all over the map how the various scanning programs (NeoReader, iNigma, Qrafter, Beetag, OptiScan, etc.) treat characters after a URL.
Is there any consensus on weather (a)this is even a good idea or not and (b)if so, what is the 'correct' (best practice) way to do it? (I know I should go read the RFC for the exact definition of a URL but since the reader programs are all over the map, I suspect they didn't read it either.)

You can make it work by converting the text message into valid URL, while trying to keep readability.
In your case it can be:
http://www.example.com?Thanks_for_scanning_this_QR_code
It's not perfect, but it can help on web analytics side to distinguish all QR code users.

Spaces are definitely not part of a URL, so, in that sense a space definitely should delimit the end of a URL.
The entire string is not a URL, taken as a whole of course. So yes it's asking for trouble.
As you've found, the empirical answer is that not every reader does what you want. Barcode Scanner for instance understands the split here, but does not prompt the user to launch the browser since the payload isn't a URL per se.
So: it's a bad idea.

Method for generating numerical values from a URL

In the 90s there was a toy called Barcode Battler. It scanned barcodes, and from the values generated an RPG like monster with various stats such as hit points, attack power, magic power, etc. Could there be a way to do a similar thing with a URL? From just an ordinary URL, generate stats like that. I was thinking of maybe taking the ASCII values of various characters in the URL and using them, but this seems too predictable and obvious.

Take the MD5 sum of the ASCII encoding of the URL? Incredibly easy to do on most platforms. That would give you 128 bits to come up with the stats from. If you want more, use a longer hash algorithm.
(I can't remember the details about what's allowed in a URL - if non-ASCII is allowed, you could use UTF-8 instead.)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart