I have a web application in which the user can configure reports (ASP.NET MVC, no Reporting Services or anything). The configuration is then represented as a JavaScript object which is sent to the server as JSON to retrieve data and actually generate the report. The submission HTML look similar to this:
<form method="post" action="/RenderReport" target="_blank">
<input type="hidden" name="reportJson"/>
</form>
This works very well for opening the report in a new browser window. However, in this report, I want to include images that are genetated from the data. How can this be done in a good way? The obvious ways that come to mind are:
Embed the metadata necessary to generate the images in the URL, like <img src="/GenerateImage/?metadata1=2&metadata2=4"/>. This won't work, however, since the metadata is very likely to make the URL exceed the 2083 characters max in IE.
Use an ajax POST request, and then when the response comes back, create an image element like <img src="data:image/png;base64,{data_in_json_response}"/>. This is not possible, though, since my application has to work in IE6, which doesn't support data URIs.
Generate the images while generating the report, creating a unique key for each image, and then use URLs of the form <img src="/GetCachedImage?id=23497abc289"/>. This is my current best idea, but it does raise the issue of where to cache the images. The places I can think of are:
In the session. Advantage: The cached item is automatically deleted when when the session is abandoned. Disadvantage: accessing the session will serialize accesses to the page within a session. This is bad in my case.
In the database: Advantage: Works well. Disadvantage: Unnecessary overhead, the cached items must be deleted some time.
In the Application / Cache object. I haven't really thought through all advantages and disadvantages of this one.
It also raises the question of when to delete the cached items. If I delete them right after the image is shown, it seems that the page can't be refreshed or printed without the images becoming red xes. Every other option means extra complexity.
How can this problem be solved in a good way, or at least one that isn't bad?
You can do a rotating disk cache of images rather easily... Google "ASP.NET image resizing module", the source code includes a disk caching module with configurable size.
However,
If the report is HTML, and contains image references, you have no way of knowing how long that report will be hanging around. Those images may be needed forever... Say someone copies and pastes into an e-mail... those links will stick around, and suddenly break when the cache is cleared.
If you only have a single server, you could use a hybrid approach. Create a Dictionary of cached images where the 'string' is your ID value in your example. object is the collection of parameters you need to create the image. Then you can just make a request for yourserver/generate/image/123456 and return the appropriate type.
This wouldn't work in a server farm unless you have some way to share the "object" that represent your parameters. You will still have to clean up this dictionary somehow or risk it growing without bound.
Related
When looking at how websites such as Facebook stores profile images, the URLs seem to use randomly generated value. For example, Google's Facebook page's profile picture page has the following URL:
https://scontent-lhr3-1.xx.fbcdn.net/hprofile-xft1/v/t1.0-1/p160x160/11990418_442606765926870_215300303224956260_n.png?oh=28cb5dd4717b7174eed44ca5279a2e37&oe=579938A8
However why not just organise it like so:
https://scontent-lhr3-1.xx.fbcdn.net/{{ profile_id }}/50x50.png
Clearly this would be much easier in terms of storage and simplicity. Am I missing something? Thanks.
Companies like Facebook have fairly intense CDNs. They may look like randomly generated urls but they aren't, each individual route is on purpose and programed to be handled in that manner.
They aren't after simplicity of storage like you would be if you were just using a FTP to connect to a basic marketing website server. While you may put all your images in a /images folder, Facebook is much too complex for this. Dozens of different types of applications accessing hundreds if not thousands of CDNs and servers world wide.
If you ever build a web app, such as a Ruby on Rails app, and you work with a services such as AWS (Amazon Web Services) you'll also encounter what seems like nonsensical urls. But it's all part of the fast delivery network provided within the architecture. Every time you "push" your app up to the server new urls are generated for each unique resource automatically, css files, JavaScript files, image files, etc all dynamically created. You don't have to type in each of these unique urls individually each time you publish the app, the code simply knows where to look for those as a part of the publishing process.
Example: you tell the web app to look for
//= require jquery
and it returns you http://example.com/assets/jquery-eb3e278249152b5b5d5170b73d9dbf52.js?body=1 in your header.
It doesn't matter that the url is more complex than it should be, the application recognizes it, and that's all that matters.
Simply put, I think it can boil down to two main reasons: Security and Cache:
Security - Adding these long unpredictable hashes prevent others from guessing photo URLs and makes it pretty hard to download photos you aren't supposed to.
Consider what would happen if I could easily guess your profile photo URL and download it, even when you explicitly chose to share it only with friends.
Cache - by adding "random" query params to each photo, you make sure each photo instance gets its own URL. Thus you can store the photo in browser's cache for a long time, knowing that whenever you replace it with a new one, the new photo will have a fresh URL and the browser won't keep showing you the old photo.
If you were to keep the same URL for each user's profile photo (e.g. https://scontent-lhr3-1.xx.fbcdn.net/{{ profile_id }}/50x50.png), and then upload a new photo, either one of these can happen:
If you stored the photo in browser's cache for a long time, the browser will keep showing you the cached version (as long as URL is the same, and cache hasn't expired, there's no need to re-download the image).
If, instead, you only keep the image in cache for short period of time, you end up hitting your server much more then actually needed, increasing the load and hurting performance.
I hope this clarifies it.
With your route scheme, how would you avoid strangers to access the pictures of a private account? The hash also prevent bots to downloads all the pictures.
I get your pain :-) I might not stay with describing how this problem could appear more, but rather let me speak of a solution. Well it is normal that in general code while dealing with hashed value or even base64ed value it seems likes mess to deal with, but with an identifier to explain along, it does not remain much!
I use to work in a company where we use to collate Facebook post, using Graph API get its Insights Object and extract information from it for easy passing around within UI and sending back to our Redis cache store; and once we defined a data-structure in TaffyDB how an object organization is going to look like, everything just made sense with its ability to query the useful finite from long junk looking stream of minified Javascript stream
Refer: http://www.taffydb.com/
The extra values in the URL are useful to:
Track access. This is like when a newspaper appends "&homepage" vs. "&email" to an article URL, so their system knows how a reader found the page.
Avoid abuse and control access. Imagine that a user loaded a small, popular pornographic image into a profile image. They could then hijack the CDN to be a free web host for their porn site. But that code is used internally by the CDN to limit the number of views.
I am working on a ASP.NET MVC app with knockout. its a single page app. I have a functionality to upload image. I am not sure whats the best option. the problem i have is, the session is not a sticky session. Which means there is no guarantee that the request will go to the same box. The options i have tried to do image upload are
1.)Data uri - I have created a custom knockout binding for image upload which posts a form to the MVC controller and the controller converts the image to a base64 string and i set the response to a viewmodel property in JS and binding to img tag. While this seemed to be the best solution, i had to support IE8 and this dint work work in IE8 as IE8 has limitation for Data URI
2.)Storing the image in temp folder in application server - Since sticky session is not available this wont work reliably
3.)Store the image in session - This seems to be non performant as session would end up hogging memory
Is there any other approach?
Ad 1) This option is cool for session problems, but you will have a really big overhead in transfer (every image you will send to client and back multiple times).
Ad 2) You can store image in temp folder in unique subfolder (for example with GUID name), and send to client only this GUID. Of course you will have to provide solution which will for time to time clean up this folder - but this is quite easy.
Ad 3)If you store your session in memory, your Web server will quite fast "blow up" with out of memory exception.
So in my option the best option is 2, because:
You will save transfer, so your site will work faster for client
It is quite easy to implement
It is easy to manage.
Ok, I'm building a PoC for a mobile application that needs to have offline capabilities, and I have several questions about whether I'm designing the application correctly and also what behavior I will get from the cache manifest.
This question is about including URLs of Controller actions in both the CACHE section of the manifest as well as in the NETWORK section.
I believe I've read some conflicting information online about this. In a few sites I read that including the wild card in the NETWORK section would make the browser try to retrieve everything from the server when it's online, and just use whatever is cached if there is no internet connection.
However, this morning I read the following on Dive into HTML5 : Let's take this offline:
The line marked NETWORK: is the beginning of the “online whitelist” section.
Resources in this section are never cached and are
not available offline. (Attempting to load them while offline will
result in an error.)
So, which information is correct? How would the application behave if I added the URL for a controller action in both the CACHE and the NETWORK sections?
I have a very simple and small PoC working so far, and this is what I've observed regarding this question:
I have a controller action that just generates 4 random numbers and sets them on the ViewBag, and the View will display them on a UL.
I'm not using Output caching at all. The only caching comes from the manifest file.
Before adding the manifest attribute to my Layout.cshtml's html tag, each time I requested the View, I'd get different random numbers every time, and a breakpoint set on the controller action would be hit.
The first time I requested the URL/View after adding the manifest attribute, the breakpoint on the controller is hit 3 times (as opposed to just 1 before). This is already weird and I'll post a separate question about this, I'm just writing it here for reference.
After the manifest and the resources are cached (verified by looking at the Console window on Chrome Dev Tools), everytime I request the View/URL I get the cached version and the breakpoint is never hit again.
This behavior makes me believe that whatever is in the CACHE section will override or ignore anything that is on the NETWORK section, but like I said (and the reason I'm asking here) is because I'm new to working with this and I'm not sure if this is how it's supposed to work or if I'm missing something or not using it correctly.
Any help is greatly appreciated
Here's the relevant section of the cache.manifest:
CACHE MANIFEST
#V1.0
CACHE:
/
/Content/Site.css
/Content/themes/base/jquery-ui.css
NETWORK:
*
/
FALLBACK:
As it turns out, html5 appcache or manifest caching does work differently than I expected it to.
Here's a quote from whatwg.org, which explains it nicely:
Offline Web Applications
The application cache feature works best if the application logic is
separate from the application and user data, with the logic (markup,
scripts, style sheets, images, etc) listed in the manifest and stored
in the application cache, with a finite number of static HTML pages
for the application, and with the application and user data stored in
Web Storage or a client-side Indexed Database, updated dynamically
using Web Sockets, XMLHttpRequest, server-sent events, or some other
similar mechanism.
Legacy applications, however, tend to be designed so that the user
data and the logic are mixed together in the HTML, with each operation
resulting in a new HTML page from the server.
The mixed-content model does not work well with the application cache
feature: since the content is cached, it would result in the user
always seeing the stale data from the previous time the cache was
updated.
While there is no way to make the legacy model work as fast as the
separated model, it can at least be retrofitted for offline use using
the prefer-online application cache mode. To do so, list all the
static resources used by the HTML page you want to have work offline
in an application cache manifest, use the manifest attribute to select
that manifest from the HTML file, and then add the following line at
the bottom of the manifest:
SETTINGS:
prefer-online
NETWORK:
*
so, as it turns out, application cache is not a good fit for pages with dynamic information that are rendered on the server. whatwg.org calls these type of apps "legacy".
for a natural fit with application cache, you'd need to have only the display and generic logic on your html page and retrieve any dynamic information through ajax requests.
hope this helps.
I'm a newbie to web development (and development in general) and I'm building out a rails app which scrapes data from a third party website. I'm using Nokogiri to parse for specific html elements that I'm interested in and these elements are stored in a database.
However, I'd like to save the html of the whole page I'm scraping as a back-up in case I change my mind on what type of information I want and in case the website removes the site (or updates it).
What's the best practice for storing the archived html?
Should I extract it as a string and put it in a database, write it to a log or text file, or what?
Edit:
I should have clarified a bit. I am crawling on the order of 10K websites a week and anticipate only needing to access the back-ups on once-off basis if I redefine the type of data I want.
So as an example, if was crawling UN data on country population data and originally was looking at age distributions but later realized I wanted to get the gender distributions as well, I'd want to go back to all my HTML archives and pull the data out. I don't anticipate this happening much (maybe 1-3 times a month) but when it does I'll want to retrieve it across 10K-100K listings. The task should only take a few hours to do around 10K records so I guess each website fetch should take at most a second. I don't need any versioning capability. Hope this clarifies.
I'm not sure what the "best practice" for this case is (it will vary by the specifics of your project), but as a starting point I'd suggest creating a model with a string field for the URL and a text field for the HTML itself, and save the pages there. You might add a uniqueness validator for the URL, to make sure you don't store the same HTML twice.
You could then optionally add model methods to initiate a nokogiri document from the HTML text, thus using the HTML string as the "master" record (in the DB) and generating the nokogiri document on the fly when needed. But again, as #dave-newton points out, a lot of this will depend on what you're going to do with this HTML.
I would strongly suggest saving it into a table in the same DB as the data you are scraping. Why change what works? Keep it all as you normally would, or write it all to a separate database entirely just in case and keep some form or ref to link the scraped data to the backups just in case.
Yesterday morning I noticed Google Search was using hash parameters:
http://www.google.com/#q=Client-side+URL+parameters
which seems to be the same as the more usual search (with search?q=Client-side+URL+parameters). (It seems they are no longer using it by default when doing a search using their form.)
Why would they do that?
More generally, I see hash parameters cropping up on a lot of web sites. Is it a good thing? Is it a hack? Is it a departure from REST principles? I'm wondering if I should use this technique in web applications, and when.
There's a discussion by the W3C of different use cases, but I don't see which one would apply to the example above. They also seem undecided about recommendations.
Google has many live experimental features that are turned on/off based on your preferences, location and other factors (probably random selection as well.) I'm pretty sure the one you mention is one of those as well.
What happens in the background when a hash is used instead of a query string parameter is that it queries the "real" URL (http://www.google.com/search?q=hello) using JavaScript, then it modifies the existing page with the content. This will appear much more responsive to the user since the page does not have to reload entirely. The reason for the hash is so that browser history and state is maintained. If you go to http://www.google.com/#q=hello you'll find that you actually get the search results for "hello" (even if your browser is really only requesting http://www.google.com/) With JavaScript turned off, it wouldn't work however, and you'd just get the Google front page.
Hashes are appearing more and more as dynamic web sites are becoming the norm. Hashes are maintained entirely on the client and therefore do not incur a server request when changed. This makes them excellent candidates for maintaining unique addresses to different states of the web application, while still being on the exact same page.
I have been using them myself more and more lately, and you can find one example here: http://blixt.org/js -- If you have a look at the "Hash" library on that page, you'll see my implementation of supporting hashes across browsers.
Here's a little guide for using hashes for storing state:
How?
Maintaining state in hashes implies that your application (I'll call it application since you generally only use hashes for state in more advanced web solutions) relies on JavaScript. Without JavaScript, the only function of hashes would be to tell the browser to find content somewhere on the page.
Once you have implemented some JavaScript to detect changes to the hash, the next step would be to parse the hash into meaningful data (just as you would with query string parameters.)
Why?
Once you've got the state in the hash, it can be modified by your code (or your user) to represent the current state in your application. There are many reasons for why you would want to do this.
One common case is when only a small part of a page changes based on a variable, and it would be inefficient to reload the entire page to reflect that change (Example: You've got a box with tabs. The active tab can be identified in the hash.)
Other cases are when you load content dynamically in JavaScript, and you want to tell the client what content to load (Example: http://beta.multifarce.com/#?state=7001, will take you to a specific point in the text adventure.)
When?
If you had a look at my "JavaScript realm" you'll see a border-line overkill case. I did it simply because I wanted to cram as much JavaScript dynamics into that page as possible. In a normal project I would be conservative about when to do this, and only do it when you will see positive changes in one or more of the following areas:
User interactivity
Usually the user won't see much difference, but the URLs can be confusing
Remember loading indicators! Loading content dynamically can be frustrating to the user if it takes time.
Responsiveness (time from one state to another)
Performance (bandwidth, server CPU)
No JavaScript?
Here comes a big deterrent. While you can safely rely on 99% of your users to have a browser capable of using your page with hashes for state, there are still many cases where you simply can't rely on this. Search engine crawlers, for example. While Google is constantly working to make their crawler work with the latest web technologies (did you know that they index Flash applications?), it still isn't a person and can't make sense of some things.
Basically, you're on a crossroads between compatability and user experience.
But you can always build a road inbetween, which of course requires more work. In less metaphorical terms: Implement both solutions so that there is a server-side URL for every client-side URL that outputs relevant content. For compatible clients it would redirect them to the hash URL. This way, Google can index "hard" URLs and when users click them, they get the dynamic state stuff!
Recently google also stopped serving direct links in search results offering instead redirects.
I believe both have to do with gathering usage statistics, what searches were performed by the same user, in what sequence, what of the search results the user has followed etc.
P.S. Now, that's interesting, direct links are back. I absolutely remember seeing there only redirects in the last couple of weeks. They are definitely experimenting with something.