upper bound - display - image-processing

This is an idea I got in to my mind,
All the display devices(screens which have pixels etc...) have an upper bound for the amount of various images they can generate.
as an example 1024*728 - 32 bit pixel display can only show (2^32)^(1024*768) etc... number of identical frames without duplicating any scene(view).
funny thing is, It's like we could pre generate all the films all the windows we have ever seen in our lives through screens etc...
the question here is can anybody use this abstract idea to create something useful? :D

You're talking of a number about
(2^32)^(1024*768) ~~ ((2^4)^8)^(10^6) ~~ 10^8^(10^6) ~ 10^8000000.
The number of atoms in universe is about
10^80 // http://en.wikipedia.org/wiki/Observable_universe#Matter_content
I think that there is no way we could pre-generate all the screens in our life.
Let me formulate another question. From a number this big, what can we do to reduce it? How to aggregate similar pictures in order to reduce the complexity?
Another nice question is: what kind of data structure we need to store all this information? Suppose we reduce the number of similar images to 10^10. What kind of structure can handle so many different kinds of pictures in an efficient way?

So given some extra information about the scenes you could generate you might be able to pull apart the scenes that no-one has ever seen.
So if you could take all the pictures out on the internet and the statistics about what was popular or viewed a lot then compute your all possible screens you could pull apart that was not viewed much.
With some basic rules about complexity of the image you might be able to come up with images that have not been seen before. Think 80% flesh tones might produce something coupled with a variance to show range might render people naked. :-)
Of course the computation of such an idea is vastly outside our potential. 2^32^(1024*768) is in the superexponential range which is outside the bounds of reality. I tried to compute it in ruby, and it just died. It would have been fun if it had actually worked. :-)


Change in two 3D models

I'm trying to think of the best way to conduct some sort of analysis between two 3D models of the same object.
The first scan is of the original item and the second scan is after it has been put under some load x.
An example would be trying to find the difference between two types of metal.
I would like to be able to scan the initial metal cylinder, apply a measured load, scan it again, and then finally apply some sort of algorithm to compare the difference.
Is it possible to do this efficiently (maybe using Mablab) over say 50 - 100 items for an object around 5inch^3?
I am assuming I will need to work out some sort of utility function as the total mass should be the same?
Would machine learning be beneficial in this case?
Any suggestions or direction would be amazing.
Thank you :)
EDIT: The scan files are coming through as '.stl'

Finding features for classifying document into printable or non-printable

I would like to perform a binary classification of documents (.txt, .pdf, .jpeg, .img, etc.) into two categories: printable and non-printable. Essentially our school runs a free printing service for clubs, but the reality is that many clubs abuse the free printing and end up printing their homework, papers, etc., which amounts to thousands of dollars in ink and paper. Thus we would like to take some unsupervised methods to help limit this by determining whether a document is with high probability not club related (e.g. Biophysics paper, there is no biophysics club!).
So this is a very simple binary classification problem. I am not looking for low-level implementation details or which ML algorithms I should use, but rather how I should discover the relevant features that will then be fed to the training, etc.
My first idea was to gather all the documents that students print in the library. The idea is that if you have actual club printing, you'll do it for free at the club printing center rather than pay for it at the library. That would be a massive dataset, assuming every document printed at the library is assigned the non-printable/club material category. Unfortunately, the school is very liberal and opposed to allowing this due to privacy concerns, so it is not really an option without legal risks.
A similar-minded option would be to collect documents that are tied to courses / school work, e.g. course syllabi, available course documents online (homeworks, papers, etc.) and do feature extraction / selection on these. The assumption is that students would be abusing the printing to generally print material relevant to their studies.
While for .pdf and .txt based document this approach should have reasonable performance, I am at a loss at how to classify image based documents, besides perhaps using the title of the document and other meta data. A clever violator could simply convert all their text documents to image format to circumvent this system. However that is outside the scope of this question and should be saved for a future question / research. For now the scope is just text based documents.
Note that there are previous questions on topics similar to this, but mine is very specific and I believe it may pose challenges that something like movie review classification might not have to face.
I just wanted to leave a comment but it ended way longer than what I imagined.
While this is an interesting problem I'm not sure ML will get you what you need easily.
Firstly your classification problem is of the type A vs the World and A isn't strictly defined. Unless you know exactly what kind of stuff the clubs print you can't really say that new material belong or no to that class.
This will prove particularly difficult when you will need to assemble a large enough training set to be able to cover whatever can or cannot be printed. Such task will be extremely tedious, and as you said you won't have access to what the clubs usually print out so at best you will have a large class imbalance in your training set.
As the goal is to make the system automated (I mean if there is human interaction anyway, it's faster to check what will be printed than to make a ML algorithm that will provide a score that a human will have to investigate anyway) the number of false positives and false negatives will also be problematic. There will be cases where the clubs won't be able to print things they have the right to.
As you said you could simplify greatly the problem by classifying Course Material and Not Course Material. For that I will look towards BoW because some words are more present than others in papers or course material (everything remotely technical). The number of words as well as the overall size of the file seem like sensible things to extract. The structure is often also particular : it might be a good idea to extract such things : "number of lines with less than x words", "number of lines per page", "number of pictures" (if that's something you can extract from the file), ...
For pictures the major thing to check would be if this a scan of something (often they will scan and print course related things I guess), for that the format of the image is already a good indication but I don't see other things that would be particularly "course related".
So for me, if you can't really define precisely one of your two classes don't go with classification or reduce the problem to something you can really define (course related things).
If you are able to compile a "black list" of documents students are not allowed to print, you can then implement a several layers rejection mechanism.
I would suggest these 3 levels:
compare the md5 of the file they want to print with a database of all the md5 of the black-listed documents.
if the 1) is passed, compare repeat 1) but at a page level, rather than at document level (perhaps they want to print just few pages rather than the entire document).
if 2) is passed you can compare the page they want to print with the pages of the black-listed documents document using an image similarity method, like SSIM. if you get a high score between the page they want print and one of the black-listed items do not print, and update your md5 database accordingly.
if 3) is passed: print!
A few words about SSIM: this method is quite robust to noise, so even a smart student who added some sort of niose to the image will be caught
you have to find a proper way to extract a region of interest (ROI) from the page and the db of documents (if the two ROIs are in two different area of the page, SSIM will be negative)
SSIM might be slow! definitely a C implementation is needed here.
I think SSIM is not rotational invariant, hence the check will fail if they print the page upside down (unless you have a smart way to rotate the page).

What's a good way to generate levels? How does a game like Defender (Droid Hen) generate their levels?

I started making a game for iPhone and just wondering what would be the best way to generate the levels? The way I currently am doing it is I can create Level files and then load them in through the functions, but what if I have over 100 levels? Am I supposed to create 100 files?
A game like Defender (By DroidHen) has maybe 1000 levels... I don't know because I stopped after like 600.. Are they supposed to have 1000 files? Or do they just have like 1 basic structure and then just use a random layout function that generates the enemies?
I just need some insight on this so I can get an idea of how to perform this task... If this is the wrong place to ask this question just please let me know where else I could ask
There are a number of ways to tackle this, depending on your game design. Obviously (?), manually-created levels that are tweaked for being "interesting" are best, but algorithmic level generation can mean potentially unlimited replayability (if your basic game design is solid).
You don't necessarily need to have one file per level, though that's a common-enough design.
The game you mentioned seems to be a "waves of enemies" style game (it's basically tower defense). It's really easy to come up with a very compact representation of each level in that case. You just keep a list for each "level" of how many of which enemy show up, and how much time to leave between them appearing.
So, an individual level could be something like:
That'd mean "create 5 type 'A' enemies, wait 500ms (1/2 second), then generate 5 type 'B' enemies".
You could fit thousands of those levels into a reasonably-compact source file.
Another option for more-complex game designs is to generate a large number of random levels, and manually select the "good" ones. If you save just the seeds for the random generator, that's a very compact representation.

How to bulk-load an r-tree in C#?

I am looking for C# code to construct an r-tree. I have code that builds an r-tree incrementally i.e. items are added one by one to the tree, but I guess a better r-tree could be built if all items are given all at once to the tree creation algorithm. Please let me know if anyone knows how to bulk-load an r-tree in this manner. I tried doing some search but couldn't find anything very useful.
The most common method for low-dimensional point data is sort-tile-recursive (STR). It does exactly that: sort the data, tile it into the optimal number of slices, then recurse if necessary.
The leaf level of a STR-loaded tree with point data will have no overlap, so it is really good. Higher levels may have overlap, as STR does not take the extend of objects into account.
A proven good bulk-loading is a key component to the Priority-R-Tree, too.
And even when not bulk-loading, the insertion strategy makes a big difference. R-Trees built with linear splits such as Guttmans or Ang-Tan will usually be worse than those built with the R*-Tree split heuristics. In particular Ang-Tan tends to produce "sliced" pages, that are very unbalanced in their spatial extend. It is a fast split strategy and probably the simplest, but the results aren't good.
A paper by Achakeev et al.,Sort-based Parallel Loading of R-trees might be of some help. And you could also find other methods in their references.

Game Terrain Database Model

I am developing a game for the web. The map of this game will be a minimum of 2000km by 2000km. I want to be able to encode elevation and terrain type at some level of granularity - 100m X 100m for example.
For a 2000km by 2000km map storing this information in 100m2 buckets would mean 20000 by 20000 elements or a total of 400,000,000 records in a database.
Is there some other way of storing this type of information?
The map itself will not ever be displayed in its entirety. Units will be moved on the map in a turn based fashion and the players will get feedback on where they are located and what the local area looks like. Terrain will dictate speed and prohibition of movement.
I guess I am trying to say that the map will be used for the game and not necessarily for a graphical or display purposes.
It depends on how you want to generate your terrain.
For example, you could procedurally generate it all (using interpolation of a low resolution terrain/height map - stored as two "bitmaps" - with random interpolation seeded from the xy coords to ensure that terrain didn't morph), and use minimal storage.
If you wanted areas of terrain that were completely defined, you could store these separately and use them where appropriate, randomly generating the rest.)
If you want completely defined terrain, then you're going to need to look into some kind of compression/streaming technique to only pull terrain you are currently interested in.
I would treat it differently, by separating terrain type and elevation.
Terrain type, I assume, does not change as rapidly as elevation - there are probably sectors of the same type of terrain that stretch over much longer than the lowest level of granularity. I would map those sectors into database records or some kind of hash table, depending on performance, memory and other requirements.
Elevation I would assume is semi-contiuous, as it changes gradually for the most part. I would try to map the values into set of continuous functions (different sets between parts that are not continues, as in sudden change in elevation). For any set of coordinates for which the terrain is the same elevation or can be described by a simple function, you just need to define the range this function covers. This should reduce much the amount of information you need to record to describe the elevation at each point in the terrain.
So basically I would break down the map into different sectors which compose of (x,y) ranges, once for terrain type and once for terrain elevation, and build a hash table for each which can return the appropriate value as needed.
If you want the kind of granularity that you are looking for, then there is no obvious way of doing it.
You could try a 2-dimensional wavelet transform, but that's pretty complex. Something like a Fourier transform would do quite nicely. Plus, you probably wouldn't go about storing the terrain with a one-record-per-piece-of-land way; it makes more sense to have some sort of database field which can store an encoded matrix.
I think the usual solution is to break your domain up into "tiles" of manageable sizes. You'll have to add a little bit of logic to load the appropriate tiles at any given time, but not too bad.
You shouldn't need to access all that info at once--even if each 100m2 bucket occupied a single pixel on the screen, no screen I know of could show 20k x 20k pixels at once.
Also, I wouldn't use a database--look into height mapping--effectively using a black & white image whose pixel values represent heights.
Good luck!
That will be awfully lot of information no matter which way you look at it. 400,000,000 grid cells will take their toll.
I see two ways of going around this. Firstly, since it is a web-based game, you might be able to get a server with a decently sized HDD and store the 400M records in it just as you would normally. Or more likely create some sort of your own storage mechanism for efficiency. Then you would only have to devise a way to access the data efficiently, which could be done by taking into account the fact that you doubtfully will need to use it all at once. ;)
The other way would be some kind of compression. You have to be careful with this though. Most out-of-the-box compression algorithms won't allow you to decompress an arbitrary location in the stream. Perhaps your terrain data has some patterns in it you can use? I doubt it will be completely random. More likely I predict large areas with the same data. Perhaps those can be encoded as such?
