What does the different symbols mean when coding? such as * < > != # and why are they needed? how do I know which ones to use. and Which words to use to begin a code like including a save option? I have read the course work downloaded additional Wiki docs. Python.org tutorials and I have not found any answers to my questions.
They are called operators, and they perform some operation on the values associated with them (known as operands). For example, 1 * 2 means to multiply (operator *) 1 and 2 (the operands) and return the product.
You should seek more intensive programming education before attempting to implement a save option. Your question is very basic and indicates that you are still learning the basics. Keep at it! The greatest writers of all time once had to spend time learning their ABC's and how to sound out words.
Related
I need to write up an evaluation of what the advantages and disadvantages are of linked list and binary trees as structures for storing and searching for data. But I'm a little bit lost on what the advantages and disadvantages could be?
Any help would you greatly appreciated, thanks!
Lets think of how a linked list works vs how a binary tree works
For a doubly linked list we have elements like
Head < - > 5 < - > 10 < - > 4 < - > tail
If we want to add an element we can easily add it to the head or tail of the list by pointing our new element at the end point and the value that the endpoint points to and then updating both of those to point to the new element (making sure to correctly assign previous and next) I've glossed over it here but there are lots of good resources available if you search insertion into linked list. This operation has O(1) time complexity. Do some research on insertion into (balanced) binary trees this will take much longer
Now what about searching? In the linked list above if i want to find an element i have to walk through the entire list from one side until i hit the value i want. This leads to O(n) time complexity, however if we have a balanced binary tree we can check if what we are looking for is higher or lower than the value that is ~ in the middle. If its higher we can eleminate the lower half of the numbers. Then we can do the same step with the remaining numbers. This cuts the amount of remaining numbers by ~ half at every step and leads to significantly reduced times.
There are many many ways to evaluate this and they depend on different implementations. My advice to you is to choose which implementation of each you want to focus on compare the time complexity of operations and then consider alternative implementations and what they do to the time complexity of operations.
For binary consider balanced and unbalanced.
For linked list look at doubly linked, singly linked, with and without pointers to tail (essentially stack vs queue)
If you have questions about specific implementations and how they compare let us know and we will do our best to clear that up.
The .weekday component starts at 1 (sunday = 1, monday = 2 etc...) and I'm interested if anyone knows why. It seems that usually in programming things start at 0.
The reason for zero based indexing in programming dates back to the time when programs were written in machine language or assembly code. It is a reflexion of the base+displacement capability of memory access from CPU registers. It was maintained in low level programming languages (such as C) that were essentially a bridge to assembly code. Zero based indexing also provides much simpler index manipulation when processing a one dimensional array (or memory block) as a multidimensional matrix. That being said, it is still just a convention. Some languages (such as Pascal) use one based indexing and normal human beings don't start numbering things at zero.
I don't know the fundamental reason for the numbering of weekdays being based on 1 but I strongly suspect that it is more consistant (and practical) to use with calendars where day numbers within a month, and months with a year are also 1 based. It would be very confusing to manipulate days and months as zero based indexes. Given this, weekdays should follow the same conventions.
I have been looking into a development issue that requires the use of pseudorandom number generation to allow the same set of random numbers to be generated for a given seed.
I have currently been looking at using long random(void) and void srandom(unsigned seed) for this (man page), and currently these are generating the same set of random numbers in a Mac app, an iOS app and an iOS app (64-bit) which is what I was hoping. The iOS tests were only in the simulator so I don't know whether this will affect the result.
My main concerns is that this algorithm could change at some point, making the applications we're developing effectively useless with old data. What are the chances of these algorithms changing / being different on a future device?
I'd say it's extremely likely they will change as the sequence is not guaranteed by any standard.
Why not use your own random number sequence? Even a simple linear congruential generator satisfies most statistical properties of randomness. Here is the formula for such a generator:
next_number = (a * current_number + b) % c
with
a = 1103515245
b = 12345
c = 4294967296
These values of a, b, c give you good statistical properties and are quite well known for building quick and dirty generators.
I don't have the slightest idea about the answer to the question you ask.
If a related question is "How can I be absolutely sure to have the same pseudo-random sequences generated in 10 years time ?", the answer to this question is : don't rely on an external library, write the code explicitly.
Bathsheba proposed this generator. You can google for "pseudo random generator algorithm". Here is a list of algorithms listed on wikipedia.
In fact, srandom did change since Mac OS X 10.7, according to this blog post. However, this was due
to the way srandom was implemented: it tried to access an uninitialized local variable, which
is undefined behavior in C. According to the post, the new compiler used since Mac
OS X 10.7 optimized out the uninitialized memory access, changing its behavior in subtle
ways.
My platform here is Ruby - a webapp using Rails 3.2 in particular.
I'm trying to match objects (people) based on their ratings for certain items. People may rate all, some, or none of the same items as other people. Ratings are integers between 0 and 5. The number of items available to rate, and the number of users, can both be considered to be non-trivial.
A quick illustration -
The brute-force approach is to iterate through all people, calculating differences for each item. In Ruby-flavoured pseudo-code -
MATCHES = {}
for each (PERSON in (people except USER)) do
for each (RATING that PERSON has made) do
if (USER has rated the item that RATING refers to) do
MATCHES[PERSON's id] += difference between PERSON's rating and USER's rating
end
end
end
lowest values in MATCHES are the best matches for USER
The problem here being that as the number of items, ratings, and people increase, this code will take a very significant time to run, and ignoring caching for now, this is code that has to run a lot, since this matching is the primary function of my app.
I'm open to cleverer algorithms and cleverer databases to achieve this, but doing it algorithmically and as such allowing me to keep everything in MySQL or PostgreSQL would make my life a lot easier. The only thing I'd say is that the data does need to persist.
If any more detail would help, please feel free to ask. Any assistance greatly appreciated!
Check out the KD-Tree. It's specifically designed to speed up neighbour-finding in N-Dimensional spaces, like your rating system (Person 1 is 3 units along the X axis, 4 units along the Y axis, and so on).
You'll likely have to do this in an actual programming language. There are spatial indexes for some DBs, but they're usually designed for geographic work, like PostGIS (which uses GiST indexing), and only support two or three dimensions.
That said, I did find this tantalizing blog post on PostGIS. I was then unable to find any other references to this, but maybe your luck will be better than mine...
Hope that helps!
Technically your task is matching long strings made out of characters of a 5 letter alphabet. This kind of stuff is researched extensively in the area of computational biology. (Typically with 4 letter alphabets). If you do not know the book http://www.amazon.com/Algorithms-Strings-Trees-Sequences-Computational/dp/0521585198 then you might want to get hold of a copy. IMHO this is THE standard book on fuzzy matching / scoring of sequences.
Is your data sparse? With rating, most of the time not every user rates every object.
Naively comparing each object to every other is O(n*n*d), where d is the number of operations. However, a key trick of all the Hadoop solutions is to transpose the matrix, and work only on the non-zero values in the columns. Assuming that your sparsity is s=0.01, this reduces the runtime to O(d*n*s*n*s), i.e. by a factor of s*s. So if your sparsity is 1 out of 100, your computation will be theoretically 10000 times faster.
Note that the resulting data will still be a O(n*n) distance matrix, so strictl speaking the problem is still quadratic.
The way to beat the quadratic factor is to use index structures. The k-d-tree has already been mentioned, but I'm not aware of a version for categorical / discrete data and missing values. Indexing such data is not very well researched AFAICT.
We plan to use Mahout for a movie recommendation system. And we also plan to
use SVD for model building.
When a new user comes we will require him/her to rate a certain number of movies (say 10).
The problem is that, in order to make a recommendation to this new user we have to rebuild the entire model again.
Is there a better way to this?
Thanks
Yes... though not in Mahout. The implementations there are by nature built around periodic reloading and rebuilding of a data model. In some implementations this still lets you use new data on the fly, like neighborhood-based implementations. I don't think the SVD-based in-memory one does this (I didn't write it.)
In theory, you can start making recommendations from the very first click or rating, by projecting the target item/movie back into the user-feature space via fold-in. To greatly simplify -- if your rank-k approximate factorization of input A is Ak = Uk * Sk * Vk', then for a new user u, you want a new row Uk_u for your update. You have A_u.
Uk = Ak * (Vk')^-1 * (Sk)^-1. The good news is that those two inverses on the right are trivial. (Vk')^-1 = Vk because it has orthonormal columns. (Sk)^-1 is just a matter of taking the reciprocal of Sk's diagonal elements.
So Uk_u = Ak_u * (Vk')^-1 * (Sk)^-1. You don't have Ak_u, but, you have A_u which is approximately the same, so you use that.
If you like Mahout, and like matrix factorization, I suggest you consider the ALS algorithm. It's a simpler process, so is faster (but makes the fold-in a little harder -- see the end of a recent explanation I gave). It works nicely for recommendations.
This also exists in Mahout, though the fold-in isn't implemented. Myrrix, which is where I am continuing work from Mahout, implements all of this.