Related
Background:
I'm a software engineering student and I was checking out several algorithms for recommendation systems. One of these algorithms, a collaborative filtering has a lot of loops int it, it has to go through all of the users and for each user all of the ratings he has made on movies, or other rateable items.
I was thinking of implementing it on ruby for a rails app.
The point is there is a lot of data to be processed so:
Should this be done in the database? using regular queries? using PL/SQL or something similar (Testing dbs is extremely time consuming and hard, specially for these kind of algorithms )
Should I do a background job that caches the results of the algorithm? (If so the data is processed on memory and if there are millions of users, how well does this scale)
Should I run the algorithm every time there is a request or every x requests? (Again, the data is processed in memory)
The Question:
I know there are things that do this like Apache Mahout but they rely on Hadoop for scaling. Is there another way out? is there a Mahout or Machine Learning equivalent for ruby and if so how where does the computation take place?
Here is my thoughts on each of the methods:
No it should not. Some calculations would be much faster to run in your database and some would not. However it would be hard and time consuming to test exactly which calculations that should be runned in your db, and you would properly experience that some part of the algorithm is slow in postgreSQL or whatever you use.
More importantly: this is not the right place to run logic, as you say yourself, it would be hard to test and it's overall a bad practice. It would also affect the performance of your requests overall each time the db have to calculate the algorithm. Also the db would still use a lot of memory processing this so that isn't a advantage.
By far the best solution. See below for more explanation.
This is a much better solution than number one. However this would mean that your apps performance would be very unstable. Some times all resources would be free for normal requests, and some times you would use all your resources on you calculations.
Option 2 is the best solution, as this doesn't interfere with the performance of the rest off your app and is much easier to scale as it works in isolation. If for example you experience that your worker can't keep up, you can just add some more running processes.
More importantly you would be able to run the background processes on a separate server and thereby easily monitor the memory and resource usage, and scale your server as necessary.
Even for real time updates a background job will be the best solution (if of course the calculation is not small enough to be done in the request). You could create a "high priority" queue that has enough resources to almost always be empty. If you need to show the result to the user with a reload, you would have to add some kind of push notification after a background job is complete. This notification could then trigger an update on the page through javascript (you can also check out the new live stream function of rails 4).
I would recommend something like Sidekiq with Redis. You could then cache the results in memcache or you could recalculate the result each time, that really depends on how often you would need to calculate this. With this solution, however, it would be much easier to setup a stable cache if you want it.
Where I work, we have an application that runs some heavy queries with a lot of calculations like this. Each night these jobs are queued and then run on a isolated server over the next few hours. This scales really well, and is also easy to monitor with new relic.
Hope this helps, and makes sense (I know my English isn't perfect), but please feel free to ask if I misunderstood something or you have more questions.
I've got an idea for a new web app which will involve the following:
1.) lots of raw inputs (text values) that will be stored in a db - some of which contribute as signals to a ranking algorithm
2.) data crunching & analysis - a series of scripts will be written which together form an algorithm that will take said raw inputs from 1.) and then store a series of ranking values for these inputs.
Events 1.) and 2.) are independent of each other. Event 2 will probably happen once or twice a day. Event 1 will happen on an ongoing basis.
I initially dabbled with the idea of writing the whole thing in node.js sitting on top of mongodb as I will curious to try out something new and while I think node.js would be perfect for event 1.) I don't think it will work well for the event 2.) outlined above.
I'd also rather keep everything in one domain rather than mixing node.js with something else for step 2.
Does anyone have any recommendations for what stacks work well for computational type web apps?
Should I stick with PHP or Rails/Mysql (which I already have good experience with)?
Is MongoDB/nosql constrained when it comes to computational analysis?
Thanks for your advice,
Ed
There is no reason why node.js wouldn't work.
You would just write two node applications.
One that takes input stores it in the database and renders output
and the other one crunches numbers in it's own process and is run once or twice per day.
Of course if your doing real number crunching and you need performance you wouldn't do nr 2 in node/ruby/php. You would do it in fortran (or maybe C).
So i've been asked to demo a windows service to customers, but i'm clueless on how to go about doing so.
I've demo'd a web/windows based ap before by just bringing up the application and going through each screen, but with a windows service it isn't the same.
Can anyone point me in the right direction?
All the service does is downloads a file from an ftp site and imports the file data into a db every half an hour.
Thanks
Do you have some way to trigger it? eg Does the service download on startup? Can you change the clock to trigger the download? Is the download interval configurable?
What about some way to view the results? Can you see the results over time, or is it only the latest version that's kept up to date?
Your demo could very well be:
Start the service
Examine the results
Wait for new results (if you can set the interval low enough)
Stop the service
You can demo both that you have have a windows service (Start / Stop commands can be demoed here), and that it performs the required function updates db every X minutes).
According to my view first you need to explain why you use a windows service. If your audience is non technical people you may need to tell them what is a windows service as well; plus why don't we have nice GUIs :)
And then you can have a sample FTP site with access to it. Then reduce your time period into something feasible like 1 or 2 mins. Then make some changes to FTP site and show the changes using a database access tool like DB visualizer. It would be nice if you can create a sample web page that loads some "nice" data from database so the viewers will see the changes in more 'interactive' manner... :) I don't think creation of such a page is a waste of time because it will be a more easier way to show them the change over time. Otherwise you will need to tell them about technical stuff related with the database that would be a waste of time...
With Web page : Hey!! See.. This is the change... Its indicated on web page huh..
Without the web page : Now you can see a new row added to this database's this table.. bla bla... ;)
Good luck..
You should spend time explaining what is going on. Detail the things you had to do to get it to work the way you wanted. Talk about things other people can do that are wrong. Tell them what you did that's better. Explain to them that the end result might be the same but security this or overhead that.
Looks like you need to sell your product a little.
I would also suggest some sort of flowchart if you don't have one already. They love those diagrams.
Talk about other possible solutions and why this is better. They want you to tell them things they can vaguely remember and parrot to other people. They don't know what you're talking about. They don't care how it really works.
A little bit of this, done right, goes a long way. People can and have bullshitted their way into millions of dollars like this, but it catches up with them and their reputation is ruined. Don't blatantly make stuff up. There's a difference between telling them you kept security in mind when developing this service, and telling them that it's bulletproof and this will be sitting next to the cockroaches after a Nuclear War.
First of all ask yourself:
Who am I demoing this to?
What do they want to get out of the demo?
Regarding "who", are they technical people? Business people? Smart people? Clueless people? Those are important things to understand before deciding how to demo the product (and shame on some of you readers for assuming the technical people are also smart and the business people are also clueless ;-)
If your audience knows what a windows service is, how they work in general, and roughly what yours is supposed to accomplish, you can jump in and explain the finer points of your solution. If that's not the case, you will want to start out by explaining that you have a solution to show them that is running all of the time waiting for new work to arrive, then automatically starts processing when that new work arrives... or whatever your windows service actually does. If this is what you're facing, try explaining it to your mom (assuming your mom isn't a programming wizard).
Regarding "what they want" from the demo, ask yourself if they want to know technically how the service is written and functions (what events trigger new work items? Does it have a DB back end? How is the service monitored for 24/7 availability?).
Do they care more about the business function? "OK so it can process time cards for employees as they are submitted. Can it also process time cards for contractors?"... that sort of thing.
Once you have figured out WHO and WHAT, look over the advice from the other posters and see what is a good match.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
At a place I used to work they typical response to any problem was to blame the hardware or the users for not using the system perfectly. I had adopted the philosophy that it's my fault until I can prove otherwise prior to that job (and so far, at least 99 times out of 100 it's correct).
One of the last "unsolvable" problems when I was there was an abundance of database timeouts. After months of research, I still only had theories but couldn't prove any of them. One of my developers adamantly suggested replacing the network (every router, switch, access point) but couldn't provide any evidence that the network was the cause; it was, however, "obviously the cause" according to my manager (no development/IT experience) so he took over the problem. One caveat and Fog Creek plug: He couldn't account for the fact that the error reporting via FogBugz worked perfectly and to the same SQL Server as the rest of the data.
A couple, timeout-free months later, my manager boasted that he had fixed the timeouts ("Look, no timeouts!"). I had to hold back from grabbing a rock and saying "Look, no tigers!" but I did ask how he knew they would have occurred to which I got no response. The timeouts did return (and in greater numbers) a couple months later.
I'm pretty content with how I handled the situation but I'm curious how the SO crowd would have responded to letting a superior/colleague implement a solution you know (or are very sure) is wrong and will likely waste thousands of dollars?
Let them, but at the same time continue searching for the real cause.
A couple thousand dollars is money well spent if it keeps me from going against that kind of thinking (which is futile).
Well, if the problem is upper management, then I would do as you have done - lodge your complaint, then follow instructions. If it turns out they were right (it happens every now and then) then you look like a good employee despite your misgivings. If it turns out you were right, then they might be more willing to listen to you given that you allowed them their turn.
This is, of course, optimistic.
In the case of a colleague, take the problem up a level and consult your superiors for advice on how to approach the subject. Be fair to both your perspective and that of your colleague's, then follow the advice you're given.
Sometimes it's best to leave a manager be. If you think about his pressures and responsibilities he had to be seen to be doing something, rather than "nothing". After enough time "investigating" resolves into nothing to outside parties that need the timeouts to stop.
By taking an action he creates an opportunity to keep researching. The trick is to find a way to put your solutions in his context. Here is something we can do now, and here's what we can continue to do. For example, "We can replace the networking gear as a precautionary step, and then look at the version control logs to rule out that possibility."
This gives him something proactive so he can look productive up his chain while achieving the solution you want which will ultimately be successful.
In the long term you should look to work for someone who trusts your technical decisions implicitly, you can talk candidly with and who well help you help him navigate the politics in a way you both know what's going on. If you manager isn't that person, move.
Is this a big problem? Its not your job to save your company dollars other than that you would like your company to remain solvent so you get paid.
If its just one manager, he will be weeded out sooner or later, if your entire company culture is like this, maybe it would be time to move on.
In the mean time, see if you can see this from your manager's perspective.
I'd consider you're manager's intent to be a good thing. It's the people that don't want to bother that I find more difficult. It's just best to find a way to use that energy to be helpful.
One common problem for lots of people (occasionally myself), is that they flail around when trying to diagnose a problem. If you're wildly guessing at it, then particularity with modern computers, you have only the slimmest possibility of being right. Approaching this type of problem with that type of attitude, generally means that you'll never fix it.
The best way of handing complex debugging, is by divide and conquer. In this case, first think up a test, them implement it. Did that test act as expected? Depending on where you are with your tests, you are getting closer or farther away from the problem. The key, is that ALL of the tests must result in some concrete (objective) behavior. If the results are ambiguous, then the test is useless.
If you're getting a disconnect in part of the system, but some other part is not, then you have a huge amount of valuable information (it also shows it's not the network). What's the difference between the parts? Just start descending until you get somewhere ...
Getting back to your manager. Whenever I encounter that type of personality problem, I try to redirect the energy into something more useful. The desire is there, it just needs some help in getting shaped. If you can convince your manager to make sure the tests are concrete, then if they do enough of them, they'll produce enough information to correctly guess the bug. Sure, a more consistent approach might be faster, but why turn down some free assistance. I generally feel that there is some useful role for anyone on any project, it's all about making it possible to harness their efforts ....
Paul.
I took a data structures class in C++ last year, and consequently implemented all the major data structures in templated code. I saved it all on a flash drive because I have a feeling that at some point in my life, I'll use it again. I imagine something I end up programming will need a B-Tree, or is that just delusional? How long do you typically save the code you write for possible reuse?
Forever (or as close as I can get). That's the whole point of a source control system.
-1 to saving everything that's ever produced. I liken that to a proud parent saving every single used nappy ever to grace the cheeks of their little nipper. It's shitty and the world doesn't benefit from it's existence.
How many people here go past the first page in google on a regular basis? Having so much crap around only seems to make it difficult to find anything useful.
+1 to keeping code forever. In this day and age, there's just no reason to delete data which could possibly be of value in the future. Even if you don't use the B-Tree as a useful structure, you may want to look at the code to see how you did something. Or even better, you may wish to come back to the code someday for instructional purposes. You'll never know when you might want to look at that one particular sniblet of code that accomplished a task in a certain way.
If I use it, it gets stuck in a Bazaar repository and uploaded to Launchpad. If it's a little side project that pitters out, I usually move it to a junk/ subdirectory.
I'll use it again. I imagine something I end up programming will need a B-Tree, or is that just delusional?
Something you write will need a B-tree, but you'll be able to use a library for it because the real world values working solutions over extra code.
I keep backups of all of my code for as long as possible. The important things are backed up on my web server and external hdd. You can always delete things later, but if you think you might find a use for it, why not keep it?
I still have (some) code I wrote as far back as college, and that would be 18 years ago :-). As is often the case, it is better to have it and never want it, than to want it and not have it.
Source control, keep it offsite and keep it for life! You'll never have to worry about it.
I have code from many, many years ago. In fact, I think I still have my first php script. If nothing else, it's a good way to see how much you have changed over time.
I agree with the other posters. I've kept my code from school in a personal source code repository. What harm does hanging on to it really do?
I would just put it on a disk for historical sake. Use the
Standard Template Library - one mistake people make is assuming that thier implementation of moderate to complex data structures are the best. I can't tell you how many times I have found a bug in a home grown B-tree implementation.
Keep everything! You never know when it will save you some work. About a year ago I needed some c code to parse an expression, tokenize it for storage, and evaluate the results latter. Ugly little piece of code.. But is seemed familiar, as it should have- I had to do a post-fix evaluator in college (30 years ago)- and still had the code. Admittedly it needed a little clean-up, but saved me a couple of days of work.
I implemented a red black tree in Java while in in college. I have always wanted to find that code again and cannot.
Now I do not have the time to recreate it from scratch since I have three kids and do not develop in Java.
I now keep everything so that I can relearn much faster. I also find it fascinating to see how I did something 1, 5, 10 years ago. It makes me feel good because I either did it right or I am better now and would do it differently
If I ever go back to college to give a lecture to future students it in on the list of things to do:
Save everything...
I'm a code packrat, for better or worse, but I guard it, because sometimes it's client-confidential.
On occasion, this has been really useful, like if a client lost their stuff, or their documentation.
I lost a lot of old code (from 10 years ago) because of computer failure that wasn't backed up but in fact I do not really care because I do not really want to see code that is programmed in very old language. Most of this code was written in VB5...
I agree that now it's easy to keep everything but I think sometime it's good to clean up our backup/computer storage because it's like in the real world, we do not need to keep everything forever.
Forever is the beauty of the electronic medium. That's one of the most attractive aspects for me.
But, the keeping of it depends on your coding style, and what you do with it.
I'd suggest tossing your code if you're the type that...
Never looks back.
Would rather re-write from your memory to improve your craft.
Isn't very organized.
Is bothered by latent storage to no end.
Likes to live on the edge.
Worships efficiency of memory.
Logical reasons for tossing could would be...
It bothers you.
It disrupts your workflow by getting in your way.
You're ashamed of it.
It confuses you and distracts you.
Like anything that takes up physical space in life, it's value is weighed against it's usefulness.
All my code is kept indefinitely, with plans to return to it at some point, reflect, and refactor. I do that because it's fun to see my progress, and provides very accessible learning experiences. Furthermore, the incorporation of all my code into a consolidated framework is something I work towards all the time.
Forever...
Good code never dies. ;)
I don't own most of the code I develop: my employer does. So I don't keep that code (my employer does - or should).
Since I discovered computing, I wrote code for devices that no longer exist in languages that are no longer worth. Maybe there is some emulator but keeping that code and running it would be nostalgia.
You can find B-tree information (and many other subjects) on Wikipedia (and many other places). There is no need to keep that code.
In the end I keep only code that I own and maintain.