Where to begin for basic machine algorithms for, say, document recognition and organization? - machine-learning

Pardon if this question is not appropriate. It is kind of specific and I am not asking for actual code but moreso guidance on whether or not this task is worth undertaking. If this is not the place, please close the question and kindly point me in the correct direction.
Short background: I have always been interested in tinkering. I used to play with partitions and OS X scripts when I was younger, eventually reaching basic-level "general programming" aptitude before my father prohibited my computer usage. I am now going to law school and working at a law firm but I love development and I want to implement more tech innovation in the field.
Main point: At our firm, we have a busy season every year from mid march to the first week of april (immigration + H1B deadline). We receive a lot of documents and scanned files that need to be verified, organized, and checked.
I added (very) simple lines of code to our online platform to help in organization; basically, I attached tags to all incoming documents, and once they were verified, the code would organize them by tag (like "identification doc", "work experience doc" etc.). This would my life much easier every year, as I end up working 100+ hour weeks this season.
I want to take this many steps further with an algorithm that can check for signatures and data mismatches between documents and ultimately organize the documents so they are ready to print. Eventually, I would like to maybe even implement machine learning and a very basic neural network to automate the whole mind-numbing and painful process...
Actual Question(s): I just wanted to know the best way for me to proceed or get started. I know a decent amount of python and java, and we have an online platform already with the documents. What other resources would you recommend in terms of books, videos, or even classes? Is there a name for this kind of basic categorization? Can I build something like this through my own effort without an advanced degree?
Stupid and over-dramatic epilogue: Truth be told, a part of me feels like I wasted my life thus far by not pursuing what I knew I loved at the age of 12. This is my way of making amends I guess, and if I can do this then maybe I can keep doing it in law and beyond...

You don't give many specifics about the task but if you have a finite number of forms in digital form as images, then this seems very possible.
I have personally used OpenCV with Python a lot and more complex machine learning tasks have become increasingly simple in the past 10 years.
Take for example object detection (e.g. 1, 2) to check whether there is anything in a signature field or try extracting the date from an image (e.g 1, 2).
I would suggest you start with the simplest thing that would improve your work. A small and easy task will let you build up your knowledge on how to do things.

Related

Creation of a platform for scientific studies and surveys

I am looking for competent people to create a platform for scientific studies and globalized surveys. I would like each human being to have an identification number (identity verification is important otherwise the study will not be valid) and with this system each person will be able to give their opinion on different issues. The objective is therefore on a large scale, the system will have to be efficient anywhere and with many people connected at the same time.
The data war has already begun. My goal is to counterbalance the power by giving it to humanity. The platform will therefore disturb -> many hacking attacks to be expected and coveted for its precious data that we must protect.
It is a project of the heart, the dimension is not pecuniary even if we will earn money, it is humanistic. Let's stop the mass manipulation, let's put the people with real practical solutions in the spotlight and let's be aware of the real percentages behind an idea.
I'm going on a world tour in a few days, I'd like to take the opportunity to highlight the project around the globe but for that I need the interface to be functional.
I hope you like the idea. I look forward to talking to you about it and finding out how and how quickly it can be done.
Sincerely Louise

How does gmail extract time and date from text

I was going through my mails, and saw that gmail automatically suggested me to add coming friday around 5pm to an event on 21st Feb. I am surprised how gmail does this ?
I mean how did it correctly figure out that this friday meant the coming friday, and also that the 5 PM is linked with Friday.
I am a newbie in NLP and machine learning, so if someone can explain it to me in layman terms I would be very glad
I don't think this needs a lot of machine learning as such. A bit of NLP is helpful to get the dependencies from the sentence but even that isn't strictly necessary.
You could start off with just looking at keywords monday,tuesday etc. and then do a look around to see what is around them last monday, next monday, coming monday, previous monday and so on. These are called window features because they provide a window +/- 1,2,3 ... around the feature you are interested in monday. The around 5pm you could theoretically also get from just looking at window features, I don't have an intuition as to how noisy that would be. Try to think of all the ways of expressing time in that context and then think of those ways can be mixed up with something else. Of the top of my head it would seem relatively easy to do that.
Anyhow, the other way is to use a dependency parser to extract the grammatical relations of the elements in the sentence. This requires you to Part of Speech (POS) tag the sentence (after splitting it into tokens). The POS tagger would need to be trained to recognize that friday and monday are nouns, perhaps even that they are temporal expressions, same goes for 5pm and around 5pm. That does require machine learning and a lot of it. The benefit Google has as opposed to others is that they have a lot of data, which allows them to have lots and lots and lots of examples of different ways expressing what essentially is the same thing. This gives their models a lot of breadth. Once you've got the sentence POS tagged, you feed it to a dependency parser (such as the Stanford Dependency Parser) which tells you what the relation between all the different tokens in the sentence is.
Again Google has a lot of data which helps. On top of all this Google has had years to hone the output of the models so that when the models isn't entirely sure what is going on it won't highlight/extract the result. In terms of actually applying NLP in the real world this last step is very important because it given people confidence in what the system is doing. Basically if the software isn't sure what is happening do nothing, because doing something risks doing the wrong thing which then reduces people's confidence on the system as a whole.
Releasing a reliable easy to use NLP application requires a tradeoff between the quality of the NLP/Machine Learning and general software engineering to hide all the parts where the NLP fails from the users.
Try sending yourself email(s) with time expressed in different ways and see which ones Google gets and which ones it doesn't. For instance
Can we meet Friday next week?
How about coffee next week's Friday at 2pm
I can't do Friday but I can meet Wednesday at 4pm
and so on, it's always interesting to poke holes in technology. It can also reveal quite a lot about what it is doing, and how it is doing it.

Proving the ROI of a technology?

How does one prove the ROI of a technology to their manager?
The closest thing I have found to a document on how to do this is:
http://www.agilejournal.com/pdf/Finding-ROI-in-Build-Automation.pdf
There are formulas in this document, but I can't really tell if they are just alot of marketing or if they are accurate formulas on how to calculate ROI.
I'm not really trying to calculate the ROI of the build tool in the above paper, I was just trying to calculate the ROI of a simple build tool like ANT.
They don't cut to the meat of the question: the intangible benefits - though they at least try to walk through an example. The formulas are just to get ROI as a nice percentage - if "using build tools" was a stock, how much return would I get on my investment?
Which already shows that the question itself is flawed: An automated build is mainly an instrument to improve quality; improving productivity is usually a secondary concern.
However, this doesn't help when talking to the guys sitting on the money.
Metrics I woud use to analyse effect of a build tool:
Turnaround time from checkin to final media
Number of builds (for testing, for release, ..)
Number of build requested (with faster builds, you can expect an increase in demand)
Number of errors introduced during manual build (assuming you track those)
Number of developers able to publish a release
Estimated resources (time, licences, build server, ..) for implementation and maintenance
Analysis of low-probability, high risk scenarios
Often, an automated build tool pays for itself just by removing a bottleneck: every developer can publish the software, not just John the Builder.
The last point is important (but hardest to give numbers for), as the total cost of bugs doesn't have a normal distribution, but is highly "pareto": a single bug can give you some nasty press, or make key customers switch to competition.
The core argument for maintaining an automated build is that publishing bugs are mostly avoidable.
I can't imagine that there would be any sensible way to accurately measure ROI on developer tools and practices. The only thing I can think of where that would might be possible would be in factory environments where you you might be able to measure productivity and average quality.
I'd suggest doing what everyone else does, which is pick some formulas that will support what you want and then tweak them until the ROI is high enough to justify the investment.
Put the estimate in hours:
Estimate how much time you currently spend, and how much time you'd spend
Put the estimate in customer complaints:
Estimate your current number of bugs. Estimate how many of those bugs the new system would have caught. Find out the percentage of bugs reported by users, and document how many fewer bugs will be user visible.
Add to the hours:
Figure out how long it would take to fix the bugs that would be caught, and tack that onto the hourly estimate.
Add a non-quantifiable "salability".
With the extra time, we build extra features. With fewer bugs, that's fewer demos where sales guys shoot themselves in the foot. How many extra copies of software can we sell if we do this?
The last bit won't be successful; it's there primarily to focus attention on the first two metrics; hours and customer-visible defects.
My advice is to discredit what's there now then offer the alternative.
You could try putting the emphasis on the problems of how you do build and deployment now and win that battle first. Manager's don't want to change something that isn't causing them grief so you need to prove it is going to be a problem if nothing's done.
You should consider: how much time and credibility is lost with bad builds; how many mistakes are made currently; how much manual repetition takes place etc. and try to put metrics or examples of these things.
If you can win the support of your developers then you can also add their approval to the strength of your argument. Another point to make is that good developers like to work with good tools so progressive Management equals motivated developers.
If you win the hearts-and-minds of the developers and your Manager that may mean more than a piece of paper with some numbers on it.

What is the most critical piece of code you have written and how did you approach it?

Put it another way: what code have you written that cannot fail. I'm interested in hearing from those who have worked on projects dealing with heart monitors, water testing, economic fundamentals, missile trajectories, or the O2 concentration on the space shuttle.
How did you prepare for writing this sort of code: methodologically, intellectually, and emotionally?
Edit
I've marked this wiki in case the rep issue is keeping people from replying. I thought there would be a good deal more perspective on this issue than there has been.
While I am not personally involved in what is described there, this article will hopefully contribute to the spirit of your question: They Write the Right Stuff.
I wrote a driver for a blood pressure measuring device for hospital use. If it "fails", the patient will not have his blood pressure checked at the scheduled time; if his blood pressure is abnormal, no alarm (in the larger system) will be triggered. Such an event could be clinically significant.
My approach was to thoroughly read the spec/documentation in a non-work environment (to avoid the temptation to start coding right away), then read it again at work. After that, I summarized the possible states and actions on paper and "flowcharted" an algorithm, and annotated all the potential real-world "bad events" (cables getting unplugged, batteries dying, etc). Finally, I wrote and rewrote the driver three times, each with different mechanisms (e.g. FSM), and compared their results. Each iteration helped me identify weaknesses I hadn't yet discovered. The third rewrite was the "official" result. I reviewed each iteration with my co-worker.
Emotional preparation consisted of convincing myself that should the unthinkable happen, at least I wasn't willfully negligent -- just incompetent (the old "I'm only human" excuse). ;-)
I have written computer interface to a MRI machine. It had no chance of hurting the end user as it was just record management, but it could potentially have given an incorrect diagnosis or omit important information.
Tests, lots and lots of tests.
Unit tests, mid and high level tests. Simulate all possible input combinations. Also a great deal of testing with the hardware itself. Testing must be done in a complete and methodical way. It should take a great deal more time to test than to write.
Error Reporting
All errors must be reported and be obvious. If it won't hurt the patent to do so, fail fast.
For something that is actively keeping a person alive things are even worse. It must never stop working. If it fails it needs to restart and keep trying. Redundant internals are also a must in case the hardware fails.
At the wrong company it can really a difficult kind of situation to work in. However, if things are going well, you are well funded and release pressure is not high, it can be a very rewarding space to work in.
Not really an answer, but:
I've got a friend who writes embedded control software for laser eye surgery machines. When he had laser eye surgery himself, he made sure to go to an ophthalmologist who used his company's system. I have great admiration for this guy. I can't think of a piece of software I've ever written whose level of quality was high enough that I'd trust my own eyesight to it.
Right now I'm working on some base code for a system that retrieves medical patient information from clinics and hospitals for a medical billing office. We're starting out with a smaller client and a long break-in period to ensure quality, but eventually this code needs to securely handle a large variety of report formats from a number of clients at different facilities.
It's not quite in the same scale as your examples, but a bad mistake could result in the wrong people being billed or the right person billed to a defunct address (screwing up credit reports) or open people up to identity theft, so it's still pretty critical. Oh yeah, and it could mean doctors don't get paid quite as quick. That's important, too, especially from a business perspective, but not in the same class as data protection and integrity.
I've heard crazy stories of the processes used to write code at NASA for the spaceshuttles. Every line of code has about 10-20 lines of documentation, along with tests, full revision history, etc. Every time a bug is found, not only is the code evaluated and repaired, but the entire procedure of writing code, the entire command chain, etc. is reviewed to answer the question: "What happened wrong in our process that allowed this bug to get included in the first place?"
While nothing quite so important as an MRI machine or a blood pressure monitor, I did get tapped to do a rewrite of Blackjack when I worked for an online gambling provider. Blackjack is by far the most popular online game, and millions of dollars was going to go through this software (and did).
I wrote the game engine separate from the server and the client, and used Test Driven Development to ensure that what I was assuming was coming through in the results. I also had a wrapper "server" that had console output that would allow me to play. This was actually only useful in that it mimicked the real server interface, since playing a text version of blackjack isn't very fun or easy ("You draw a 10. You now have a 10 and a 6, while the dealer has a 6 showing. [bsd] >")
The game is still being run on some sites out there, and to my knowledge, has never had any financial bugs after years of play.
My first "real" software job was writing a GUI app for planning stereotactic brain surgery. Testing, testing, testing... absolutely no formal methods, engineering-style thoughts, just younger programmers cranking it out. When they started talking about using the software to control a robotic arm with a laser, without any serious engineering methods in place, i got a bit worried, left for more officey lands.
I've created information system application for local government cultures and tourism department in Bali island which were installed in several tourism denstinations, providing extensive informations about the culture, maps, accomodations etc.
if it failed then probably tourists couldnt get the right informations they need most, cheat by brookers, or lost somewhere :)

How to deal with poorly informed customer choices [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Here's a scenario I'm sure you're all familiar with.
You have a fairly "hands off" customer, who really doesn't want to get too involved in the decision making despite your best efforts.
An experienced development team spend hours discussing the pros and cons of a particular approach to a problem and come up with an elegant solution which avoids the pitfalls of the more obvious approaches.
The customer casually mentions after a quick glance that they want it changed. They have no understanding of all the usability / consistency issues you were trying to avoid in your very carefully thought out approach.
Despite explanations, customer isn't interested, they just want it changed.
You sigh and do what they ask, knowing full well what will happen next...
3 weeks later, customer says it isn't working well this way, could you change it? You suggest again your original solution, and they seize on it with enthusiasm. They invariably seem to have had a form of selective amnesia and blocked out their role in messing this up in the first place.
I'm sure many of you have gone through this. The thing which gets me is always when we know the time and effort that reasonably bright and able people have put in to really understanding the problem and trying to come up with a good solution. The frustration comes in contrasting this with the knowledge that the customer's choice is made in 3 minutes in a casual glance (or worse, by their managers who often don't even know what the project is really about). The icing on the cake is that it's usually made very late in the day.
I know that the agile methodologies are designed to solve exactly this kind of problem, but it requires a level of customer buy in that certain types of customers (people spending other peoples money usually) are just not willing to give.
Anyone any clever insight into how you deal with this?
EDIT: Oops - by the way, I'm not talking about any current or recent customer in this. It's purely hypothetical...
Make your customer pay by the effort you are putting into designing and developing the solution to their problem.
The more you work, the more you get. The customer will have to pay for his mistakes.
Customer will eventually learn to appreciate your experience and insight in the programming field.
Niyaz is correct, unfortunately getting a customer buy-in is difficult until they have been burned like this once before.
Additionally describe to the customer the scenario above and state how much extra it would cost if you went three or four weeks down the line and had to rewrite it due to a change and then let them use the prototype. It may take a few days to put one together so they can see both options (theirs [the wrong way], and yours [the right way]). Remember they are paying you not only for your ability to program but also your experience and knowledge of the issues which crop up.
Whatever the decision the customer makes, ensure that you get it documented, update your risks register for the project with the risks that the chosen implementation will incurr and speak to the project manager (if its not you) about the mitigation plans for them.
I agree with Niyaz. However at the time the customer suggests the change you should work out what the impact of the change will be, and how likely that impact is to happen. Then ask whomever is responsible (it's not always that customer) for the deliverable if they approve the change.
Making the impact clear (more cost, lower reliability, longer delivery time etc) is very important to helping the customer to make a decision. It's very important to describe the impacts on the project or their business in a factual way, and assess how likely that impact is to occur. "Maybes" and "i feel" are very ignorable.
After that as long as the right people approve the change and as long as they pay for it.. well you did give them what they wanted :)
One thing we have done with some success in the past in these kinds of situations is to hand the issues over to the client.
"OK, you want to change it - this is
what will happen if you do that. These
are the issues involved. You have a
think about how you'd like it to work
and then get back to us".
This approach doesn't tend to yield good solutions (unsurprisingly) but does tend to let the client see that it's not a "gut feeling", wild stab in the dark kind of question.
And failing that, it usually makes them stop asking you to change it!
Usually a scenario like this is caused by 2 things. The ones that are supposed to give you the requirement specifications are either don't put their hearts into the project because they have no interest in it, or because they really have no idea what they want.
Agile programming is one of the best ways, but there are other ways to do this. Personally I usually use a classic waterfall method, so spiral and agile methods are out of the questions. But this doesn't mean that you can't use prototypes.
As a matter of fact, using a prototype would probably be the most helpful tool to use. Think about the iceberg effect. The secret is that People Who Aren't Programmers Do Not Understand This. http://img134.imageshack.us/my.php?image=icebergbelowwater.jpg
"You know how an iceberg is 90% underwater? Well, most software is like that too -- there's a pretty user interface that takes about 10% of the work, and then 90% of the programming work is under the covers...." - Joel Spolsky
Generating the prototype takes time and effort but it is the most effective way to gather requirements. What my project team did was, the UI designer was the one that made the prototypes. If you give the users a prototype (at least a working interface of what the application is going to look and feel like) then you will get lots of criticism which can lead to desires and requirements. It can look like comments on YouTube but it's a start.
Second issue:
The customer casually mentions after a quick glance that they want it changed. They have no understanding of all the usability / consistency issues you were trying to avoid in your very carefully thought out approach.
Generate another prototype. The key here are results that the users would like to see instead of advice that they have to listen to.
But if all else fails you can always list the pros and cons of why you implemented the solution, whether or not the particular solution they like is not the one you insisted on. Make that part of the documentation as readable as possible. For example:
Problem:
The park is where all the good looking women jog to stay in shape. Johnny Bravo loves enjoying "mother nature's beauty", so he's lookin to blend in... you know... lookin all buff and do a little jogging while chasing tail.
Alternative Solutions:
1) Put on black suede shoes to look as stylish as you can.
2) Put on a pair of Nike's. Essential shoes for running. Try the latest styles.
Implemented Solution:
Black suede shoes were top choice because... well because hot mommies dig black suede shoes.
Or else, if they won't pay for the effort, just avoid putting that much resources into the solution of the problem, and just give them exactly what they've asked for and then think about it after the three weeks have passed.
Somewhat frustrating, yes, but that's the way it'll always be with that kind of customers. At least you won't be losing money.

Resources