I was recently looking at security issues from online-polls and the problem with online-elections and how they can sometimes very easily be tampered with.
Now it sprung to my eye that a lot of websites that I visit and even local newspapers in my area use "pinpoll" for online-polls.
So I wanted to know how trustworthy and secure these polls are?
Tobias here, Founder and CEO of Pinpoll.
I agree with #GreyFairer, let's not discuss this on SO (unless you want to know why fingerprinting libraries shouldn't be used to identify individual clients or how Pinpoll is applying ws to broadcast live updates across the globe).
Just send me an e-mail to privacy#pinpoll.com and I'm happy to explain to you in more detail what we do (and cannot do) to protect polls agains bots and fake votes.
And let me make one thing clear: we're one of the most trustworthy providers in Europe, especially when it comes to complying with the EU's strict data protection laws.
One example: You won't find a single request to a server other than our own (located in the EU) in our interactive elements.
And one last thing: What might be annoying to you (which is fully accepted), is interesting and entertaining to others. So let's agree to disagree when it comes to online polls in news portals ;)
For my bachelor´s degree I got the task to implement a B2B communication for an ERP system developed by the company I currently work for. Because it should also be able to communicate with other software I consider using EDI messages (EDIFACT) or maybe cXML. What is the best way to approach this task.
I had the idea to translate the EDIFACT message into xml defined by one xsd describing every EDIFACT message.
Then I would write the xml into the database or to business objects using a selfwritten mapper.
For writing EDIFACT messages I just use the same methods the other way round.
I thought using XML transformation first would be easier for the mapping and gives the opportunity using the xml for other purposes like writing other edi formats.
The other idea is to just use cXML and map it.
What is the best approach to this task?
You're essentially designing and implementing a public facing API for the ERP, so you need to consider security, reliability, non-repudiation, impact on business process under normal and abnormal conditions.
You'll also need to consider (ask) what sorts of information your customers will need to exchange with their partners (master data, transactional messages, financial information, etc).
I'd start by looking at the most commonly exchanged messages in the industry most representative of the ERPs users - look for message content and structure.
whether you choose to use EDIFACT, ANSI X12, cXML, XCBL, GS1XML, ebXML or something else is less important than good documentation and flexibility. it's unlikely that your choice will be exactly what any of your customers need without further transformation. you don't want to invent a new any to any transformation tool, and you probably don't even want to bundle an existing one.
I have heard of the concept of minimizing code and maximizing data, and was wondering what advice other people can give me on how/why I should do this when building my own systems?
Typically data-driven code is easier to read and maintain. I know I've seen cases where data-driven has been taken to the extreme and winds up very unusable (I'm thinking of some SAP deployments I've used), but coding your own "Domain Specific Languages" to help you build your software is typically a huge time saver.
The pragmatic programmers remain in my mind the most vivid advocates of writing little languages that I have read. Little state machines that run little input languages can get a lot accomplished with very little space, and make it easy to make modifications.
A specific example: consider a progressive income tax system, with tax brackets at $1,000, $10,000, and $100,000 USD. Income below $1,000 is untaxed. Income between $1,000 and $9,999 is taxed at 10%. Income between $10,000 and $99,999 is taxed at 20%. And income above $100,000 is taxed at 30%. If you were write this all out in code, it'd look about as you suspect:
total_tax_burden(income) {
if (income < 1000)
return 0
if (income < 10000)
return .1 * (income - 1000)
if (income < 100000)
return 999.9 + .2 * (income - 10000)
return 18999.7 + .3 * (income - 100000)
}
Adding new tax brackets, changing the existing brackets, or changing the tax burden in the brackets, would all require modifying the code and recompiling.
But if it were data-driven, you could store this table in a configuration file:
1000:0
10000:10
100000:20
inf:30
Write a little tool to parse this table and do the lookups (not very difficult, right?) and now anyone can easily maintain the tax rate tables. If congress decides that 1000 brackets would be better, anyone could make the tables line up with the IRS tables, and be done with it, no code recompiling necessary. The same generic code could be used for one bracket or hundreds of brackets.
And now for something that is a little less obvious: testing. The AppArmor project has hundreds of tests for what system calls should do when various profiles are loaded. One sample test looks like this:
#! /bin/bash
# $Id$
# Copyright (C) 2002-2007 Novell/SUSE
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation, version 2 of the
# License.
#=NAME open
#=DESCRIPTION
# Verify that the open syscall is correctly managed for confined profiles.
#=END
pwd=`dirname $0`
pwd=`cd $pwd ; /bin/pwd`
bin=$pwd
. $bin/prologue.inc
file=$tmpdir/file
okperm=rw
badperm1=r
badperm2=w
# PASS UNCONFINED
runchecktest "OPEN unconfined RW (create) " pass $file
# PASS TEST (the file shouldn't exist, so open should create it
rm -f ${file}
genprofile $file:$okperm
runchecktest "OPEN RW (create) " pass $file
# PASS TEST
genprofile $file:$okperm
runchecktest "OPEN RW" pass $file
# FAILURE TEST (1)
genprofile $file:$badperm1
runchecktest "OPEN R" fail $file
# FAILURE TEST (2)
genprofile $file:$badperm2
runchecktest "OPEN W" fail $file
# FAILURE TEST (3)
genprofile $file:$badperm1 cap:dac_override
runchecktest "OPEN R+dac_override" fail $file
# FAILURE TEST (4)
# This is testing for bug: https://bugs.wirex.com/show_bug.cgi?id=2885
# When we open O_CREAT|O_RDWR, we are (were?) allowing only write access
# to be required.
rm -f ${file}
genprofile $file:$badperm2
runchecktest "OPEN W (create)" fail $file
It relies on some helper functions to generate and load profiles, test the results of the functions, and report back to users. It is far easier to extend these little test scripts than it is to write this sort of functionality without a little language. Yes, these are shell scripts, but they are so far removed from actual shell scripts ;) that they are practically data.
I hope this helps motivate data-driven programming; I'm afraid I'm not as eloquent as others who have written about it, and I certainly haven't gotten good at it, but I try.
In modern software the line between code and data can become awfully thin and blurry, and it is not always easy to tell the two apart. After all, as far as the computer is concerned, everything is data, unless it is determined by existing code - normally the OS - to be otherwise. Even programs have to be loaded into memory as data, before the CPU can execute them.
For example, imagine an algorithm that computes the cost of an order, where larger orders get lower prices per item. It is part of a larger software system in a store, written in C.
This algorithm is written in C and reads a file that contains an input table provided by the management with the various per-item prices and the corresponding order size thresholds. Most people would argue that a file with a simple input table is, of course, data.
Now, imagine that the store changes its policy to some sort of asymptotic function, rather than pre-selected thresholds, so that it can accommodate insanely large orders. They might also want to factor in exchange rates and inflation - or whatever else the management people come up with.
The store hires a competent programmer and she embeds a nice mathematical expression parser in the original C code. The input file now contains an expression with global variables, functions such as log() and tan(), as well as some simple stuff like the Planck constant and the rate of carbon-14 degradation.
cost = (base * ordered * exchange * ... + ... / ...)^13
Most people would still argue that the expression, even if not as simple as a table, is in fact data. After all it is probably provided as-is by the management.
The store receives a large amount of complaints from clients that became brain-dead trying to estimate their expenses and from the accounting people about the large amount of loose change. The store decides to go back to the table for small orders and use a Fibonacci sequence for larger orders.
The programmer gets tired of modifying and recompiling the C code, so she embeds a Python interpretter instead. The input file now contains a Python function that polls a roomfull of Fib(n) monkeys for the cost of large orders.
Question: Is this input file data?
From a strict technical point, there is nothing different. Both the table and the expression needed to be parsed before usage. The mathematical expression parser probably supported branching and functions - it might not have been Turing-complete, but it still used a language of its own (e.g. MathML).
Yet now many people would argue that the input file just became code.
So what is the distinguishing feature that turns the input format from data into code?
Modifiability: Having to recompile the whole system to effect a change is a very good indication of a code-centric system. Yet I can easily imagine (well, more like I have actually seen) software that has been designed incompetently enough to have e.g. an input table built-in at compile time. And let's not forget that many applications still have icons - that most people would deem data - built in their executables.
Input format: This is the - in my opinion, naively - most common factor that people consider: "If it is in a programming language then it is code". Fine, C is code - you have to compile it after all. I would also agree that Python is also code - it is a full blown language. So why isn't XML/XSL code? XSL is a quite complex language in its own right - hence the L in its name.
In my opinion, none of these two criteria is the actual distinguishing feature. I think that people should consider something else:
Maintainability: In short, if the user of the system has to hire a third party to make the expertise needed to modify the behaviour of the system available, then the system should be considered code-centric to a degree.
This, of course, means that whether a system is data-driven or not should be considered at least in relation to the target audience - if not in relation to the client on a case-by-case basis.
It also means that the distinction can be impacted by the available toolset. The UML specification is a nightmare to go through, but these days we have all those graphical UML editors to help us. If there was some kind of third-party high-level AI tool that parses natural language and produces XML/Python/whatever, then the system becomes data-driven even for far more complex input.
A small store probably does not have the expertise or the resources to hire a third party. So, something that allows the workers to modify its behaviour with the knowledge that one would get in an average management course - mathematics, charts etc - could be considered sufficiently data-driven for this audience.
On the other hand, a multi-billion international corporation usually has in its payroll a bunch of IT specialists and Web designers. Therefore, XML/XSL, Javascript, or even Python and PHP are probably easy enough for it to handle. It also has complex enough requirements that something simpler might just not cut it.
I believe that when designing a software system, one should strive to achieve that fine balance in the used input formats where the target audience can do what they need to, without having to frequently call on third parties.
It should be noted that outsourcing blurs the lines even more. There are quite a few issues, for which the current technology simply does not allow the solution to be approachable by the layman. In that case the target audience of the solution should probably be considered to be the third party to which the operation would be outsourced to.
That third party can be expected to employ a fair number of experts.
One of five maxims under the Unix Philosophy, as presented by Rob Pike, is this:
Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
It is often shortened to, "write stupid code that uses smart data."
Other answers have already dug into how you can often code complex behavior with simple code that just reacts to the pattern of its particular input. You can think of the data as a domain-specific language, and of your code as an interpreter (maybe a trivial one).
Given lots of data you can go further: the statistics can power decisions. Peter Norvig wrote a great chapter illustrating this theme in Beautiful Data, with text, code, and data all available online. (Disclosure: I'm thanked in the acknowledgements.) On pp. 238-239:
How does the data-driven approach compare to a more traditional software development
process wherein the programmer codes explicit rules? ... Clearly, the handwritten rules are difficult to develop and maintain. The big
advantage of the data-driven method is that so much knowledge is encoded in the data,
and new knowledge can be added just by collecting more data. But another advantage is
that, while the data can be massive, the code is succinct—about 50 lines for correct, compared to over 1,500 for ht://Dig’s spelling code. ...
Another issue is portability. If we wanted a Latvian spelling-corrector, the English
metaphone rules would be of little use. To port the data-driven correct algorithm to another
language, all we need is a large corpus of Latvian; the code remains unchanged.
He shows this concretely with code in Python using a dataset collected at Google. Besides spelling correction, there's code to segment words and to decipher cryptograms -- in just a couple pages, again, where Grady Booch's book spent dozens without even finishing it.
"The Unreasonable Effectiveness of Data" develops the same theme more broadly, without all the nuts and bolts.
I've taken this approach in my work for another search company and I think it's still underexploited compared to table-driven/DSL programming, because most of us weren't swimming in data so much until the last decade or two.
In languages in which code can be treated as data it is a non-issue. You use what's clear, brief, and maintainable, leaning towards data, code, functional, OO, or procedural, as the solution requires.
In procedural, the distinction is marked, and we tend to think about data as something stored in an specific way, but even in procedural it is best to hide the data behind an API, or behind an object in OO.
A lookup(avalue) can be reimplemented in many different ways during its lifetime, as long as its starts as a function.
...All the time I desing programs for nonexisting machines and add: 'if we now had a machine comprising the primitives here assumed, then the job is done.'
... In actual practice, of course, this ideal machine will turn out not to exist, so our next task --structurally similar to the original one-- is to program the simulation of the "upper" machine... But this bunch of programs is written for a machine that in all probability will not exist, so our next job will be to simulate it in terms of programs for a next lower level machine, etc., until finally we have a program that can be executed by our hardware...
E. W. Dijkstra in Notes on Structured Programming, 1969, as quoted by John Allen, in Anatomy of Lisp, 1978.
When I think of this philosophy which I agree with quite a bit, the first thing that comes to mind is code efficiency.
When I'm making code I know for sure it isn't always anything close to perfect or even fully knowledgeable. Knowing enough to get close to maximum efficiency out of a machine when it is needed and good efficiency the rest of the time (perhaps trading off for better workflow) has allowed me to produce high quality finished products.
Coding in a data driven way, you end up using code for what code is for. To go and 'outsource' every variable to files would be foolishly extreme, the functionality of a program needs to be in the program and the content, settings and other factors can be managed by the program.
This also allows for much more dynamic applications and new features.
If you have even a simple form of database, you are able to apply the same functionality to many states. You may also do all manner of creative things like changing the context of what your program is doing based on file header data or perhaps directory, file name or extension, though not all data is necessarily stored on a filesystem.
Finally keeping your code in a state where it is simply handling data puts you in a state of mind where you are closer to envisioning what is actually going on. This also keeps the bulk out of your code, greatly reducing bloatware.
I believe it makes code more maintainable, more flexible and more efficient aaaand I like it.
Thank you to the others for your input on this as well! I found it very encouraging.
**update: horray! so it is a journey of practice and understanding. ;) now i no longer feel so dumb.*
I have read up many articles on REST, and coded up several rails apps that makes use of RESTful resources. However, I never really felt like I fully understood what it is, and what is the difference between RESTful and not-restful. I also have a hard time explaining to people why/when they should use it.
If there is someone who have found a very clear explanation for REST and circumstances on when/why/where to use it, (and when not to) it would benefit the world if you could put it up, thanks! =)
REST is usually learned like this:
You hear about REST being using HTTP the way it was meant to be used, and from that you shun SOAP Web Services' envelopes, since most of what's needed by many SOAP standards are handled by HTTP in a simple, no-nonsense way. You also quickly learn that you need to use the right method for the right operation.
Later, perhaps years later, you hear that REST is more than that. REST is in fact also the concept of linking between resources. This often takes a while to grasp the full meaning of, but when you learn this, you start introducing hyperlinks into your responses so that clients can navigate your system without being coupled to how the server wants to name its resources (i.e. the URIs).
Even later, you learn that you still haven't understood REST! And this is because you find out that media types are important. You start making media types called application/vnd.example.foo+json and put hyperlinks in them, since that's already your understanding of REST.
Years pass, and you re-read Fielding's thesis for the umpteenth time, to see if there's anything you missed, and it suddenly dawns upon you what really the HATEOAS constraint is: It's about the client not having any notion of how the server's resources are structured, but that it discoveres these relationships at runtime. It also means that the screen in front of the user is driven completely by what is passed over the wire, so in fact, if a server passes an image/jpeg then that's what you're supposed to show to the user, not an error message saying "AtomProcessor can't handle image/jpeg".
I'm just coming to terms with #4 and I'm hoping the ladder isn't much longer! It's taken me seven years.
This article does a good job classifying the differences in several http application styles from WS-* to RESTian purity. What I like about this post is it reminds you that most of what we call REST really is something only partly in line with Roy Fielding's original definition.
InfoQ has a whole section addressing more of the "what is REST" angle as well.
In terms of REST vs. SOAP, this question seems to have a number of good responses, particularly the selected answer.
I would imagine YMMV, but I found it very easy to start understanding the details of REST after I realised how REST essentially was a continuation of the static WWW concepts into the web application design space. I had written (a rather longish) post on the same : Why REST?
Scalability is an obvious benefit of REST (stateless, caching).
But also - and this is probably the main benefit of hypertext - REST is ideal for when you have lots of clients to your service. Following REST and the hypertext constraint drastically reduces the coupling between all those clients and your server, which means you have more freedom when evolving/developing your service over time - you are not tied down by the risk of breaking would-be-coupled clients.
On a practical note, if you're working with rails - then restfulie is a great little framework for tackling hypertext on the client and server. Server side is a rails extension, and client is a DSL for handling state changes. Interesting stuff, check it out here: http://restfulie.caelum.com.br/ - I highly recommend the tutorial/demo vids they have up on vimeo :)
Content-Type: text/x-flamebait
I've been asking the same question lately, and my supposition is that
half the problem with explaining why full-on REST is a good thing when
defining an interface for machine-consumed data is that much of the
time it isn't. OK, you'd need a really good reason to ignore the
commonsense bits (URLs define resources, HTTP verbs define actions,
etc etc) - I'm in no way suggesting we go back to the abomination that
was SOAP. But doing HATEOAS in a way that is both Fielding-approved
(no non-standard media types) and machine-friendly seems to offer
diminishing returns: it's all very well using a standard media type to
describe the valid transitions (if such a media type exists) but where
the application is at all complicated your consumer's agent still
needs to know which are the right transitions to make to achieve the
desired goal (a ticket purchase, or whatever), and it can't do that
unless your consumer (a human) tells it. And if he's required to
build into his program the out-of-band knowledge that the path with
linkrels create_order => add_line => add_payment_info => confirm is
the correct one, and reset_order is not the right path, then I don't
see that it's so much more grievous a sin to make him teach his XML
parser what to do with application/x-vnd.yourname.order.
I mean, obviously yes it's less work all round if there's a suitable
standard format with libraries and whatnot that can be reused, but in
the (probably more common) case that there isn't, your options
according to Fielding-REST are (a) create a standard, or (b) to
augment the client by downloading code to it. If you're merely
looking to get the job done and not to change the world, option (c)
"just make something up" probably looks quite tempting and I for one wouldn't
blame you for taking it.
I'm maintaining a program that needs to parse out data that is present in an "almost structured" form in text. i.e. various programs that produce it use slightly different formats, it may have been printed out and OCR'd back in (yeah, I know) with errors, etc. so I need to use heuristics that guess how it was produced and apply different quirks modes, etc. It's frustrating, because I'm somewhat familiar with the theory and practice of parsing if things are well behaved, and there are nice parsing frameworks etc. out there, but the unreliability of the data has led me to write some very sloppy ad-hoc code. It's OK at the moment but I'm worried that as I expand it to process more variations and more complex data, things will get out of hand. So my question is:
Since there are a fair number of existing commercial products that do related things ("quirks modes" in web browsers, error interpretation in compilers, even natural language processing and data mining, etc.) I'm sure some smart people have put thought into this, and tried to develop a theory, so what are the best sources for background reading on parsing unprincipled data in as principled a manner as possible?
I realize this is somewhat open-ended, but my problem is that I think I need more background to even know what the right questions to ask are.
Given the choice between what you've proposed and fighting a hungry crocodile while covered in raw-beef-flavored marmalade and both hands tied behind my back, I'd choose the ...
Well, OK on a more serious note, if you have data that doesn't abide by the any "sane" structure, you have to study the data and find frequencies of quirks in it and correlate the data for the given context (i.e. how it was generated)
Print to OCR to get the data in is almost always going to lead to heart break. The company I work for employs a veritable army of people who manually read such documents and hand "code" (i.e. enter by hand) the data for known problematic OCR scenarios, or documents our customers detect the original OCR failed on.
As for leveraging "Parsing Frameworks" these tend to expect data that will always follow the grammar rules you've laid out. The data you've described has no such guarantees. If you go that route be prepared for unexpected - though not always obvious - failures.
By all means if there is any way possible to get the original data files, do so. Or if you can demand that those providing the data make their data come in a single well defined format, even better. (It might not be "YOUR" format, but at least it's a regular and predictable format you can convert from)