Is there a way to preprocess ruby code and find errors that would occur runtime? - ruby-on-rails

We have huge code base and we are generating issues that would have been caught at compile time in type languages such as Java but we are not catching them until runtime in Ruby. This is bad since we generate bugs that most of the time are typos or refactoring that leaves some invalid code.
Example:
def mysuperfunc
# some code goes here
# this was a valid call but not anymore since enforcesecurity
# signature changed
#system.enforcesecurity
end
I mean, IDEs can do it but some guys use ATOM or sublime, so we need something to "compile" and report that kind of issues so they don't reach deployment. What have you been using?
This is generating a little percentage of our bug reports, but since we are forced to produce at a ridiculous pace we don't have 100% code coverage. If there is no tool to help, I'll just make sure everybody uses and IDE and run the reports with tools such as Rubymine.
Our stack includes, rspec, minitest, SimpleCov. We enforce code reviews, multistack deployments (dev, qa, pre-prod, sandbox, prod). And still some issues are reaching higher level and makes us programmers look bad. I'm not looking of magic, just a little automation that might help a bit.

Unfortunately, the Halting Problem, Rice's Theorem, and all the other Undecidability and Uncomputability Results tell us that it is simply impossible in the general case to statically determine any "interesting" property about the runtime behavior of a program. We cannot even statically determine something as simple as "will it halt", so how are we going to determine "is bug-free"?
There are certain things that can be statically determined, and there are certain restricted programs for which some interesting properties can be statically determined, but largely, this is not possible. And even to the small extent that it is possible, it generally requires the language to be specifically designed to be easy to statically analyze (which Ruby isn't).
That being said, there are certain tools that contain certain heuristics to point out code that may have problems. There are certain coding standards that may help avoid bugs, and there are tools to enforce those coding standards. Keywords to search for are "code quality tools", "linter", "static analyzer", etc. You have already been given examples in the other answers and comments, and given those examples and these keywords, you'll likely find more.
However, I also wanted to discuss something you wrote:
we are forced to produce at a ridiculous pace we don't have 100% code coverage
That's a problem, which has to be approached from two sides:
Practice, practice, practice. You need to practice testing and writing high-quality code until it is so naturally to you that not doing it actually ends up being harder and slower. It should become second nature to you, such that under pressure when your mind goes blank, the only thing you know is to write tests and write well-designed, well-factored, high-quality code. Note: I'm talking about deliberate practice, which means setting time aside to really practice … and practice is practice, it's not work, it's not fun, it's not hobby, if you don't delete the code you wrote immediately after you have written it, you are not practicing, you are working.
Sustainable Pace. You should never develop faster than the pace you could sustain indefinitely while still producing well-tested, well-designed, well-factored, high-quality code, having a fulfilling social life, no stress, plenty of free time, etc. This is something that has to be backed and supported and understood by management.

I'm unaware of anything exactly like you want. However, there are a few gems that will analyze code and warn you about some errors and/or bad practices. Try these:
https://github.com/bbatsov/rubocop
https://github.com/railsbp/rails_best_practices

FLAY
https://rubygems.org/gems/flay
Via the repo https://github.com/seattlerb/flay:
DESCRIPTION:
Flay analyzes code for structural similarities. Differences in literal
values, variable, class, method names, whitespace, programming style,
braces vs do/end, etc are all ignored. Making this totally rad.
[FEATURES:]
Reports differences at any level of code.
Adds a score multiplier to identical nodes.
Differences in literal values, variable, class, and method names are ignored.
Differences in whitespace, programming style, braces vs do/end, etc are ignored.
Works across files.
Add the flay-persistent plugin to work across large/many projects.
Run --diff to see an N-way diff of the code.
Provides conservative (default) and --liberal pruning options.
Provides --fuzzy duplication detection.
Language independent: Plugin system allows other languages to be flayed.
Ships with .rb and .erb.
javascript and others will be
available separately.
Includes FlayTask for Rakefiles.
Uses path_expander, so you can use:
dir_arg -- expand a directory automatically
#file_of_args -- persist arguments in a file
-path_to_subtract -- ignore intersecting subsets of
files/directories
Skips files matched via patterns in .flayignore (subset format of .gitignore).
Totally rad.
FLOG
https://rubygems.org/gems/flog
Via the repo https://github.com/seattlerb/flog:
DESCRIPTION:
Flog reports the most tortured code in an easy to read pain report.
The higher the score, the more pain the code is in.
[FEATURES:]
Easy to read reporting of complexity/pain.
Uses path_expander, so you can use:
dir_arg – expand a directory automatically
#file_of_args – persist arguments in a file
-path_to_subtract – ignore intersecting subsets of files/directories
SYNOPSIS:
% ./bin/flog -g lib
Total Flog = 1097.2 (17.4 flog / method)
323.8: Flog total
85.3: Flog#output_details
61.9: Flog#process_iter
53.7: Flog#parse_options
...

There is a ruby gem called guard that does automated testing. You can set your own custom rules.
For example, you can make it where anytime you modify certain files, the test framework will automatically run.
Here is the link for guard

Related

Is there a sandboxable compiled programming language simililar to lua

I'm working on a crowd simulator. The idea is people walking around a city in 2D. Think gray rectangles for the buildings and colored dots for the people. Now I want these people to be programmable by other people, without giving them access to the core back end.
I also don't want them to be able to use anything other than the methods I provide for them. Meaning no file access, internet access, RNG, nothing.
They will receive get events like "You have just been instructed to go to X" or "You have arrived at P" and such.
The script should then allow them to do things like move_forward or how_many_people_are_in_front_of me and such.
Now I have found out that Lua and python are both thousands of times slower than compiled languages (I figured it would be in order of magnitude of 10s times slower), which is way to slow for my simulation.
So heres my question: Is there a programming language that is FOSS, allows me to restrict system access (sandboxing) the entire language to limit the amount of information the script has by only allowing it to use my provided functions, that is reasonably fast, something like <10x slower than Java, where I can send events to objects inside that language with which I can load in new Classes/Objects on the fly.
Don't you think that if there was a scripting language faster than lua and python, then it'd be talked about at least as much as they are?
The speed of a scripting language is rather vague term. Scripting languages essentially are converted to a series of calls to functions written in fast compiled languages. But the functions are usually written to be general with lots of checks and fail-safes, rather than to be fast. For some problems, not a lot of redundant actions stacks up and the script translation results in essentially same machine code as the compiled program would have. For other problems, a person, knowledgeable about the language, might coerce it to translate to essentially same machine code. For other problems the price of convenience stay forever with the script.
If you look at the timings of benchmark tasks, you'll find that there's no consistent winner across them. For one task the language is fastest, for the other it is way behind.
It would make sense to gauge language speed at your task by looking at similar tasks in benchmarks. So, which of those problem maps the closest to yours? My guess would be: none.
Now, onto the question of user programs inside your program.
That's how script languages came to existence in the first place. You can read up on why such a language may be slow for example in SICP.
If you evaluate what you expect people to write in their programs, you might decide, that you don't need to give them whole programming language. Then you may give them a simple set of instructions they can use to describe a few branching decisions and value lookups. Then your own very performant program will construct an object that encompasses the described logic. This tric is described here and there.
However if you keep adding more and more complex commands for users to invoke, you'll just end up inventing your own language. At that point you'll likely wish you'd went with Lua from the very beginning.
That being said, I don't think the snippet below will run significantly different in compiled code, your own interpreter object, or any embedded scripting language:
if event = "You have just been instructed to go to X":
set_front_of_me(X) # call your function
n = how_many_people_are_in_front_of_me() #call to your function
if n > 3:
move_to_side() #call to function provided by you
else:
move_forward() #call to function provided by you
Now, if the users would need to do complex computer-sciency stuff, solve np-problems, do machine learning or other matrix multiplications, then yes, that would be slow, provided someone would actually trouble themselves with implementing that.
If you get to that point, it seem that there are at least some possibilities to sandbox the compiled dlls (at least in some languages). Or you could do compilation of users' code yourself to control the functionality they invoke and then plug it in as a library.

reading/parsing common lisp files from lisp without all packages available or loading everything

I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.

Which of the three mutually exclusive Clang sanitizers should I default to?

Clang has a number of sanitizers that enable runtime checks for questionable behavior. Unfortunately, they can't all be enabled at once.
It is not possible to combine more than one of the -fsanitize=address, -fsanitize=thread, and -fsanitize=memory checkers in the same program.
To make things worse, each of those three seems too useful to leave out. AddressSanitizer checks for memory errors, ThreadSanitizer checks for race conditions and MemorySanitizer checks for uninitialized reads. I'm worried about all of those things!
Obviously, if I have a hunch about where a bug lies, I can choose a sanitizer according to that. But what if I don't? Going further, what if I want to use the sanitizers as a preventative tool rather than a diagnostic one, to point out bugs that I didn't even know about?
In other words, given that I'm not looking for anything in particular, which sanitizer should I compile with by default? Am I just expected to compile and test the whole program three times, once for each sanitizer?
As you pointed out, sanitizers are typically mutually exclusive (you can combine only Asan+UBsan+Lsan, via -fsanitize=address,undefined,leak, maybe also add Isan via -fsanitize=...,integer if your program does not contain intentional unsigned overflows) so the only way to ensure complete coverage is to do separate QA runs with each of them (which implies rebuilding SW for every run). BTW doing yet another run with Valgrind is also recommended.
Using Asan in production has two aspects. On one hand common experience is that some bugs can only be detected in production so you do want to occasionally run sanitized builds there, to increase test coverage [*]. On the other hand Asan has been reported to increase attack surface in some cases (see e.g. this oss-security report) so using it as hardening solution (to prevent bugs rather than detect them) has been discouraged.
[*] As a side note, Asan developers also strongly suggest using fuzzing to increase coverage (see e.g. Cppcon15 and CppCon17 talks).
[**] See Asan FAQ for ways to make AddressSanitizer more strict (look for "aggressive diagnostics")

Find unused code in a Rails app

How do I find what code is and isn't being run in production ?
The app is well-tested, but there's a lot of tests that test unused code. Hence they get coverage when running tests... I'd like to refactor and clean up this mess, it keeps wasting my time.
I have a lot of background jobs, this is why I'd like the production env to guide me. Running at heroku I can spin up dynos to compensate any performance impacts from the profiler.
Related question How can I find unused methods in a Ruby app? not helpful.
Bonus: metrics to show how often a line of code is run. Don't know why I want it, but I do! :)
Under normal circumstances the approach would be to use your test data for code coverage, but as you say you have parts of your code that are tested but are not used on the production app, you could do something slightly different.
Just for clarity first: Don't trust automatic tools. They will only show you results for things you actively test, nothing more.
With the disclaimer behind us, I propose you use a code coverage tool (like rcov or simplecov for Ruby 1.9) on your production app and measure the code paths that are actually used by your users. While these tools were originally designed for measuring test coverage, you could also use them for production coverage
Under the assumption that during the test time-frame all relevant code paths are visited, you can remove the rest. Unfortunately, this assumption will most probably not fully hold. So you will still have to apply your knowledge of the app and its inner workings when removing parts. This is even more important when removing declarative parts (like model references) as those are often not directly run but only used for configuring other parts of the system.
Another approach which could be combined with the above is to try to refactor your app into distinguished features that you can turn on and off. Then you can turn features that are suspected to be unused off and check if nobody complains :)
And as a final note: you won't find a magic tool to do your full analysis. That's because no tool can know whether a certain piece of code is used by actual users or not. The only thing that tools can do is create (more or less) static reachability graphs, telling you if your code is somehow called from a certain point. With a dynamic language like Ruby even this is rather hard to achieve, as static analysis doesn't bring much insight in the face of meta-programming or dynamic calls that are heavily used in a rails context. So some tools actually run your code or try to get insight from test coverage. But there is definitely no magic spell.
So given the high internal (mostly hidden) complexity of a rails application, you will not get around to do most of the analysis by hand. The best advice would probably be to try to modularize your app and turn off certain modules to test f they are not used. This can be supported by proper integration tests.
Checkout the coverband gem, it does what you exactly what are you searching.
Maybe you can try to use rails_best_practices to check unused methods and class.
Here it is in the github: https://github.com/railsbp/rails_best_practices .
Put 'gem "rails_best_practices" ' in your Gemfile and then run rails_best_practices . to generate configuration file
I had the same problem and after exploring some alternatives I realized that I have all the info available out of the box - log files. Our log format is as follows
Dec 18 03:10:41 ip-xx-xx-xx-xx appname-p[7776]: Processing by MyController#show as HTML
So I created a simple script to parse this info
zfgrep Processing production.log*.gz |awk '{print $8}' > ~/tmp/action
sort ~/tmp/action | uniq -c |sort -g -r > ~/tmp/histogram
Which produced results of how often an given controller#action was accessed.
4394886 MyController#index
3237203 MyController#show
1644765 MyController#edit
Next step is to compare it to the list of all controller#action pair in the app (using rake routes output or can do the same script for testing suite)
You got already the idea to mark suspicious methods as private (what will maybe break your application).
A small variation I did in the past: Add a small piece code to all suspicious methods to log it. In my case it was a user popup "You called a obsolete function - if you really need please contact the IT".
After one year we had a good overview what was really used (it was a business application and there where functions needed only once a year).
In your case you should only log the usage. Everything what is not logged after a reasonable period is unused.
I'm not very familiar with Ruby and RoR, but what I'd suggest some crazy guess:
add :after_filter method wich logs name of previous called method(grab it from call stack) to file
deploy this to production
wait for a while
remove all methods that are not in log.
p.s. probably solution with Alt+F7 in NetBeans or RubyMine is much better :)
Metaprogramming
Object#method_missing
override Object#method_missing. Inside, log the calling Class and method, asynchronously, to a data store. Then manually call the original method with the proper arguments, based on the arguments passed to method_missing.
Object tree
Then compare the data in the data store to the contents of the application's object tree.
disclaimer: This will surely require significant performance and resource consideration. Also, it will take a little tinkering to get that to work, but theoretically it should work. I'll leave it as an exercise to the original poster to implement it. ;)
Have you tried creating a test suite using something like sahi you could then record all your user journies using this and tie those tests to rcov or something similar.
You do have to ensure you have all user journies but after that you can look at what rcov spits out and at least start to prune out stuff that is obviously never covered.
This isn't a very proactive approach, but I've often used results gathered from New Relic to see if something I suspected as being unused had been called in production anytime in the past month or so. The apps I've used this on have been pretty small though, and its decently expensive for larger applications.
I've never used it myself, but this post about the laser gem seems to talk about solving your exact problem.
mark suspicious methods as private. If that does not break the code, check if the methods are used inside the class. then you can delete things
It is not the perfect solution, but for example in NetBeans you can find usages of the methods by right click on them (or press Alt+F7).
So if method is unused, you will see it.

Thoughts on minimize code and maximize data philosophy

I have heard of the concept of minimizing code and maximizing data, and was wondering what advice other people can give me on how/why I should do this when building my own systems?
Typically data-driven code is easier to read and maintain. I know I've seen cases where data-driven has been taken to the extreme and winds up very unusable (I'm thinking of some SAP deployments I've used), but coding your own "Domain Specific Languages" to help you build your software is typically a huge time saver.
The pragmatic programmers remain in my mind the most vivid advocates of writing little languages that I have read. Little state machines that run little input languages can get a lot accomplished with very little space, and make it easy to make modifications.
A specific example: consider a progressive income tax system, with tax brackets at $1,000, $10,000, and $100,000 USD. Income below $1,000 is untaxed. Income between $1,000 and $9,999 is taxed at 10%. Income between $10,000 and $99,999 is taxed at 20%. And income above $100,000 is taxed at 30%. If you were write this all out in code, it'd look about as you suspect:
total_tax_burden(income) {
if (income < 1000)
return 0
if (income < 10000)
return .1 * (income - 1000)
if (income < 100000)
return 999.9 + .2 * (income - 10000)
return 18999.7 + .3 * (income - 100000)
}
Adding new tax brackets, changing the existing brackets, or changing the tax burden in the brackets, would all require modifying the code and recompiling.
But if it were data-driven, you could store this table in a configuration file:
1000:0
10000:10
100000:20
inf:30
Write a little tool to parse this table and do the lookups (not very difficult, right?) and now anyone can easily maintain the tax rate tables. If congress decides that 1000 brackets would be better, anyone could make the tables line up with the IRS tables, and be done with it, no code recompiling necessary. The same generic code could be used for one bracket or hundreds of brackets.
And now for something that is a little less obvious: testing. The AppArmor project has hundreds of tests for what system calls should do when various profiles are loaded. One sample test looks like this:
#! /bin/bash
# $Id$
# Copyright (C) 2002-2007 Novell/SUSE
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License as
# published by the Free Software Foundation, version 2 of the
# License.
#=NAME open
#=DESCRIPTION
# Verify that the open syscall is correctly managed for confined profiles.
#=END
pwd=`dirname $0`
pwd=`cd $pwd ; /bin/pwd`
bin=$pwd
. $bin/prologue.inc
file=$tmpdir/file
okperm=rw
badperm1=r
badperm2=w
# PASS UNCONFINED
runchecktest "OPEN unconfined RW (create) " pass $file
# PASS TEST (the file shouldn't exist, so open should create it
rm -f ${file}
genprofile $file:$okperm
runchecktest "OPEN RW (create) " pass $file
# PASS TEST
genprofile $file:$okperm
runchecktest "OPEN RW" pass $file
# FAILURE TEST (1)
genprofile $file:$badperm1
runchecktest "OPEN R" fail $file
# FAILURE TEST (2)
genprofile $file:$badperm2
runchecktest "OPEN W" fail $file
# FAILURE TEST (3)
genprofile $file:$badperm1 cap:dac_override
runchecktest "OPEN R+dac_override" fail $file
# FAILURE TEST (4)
# This is testing for bug: https://bugs.wirex.com/show_bug.cgi?id=2885
# When we open O_CREAT|O_RDWR, we are (were?) allowing only write access
# to be required.
rm -f ${file}
genprofile $file:$badperm2
runchecktest "OPEN W (create)" fail $file
It relies on some helper functions to generate and load profiles, test the results of the functions, and report back to users. It is far easier to extend these little test scripts than it is to write this sort of functionality without a little language. Yes, these are shell scripts, but they are so far removed from actual shell scripts ;) that they are practically data.
I hope this helps motivate data-driven programming; I'm afraid I'm not as eloquent as others who have written about it, and I certainly haven't gotten good at it, but I try.
In modern software the line between code and data can become awfully thin and blurry, and it is not always easy to tell the two apart. After all, as far as the computer is concerned, everything is data, unless it is determined by existing code - normally the OS - to be otherwise. Even programs have to be loaded into memory as data, before the CPU can execute them.
For example, imagine an algorithm that computes the cost of an order, where larger orders get lower prices per item. It is part of a larger software system in a store, written in C.
This algorithm is written in C and reads a file that contains an input table provided by the management with the various per-item prices and the corresponding order size thresholds. Most people would argue that a file with a simple input table is, of course, data.
Now, imagine that the store changes its policy to some sort of asymptotic function, rather than pre-selected thresholds, so that it can accommodate insanely large orders. They might also want to factor in exchange rates and inflation - or whatever else the management people come up with.
The store hires a competent programmer and she embeds a nice mathematical expression parser in the original C code. The input file now contains an expression with global variables, functions such as log() and tan(), as well as some simple stuff like the Planck constant and the rate of carbon-14 degradation.
cost = (base * ordered * exchange * ... + ... / ...)^13
Most people would still argue that the expression, even if not as simple as a table, is in fact data. After all it is probably provided as-is by the management.
The store receives a large amount of complaints from clients that became brain-dead trying to estimate their expenses and from the accounting people about the large amount of loose change. The store decides to go back to the table for small orders and use a Fibonacci sequence for larger orders.
The programmer gets tired of modifying and recompiling the C code, so she embeds a Python interpretter instead. The input file now contains a Python function that polls a roomfull of Fib(n) monkeys for the cost of large orders.
Question: Is this input file data?
From a strict technical point, there is nothing different. Both the table and the expression needed to be parsed before usage. The mathematical expression parser probably supported branching and functions - it might not have been Turing-complete, but it still used a language of its own (e.g. MathML).
Yet now many people would argue that the input file just became code.
So what is the distinguishing feature that turns the input format from data into code?
Modifiability: Having to recompile the whole system to effect a change is a very good indication of a code-centric system. Yet I can easily imagine (well, more like I have actually seen) software that has been designed incompetently enough to have e.g. an input table built-in at compile time. And let's not forget that many applications still have icons - that most people would deem data - built in their executables.
Input format: This is the - in my opinion, naively - most common factor that people consider: "If it is in a programming language then it is code". Fine, C is code - you have to compile it after all. I would also agree that Python is also code - it is a full blown language. So why isn't XML/XSL code? XSL is a quite complex language in its own right - hence the L in its name.
In my opinion, none of these two criteria is the actual distinguishing feature. I think that people should consider something else:
Maintainability: In short, if the user of the system has to hire a third party to make the expertise needed to modify the behaviour of the system available, then the system should be considered code-centric to a degree.
This, of course, means that whether a system is data-driven or not should be considered at least in relation to the target audience - if not in relation to the client on a case-by-case basis.
It also means that the distinction can be impacted by the available toolset. The UML specification is a nightmare to go through, but these days we have all those graphical UML editors to help us. If there was some kind of third-party high-level AI tool that parses natural language and produces XML/Python/whatever, then the system becomes data-driven even for far more complex input.
A small store probably does not have the expertise or the resources to hire a third party. So, something that allows the workers to modify its behaviour with the knowledge that one would get in an average management course - mathematics, charts etc - could be considered sufficiently data-driven for this audience.
On the other hand, a multi-billion international corporation usually has in its payroll a bunch of IT specialists and Web designers. Therefore, XML/XSL, Javascript, or even Python and PHP are probably easy enough for it to handle. It also has complex enough requirements that something simpler might just not cut it.
I believe that when designing a software system, one should strive to achieve that fine balance in the used input formats where the target audience can do what they need to, without having to frequently call on third parties.
It should be noted that outsourcing blurs the lines even more. There are quite a few issues, for which the current technology simply does not allow the solution to be approachable by the layman. In that case the target audience of the solution should probably be considered to be the third party to which the operation would be outsourced to.
That third party can be expected to employ a fair number of experts.
One of five maxims under the Unix Philosophy, as presented by Rob Pike, is this:
Data dominates. If you have chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.
It is often shortened to, "write stupid code that uses smart data."
Other answers have already dug into how you can often code complex behavior with simple code that just reacts to the pattern of its particular input. You can think of the data as a domain-specific language, and of your code as an interpreter (maybe a trivial one).
Given lots of data you can go further: the statistics can power decisions. Peter Norvig wrote a great chapter illustrating this theme in Beautiful Data, with text, code, and data all available online. (Disclosure: I'm thanked in the acknowledgements.) On pp. 238-239:
How does the data-driven approach compare to a more traditional software development
process wherein the programmer codes explicit rules? ... Clearly, the handwritten rules are difficult to develop and maintain. The big
advantage of the data-driven method is that so much knowledge is encoded in the data,
and new knowledge can be added just by collecting more data. But another advantage is
that, while the data can be massive, the code is succinct—about 50 lines for correct, compared to over 1,500 for ht://Dig’s spelling code. ...
Another issue is portability. If we wanted a Latvian spelling-corrector, the English
metaphone rules would be of little use. To port the data-driven correct algorithm to another
language, all we need is a large corpus of Latvian; the code remains unchanged.
He shows this concretely with code in Python using a dataset collected at Google. Besides spelling correction, there's code to segment words and to decipher cryptograms -- in just a couple pages, again, where Grady Booch's book spent dozens without even finishing it.
"The Unreasonable Effectiveness of Data" develops the same theme more broadly, without all the nuts and bolts.
I've taken this approach in my work for another search company and I think it's still underexploited compared to table-driven/DSL programming, because most of us weren't swimming in data so much until the last decade or two.
In languages in which code can be treated as data it is a non-issue. You use what's clear, brief, and maintainable, leaning towards data, code, functional, OO, or procedural, as the solution requires.
In procedural, the distinction is marked, and we tend to think about data as something stored in an specific way, but even in procedural it is best to hide the data behind an API, or behind an object in OO.
A lookup(avalue) can be reimplemented in many different ways during its lifetime, as long as its starts as a function.
...All the time I desing programs for nonexisting machines and add: 'if we now had a machine comprising the primitives here assumed, then the job is done.'
... In actual practice, of course, this ideal machine will turn out not to exist, so our next task --structurally similar to the original one-- is to program the simulation of the "upper" machine... But this bunch of programs is written for a machine that in all probability will not exist, so our next job will be to simulate it in terms of programs for a next lower level machine, etc., until finally we have a program that can be executed by our hardware...
E. W. Dijkstra in Notes on Structured Programming, 1969, as quoted by John Allen, in Anatomy of Lisp, 1978.
When I think of this philosophy which I agree with quite a bit, the first thing that comes to mind is code efficiency.
When I'm making code I know for sure it isn't always anything close to perfect or even fully knowledgeable. Knowing enough to get close to maximum efficiency out of a machine when it is needed and good efficiency the rest of the time (perhaps trading off for better workflow) has allowed me to produce high quality finished products.
Coding in a data driven way, you end up using code for what code is for. To go and 'outsource' every variable to files would be foolishly extreme, the functionality of a program needs to be in the program and the content, settings and other factors can be managed by the program.
This also allows for much more dynamic applications and new features.
If you have even a simple form of database, you are able to apply the same functionality to many states. You may also do all manner of creative things like changing the context of what your program is doing based on file header data or perhaps directory, file name or extension, though not all data is necessarily stored on a filesystem.
Finally keeping your code in a state where it is simply handling data puts you in a state of mind where you are closer to envisioning what is actually going on. This also keeps the bulk out of your code, greatly reducing bloatware.
I believe it makes code more maintainable, more flexible and more efficient aaaand I like it.
Thank you to the others for your input on this as well! I found it very encouraging.

Resources