Is there a data model, or principle, for structuring data for scripting, rather than the human eye?

Is there a data model, or principle, for structuring data for scripting, rather than the human eye? - google-sheets

I spend a lot of time writing scripts to automate Google Sheets using Google Apps Script.
Google Apps Script can get relatively slow when the size of your data gets very big, spanning multiple sheets within multiple workbooks. I've spent my fair share of time trying to refactor my code as much as possible to reach Apps Script's 6 minute execution quota (excluding cache service).
I believe the root of the problem is the way data is being organized on sheets. An example of this is splitting 1 column into 5 columns for readability despite the data being automated. I'm sure there are other factors such as validating data once its processed into information & making sure there are no bugs but, beyond that, code doesn't care about what data looks like.
Additionally, I don't think this is a matter of the XY Problem. The best way, overall, to make Google Apps Script more efficient is to use less of it.
I'm not well versed in data science in general so I could definitely be wrong but, nonetheless, is this an existing principle I can learn more about? I've tried searching for this already but its hard to look for what you don't know.

Related

Why one of two identical Google Sheets runs so much slower than the other

I have a Google Sheet consisting of IMPORTXML functions (among others) as the decision basis for some dependent Zapier functions with a 3rd application, which has been running increasingly slower. This has downstream issues to the cascading schedule of systems I use thereafter, and not to mention mis-triggers to Zapier as it tries to catch up with changing cell values, as the formulas run.
I've tried to optimise what I can in it, to get it to run faster. One step I took, was to clone the original sheet for the purpose of testing optimisations on the clone (essentially staging sheet).
What I've noticed in doing this, is that the clone sheet runs all the included formulas sooooo much quicker than its identical twin original, like soooo much quicker.
Just wondering if anyone knows of some obvious reasons why this would be happening, and what I can do to get the original running as fast as the clone?

Best db engine for building a web app with ranking algorithms

I've got an idea for a new web app which will involve the following:
1.) lots of raw inputs (text values) that will be stored in a db - some of which contribute as signals to a ranking algorithm
2.) data crunching & analysis - a series of scripts will be written which together form an algorithm that will take said raw inputs from 1.) and then store a series of ranking values for these inputs.
Events 1.) and 2.) are independent of each other. Event 2 will probably happen once or twice a day. Event 1 will happen on an ongoing basis.
I initially dabbled with the idea of writing the whole thing in node.js sitting on top of mongodb as I will curious to try out something new and while I think node.js would be perfect for event 1.) I don't think it will work well for the event 2.) outlined above.
I'd also rather keep everything in one domain rather than mixing node.js with something else for step 2.
Does anyone have any recommendations for what stacks work well for computational type web apps?
Should I stick with PHP or Rails/Mysql (which I already have good experience with)?
Is MongoDB/nosql constrained when it comes to computational analysis?
Thanks for your advice,
Ed

There is no reason why node.js wouldn't work.
You would just write two node applications.
One that takes input stores it in the database and renders output
and the other one crunches numbers in it's own process and is run once or twice per day.
Of course if your doing real number crunching and you need performance you wouldn't do nr 2 in node/ruby/php. You would do it in fortran (or maybe C).

Suitability of Amazon SimpleDB for large temporal data sets eminating from thousands of separate devices

I'm trying to establish whether Amazon SimpleDB is suitable for a subset of data I have.
I have thousands of deployed autonomous sensor devices recording data.
Each sensor device essentially reports a couple of values four times an hour each day, over months and years. I need to keep all of this data for historic statistical analysis. Generally, it is write once, read many times. Server-based applications run regularly to query the data to infer other information.
The rows of data today, in SQL look something like this:
(id, device_id, utc_timestamp, value1, value2)
Our existing MySQL solution is not going to scale up much further, with tens of millions of rows. We query things like "tell me the sum of all the value1 yesterday" or "show me the average of value2 in the last 8 hours". We do this in SQL but can happily change to doing it in code. SimpleDBs "eventual consistency" appears fine for our puposes.
I'm reading up all I can and am about to start experimenting with our AWS account, but it's not clear to me how the various SimpleDB concepts (items, domains, attributes, etc.) relate to our domain.
Is SimpleDB an appropriate vehicle for this and what would a generalised approach be?
PS: We mostly use Python, but this shouldn't matter when considering this at a high level. I'm aware of the boto library at this point.
Edit:
Continuing to search on solutions for this I did come across Stack Overflow question What is the best open source solution for storing time series data? which was useful.

Just following up on this one many months later...
I did actually have the opportunity to speak to Amazon directly about this last summer, and eventually got access to the beta programme for what eventually became DynamoDB, but was not able to talk about it.
I would recommend it for this sort of scenario, where you need a primary key and what might be described as a secondary index/range - eg timestamps. This allows you much greater confidence in search, ie "show me all the data for device X between monday and friday"
We haven't actually moved to this yet for various reasons but do still plan to.
http://aws.amazon.com/dynamodb/

I my opinon, Amazon SimpleDb as well as Microsoft Azure Tables is a fine solution as long as your queries are quite simple. As soon as you trying to do stuff that's absolutely a non-issue on relational databases like aggregates you begin to run into trouble. So if you are going to do some heavy reporting stuff it might get messy.

It sounds like your problem may be best handled by a round-robin database (RRD). An RRD stores time variable data in such a way so that the file size never grows beyond its initial setting. It's extremely cool and very useful for generating graphs and time series information.

I agree with Oliver Weichhold that a cloud based database solution will handle the usecase you described. You can spread your data across multiple SimpleDB domains (like partitions) and stored your data in a way that most of your queries can be executed from a single domain without having to traverse the entire database. Defining your partition strategy will be key to the success of moving towards a cloud based DB. Data set partitioning is talked about here

Riak vs Amazon SimpleDB

I am looking for an eventually consistent key value data store and i decided to choose between Amazon SimpleDB and Riak ,so can anyone share their valuable experiences comparing both .
Thanks in advance
Fedrick

Riak is a key-value store. The data values you store is opaque to the database, so you have no secondary indexes. But you do have the ability to run map-reduce if your data is JSON (or XML, I think). You can run map-reduce over all data, or just a subset ("seed keys"). It also has a "link walking" feature where documents can refer to other documents, which can be auto-fetched. They don't currently have an incremental map-reduce like CouchDB, which means any secondary queries (non-key) are quite expensive. They have plans to fix this.
SimpleDB is actually halfway between a docstore and a keystore: Each key->item supports multiple attributes, but it only goes one level deep. You can query on your key or your attribute values.
In production, Riak should be pretty "hands-off". If it's slow or getting full, just spin up a new server and tell it to join the cluster. (unlike CouchDB or MongoDB where you have to futz with multiple config files).
SimpleDB can take a pounding (tens of thousands of requests per second I've heard), but you are responsible for data scaling (i.e. don't violate their domain size limits or it will slow down).

I have used SimpleDB for about 6 months now. I am going into production with it. It works well, but I wish it were faster. I perform %like% queries for searching, and I can't seem to get it to dive through more than a few MB a second worth of values. But non %like% searches are much faster. I get the feeling that it could be sped up if someone at Amazon wrote a few algorithms in good old c, rather than Erlang, but then again I am a c coder.
Also the first few queries on a recently opened Domain will take longer, as the system gets it all read in.
Overall it worked for me, but if I want to scale higher I will have to go with something else.
Also, I think that almost all my use of it will be free - there is a generous allocation of space, etc.
Make sure you plan on the fact that SimpleDB currently has no 'read only' access modes, etc. Any user that can use it can edit it.
--Tom

Tracking/Monitoring sudden trend changes

I track alot of things with RRD, eg, uptime, network throughput, etc. This works well when you can fit all the graphs on a single page, however, once you scale beyond a page it becomes difficult to use graphs to catch issues, you need to look at them to see that there is an issue, and if there is hundreds or thousands of graphs, that obviously isn't possible.
So, is there any standard way, or existing software for monitoring rrd databases for trend changes? Eg, every day, network traffic looks pretty much the same, if it spikes or dips dramatically in a single hour/day/week compared to the norm, I'd like to be alerted to it.
Or even just generic methods for finding changes in trends.

You can read the RRD file directly, not just use the graphs generated. You might need to write your own app to do this, but the file format is an open standard so shouldn't be that difficult to get what you need.
RRD File Format

Looks like RRD actually supports this,
http://cricket.sourceforge.net/aberrant/rrd_hw.htm
Would be interested in hearing if anyone has used this.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart