In FPA, are ILFs and EIFs really only Files or Databases? - function-points

In FPA, are ILFs and EIFs really only Files or Databases - or can they be internal data structures?
I have not been able to determine how to apply FPA to complex algorithms or number crunching (such as signal processing) so I am wondering where internal data structures (that are computed only at runtime) are accounted for in FPA. Could such processing (and data) be accounted for in the External Output (EO)?
Thanks gang.

Based on various sources, it does seem that ILFs and EIFs are limited to physical files or databases vs. internal data structures. FPA seems oriented toward MIS type applications.

Related

How to load a huge model on Dask with limited RAM?

I want to load a model (ANNOY model) on Dask. The size of the model is 60GB and Dask RAM is 2GB only. Is there a way to load the model in distributed manner as well?
If by "load" you mean: "store in memory", then obviously there is no way to do this. If you need access to the whole dataset in memory at once, you'll need a machine that can handle this.
However, you very probably meant that you want to do some processing to the data and get a result (prediction, statistical score...) which does fit in memory.
Since I don't know what ANNOY is (array? dataframe? something else?), I can only give you general rules. For dask to work, it needs to be able to split a job into tasks. For data IO, this commonly means that the input is in multiple files, or that the files have some natural internal structure such that they can be loaded chunk-wise. For example, zarr (for arrays) stores each chunk of a logical dataset as a separate file, parquet (for dataframes) chunks up data into pages within columns within groups within files, and even CSV can be loaded chunkwise by looking for newline characters.
I suspect annoy ( https://github.com/spotify/annoy ?) has complex internal storage structure, and you may eed to raise an issue on their repo asking about dask support.

Data Type Overhead

Whenever a program implements specific data types, there must be a header and some sort of system to detect what data type a variable is. Does this dramatically affect the required space to store smaller data types such as bytes and booleans, or the time required to change or apply operations to them? Are there any techniques commonly used to minimize this overhead for the smaller variables?
In compiled languages there is generally not such overhead. At compile time, there is a need for such information but not beyond that.
However, some Object Oriented Languages have type descriptors for Object (but not integers and booleans).

In beam custom combine function, does serialization occur even if the object is on "same" machine?

We have a custom combine function (on beam sdk 2.0) in which the millions of objects get accumulated but they do NOT necessarily get reduced....that is, they sometimes get added to a List such that eventually, the List might get quite large (hundreds of megabytes, even gigabytes).
To minimize the problem of having to "pass around" these objects (during merging of accumulators) between nodes, we've created a SINGLE giant node (of 64 cores, tonnes of RAM).
So, in "theory", dataflow does not need to serialize the List object (and any of these big objects in the List) even during "merge accumulator" operations, since all the objects are on the same node. But, does dataflow still serialize even if all the objects of interest are on the same node or is it smart enough to know that an object is on the same node vs separate nodes?
Ideally, when objects are on same node, we can just pass around references to the objects (rather than serializing/deserializing the contents of these objects, which can be very very large.) (I understand, of course, than when dealing with multiple nodes, there's no choice but to serialize/deserialize since the data has to be passed around somehow; but within a node, is beam sdk 2.0 smart enough to not serialize/deserialize during these combine functions, group by's etc.?)
The Dataflow service aggressively optimizes your pipeline to avoid needless serialization. The optimization you are interested in is fusion, described here in the Dataflow documentation. When data moves through a fused "stage" (a sequence of low-level instructions roughly corresponding to steps in your input pipeline), it is not serialized and deserialized.
However, if your CombineFn builds a list, and that list grows large, you should try to rephrase your pipeline to use a raw GroupByKey. Another important optimization is "combiner lifting" or "mapper-side combine" where your CombineFn is applied per-key locally prior to shuffling your data between machines, based on the assumption that the accumulator will be smaller than just a list of elements. So the whole list will be serialized, shuffled, and deserialized prior to completing the Combine transform. If, instead, you use a GroupByKey directly, your elements would be much more efficiently streamed, without serializing an entire list.
I should note that Beam's other runners also perform standard fusion optimization and others. These all generally come from functional programming work in the late 80s / early 90s and was applied to distributed data processing in FlumeJava, circa 2010, so it is a baseline expectation now.

How does Titan achieve constant time lookup using HBase / Cassandra?

In the O'Reilly book "Graph Databases" in chapter 6, which is about how Neo4j stores a graph database it says:
To understand why native graph processing is so much more efficient
than graphs based on heavy indexing, consider the following. Depending on the implementation, index lookups could be O(log n) in algorithmic complexity versus O(1) for looking up immediate relationships.
To traverse a network of m steps, the cost of the indexed approach, at
O(m log n), dwarfs the cost of O(m) for an implementation that uses
index-free adjacency.
It is then explained that Neo4j achieves this constant time lookup by storing all nodes and relationships as fixed size records:
With fixed sized records and pointer-like record IDs, traversals are
implemented simply by chasing pointers around a data structure, which
can be performed at very high speed. To traverse a particular
relationship from one node to another, the database performs several
cheap ID computations (these computations are much cheaper than
searching global indexes, as we’d have to do if faking a graph in a
non-graph native database)
This last sentence triggers my question: how does Titan, which uses Cassandra or HBase as a storage backend, achieve these performance gains or make up for it?
Neo4j only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Neo4j is slow because of pointer chasing on disk (they have a poor disk representation).
Titan only achieves O(1) when the data is in-memory in the same JVM. When the data is on disk, Titan is faster than Neo4j cause it has a better disk representation.
Please see the following blog post that explains the above quantitatively:
http://thinkaurelius.com/2013/11/24/boutique-graph-data-with-titan/
Thus, its important to understand when people say O(1) what part of the memory hierarchy they are in. When you are in a single JVM (single machine), its easy to be fast as both Neo4j and Titan demonstrate with their respective caching engines. When you can't put the entire graph in memory, you have to rely on intelligent disk layouts, distributed caches, and the like.
Please see the following two blog posts for more information:
http://thinkaurelius.com/2013/11/01/a-letter-regarding-native-graph-databases/
http://thinkaurelius.com/2013/07/22/scalable-graph-computing-der-gekrummte-graph/
OrientDB uses a similar approach where relationships are managed without indexes (index-free adjacency), but rather with direct pointers (LINKS) between vertices. It's like in memory pointers but on disk. In this way OrientDB achieves O(1) on traversing in memory and on disk.
But if you have a vertex "City" with thousands of edges to the vertices "Person", and you're looking for all the people with age > 18, then OrientDB uses indexes because a query is involved, so in this case it's O(log N).

Rules engine for spatial and temporal reasoning?

I have an application that receives a number of datums that characterize 3 dimensional spatial and temporal processes. It then filters these datums and creates actions which are then sent to processes that perform the actions. Rinse and repeat.
At present, I have a collection of custom filters that perform a lot of complicated spatial/temporal calculations.
Many times as I discuss my system to individuals in my company, they ask if I'm using a rules engine.
I have yet to find a rules engine that is able to reason well temporally and spatially. (Things like: When are two 3D entities ever close? Is 3D entity A ever contained in 3D region B? If entity C is near entity D but oriented backwards relative to C then perform action D.)
I have looked at Drools, Cyc, Jess in the past (say 3-4 years ago). It's time to re-examine the state of the art. Any suggestions? Any standards that you know of that support this kind of reasoning? Any defacto standards? Any applications?
Thanks!
Premise - remember that a SQL-based1 DBMS is a (quite capable) inference engine, as can be seen from these comparisons between SQL and Prolog:
prolog to SQL converter
difference between SQL and Prolog
To address specifically your spatio-temporal applications, this book will help:
TEMPORAL DATA AND THE RELATIONAL MODEL - A Detailed Investigation into
the Application of Interval and Relation Theory to the Problem of Temporal Database Management.
That is, combining Interval and Relation Theory is possible to reasoning about spatio-temporal problems effectively (see 5.2 Applications of Intervals).
Of course, if your SQL-based DBMS is not (yet) equipped with interval (and other) operators you will need to extend it appropriately (via store-procedures and/or User-Defined Functions - UDFs).
Update: skimming the paper pointed out in comments by timemirror (Towards a 3D Spatial Query Language for Building Information Models) they do essentially what I touched on above:
(last page)
IMPLEMENTATION CONCEPTS
The implementation of the abstract
type system into a query language will
be performed on the basis of the query
language SQL, which is a widely
established standard in the field of
object-relational databases. The
international standard SQL:1999
extends the relational model to
include object-oriented aspects, such
as the possibility to define complex
abstract data types with integrated
methods.
I do not concur with the "object-relational database" terminology (for reason off-topic here) but I think the rest is pertinent.
Update: a quote regardind 3D and interval theory from the book cited above:
NOTE: All of the intervals discussed
so far can be thought of as
one-dimensional. However, we might
want to combine two one-dimensional
intervals to form a twodimensional
interval. For example, a rectangular
plot of ground might be thought of as
a two-dimensional interval, because it
is, by definition, an object with
length and width, each of which is
basically a one-dimensional interval
measured along some axis. And, of
course, we can extend this idea to any
number of dimensions. For example, a
(rather simple!) building might be
regarded as a three-dimensional
interval: It is an object with length,
width, and height, or in other words a
cuboid. (More realistically, a
building might be regarded as a set of
several such cuboids that overlap in
various ways.) And so on. In what
follows, however, we will restrict our
attention to one-dimensional intervals
specifically, barring explicit
statements to the contrary, and we
will omit the "one-dimensional"
qualifier for simplicity.
Note
I wrote SQL-based and not relational because there are ways to use such DBMSes that completely deviate from relational theory.
This is Spatial Reasoning... a few models but 9DE-IM is now accepted by OGC and implemented in PostGIS and other programming tools.
PostGIS implements a spatial reasoning engine based on dimensionally extended 9 intersection model... 9DE-IM..
http://postgis.refractions.net/documentation/manual-svn/ch04.html#DE-9IM
check sect 4.3.6.1. Theory...
So does the Java Topology Suite (and Net Topology suite for C# etc)...
http://docs.codehaus.org/display/GEOTDOC/Point+Set+Theory+and+the+DE-9IM+Matrix
In particualr check out the geometry.relate stuff.. such as
boolean isRelated = geometry.relate( geometry2, "T*T***T**" )
You can test the relationships, or filter data based on them.
Works with pts, lines, polygons etc...
This might help on temporal stuff..
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.87.4643&rep=rep1&type=pdf
Check out SpatialRules at http://www.objectfx.com/. It's a geospatial complex event processor for 2D and 3D.

Resources