Matrix storage in Rails? - ruby-on-rails

I'm wondering what's the best way to handle a huge matrix in Rails 3. This matrix would store distances between points (it's symetric).
Points could be added anytime so the matrix could be frequently updated.
I see two ways:
storing values in database and get distances through db requests (easy but a bit slow)
storing values in a file and put this file in cache (could be hard to update)
Thoughts?
PS: I'm packaging this for a new release of my gmaps4rails gem (dedicated to make gmaps easy for rails users)

If you have to store a unique and big matrix, I would recommend doing it in a separate table (column/line/value). It will scale better than with a file, and :
You can access and update individual cell more easily
You mentioned using a file to cache your matrix, but if the need arise you can also fetch your entire table to cache your matrix
You can update rows, columns, and sub-matrixes with well formed queries
If you encounter performance problems when making your matrix grow, take a look at the activerecord-import library. It will help you batch insert data in your matrix.

Related

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

Strategies to speed up access to databases when working with columns containing massive amounts of data (spatial columns, etc)

First things first, I am an amateur, self-taught ruby programmer who came of age as a novice engineer in the age of super-fast computers where program efficiency was not an issue in the early stages of my primary GIS software development project. This technical debt is starting to tax my project and I want to speed up access to this lumbering GIS database.
Its a postgresql database with a postgis extension, controlled inside of rails, which immediately creates efficiency issues via the object-ification of database columns when accessing and/or manipulating database records with one or many columns containing text or spatial data easily in excess of 1 megabyte per column.
Its extremely slow now, and it didn't used to be like this.
One strategy: I'm considering building child tables of my large spatial data tables (state, county, census tract, etc) so that when I access the tables I don't have to load the massive spatial columns every time I access the objects. But then doing spatial queries might be difficult on a parent table's children. Not sure exactly how I would do that but I think its possible.
Maybe I have too many indexes. I have a lot of spatial indexes. Do additional spatial indexes from tables I'm not currently using slow down my queries? How about having too many for one table?
These tables have a massive amount of columns. Maybe I should remove some columns, or create parent tables for the columns with massive serialized hashes?
There are A LOT of tables I don't use anymore. Is there a reason other than tidiness to remove these unused tables? Are they slowing down my queries? Simply doing a #count method on some of these tables takes TIME.
PS:
- Looking back at this 8 hours later, I think what I'm equally trying to understand is how many of the above techniques are completely USELESS when it comes to optimizing (rails) database performance?
You don't have to read all of the columns of the table. Just read the ones you need.
You can:
MyObject.select(:id, :col1, :col2).where(...)
... and the omitted columns are not read.
If you try to use a method that needs one of the columns you've omitted then you'll get an ActiveModel::MissingAttributeError (Rails 4), but you presumably know when you're going to need them or not.
The inclusion of large data sets in the table is going to be a noticeable problem from the database side if you have full table scans, and then you might consider moving these data to other tables.
If you only use Rails to read and write the large data columns, and don't use PostgreSQL functions on them, you might be able to compress the data on write and decompress on read. Override the getter and setter methods by using write_attribute and read_attribute, compressing and decompressing (respectively of course) the data.
Indexing. If you are using postgres to store such large chucks of data in single fields consider storing it as Array, JSON or Hstore fields. If you index it using the gin index types so you can search effectively within a given field.

How to improve the performance in big table join?

Please help me out with this big data problem.
I have a very large table (500G) that stores cookie information collected from one website, and I try to provide service to many other clients. For each client, they have their cookies, so in the end I need to do query on 500G+300G(client_data).
Since some query use both my cookie data and client cookie data, it is possible that I need to do a join between my table and their table, therefore the performance is bad. To solve this problem, I put the entire 800GB data into a giant table. Since there is no join table, the performance is good. But when I expand my service to multiple client, it takes too much storage.
Current I am using Vertica as my data source, and use bitmap to store my information.
Any suggestion that can maintain my current performance but also support like 40 cients? My storage is about 12 TB and each client in current solution talkes 1.5T.
what I want is either a replacement of Vertica with can support bitmap operation and quick table join. Or a better way to represent my data.
My storage is about 12 TB and each client in current solution talkes 1.5T.
If you have 40 * 1.5TB worth of non-duplicated cookie data to store, there's no magic to make that fit into 12TB.
This will be an imprecise answer due to the lack of details about definitions, etc. But I would add the following about performance:
Look at your projection definitions. You may be able to get performance gains depending on what you put in the order by clause of the projection.
You have a few ways forward, depending on the specifics of your case. Point 1 and 3 are the easiest to deal with:
You can properly set projections, to make sure that both tables are identically segmented: https://my.vertica.com/docs/6.1.x/HTML/index.htm#12549.htm
You can set up pre join projections, where the join cost is paid during data load, not during data retrieval, see https://my.vertica.com/docs/6.1.x/HTML/index.htm#1299.htm
Make sure that your data type is the best possible. Matching on ints is faster than matching on strings, matching columns with low cardinality is faster than matching columns with high cardinality.
If 1 and 3 are well set, Vertica can actually apply filters before decompression, fastening a lot your query and thus using a lot less memory.

How to display large list in the view without using database slicing?

I have a service that generates a large map through multiple iterations and calculations from multiple tables. My problem is I cannot use pagination offset to slice the data because the data is coming from multiple tables and different modifications happen on the data. To display this on the screen; I have to send the map with 10-20,000 records to the view and that is problematic with this large dataset.
At this time I have on-page pagination but this is very slow and inefficient.
One thing I thought is to dump it on a table and query it each time but then I have to deal with concurrent users.
My question is what is the best approach to display this list when I cannot use database slicing (offset, max)?
I am using
grails 1.0.3
datatables and jquery
Maybe SlickGrid! is an option for you. One of there examples works with 50000 rows and it seems to be fast.
Christian
I end up writing the result of the map in a table and use the data slicing on that table for pagination. It takes some time to save the data but at least I don't have to worry about the performance with the large data. I use time-stamp to differentiate between requests. each requests will be saved and retrieved with its time stamp.

Architecture of finding movable geotagged objects

I currently have a Postgres DB filled with approx. 300.000 data-sets of moving vehicles all over the world. My very frequently repeated query is: Give me all vehicles in a 5/10/20mile radius. Currently I spend around 600 to 1200 ms in the DB to prepare the set of located vehicle-objects.
I am looking to vastly improve this time by ideally one or two orders of magnitude if possible. I am working in a Ruby on Rails 3.0beta environment if this is relevant.
Any ideas how to architect the whole system to accelerate this query? Any NoSQL database able to deliver this kind of geolocation performance? I know of MongoDB working on an extension to facilitate this scenario but haven't tried it yet. Any intelligent use of Redis to achieve this?
One problem with SQL-DBs here seems to be that I can't possibly use indexes because my vehicles are mostly moving around, meaning I had to constantly created DB indexes which, by itself, is probably more expensive than just doing the searching without index.
Looking forward to your thoughs, Thanks!
If you use the right algorithm for organizing your data, you will be able to use a spatial index which can dramatically speed up your queries.
The best practice for the geolocation domain is to use a geohash, quad-tree, R-tree or similar data structure (R-trees are the most generic, but it sounds like you're querying point data, so that may not matter). In each case, you can create a spatial index that uses a single, linear column where each value represents a bounding box of varying size and shape. This should let you answer most queries with a single range query in your database. Spatial indices can be implemented in SQL (PostGIS, MS SQL, MySQL all have spatial datatypes and spatial indices which use one of these techniques) or NoSQL (popular for its horizontal scalability; AppEngine has geomodel, SimpleGeo uses Cassandra, Foursquare uses MongoDB).
Using an index can be complicated by constantly moving points, but I would suspect that writes, even slightly heavier writes that update indices, wouldn't be your bottleneck.
Even though your vehicles are moving around all the time, I assume they have some kind of speed limit. What you can do is to create some kind of discrete coordinate system, one example would be the integer part of the lat/long coordinate. Then you put those values in separate columns, keeping the exact location in another column. You should then be able to index the integer columns, as the vehicles won't move so much that they change those values very often.
When doing a search, you first find out what "squares" are interesting, and restrict your query to the vechicles within those sqeares, using the indexed columns. Then you have to do a full search of all vehicles within each square. The number of vehicles you have to do a full search over should now only be a small fraction of all vechiles. The efficiency of this strategy of course depends on the distribution of your vechiles. If 50% of them are in a certain city somewhere this will not work, but assuming the largest group of vehicles in one place is 5-10% it should improve performance.

Resources