To create indexes, Geomesa creates multiple tables in HBase. I have a few questions:
What Geomesa does to ensure these tables are in sync?
What will be the impact on Geomesa query, if the index tables are not in sync?
What happens (with write calls) if Geomesa is not able to write one of the index tables?
Synchronization between tables are the best effort or Geomesa ensure the availability of data with eventual consistency?
I am planning to use Geomesa with Hbase (backed by S3) combination to store my geospatial data; the data size can grow up to Terabytes to Petabytes.
I am investigating how reliable Geomesa is in terms of synchronization between the primary and index table?
HBase Tables:
catalog1
catalog1_node_id_v4 (Main Table)
catalog1_node_z2_geom_v5 (Index Table)
catalog1_node_z3_geom_lastUpdateTime_v6 (Index Table)
catalog1_node_attr_identifier_geom_lastUpdateTime_v8 (Index Table)
Geomesa Schema
geomesa-hbase describe-schema -c catalog1 -f node
INFO Describing attributes of feature 'node'
key | String
namespace | String
identifier | String (Attribute indexed)
versionId | String
nodeId | String
latitude | Integer
longitude | Integer
lastUpdateTime | Date (Spatio-temporally indexed)
tags | Map
geom | Point (Spatio-temporally indexed) (Spatially indexed)
User data:
geomesa.index.dtg | lastUpdateTime
geomesa.indices | z3:6:3:geom:lastUpdateTime,z2:5:3:geom,id:4:3:,attr:8:3:identifier:geom:lastUpdateTime
GeoMesa does not do anything to sync indices - generally this should be taken care of in your ingest pipeline.
If you have a reliable feature ID tied to a given input feature, then you can write that feature multiple times without causing duplicates. During ingest, if a batch of features fails due to a transient issue, then you can just re-write them to ensure that the indices are correct.
For HBase, when you call flush or close on a feature writer, the pending mutations will be sent to the cluster. Once that method returns successfully, then the data has been persisted to HBase. If an exception is thrown, you should re-try the failed features. If there are subsequent HBase failures, you may need to recover write-ahead logs (WALs) as per standard HBase operation.
A feature may also fail to be written due to validation (e.g. a null geometry). In this case, you would not want to re-try the feature as it will never ingest successfully. If you are using the GeoMesa converter framework, you can pre-validate features to ensure that they will ingest ok.
If you do not have an ingest pipeline already, you may want to check out geomesa-nifi, which will let you convert and validate input data, and re-try failures automatically through Nifi flows.
Related
I have to implement a system where a tenant can store multiple key-value stores. one key-value store can have a million records, and there will be multiple columns in one store
[Edited] I have to store tabular data (list with multiple columns) like Excel where column headers will be unique and have no defined schema.
This will be a kind of static data (eventually updated).
We will provide a UI to handle those updates.
Every tenant would like to store multiple table structured data which they have to refer it in different applications and the contract will be JSON only.
For Example, an Organization/Tenant wants to store their Employees List/ Country-State List, and there are some custom lists that are customized for the product and this data is in millions.
A simple solution is to use SQL but here schema is not defined, this is a user-defined schema, and though I have handled this in SQL, there are some performance issues, so I want to choose a NoSQL DB that suits better for this requirement.
Design Constraints:
Get API latency should be minimum.
We can simply assume the Pareto rule, 80:20 80% read calls and 20% write so it is a read-heavy application
Users can update one of the records/one columns
Users can do queries based on some column value, we need to implement indexes on multiple columns.
It's schema-less so we can simply assume it is NoSql, SQL also supports JSON but it is very hard to update a single row, and we can not define indexes on dynamic columns.
I want to segregate key-values stores per tenant, no list will be shared between tenants.
One Key Value Store :
Another key value store example: https://datahub.io/core/country-list
I am thinking of Cassandra or any wide-column database, we can also think of a document database (Mongo DB), every collection can be a key-value store or Amazon Dynamo database
Cassandra: allows you to partition data by partition key and in my use case I may want to get data by different columns in Cassandra we have to query all partitions which will be expensive.
Your example data shows duplicate items, which is not something NoSQL datbases can store.
DynamoDB can handle this scenario quite efficiently, its well suited for high read activity and delivers consistent single digit ms low latency at any scale. One caveat of DynamoDB compared to the others you mention is the 400KB item size limit.
In order to get top performance from DynamoDB, you have to utilize the Partition key as much as possible, because it provides you with hash-based access (super fast).
Its obvious that unique identifier for the user should be present (username?) in the PK, but if there is another field that you always have during request time, like the country for example, you should include it in the PK.
Like so
PK SK
Username#S2#Country#US#State#Georgia Address#A1
It might be worth storing a mapping for the countries alone so you can retrieve them before executing the heavy query. Global Indexes can't be more than 20, keep that in mind and reuse/overload indexes and keys as much as possible.
Stick to single table design to utilize this better.
As mentioned by Lee Hannigan, duplicated elements are not supported, all keys (including those of the indexes) must be unique pairs
I am building a dashboard using InfluxDB. I have a source which generates approx. 2000 points per minute. Each point has 5 tags, 6 fields. There is only one measurement.
Everything works fine for about 24hrs but as the data size grows, I am not able to run any queries on influx. Like for example, right now I have approx 48hrs of data and even a basic select brings down the influx db,
select count(field1) from measurementname
It times out with the error:
ERR: Get http://localhost:8086/query?db=dbname&q=select+count%28field1%29+from+measuementname: EOF
Configuration:
InfluxDB version: 0.10.1 default configuration
The OS Version:Ubuntu 14.04.2 LTS
Configuration: 30GB RAM, 4 VCPUs, 150GB HDD
Some Background:
I have a dashboard and a web app querying the influxdb. The webapp lets a user query the DB based on tag1 or tag2.
Tags:
tag1 - unique for each record. Used in a where clause in the web app to get the record based on this field.
tag2 - unique for each record. Used in a where clause in the web app to get the record based on this field.
tag3 - used in group by. Think of it as departmentid tying a bunch of employees.
tag4 - used in group by. Think of it as departmentid tying a bunch of employees.
tag5 - used in group by. Values 0 or 1 or 2.
Pasting answer from influxdb#googlegroups.com mailing list: https://groups.google.com/d/msgid/influxdb/b4fb503e-18a5-4bd5-84b1-632dc4950747%40googlegroups.com?utm_medium=email&utm_source=footer
tag1 - unique for each record.
tag2 - unique for each record.
This is a poor schema. You are creating a new series for every record, which puts a punishing load on the database. Each series must be indexed, and the entire index currently must reside in RAM. I suspect you are running out of memory after 48 hours because of series cardinality, and the query is just the last straw, not the actual cause of the low RAM situation.
It is very bad practice to use a unique value in tags. You can still use fields in the WHERE clause, they just aren't as performant, and the damage to your system is much less than having a unique series for every point.
https://docs.influxdata.com/influxdb/v0.10/concepts/schema_and_data_layout/
https://docs.influxdata.com/influxdb/v0.10/guides/hardware_sizing/#when-do-i-need-more-ram
I'm interested in counting user interactions with uniquely identifiable resources between two points in time.
My use cases are:
Retrieve the total count for an individual resourceId (between time x and time y)
Produce a list of the top resourceIds ordered by count (between time x and time y)
Ideally I'd like to achieve this using DynamoDB. Sorting time series data in dynamo looks to have it's challenges and I'm running into some anti-best-practices whilst attempting to model the data.
Data model so far
A downsampled table could look like this, where count is then number of interactions with a resourceId within the bounds of a timebin.
| resourceId | timebin | count |
|---------------|-----------|-------|
|(Partition Key)| (Sort Key)| |
The total interaction count for each resource is the sum of the count attribute in each of items with the same resourceId. As an unbounded "all time" count is of interest, older events will never become obsolete, but they can be further downsampled and rolled into larger timebins.
With the above schema use case 1 is fulfilled by queuing a resource using it's hash key and enforcing time constraints using the sort key. The total count can then be calculated application side.
For use case 2, I'm looking to achieve the equivalent of an SQL GROUP BY resourceId, SUM(count). To do this the database needs to return all of the items that match the provided timebin constraints, regardless of resourceId. Grouping and summing of counts can then be performed application side.
Problem: With the above schema a full table scan is required to do this.
This is obviously something I would like to avoid.
Possible solutions
Heavily cache the query for use case 2, so that scan is used, but only rarely (eg once a day).
Maintain an aggregate table, with for example, predefined timeRanges as the Partition Key and the corresponding count as the Sort Key.
i.e.
| resourceId | timeRange (partition) | count (sort) |
|------------|------------------------|--------------|
| 1234 | "all_time" | 9999 |
| 1234 | "past_day" | 533 |
Here, "all_time" has a fixed FROM date, so could be incremented each time a resourceId event is received. "past_day", however, has a moving FROM date so would need to be regularly re-aggregated using updated FROM and TO markers.
My Question
Is there a more efficient way to model this data?
Based on your description of the table with the resourceId being the hash key of the table, if you are performing aggregations within a single hash key this can be accomplished with a query. Additionally if timebin, the range key, can be compared using greater than and less than operators you will be able to directly get to the records that you want with an efficient query and then sum up the counts on the application side.
However, this will not accomplish your second point so additional work will be required to meet both requirements.
Maintaining an aggregate table seems like the logical approach for a global leader board. I'd recommend using DynamoDB Streams with AWS Lambda to maintain that aggregate table in near-real-time. This follows the AWS best practices.
The periodic scan and aggregate approach is also valid and depending on your table size may be more practical since it is a more straight forward to implement, but there are a number of things to watch out for...
Make sure the process that scans is separate from your main application execution logic. Populating this cache in real time would not be practical. Table scans are only practical for real time requests if the number of items in the table is just a few hundred or less.
Make sure you rate limit your scan so that this process doesn't consume all of the IOPS. Alternatively you could substantially raise the IOPS during this time period then lower then back once the process completes. Another alternative would be to make a GSI that is as narrow as possible to scan, dedicating the GSI to this process would avoid needing to rate limit as it could consume all of the IOPS it wants without impacting other users of the table.
We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.
In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.
The query which is loading tbl_EmployeRecord is,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.
The columns of the tables are,
tbl_Department : DEPARTMENTID,DEPTNAME
tbl_Employee : EMPID,EMPNAME,DEPARTMENTID
tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME
Kindly suggest how to design the staging for this to handle Insert, Update and Delete.
Identifying Incremental Data
The incremental loading needs to be based on some segregating information present in your source table. Such information helps you to identify the incremental portion of the data that you will load. Often times, load date or last updated date of the record is a good choice for this.
Consider this, your source table has a date column that stores both the date of insertion of the records as well as the date when any update was done on that record. At any given day during your staging load, you may take advantage of this date to identify which are the records that are newly inserted or updated since your last staging load and you consider only those changed / updated records as your incremental delta.
Given your structure of the tables, I am not sure which column you may use for this. ID columns will not help as if the record gets updated you won't know that.
Maintaining Load History
It is important to store information about how much you have loaded today so that you can load the next part in the next load. To do this, maintain a staging table - often called Batch Load Details table. That load typically will have structure such as below:
BATCH ID | START DATE | END DATE | LOAD DATE | STATUS
------------------------------------------------------
1 | 01-Jan-14 | 02-Jan-14 | 02-Jan-14 | Success
You need to insert a new record in this table everyday before you start the data loading. The new record will have start date equal to the end date of last successful load and status null. Once loading is successful, you will update the status to 'Success'
Modification in data Extraction Query to take Advantage of Batch Load Table
Once you maintain your loading history like above, you may include this table in your extraction query,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
WHERE E.load_date >= (SELECT max(START_DATE) FROM BATCH_LOAD WHERE status IS NULL)
What I am going suggest you is by no means a standard. In fact you should evaluate my suggestion carefully against your requirement.
Suggestion
Use incremental loading for transaction data, not for master data. Transaction data are generally higher in volume and can be easily segregated in incremental chunks. Master data tend to be more manageable and can be loaded in Full everytime. In the above example, I am assuming your Employee table is behaving like transactional data whereas your department table is your master.
I trust this article on incremental loading will be very helpful for you
I'm not sure what database you are using, so I'll just talk in conceptual terms. If you want to add tags for specific technologies, we can probably provide specific advice.
It looks like you have 1 row per employee and that you are only keeping the current record for each employee. I'm going to assume that EMPIDs are unique.
First, add a field to the query that currently populates the dimension. This field will be a hash of the other fields in the table EMPID, EMPNAME, DEPTNAME. You can create a view, populate a new staging table, or just use the query. Also add this same hash field to the dimension table. Basically, the hash is an easy way to generate a field that is unique for each record and efficient to compare.
Inserts: These are the records for which the EMPID does not already exist in the dimension table but does exist in your staging query/view.
Updates: These are the records for which the EMPID does in both the staging query/view the dimension table, but the hash field doesn't match.
Deletes: These are the records for which the EMPID exists in the dimension but does not exist in the staging query/view.
If this will be high-volume, you may want to create new tables to hold the records that should be inserted and the records that should be updated. Once you have identified the records, you can insert/update them all at once instead of one-by-one.
It's a bit uncommon to delete lots of records from a data warehouse as they are typically used to keep history. I would suggest perhaps creating a column that is a status or a bit field that indicates if is is active or deleted in the source. Of course, how you handle deletes should be dependent upon your business needs/reporting requirements. Just remember that if you do a hard delete you can never get that data back if you decide you need it later.
Updating the the existing dimension in place (rather than creating historical records for each change) is called a Type 1 dimension in dimensional modeling terms. This is fairly common. But if you decide you need to keep history, you can use the hash to help you create the SCD type 2 records.
I have to update a field in the MySQL database on each HTTP request (number of page visits). Sometimes I have to do this for static files also. In the heavy-load environment this can potentially block responses because the MySQL database can perform a single update (per record) process at a time.
I was thinking of
appending to a file and parse it in the background every X minutes
writing log records to the MySQL database (special table) and do a
background processing
How would you optimize that?
I'm not convinced by what you've said about MySQL - more than one update can happen concurrently.
If you do reach mysql's concurrency limits, have you considered using a different data store for this information? For example mongodb is very well adapted to these sorts of upsert operations, as it has a wide range of "fire and forget" update operations known as modifiers
For example you can do something like
db.pagecounters.update({name: 'x'}, {$inc: {count: 1}}, true)
This would find the pagecounter object with name x and either increment count by 1 (if count exists) or set count to 1. If there was no such page counter, it would create one