Amazon Kinesis Firehose - How to pause a stream? - amazon-kinesis-firehose

I need the ability to pause a stream in AWS Kinesis Firehose.
I need it when I need to perform a schema change that requires re-creation of the table (just for example, change in sortkey).
Those changes usually require creating a new table inserting the rows to the new table, the dropping the original table and renaming the new table to the original name. Doing this will result in loss of rows that were streamed during this process.
I can think on two workarounds:
Renaming the original table at the begging of process, then force firehose to fail, and retry until you make the change and rename it back. I am not sue if the retry mechanism is bullet proof enough for this.
Defining a time interval of few hours (as needed) in between the loads, then watching the "COPY" queries, and doing the same as #1 just after the COPY. Thisi s jsut a bi more safe than #1.
Both workarounds doesn't feels lek a best practice, under statement.
Is there a better solution?
How bullet prof my solutions are?

I encountered the same issue and did the following. Note: for this method to work, you must have timestamps (created_at in the answer below) on the events you are ingesting into Redshift from Kinesis.
Assume table1 is the table you already have, and Kinesis is dumping events into it from firehose1.
Create a new firehose, firehose2, that dumps events to a new table, table2, which has the same schema as table1.
Once you can confirm that events are landing in table2, and max(created_at) in table1 is less than min(created_at) in table2, delete firehose1. We can now be sure we will not lose any data because there is already an overlap between table1 and table2.
Create a table table3 that has the same schema as table1. Copy all events from table1 into table3.
Drop table1 and recreate it, this time with the sort key.
Recreate firehose1 to continue dumping events into table1.
Once events start landing in table1 again, confirm that min(created_at) in table1 is less than max(created_at) in table2. When this is true, delete firehose2.
Copy all events from table2 with created_at strictly greater than max(created_at) in table3 and strictly less than min(created_at) in table1 into table1. If your system allows events with the same timestamp, there may be duplicates introduced in this step.
Copy all events from table3 back into the new table1.
EDIT: You can avoid using table3 if you use alter table to rename table1 to table1_old and then make table2 above the new table1.

Since AWS Kinesis Stream can store data (by default 1 day and up to a year but more than 24 hours additional charges will be applied), I recommend to delete the Delivery Stream (Kinesis Firehose) and once you are done with upgrade/maintenance work you easily can re-configure a new delivery stream.

Related

Reducing over multiple joins in CouchDB

In my CouchDB database, I have the following models (implemented as documents in the database with different type fields):
Team: name, id (has many matches, has many fans)
Match: name, team_a, team_b, time (has many teams, has many tweets)
Fan: team_id (has many tweets)
Tweet: time, sentiment, fan_id
I want to average the tweet sentiment for each team. If I were using SQL I'd do it like this:
SELECT avg(sentiment)
FROM team
JOIN match on team.id = match.team_a OR team.id = match.team_b
JOIN fan on fan.team = team.id
JOIN tweet on (tweet.time BETWEEN match.time AND match.time + interval '1 hour') AND tweet.user = fan.id
GROUP BY team.id
However in CouchDB you can at best do 1 join in a view function, as explained in the docs (by emitting the join field as the key).
How can this be better modelled in CouchDB to allow for this query to work? I don't really want to denormalise too much, but I guess I will if I have to?
It's a bit complex, but I use what I call "tertiary indexes". The goal is to be able to write a view that is applied to another view. Unfortunately, the only way to do this is to use a view to write data to a secondary database and then have another view that works on that database. Doing this requires an outside process - I use a script that listens to the _changes feed of the primary database, and then updates the relevant documents in the secondary database when something changes.
So in your example your secondary database could consist of a single document for each team with all of the (or the latest) match/fan/tweet data in that one document. Then you write a view that extracts the sentiment (or whatever) from that secondary database.

How can I replace records into another dataset?

I want to copy records of a few employees from one company into another larger one. If a primary key duplicate conflict occurs, the record has to be replaced. For Delphi DataSet there are commands "Insert", "Append", "Edit", and "Delete", but is there an easy way to "Replace" the record between the same tables, without knowing the full table structure or primary keys? There are like 30+ fields and they may be changed in the future.
In MySQL it would be REPLACE INTO table2 (SELECT * FROM table1) but I wanted to change a few fields in the target table, like employee's ID and department codes.
I'm afraid there is no way to replace/overwrite dataset records in Delphi. But using MySQL I can select the source data into a temp table, modify the temp table data, and place it into target table.

Update data between databases sqlite - ios

Here's my case
I've to replace my car company database file with a new one because of some structural changes.
The newer car company table (in the database) has some addition companies compared with the old one.
car company
( id,
companyname,
address,
phone,
isUserFavorite,
...
)
I want to backup the user favorite field from old the database, which means I've to SELECT and UPDATE between 2 databases.
Also, i need to backup the history table, too. (SELECT and INSERT to newer one).
I think I've to ATTACH DATABASE, do my tasks, and then DETACH DATABASES when i'm done, right ?
But I don't know how to do that in particular. Do I've to write multiple methods to do these tasks or how can I execute multiple sqlite's query in one method ?
Thanks

2 column table, ignore duplicates on mass insert postgresql

I have a Join table in Rails which is just a 2 column table with ids.
In order to mass insert into this table, I use
ActiveRecord::Base.connection.execute("INSERT INTO myjointable (first_id,second_id) VALUES #{values})
Unfortunately this gives me errors when there are duplicates. I don't need to update any values, simply move on to the next insert if a duplicate exists.
How would I do this?
As an fyi I have searched stackoverflow and most the answers are a bit advanced for me to understand. I've also checked the postgresql documents and played around in the rails console but still to no avail. I can't figure this one out so i'm hoping someone else can help tell me what I'm doing wrong.
The closest statement I've tried is:
INSERT INTO myjointable (first_id,second_id) SELECT 1,2
WHERE NOT EXISTS (
SELECT first_id FROM myjointable
WHERE first_id = 1 AND second_id IN (...))
Part of the problem with this statement is that I am only inserting 1 value at a time whereas I want a statement that mass inserts. Also the second_id IN (...) section of the statement can include up to 100 different values so I'm not sure how slow that will be.
Note that for the most part there should not be many duplicates so I am not sure if mass inserting to a temporary table and finding distinct values is a good idea.
Edit to add context:
The reason I need a mass insert is because I have a many to many relationship between 2 models where 1 of the models is never populated by a form. I have stocks, and stock price histories. The stock price histories are never created in a form, but rather mass inserted themselves by pulling the data from YahooFinance with their yahoo finance API. I use the activerecord-import gem to mass insert for stock price histories (i.e. Model.import columns,values) but I can't type jointable.import columns,values because I get the jointable is an undefined local variable
I ended up using the WITH clause to select my values and give it a name. Then I inserted those values and used WHERE NOT EXISTS to effectively skip any items that are already in my database.
So far it looks like it is working...
WITH withqueryname(first_id,second_id) AS (VALUES(1,2),(3,4),(5,6)...etc)
INSERT INTO jointablename (first_id,second_id)
SELECT * FROM withqueryname
WHERE NOT EXISTS(
SELECT first_id FROM jointablename WHERE
first_id = 1 AND
second_id IN (1,2,3,4,5,6..etc))
You can interchange the Values with a variable. Mine was VALUES#{values}
You can also interchange the second_id IN with a variable. Mine was second_id IN #{variable}.
Here's how I'd tackle it: Create a temp table and populate it with your new values. Then lock the old join values table to prevent concurrent modification (important) and insert all value pairs that appear in the new table but not the old one.
One way to do this is by doing a left outer join of the old values onto the new ones and filtering for rows where the old join table values are null. Another approach is to use an EXISTS subquery. The two are highly likely to result in the same query plan once the query optimiser is done with them anyway.
Example, untested (since you didn't provide an SQLFiddle or sample data) but should work:
BEGIN;
CREATE TEMPORARY TABLE newjoinvalues(
first_id integer,
second_id integer,
primary key(first_id,second_id)
);
-- Now populate `newjoinvalues` with multi-valued inserts or COPY
COPY newjoinvalues(first_id, second_id) FROM stdin;
LOCK TABLE myjoinvalues IN EXCLUSIVE MODE;
INSERT INTO myjoinvalues
SELECT n.first_id, n.second_id
FROM newjoinvalues n
LEFT OUTER JOIN myjoinvalues m ON (n.first_id = m.first_id AND n.second_id = m.second_id)
WHERE m.first_id IS NULL AND m.second_id IS NULL;
COMMIT;
This won't update existing values, but you can do that fairly easily too by using with a second query that does an UPDATE ... FROM while still holding the write table lock.
Note that the lock mode specified above will not block SELECTs, only writes like INSERT, UPDATE and DELETE, so queries can continue to be made to the table while the process is ongoing, you just can't update it.
If you can't accept that an alternative is to run the update in SERIALIZABLE isolation (only works properly for this purpose in Pg 9.1 and above). This will result in the query failing whenever a concurrent write occurs so you have to be prepared to retry it over and over and over again. For that reason it's likely to be better to just live with locking the table for a while.

How to make sure that it is possible to update a database table column only in one way?

I am using Ruby on Rails v3.2.2 and I would like to "protect" a class/instance attribute so that a database table column value can be updated only one way. That is, for example, given I have two database tables:
table1
- full_name_column
table2
- name_column
- surname_column
and I manage the table1 so that the full_name_column is updated by using a callback stated in the related table2 class/model, I would like to make sure that it is possible to update the full_name_column value only through that callback.
In other words, I should ensure that the table2.full_name_column value is always
"#{table1.name_column} #{table1.surname_column}"
and that it can't be another value. So, for example, if I try to "directly" update the table1.full_name_column, it should raise something like an error. Of course, that value must be readable.
Is it possible? What do you advice on handling this situation?
Reasons to this approach...
I want to use that approach because I am planning to perform database searches on table1 columns where the table1 contains other values related to a "profile"/"person" object... otherwise, probably, I must make some hack (maybe a complex hack) to direct those searches to the table2 so to look for "#{table1.name_column} #{table1.surname_column}" strings.
So, I think that a simple way is to denormalize data as explained above, but it requires to implement an "uncommon" way to handling that data.
BTW: An answer should be intend to "solve" related processes or to find a better approach to handle search functionalities in a better way.
Here's two approaches for maintaining the data on database level...
Views and materialized tables.
If possible, the table1 could be VIEW or for example MATERIALIZED QUERY TABLE (MQT). The terminology might differ slightly, depending on the used RDMS, I think Oracle has MATERIALIZED VIEWs whereas DB2 has MATERIALIZED QUERY TABLEs.
VIEW is simply an access to data that is physically in some different table. Where as MATERIALIZED VIEW/QUERY TABLE is a physical copy of the data, and therefore for example not in sync with source data in real time.
Anyway. these approaches would provide read-only access to data, that is owned by table2, but accessible by table1.
Example of very simple view:
CREATE VIEW table1 AS
SELECT surname||', '||name AS full_name
FROM table2;
Triggers
Sometimes views are not convenient as you might actually want to have some data in table1 that is not available from anywhere else. In these cases you could consider to use database triggers. I.e. create trigger that when table2 is updated, also table1 is updated within the same database transaction.
With the triggers the problem might be that then you have to give privileges to the client to update table1 also. Some RDMS might provide some ways to tune access control of the triggers, i.e. the operations performed by TRIGGERs would be performed with different privileges from the operations that initiate the TRIGGER.
In this case the TRIGGER could look something like this:
CREATE TRIGGER UPDATE_NAME
AFTER UPDATE OF NAME, SURNAME ON TABLE2
REFERENCING NEW AS NEWNAME
FOR EACH ROW
BEGIN ATOMIC
UPDATE TABLE1 SET FULL_NAME = NEWNAME.SURNAME||', '||NEWNAME.NAME
WHERE SOME_KEY = NEWNAME.SOME_KEY
END;
By replicating the data from table2 into table1 you've already de-normalized it. As with any de-normalization, you must be disciplined about maintaining sync. This means not updating things you're not supposed to.
Although you can wall off things with attr_accessible to prevent accidental assignment, the way Ruby works means there's no way to guarantee that value will never be modified. If someone's determined enough, they will find a way. This is where the discipline comes in.
The best approach is to document that the column should not be modified directly, block mass-assignment with attr_accessible, and leave it at that. There's no concept of a write-protected attribute, really, as far as I know.

Resources