Delete Bigtable row in Apache Beam 2.2.0 - google-cloud-dataflow

In Dataflow 1.x versions, we could use CloudBigtableIO.writeToTable(TABLE_ID) to create, update, and delete Bigtable rows. As long as a DoFn was configured to output a Mutation object, it could output either a Put or a Delete, and CloudBigtableIO.writeToTable() successfully created, updated, or deleted a row for the given RowID.
It seems that the new Beam 2.2.0 API uses BigtableIO.write() function, which works with KV<RowID, Iterable<Mutation>>, where the Iterable contains a set of row-level operations. I have found how to use that to work on Cell-level data, so it's OK to create new rows and create/delete columns, but how do we delete rows now, given an existing RowID?
Any help appreciated!
** Some further clarification:
From this document: https://cloud.google.com/bigtable/docs/dataflow-hbase I understand that changing the dependency ArtifactID from bigtable-hbase-dataflow to bigtable-hbase-beam should be compatible with Beam version 2.2.0 and the article suggests doing Bigtble writes (and hence Deletes) in the old way by using CloudBigtableIO.writeToTable(). However that requires imports from the com.google.cloud.bigtable.dataflow family of dependencies, which the Release Notes suggest is deprecated and shouldn't be used (and indeed it seems incompatible with the new Configuration classes/etc.)
** Further Update:
It looks like my pom.xml didn't refresh properly after the change from bigtable-hbase-dataflow to bigtable-hbase-beam ArtifactID. Once the project got updated, I am able to import from the
com.google.cloud.bigtable.beam.* branch, which seems to be working at least for the minimal test.
HOWEVER: It looks like now there are two different Mutation classes:
com.google.bigtable.v2.Mutation and
org.apache.hadoop.hbase.client.Mutation ?
And in order to get everything to work together, it has to be specified properly which Mutation is used for which operation?
Is there a better way to do this?

Unfortunately, Apache Beam 2.2.0 doesn't provide a native interface for deleting an entire row (including the row key) in Bigtable. The only full solution would be to continue using the CloudBigtableIO class as you already mentioned.
A different solution would be to just delete all the cells from the row. This way, you can fully move forward with using the BigtableIO class. However, this solution does NOT delete the row key itself, so the cost of storing the row key remains. If your application requires deleting many rows, this solution may not be ideal.
import com.google.bigtable.v2.Mutation
import com.google.bigtable.v2.Mutation.DeleteFromRow
// mutation to delete all cells from a row
Mutation.newBuilder().setDeleteFromRow(DeleteFromRow.getDefaultInstance()).build()

I would suggest that you should continue using CloudBigtableIO and bigtable-hbase-beam. It shouldn't be too different from CloudBigtableIO in bigtable-hbase-dataflow.
CloudBigtableIO uses the HBase org.apache.hadoop.hbase.client.Mutation and translates them into the Bigtable equivalent values under the covers

Related

Deletion of module columns/attribute not visible on a given View

I have been given the task of creating a DXL script. First problem is that I have never used DXL before, even though I have many years experience with DOORS itself. I have been surfing the Net to seek guidance on my particular problem. I also have a few specimen DXL scripts for reference.
My new client requires that for each View of a given Module, of which there are many Views, new "reduced" Modules are to be produced reflecting each View.
By "reduced", I mean that these new Modules are to contain nothing that isn't actually needed for that View., i.e. Columns, Attributes etc. These new Modules will only have the single View.
So, the way forward as I see it, is to take copies of the single master Module, one for each View, rename those copies to reflect a given Master Module/Required View, select that required View in the given copy Module and then delete everything that is not needed by that View, i.e. available Columns, Attributes etc.
This would be simple if I had the required DXL knowledge, which I am endeavouring to pick up as fast as I can.
If at all possible, this script has to be generic and be able to work upon any of the master Module copies to produce the associated "reduced" Module reflecting a particular View.
The client aims to use the script periodically for View archiving (I know, that's the way they want it).
Clarification
Some clarification of what I believe is required, given the following text from my original question:
If at all possible, this script has to be generic and be able to work upon any of the master Module copies to produce the associated "reduced" Module reflecting a particular View.
So, say there are ten views of the master Module, outside of the DXL script, I would copy the master Module ten times, renaming each copy to reflect each of the ten views. Unless you know different, each of those ten copies will reflect the same “Absolute Number”s as are in the master Module, so no problem there?
So, starting with the first of the copied Modules, each named to reflect the View it will eventually represent, its View would be set from the ten Views available to it, that which matches its title.
The single generic DXL script would then be run against that first copy Module, the aim being to delete everything not actually needed for that view, i.e. Attributes, Columns etc. Would some kind of purging command be required in the script for any aforementioned deleted items?
The single generic DXL script would then delete ALL views from that copy Module. The log that is produced when running the script also needs capturing, but I’m not sure whether this should be done from within the script, if possible or as a separate manual task outside of the script.
The aforementioned (indented) process would then be repeated, using the same generic script, against the remaining nine copied Modules. The intension is to leave us with ten copy Modules, each one reflecting one of the ten possible Views, with each one containing only the Attributes, Columns etc. required for that View.
Creating a mirror of a module with this approach is not so easy IMO. Think e.g. about "Absolute Number". If the original module contains the numbers 15 (level 1), 2000 (level 2), 1 (level 1), you will have to create 2000 objects, purge 1997 of them and move them to the correct place.
There is a "duplicate" tool at https://www.ibm.com/developerworks/community/forums/html/topic?id=43862118-113d-4eac-b3f1-21d3b73959d1 which tries to do this, but as stated there, this script is said not to work correctly in all situations.
So, I would rather use the approach "string clipCopy (Item i); string clipPaste(Folder folderRef)". Should be faster and less error prone. But: all Out-Links will also be copied with this method, you will probably have to delete these after the copy or else the link target module(s) will have lots of In-Links.
The problem is still not so easy to solve, as every view might have DXL columns that rely on some or other attribute, and it might contain DXL attributes which again might rely on sth else. I doubt that there is a way to analyze DXL code "on the fly" and find out which columns may be deleted.
Perhaps a totally different approach would be feasible: open each view and create an export to Excel, this way you will get rid of any dynamic dependencies. Then re-import the excel sheet to a new DOORS module. You will still have the "Absolute Number" problem, but perhaps you can make a deal that you will have a pseudo attribute "Original Absolute Number" and disregard the "new" "Absolute Number"'
Quite a big task for a DXL beginner....
Update: On second thought, perhaps you might want to combine these approaches
agree with your employer that you will use an alternative attribute for Absolute Number
use a loop like Russel suggested, when creating objects remember that objects might have to be created "below" or "after" its predecessor or sibling
for DXL attributes do not copy the DXL code but the actual current value of the object
for DXL columns create pseudo attributes _ and create a new view that uses these pseudo attributes instead of the original value
Copying the entire module, then deleting everything not in that view, seems worse than just copying the things you need from each particular view.
I would take the following as the outline of your program:
for view in main module do {
for column in view do {
Find attribute for each column and store (possibly in a skip list?)
Store name of column
}
create new module
create needed types / attributes in new module
create new view in new module
for object in main module {
create object in new module
for attribute in main module {
check if attribute is in new module {
copy info from old object to new
}
}
}
}
Each of these for X in y loops should be in the DXL reference manual in some for or another.
If you need more help, let me know!

Neo4j APOC import error

I have a data model that starts with a single record, this has a custom "recordId" that's a uuid, then it relates out to other nodes and they then in turn relate to each other. That starting node is what defines the data that "belongs" together, as in if we had separate databases inside neo4j. I need to export this data, into a backup data-set that can be re-imported into either the same or a new database with ease
After some help, I'm using APOC to do the export:
call apoc.export.cypher.query("MATCH (start:installations)
WHERE start.recordId = \"XXXXXXXX-XXX-XXX-XXXX-XXXXXXXXXXXXX\"
CALL apoc.path.subgraphAll(start, {}) YIELD nodes, relationships
RETURN nodes, relationships", "/var/lib/neo4j/data/test_export.cypher", {})
There are then 2 problems I'm having:
Problem 1 is the data that's exported has internal neo4j identifiers to generate the relationships. This is bad if we need to import into a new database and the UNIQUE IMPORT ID values already exist. I need to have this data generated with my own custom recordIds as the point of reference.
Problem 2 is that the import doesn't even work.
call apoc.cypher.runFile("/var/lib/neo4j/data/test_export.cypher") yield row, result
returns:
Failed to invoke procedure apoc.cypher.runFile: Caused by: java.lang.RuntimeException: Error accessing file /var/lib/neo4j/data/test_export.cypher
I'm hoping someone can help me figure out what may be going on, but I'm not sure what additional info is helpful. No one in the Neo4j slack channel has been able to help find a solution.
Thanks.
problem1:
The exported file does not contain any internal neo4j ids. It is not safe to use neo4j ids out of the database, since they are not globally unique. So you should not use them to transfer data from one database to another.
If you are about to use globally uniqe ids, you can use an external plugin like GraphAware UUID plugin. (disclaimer: I work for GraphAware)
problem2:
If you cannot access the file, then possible reasons:
apoc.import.file.enabled=true is not set in neo4j.conf
os level
permission is not set

Loading a .trig file with inference to Fuseki using the 'tdbloader" bulk loader?

I am currently writing some Java code extracting some data and writing them as Linked Data, using the TRIG syntax. I am now using Jena, and Fuseki to create a SPARQL endpoint to query and visualize this data.
The data is written so that each source dataset gives me a .trig file, containing one named graph. So I want to load thoses files in Fuseki. Except that it doesn't seem to understand the Trig syntax...
If I remove the named graphs, and rename the files as .ttl, everything loads perfectly in the default graphs. But if I try to import trig files :
using Fuseki's webapp uploader, it either crashes ("Can't make new graphs") or adds nothing except the prefixes, as if the graphs other than the default ones could not be added (the logs say nothing helpful except the error code and description).
using Java code, the process is too slow. I used this technique : " Loading a .trig file into TDB? " but my trig files are pretty big, so this solution is not very good for me.
So I tried to use the bulk loader, the console command 'tdbloader'. This time everything seems fine, but in the webapp, there is still no data.
You can see the process going fine here : Quads are added just fine
But the result still keeps only the default graph and its original data : Nothing is added
So, I don't know what to do. The guys behind Jena and Fuseki suggested not to use the bulk loader in the Java code (rather than the command line tool), so that's one solution I guess I'd like to avoid.
Did I miss something obvious about how to load TRIG files to Fuseki? Thanks.
UPDATE :
As it seemed to be a problem in my configuration (see the comments of this post for a link to my config file; I cannot post more than 2 links), I tried to add some kind of specification for some named graphs I would like to see added to the dataset on Fuseki.
I added code to link (with ja:namedgraph) external graphs that I added via tdbloader. This seems to work. Great!
Now another problem : there's no inference, even when my config file specifies an Inference model... I set that queries should be applied with named graphs merged as the default graph, but this does not seem to carry the OWL Inference rules...So simple queries work, but I have 1/ to specify the graph I query (with "FROM") and 2/ no inference in my data.
The two methods are to use the tdb bulkloader offline or you can POST data into the dataset directly. (i.e. HTTP POST operations to http://localhost:3030/ds).
You can test where your graph are there with a query like
SELECT (count(*) AS ?C) { GRAPH ?g { ?s ?p ?o } }
The named graphs will show up when the Fuseki server is started unless your configuration of the SPARQL services only exports one graph.

Creating a DTS package that uses a stored procedure

We're trying to make a DTS package where it'll launch a stored procedure and capture the contents in a flat file. This will have to run every night, and the new file should overwrite the existing file.
This wouldn't normally be a problem, as we just plug in the query and it runs, but this time everything was complicated enough that we chose to approach it with a stored procedure employing temporary tables. How can I go about using this in a DTS package? I tried going the normal route with the Wizard and then plugging in EXEC BlahBlah.dbo... It did not care for that:
The Statement could not be parsed. Additional information: Invalid object name '#DestinyDistHS'. (Microsoft SQL Server Native Client 10.0)
Can anyone guide me in the right direction here?
Thanks.
Is it an option to simply populate a non-temp table in your SP, call it and select from the non temp table when exporting?
This is only an issue if you have multiple simultaneous calls to the stored procedure. In this case you can't save to a single table.
If you do have multiple simultaneous calls then you might be able to:
Create a temp table to hold results
Use INSERT INTO #TempTable EXEC YourProc
SELECT FROM #TempTable
You might need to do this in a more forgiving command line tool (like SQLCMD). It's not as fussy about metadata.

Neo4J and timestamps

I need information about node's creation & last modification dates...
Is there a way to automatically handle created and updated properties for a node?
Hibernate offers #Version for updated field. Is there something similar with Node4J.
I found http://neo4j.rubyforge.org/classes/Neo4j/Rails/Timestamps.html but it seems to be only available for Ruby.
You can use annotations form the spring-data-commons library. Use #CreatedDate and #LastModifiedDate on properties of type Long. Make sure you're using the simple mapping mode. For now, advanced mapping mode does not support this, see DATAGRAPH-335.

Resources