Connection to Hive/BigData through Knowage in Fiware ecosystems - docker

I'm new to Knowage and Fiware ecosystem, I'm building a dataflow to store context data in Orion and to persist these data to HDFS through Cygnus. Now, I'm experimenting to connect Knowage to my Spark/Hadoop env to visualize and analyze these data, but I'm having a lot of difficulty configuring the correct jars for the Hive2 JDBC connector. I'm using the knowage:8.0 docker image, retrieved from docker hub, and I noticed that in $(TOMCAT_HOME)/webapps/knowage/WEB-INF/lib there aren't the connectors to Hive/Hive2/Spark, so I supposed to manually put them there, but It's being really struggle...
My question, is it possible in any way to connect Knowage Community Edition to any BigData source? I read this on documentation, in the section about Big Data and NoSQL (https://knowage-suite.readthedocs.io/en/master/administrator-guide/configure-data-sources.html):
"Please note that these connections are available for products KnowageBD and KnowagePM only."
I tried with different versions of hive-jdbc, but always get the same exception:
Caused by: java.sql.SQLException: Method not supported
at org.apache.hive.jdbc.HiveConnection.isValid(HiveConnection.java:1018)
at org.apache.commons.dbcp2.DelegatingConnection.isValid(DelegatingConnection.java:916)
at org.apache.commons.dbcp2.PoolableConnection.validate(PoolableConnection.java:282)
at org.apache.commons.dbcp2.PoolableConnectionFactory.validateConnection(PoolableConnectionFactory.java:362)
at org.apache.commons.dbcp2.BasicDataSource.validateConnectionFactory(BasicDataSource.java:2340)
at org.apache.commons.dbcp2.BasicDataSource.createPoolableConnectionFactory(BasicDataSource.java:2323)
... 50 more
Here the popup message from Knowage UI: https://i.stack.imgur.com/UUIDB.png

Related

Insert/update docs in neo4j using Couchbase

I want to insert/update the document in Couchbase from their it should be automatically inserted/updated to neo4j database. Is their any plugin or software to do the same? How can I achieve this functionality?
Couchbase enterprise version: 6.6
Neo4j enterprise version: 4.1.3
I read this blog https://dzone.com/articles/couchbase-amp-jdbc-integrations-for-neo4j-3x but I am not getting clarity over Neo4jJSON Loader, please guide me for the same.
You could also use the Couchbase Eventing Service which will respond to any mutation and trigger a fragment of JavaScript code. Refer to https://docs.couchbase.com/server/current/eventing/eventing-overview.html
Now you would probably want to utilize something similar to the code in this scriptlet example: https://docs.couchbase.com/server/current/eventing/eventing-handler-curl-post.html provided that the Neo4j REST API has a sub 1 ms performance and honors KeepAlive a 12 physical core system could stream about 40K inserts (or updates) per second from Couchbase to your Neo4j instance.
You can use the Couchbase Kafka connector to send CDC events to Kafka.
https://docs.couchbase.com/kafka-connector/current/quickstart.html
From there, you can read the kafka topics in order to import the data into Neo4j :
https://github.com/neo4j-contrib/neo4j-streams

Neo4J online backup — any way to address the security flaw?

If I am to make an online backup using the neo4j-admin backup tool remotely, as is advised by Neo4J, I have to open a public IP and the backup port on my Neo4J application.
However, I don't see neo4j-admin asking for any login credentials, basically making it possible for anybody to access the server and copy all the data while the port is opened.
There is no setting inside the neo4j.conf that would only accept backup requests from a certain address.
So what does it mean? When the online backups are done remotely, as is advised, the database may be vulnerable to somebody else just copying all the data.
I didn't find anything in Neo4J documentation that addresses this flaw (only a warning) and it looks like in more than 7 years that this feature has been available as a part of the commercial enterprise version there has not been any solution offered for this.
What do you do to protect the DB then? At the moment the only solution seems to not back it up remotely, but that causes additional stress on the server and is not the best solution. Plus the online backup is not stable when done locally for large DBs. Another solution could be to only open the port remotely via some kind of API to the server, but that may still be exploited if somebody figures out the time frame when the backup is made.
The documentation states that ne04j-admin must be invoked as the neo4j user. That is the user that owns the neo4j executables and the databases. So the security is handled by the OS login and the file permissions should be set to prevent unathorised access to the neo4j directories/files including the neo4j-admin executable.

Problem: empty graphics in GKE cluster node detail (No data for this time interval). How can I fix it?

I have a cluster in Google Cloud. But I need to know information about resources usage.
In interface of each node there are three graphics about CPU, memory and disk usage. But all this graphics in the each node have warning "No data for this time interval" for any time interval.
I upgraded all clusters and nodes to the latest version 1.15.4-gke.22 and changed "Legacy Stackdriver Logging" to "Stackdriver Kubernetes Engine Monitoring".
But it didn't help.
In Stackdriver Workspace there is only "disk_read_bytes" with graphics, any other requests in Metric Explorer have only message "No data for this time interval"
If I do request "kubectl top nodes" in the command line, I see current data for CPU and memory. But I need to see it on Node detail page to understand the peak load. How can I configure it?
In my case, I was missing permissions on the IAM service account associated with the cluster - make sure it has the roles:
Monitoring Metrics Writer (roles/monitoring.metricWriter)
Logs Writer (roles/logging.logWriter)
Stackdriver Resource Metadata Writer (roles/stackdriver.resourceMetadata.writer)
This is documented here
Actually it sound strange because if you can get metrics in command line and the Stackdriver interface doesn't show them maybe it's a bug.
I recommend this: if you be able, create a cluster with the minimum resources, check the same Stackdriver metrics and if there are metrics, it can be a bug and you can report it on in the appropriate GCP channel.
Check the documentation about how to get support within GCP:
Best Practices for Working with Cloud Support
Getting support for Google Cloud

Difference between database connector/reader nodes in KNIME

While creating some basic workflow using KNIME and PSQL I have encountered problems with selecting proper node for fetching data from db.
In node repo we can find at least:
PostgreSQL Connector
Database Reader
Database Connector
Actually, we can do the same using 2) alone or connecting either 1) or 2) to node 3) input.
I assumed there are some hidden advantages like improved performance with complex queries or better overall stability but on the other hand we are using exactly the same database driver, anyway..
There is a big difference between the Connector Nodes and the Reader Node.
The Database Reader, reads data into KNIME, the data is then on the machine running the workflow. This can be a bad idea for big tables.
The Connector nodes do not. The data remains where it is (usually on a remote machine in your cluster). You can then connect Database nodes to the connector nodes. All data manipulation will then happen within the database, no data is loaded to your machine (unless you use the output port preview).
For the difference of the other two:
The PostgresSQL Connector is just a special case of the Database Connector, that has pre-set configuration. However you can make the same configuration with the Database Connector, which allows you to choose more detailed options for non standard databases.
One advantage of using 1 or 2 is that you only need to enter connection details once for a database in a workflow, and can then use multiple reader or writer nodes. I'm not sure if there is a performance benefit.
1 offers simpler connection details with the bundled postgres jdbc drivers than 2

does neo4j embedded driver lock the db files?

I have a general question about the embedded driver for neo4j. What exactly does it mean to be embedded, besides it being lower level and higher performance. Is it an actual instance of the database service or just a driver for connecting to an existing database process or service. For instance
Does using the embedded driver libraries acquire an exclusive lock on the database files?
Can multiple clients use the embedded driver to use the same database at the same time?
Can it run against a database that already has a database service(along with the REST api) running? Initial tests seem to indicate no since it throws a file lock exception.
Does the embedded driver have to be on the same machine or process as the database service? For instance if the db data files are on a shared SAN that multiple machines can access, and there is another server that is running the REST api and the neo4j service. The configuration on the driver seems to point to the data files directly rather than a service or port.
I am using embedded Neo4j in a project.
Embedded Neo4j is a Neo4j server started and shutdown by your application. So it is not just a driver used to connect to some standalone server. For a standalone server you would use Neo4j over Rest (locally or remotely).
Because of it's implementation embedded neo4j can be used by only one application - the application that started the embedded instance. It retrieves a lock on the graph files, and you can't use any other application (e.g. neo4j-sh) to access those files as long the embedded server is running.

Resources