Error creating a keyspace with dataframes in spark cassandra - docker

I'm trying to connect spark to cassandra and then I make a query to the keyspace and table from flask.
The problem is that when I run the web application I get an error saying that the keyspace in not created.
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace MyKeyspace does not exist"
In spark I run the following commands:
val flightRecommendations = finalPredictions.writeStream.foreachBatch {
(batchDF: DataFrame, batchId: Long) =>
batchDF
.write
.cassandraFormat("MytableName", "MyKeyspace")
.option("cluster", "cassandra_cluster")
.mode("append")
.save
}.start()
My question is whether the above code automatically generates the keyspace and the table.
I think it could also be a problem of connection because I'm working in docker and the setting I put is this:
spark.setCassandraConf("cassandra_cluster", CassandraConnectorConf.ConnectionHostParam.option("cassandra"))
Also in the spark-submit command I put the following two configurations:
--conf spark.cassandra.connection.host=cassandra \
--conf spark.sql.extensions=com.datastax.spark.connector.CassandraSparkExtensions \
It's weird because the spark-submit doesn't give errors but the keyspace is not created.

Yes, that's possible with since Spark Cassandra Connector 2.5.0. There is a new function createCassandraTableEx that allows to create a new table based on the Dataframe schema and it has an option to handle the cases where table does already exist (in addition to other things, like, control the sorting of the clustering columns, table options, etc.) - before 2.5.0 there was the createCassandraTable function, but it thrown exception if table already exists.
Here is example from the blog post that announces 2.5.0 release. For dataframe with following structure:
root
|-- id: integer (nullable = false)
|-- c: integer (nullable = false)
|-- t: string (nullable = true)
it's possible to create a new table using following code:
import com.datastax.spark.connector.cql.ClusteringColumn
import org.apache.spark.sql.cassandra._
import com.datastax.spark.connector._
data.createCassandraTableEx("test", "test_new", Seq("id"),
Seq(("c", ClusteringColumn.Descending)),
ifNotExists = true, tableOptions = Map("gc_grace_seconds" -> "1000"))
And you don't need to use foreachBatch with that version - it was required only before 2.5.0 - in new version you can just write:
val query = streamingCountsDF.writeStream
.outputMode(OutputMode.Update)
.option("checkpointLocation", ".../checkpoint")
.option("table", "tablename")
.("keyspace", "ksname")
.start()
And with Spark 3.x & SCC 3.x you can create keyspaces & tables in Cassandra using the Spark SQL - see documentation for more details.

Related

Flink consumer job is unable to connect to Kafka from Docker

I have a simple streaming Flink Scala job which connects to a Kafka topic and maps its
org.apache.avro.generic.GenericRecord messages and map into Json string.
When it is running in IntelliJ it ingests the topic well and printing out the jsons.
When I run it in docker-compose I got the following exception:
com.esotericsoftware.kryo.KryoException: Error constructing instance of class: org.apache.avro.Schema$LockableArrayList
Serialization trace:
types (org.apache.avro.Schema$UnionSchema)
schema (org.apache.avro.Schema$Field)
fieldMap (org.apache.avro.Schema$RecordSchema)
schema (org.apache.avro.generic.GenericData$Record)
at com.twitter.chill.Instantiators$$anon$1.newInstance(KryoBase.scala:136)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1061)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.create(CollectionSerializer.java:89)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:93)
at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:22)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:143)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:21)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:679)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:106)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:657)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:262)
at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.copy(TupleSerializer.java:111)
at org.apache.flink.api.java.typeutils.runtime.TupleSerializer.copy(TupleSerializer.java:37)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:635)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:612)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:592)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:727)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:705)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collect(StreamSourceContexts.java:104)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$NonTimestampContext.collectWithTimestamp(StreamSourceContexts.java:111)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher.emitRecordWithTimestamp(AbstractFetcher.java:398)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.emitRecord(KafkaFetcher.java:185)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:150)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:715)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:208)
Caused by: java.lang.IllegalAccessException: Class com.twitter.chill.Instantiators$ can not access a member of class org.apache.avro.Schema$LockableArrayList with modifiers "public"
at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:102)
at java.lang.reflect.AccessibleObject.slowCheckMemberAccess(AccessibleObject.java:296)
at java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:288)
at java.lang.reflect.Constructor.newInstance(Constructor.java:413)
at com.twitter.chill.Instantiators$.$anonfun$normalJava$1(KryoBase.scala:170)
at com.twitter.chill.Instantiators$$anon$1.newInstance(KryoBase.scala:133)
... 37 more
I tried forcing Avro serialization with:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.disableForceKryo()
env.getConfig.enableForceAvro()
but got the same error.
Based on [this][1] I'm using all Flink related dependencies as "provided" with no good result.
What can be the difference between running the job in the IDE and in
Docker?
How can I fix the job to be able to read the Kafka topic from
Docker?
How shall I setup Docker for this?
What can I handle Kryo/Avro serialization issue?
SS
[1]: http://www.alternatestack.com/development/com-esotericsoftware-kryo-kryoexception-unusual-solution-upgrading-flink/

pyhdfs.HdfsIOException: Failed to find datanode, suggest to check cluster health. excludeDatanodes=null

I am trying to run hadoop using docker provided here:
https://github.com/big-data-europe/docker-hadoop
I use the following command:
docker-compose up -d
to up the service and am able to access it and browse file system using: localhost:9870. Problem rises whenever I try to use pyhdfs to put file on HDFS. Here is my sample code:
hdfs_client = HdfsClient(hosts = 'localhost:9870')
# Determine the output_hdfs_path
output_hdfs_path = 'path/to/test/dir'
# Does the output path exist? If not then create it
if not hdfs_client.exists(output_hdfs_path):
hdfs_client.mkdirs(output_hdfs_path)
hdfs_client.create(output_hdfs_path + 'data.json', data = 'This is test.', overwrite = True)
If test directory does not exist on HDFS, the code is able to successfully create it but when it gets to the .create part it throws the following exception:
pyhdfs.HdfsIOException: Failed to find datanode, suggest to check cluster health. excludeDatanodes=null
What surprises me is that my code is able to create the empty directory but fails to put the file on HDFS. My docker-compose.yml file is exactly the same as the one provided in the github repo. The only change I've made is in the hadoop.env file where I change:
CORE_CONF_fs_defaultFS=hdfs://namenode:9000
to
CORE_CONF_fs_defaultFS=hdfs://localhost:9000
I have seen this other post on sof and tried the following command:
hdfs dfs -mkdir hdfs:///demofolder
which works fine in my case. Any help is much appreciated.
I would keep the default CORE_CONF_fs_defaultFS=hdfs://namenode:9000 setting.
Works fine for me after adding a forward slash to the paths
import pyhdfs
fs = pyhdfs.HdfsClient(hosts="namenode")
output_hdfs_path = '/path/to/test/dir'
if not fs.exists(output_hdfs_path):
fs.mkdirs(output_hdfs_path)
fs.create(output_hdfs_path + '/data.json', data = 'This is test.')
# check that it's present
list(fs.walk(output_hdfs_path))
[('/path/to/test/dir', [], ['data.json'])]

Use MongoDB with docker-compose: create database and user

Is possible to create a database using docker-compose? I'm trying to run mongodb on docker but I'm not able to create user and initial database :(
version: '3.1'
services:
mongo:
image: mongo
restart: always
environment:
- MONGO_INITDB_DATABASE=test
- MONGO_INITDB_ROOT_USERNAME=admin
- MONGO_INITDB_ROOT_PASSWORD=admin
volumes:
- ~/docker/volumes/mongodb:/data/db
ports:
- 27017:27017
What is missing in my script?
test code:
from pymongo import MongoClient
mongo_client = MongoClient('mongodb://%s:%s#127.0.0.1' % ('admin', 'admin'))
cursor = mongo_client.list_databases()
for db in cursor:
print(db)
The above docker-compose seems fine, it will create a user named admin and the DB named will be test if your *.js file contain insert data script. otherwise, it will not create Database because so if you do not insert data with your JavaScript files, then no database is created.
You can log in with these credential that specifies in env.
- MONGO_INITDB_DATABASE=test
- MONGO_INITDB_ROOT_USERNAME=test
- MONGO_INITDB_ROOT_PASSWORD=admin
To initialize DB, you need to mount you DB init script.
volumes:
- ~/docker/volumes/mongodb:/data/db
- youdbscript/init.js:/docker-entrypoint-initdb.d/
MONGO_INITDB_DATABASE
This variable allows you to specify the name of a database to be used
for creation scripts in /docker-entrypoint-initdb.d/*.js (see
Initializing a fresh instance below). MongoDB is fundamentally
designed for "create on first use", so if you do not insert data with
your JavaScript files, then no database is created.
Initializing a fresh instance
When a container is started for the first time it will execute files
with extensions .sh and .js that are found in
/docker-entrypoint-initdb.d. Files will be executed in alphabetical
order. .js files will be executed by mongo using the database
specified by the MONGO_INITDB_DATABASE variable, if it is present, or
test otherwise. You may also switch databases within the .js script.
mongo-docker
Update:
For your information the code you pasted code is not js file. its python script. Also, change the user name from admin might conflict with admin.
from pymongo import MongoClient
mongo_client = MongoClient('mongodb://%s:%s#127.0.0.1' % ('admin', 'admin'))
cursor = mongo_client.list_databases()
for db in cursor:
print(db)
You init script will some thing like
youdbscript/init.js
You can try this.
db = db.getSiblingDB("test");
db.article.drop();
db.article.save( {
title : "this is my title" ,
author : "bob" ,
posted : new Date(1079895594000) ,
pageViews : 5 ,
tags : [ "fun" , "good" , "fun" ] ,
comments : [
{ author :"joe" , text : "this is cool" } ,
{ author :"sam" , text : "this is bad" }
],
other : { foo : 5 }
});
Just like #Adiii answer. If you just declare a db name and not insert some data. The db will not created actually.
As your question. You can add one simple script (such as .sh or .js script) to the path /docker-entrypoint-initdb.d. Like the docs on mongo Initializing a fresh instance section.

How to use the "recovery tool" to repair a perfino h2 database?

Our Perfino server crashed recently, logging since then the ERROR shown below. (There are some clues hinting to an OutOfMemory resulting in a corrupt db.)
It is suggested: 'Possible solution: use the recovery tool'. But neither the official perfino documentation nor the logs offer more instructions on how to proceed.
So here the question: how to use the recovery tool?
Stacktrace:
ERROR [collector] server: could not load transaction data
org.h2.jdbc.JdbcSQLException: File corrupted while reading record: "[495834] stream data key:64898 pos:11 remaining:0". Possible solution: use the recovery tool; SQL statement:
SELECT value FROM transaction_names WHERE id=? [90030-176]
at org.h2.message.DbException.getJdbcSQLException(DbException.java:344)
at org.h2.message.DbException.get(DbException.java:178)
at org.h2.message.DbException.get(DbException.java:154)
at org.h2.index.PageDataIndex.getPage(PageDataIndex.java:242)
at org.h2.index.PageDataNode.getNextPage(PageDataNode.java:233)
at org.h2.index.PageDataLeaf.getNextPage(PageDataLeaf.java:400)
at org.h2.index.PageDataCursor.nextRow(PageDataCursor.java:95)
at org.h2.index.PageDataCursor.next(PageDataCursor.java:53)
at org.h2.index.IndexCursor.next(IndexCursor.java:278)
at org.h2.table.TableFilter.next(TableFilter.java:361)
at org.h2.command.dml.Select.queryFlat(Select.java:533)
at org.h2.command.dml.Select.queryWithoutCache(Select.java:646)
at org.h2.command.dml.Query.query(Query.java:323)
at org.h2.command.dml.Query.query(Query.java:291)
at org.h2.command.dml.Query.query(Query.java:37)
at org.h2.command.CommandContainer.query(CommandContainer.java:91)
at org.h2.command.Command.executeQuery(Command.java:197)
at org.h2.jdbc.JdbcPreparedStatement.executeQuery(JdbcPreparedStatement.java:109)
at com.mchange.v2.c3p0.impl.NewProxyPreparedStatement.executeQuery(NewProxyPreparedStatement.java:353)
at com.perfino.a.f.b.a.a(ejt:70)
at com.perfino.a.f.o.a(ejt:880)
at com.perfino.a.f.o.a(ejt:928)
at com.perfino.a.f.o.a(ejt:60)
at com.perfino.a.f.aa.a(ejt:783)
at com.perfino.a.f.o.a(ejt:847)
at com.perfino.a.f.o.a(ejt:792)
at com.perfino.a.f.o.a(ejt:787)
at com.perfino.a.f.o.a(ejt:60)
at com.perfino.a.f.ac.a(ejt:1011)
at com.perfino.b.a.b(ejt:68)
at com.perfino.b.a.c(ejt:82)
at com.perfino.a.f.o.a(ejt:1006)
at com.perfino.a.i.b.d.a(ejt:168)
at com.perfino.a.i.b.d.b(ejt:155)
at com.perfino.a.i.b.d.b(ejt:52)
at com.perfino.a.i.b.d.a(ejt:45)
at com.perfino.a.i.a.b.a(ejt:94)
at com.perfino.a.c.a.b(ejt:105)
at com.perfino.a.c.a.a(ejt:37)
at com.perfino.a.c.c.run(ejt:57)
at java.lang.Thread.run(Thread.java:745)
Notice: I couldn't recover my database with the procedure described below. I'm still keeping this post as reference, as the probability of a successful recovery will depend on how broken the database is, and there is no evidence that this procedure is invalid.
Perfino uses by default the H2 Database Engine as its persistence storage. H2 has a recovery tool and a run script tool to import sql statements:
# 1. Create a dump of the current database using the tool [1]
# This tool creates a 'config.h2.sql' and a 'perfino.h2.sql' db dump
cd ${PERFINO_DATA_DIR}
java -cp ${PATH_TO_H2_LIB}/h2*.jar org.h2.tools.Recover
# 2. Rename the corrupt database file to e.g. *bkp
mv perfino.h2.db perfino.h2.db.bkp
# 3. Import the dump from step 1, ignoring errors
java -cp ${PATH_TO_H2_LIB}/h2*.jar \
org.h2.tools.RunScript \
-url jdbc:h2:${PERFINO_DATA_DIR}/db/perfino \
-script perfino.h2.sql -checkResults
[1]: Perfino includes a version of the h2.jar under ${PERFINO_INSTALL_DIR}/lib/common/h2.jar. You could of course download the official jar and try with it, but in my case, I could only restore the database with the jar supplied with perfino.
This failed for me with a
Exception in thread "main" org.h2.jdbc.JdbcSQLException: Feature not supported: "Restore page store recovery SQL script can only be restored to a PageStore file".
If this happens to you, try:
# 1. Delete database and mv files
cd ${PERFINO_DATA_DIR}
rm perfino.h2.db perfino.mv.db
# 2. Create a PageStore database manually
touch perfino.h2.db
# 3. try with MV_STORE=FALSE on the url [2]
java -cp ${PATH_TO_H2_LIB}/h2*.jar \
org.h2.tools.RunScript \
-url jdbc:h2:${PERFINO_DATA_DIR}/db/perfino;MV_STORE=FALSE \
-checkResults \
-continueOnError
[2]: force h2 to recreate a pagestore db instead of the new storage engine (See this thread in metabase)
I found this article trying to repair a Confluence internal h2 database and this worked for me. Here's a shell script as a gist on my GitHub with what I did - you'll have to adjust for your environment.

Error when using export-graphml in Neo4j 2.2

I am trying to use the export-graphml function in Neo4j 2.2. I have downloaded neo4j shell tools and extract it into the lib directory. I am able to export the entire database as a graphml file. However, if I try to export a subset using a query, I receive the following error:
Error occurred in server thread; nested exception is:
java.lang.NoSuchMethodError: org.neo4j.cypher.export.CypherResultSubGraph.from(Lorg/neo4j/cypher/javacompat/ExecutionResult;Lorg/neo4j/graphdb/GraphDatabaseService;Z)Lorg/neo4j/cypher/export/SubGraph;
The statement I used is:
export-graphml -o /path/to/file/out.graphml match (n:Person)-[r:RELATIONSHIP]-() WHERE n.id = 12345 return n, r
I have tried different variations with the different options (-r, -t) and none work

Resources