I am building a simple prototype wherein I am reading data from Pubsub and using BeamSQL, code snippet as below
val eventStream: SCollection[String] = sc.pubsubSubscription[String]("projects/jayadeep-etl-platform/subscriptions/orders-dataflow")
.withFixedWindows(Duration.standardSeconds(10))
val events: SCollection[DemoEvents] = eventStream.applyTransform(ParDo.of(new DoFnExample()))
events.map(row=>println("Input Stream:" + row))
val pickup_events = SideOutput[DemoEvents]()
val delivery_events = SideOutput[DemoEvents]()
val (mainOutput: SCollection[DemoEvents], sideOutputs: SideOutputCollections)= events
.withSideOutputs(pickup_events, delivery_events)
.flatMap {
case (evts, ctx) =>
evts.eventType match {
// Send to side outputs via `SideOutputContext`
case "pickup" => ctx.output(pickup_events,evts)
case "delivery" => ctx.output(delivery_events,evts)
}
Some(evts)
}
val pickup: SCollection[DemoEvents] = sideOutputs(pickup_events)
val dropoff = sideOutputs(delivery_events)
pickup.map(row=>println("Pickup:" + row))
dropoff.map(row=>println("Delivery:" + row))
val consolidated_view = tsql"select $pickup.order_id as orderId, $pickup.area as pickup_location, $dropoff.area as dropoff_location , $pickup.restaurant_id as resturantId from $pickup as pickup left outer join $dropoff as dropoff ON $pickup.order_id = $dropoff.order_id ".as[Output]
consolidated_view.map(row => println("Output:" + row))
sc.run().waitUntilFinish()
()
I am using Directrunner for testing it locally and I am able to see the results right until the beam sql is executed. The output from beam sql is not getting printed.
Input Stream:DemoEvents(false,pickup,Bangalore,Indiranagar,1566382242,49457442008,1566382242489,7106576,1566382242000,178258,7406545542,,false,null,htr23e22-329a-4b05-99c1-606a3ccf6a48,972)
Pickup:DemoEvents(false,pickup,Bangalore,Indiranagar,1566382242,49457442008,1566382242489,7106576,1566382242000,178258,7406545542,,false,null,htr23e22-329a-4b05-99c1-606a3ccf6a48,972)
Input Stream:DemoEvents(false,delivery,Bangalore,Indiranagar,1566382242,49457442008,2566382242489,7106576,1566382242000,178258,7406545542,,false,null,htr23e22-329a-4b05-99c1-606a3ccf6a48,972)
Delivery:DemoEvents(false,delivery,Bangalore,Indiranagar,1566382242,49457442008,2566382242489,7106576,1566382242000,178258,7406545542,,false,null,htr23e22-329a-4b05-99c1-606a3ccf6a48,972)
The issue was related to a bug in DirectRunner, when I changed the runner to DataflowRunner the code ran as exepected.
Related
I'm trying to convert a predicted RasterFrameLayer in RasterFrames into a GeoTiff file after training a machine learning model.
When using the demo data Elkton-VA from rasterframes,it works fine.
But when using one cropping sentinel 2a tif with ndvi indice (normalized from -1000 to 1000), it failed with NullPointedException in toRaster step.
Feel like it's due to nodata value outside the ROI.
The test data is here, geojson and log.
Geotrellis version:3.3.0
Rasterframes version:0.9.0
import geotrellis.proj4.LatLng
import geotrellis.raster._
import geotrellis.raster.io.geotiff.{MultibandGeoTiff, SinglebandGeoTiff}
import geotrellis.raster.io.geotiff.reader.GeoTiffReader
import geotrellis.raster.render.{ColorRamps, Png}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.sql._
import org.locationtech.rasterframes._
import org.locationtech.rasterframes.ml.{NoDataFilter, TileExploder}
object ClassificiationRaster extends App {
def readTiff(name: String) = GeoTiffReader.readSingleband(getClass.getResource(s"/$name").getPath)
def readMtbTiff(name: String): MultibandGeoTiff = GeoTiffReader.readMultiband(getClass.getResource(s"/$name").getPath)
implicit val spark = SparkSession.builder()
.master("local[*]")
.appName(getClass.getName)
.withKryoSerialization
.getOrCreate()
.withRasterFrames
import spark.implicits._
val filenamePattern = "xiangfuqu_202003_mask_%s.tif"
val bandNumbers = "ndvi".split(",").toSeq
val bandColNames = bandNumbers.map(b ⇒ s"band_$b").toArray
val tileSize = 256
val joinedRF: RasterFrameLayer = bandNumbers
.map { b ⇒ (b, filenamePattern.format(b)) }
.map { case (b, f) ⇒ (b, readTiff(f)) }
.map { case (b, t) ⇒ t.projectedRaster.toLayer(tileSize, tileSize, s"band_$b") }
.reduce(_ spatialJoin _)
.withCRS()
.withExtent()
val tlm = joinedRF.tileLayerMetadata.left.get
// println(tlm.totalDimensions.cols)
// println(tlm.totalDimensions.rows)
joinedRF.printSchema()
val targetCol = "label"
val geojsonPath = "/Users/ethan/work/data/L2a10m4326/zds/test.geojson"
spark.sparkContext.addFile(geojsonPath)
import org.locationtech.rasterframes.datasource.geojson._
val jsonDF: DataFrame = spark.read.geojson.load(geojsonPath)
val label_df: DataFrame = jsonDF
.select($"CLASS_ID", st_reproject($"geometry",LatLng,LatLng).alias("geometry"))
.hint("broadcast")
val df_joined = joinedRF.join(label_df, st_intersects(st_geometry($"extent"), $"geometry"))
.withColumn("dims",rf_dimensions($"band_ndvi"))
val df_labeled: DataFrame = df_joined.withColumn(
"label",
rf_rasterize($"geometry", st_geometry($"extent"), $"CLASS_ID", $"dims.cols", $"dims.rows")
)
df_labeled.printSchema()
val tmp = df_labeled.filter(rf_tile_sum($"label") > 0).cache()
val exploder = new TileExploder()
val noDataFilter = new NoDataFilter().setInputCols(bandColNames :+ targetCol)
val assembler = new VectorAssembler()
.setInputCols(bandColNames)
.setOutputCol("features")
val classifier = new DecisionTreeClassifier()
.setLabelCol(targetCol)
.setFeaturesCol(assembler.getOutputCol)
val pipeline = new Pipeline()
.setStages(Array(exploder, noDataFilter, assembler, classifier))
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol(targetCol)
.setPredictionCol("prediction")
.setMetricName("f1")
val paramGrid = new ParamGridBuilder()
//.addGrid(classifier.maxDepth, Array(1, 2, 3, 4))
.build()
val trainer = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(evaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(4)
val model = trainer.fit(tmp)
val metrics = model.getEstimatorParamMaps
.map(_.toSeq.map(p ⇒ s"${p.param.name} = ${p.value}"))
.map(_.mkString(", "))
.zip(model.avgMetrics)
metrics.toSeq.toDF("params", "metric").show(false)
val scored = model.bestModel.transform(joinedRF)
scored.groupBy($"prediction" as "class").count().show
scored.show(20)
val retiled: DataFrame = scored.groupBy($"crs", $"extent").agg(
rf_assemble_tile(
$"column_index", $"row_index", $"prediction",
tlm.tileCols, tlm.tileRows, IntConstantNoDataCellType
)
)
val rf: RasterFrameLayer = retiled.toLayer(tlm)
val raster: ProjectedRaster[Tile] = rf.toRaster($"prediction", 5848, 4189)
SinglebandGeoTiff(raster.tile,tlm.extent, tlm.crs).write("/Users/ethan/project/IdeaProjects/learn/spark_ml_learn.git/src/main/resources/easy_b1.tif")
val clusterColors = ColorRamp(
ColorRamps.Viridis.toColorMap((0 until 1).toArray).colors
)
// val pngBytes = retiled.select(rf_render_png($"prediction", clusterColors)).first //It can output the png.
// retiled.tile.renderPng(clusterColors).write("/Users/ethan/project/IdeaProjects/learn/spark_ml_learn.git/src/main/resources/classified2.png")
// Png(pngBytes).write("/Users/ethan/project/IdeaProjects/learn/spark_ml_learn.git/src/main/resources/classified2.png")
spark.stop()
}
I suspect there is a bug in the way the toLayer extension method is working. I will follow up with a bug report to RasterFrames project. That will take a little more effort I suspect.
Here is a possible workaround that is a little bit lower level. In this case it results in 25 non-overlapping GeoTiffs written out.
import geotrellis.store.hadoop.{SerializableConfiguration, _}
import geotrellis.spark.Implicits._
import org.apache.hadoop.fs.Path
// Need this to write local files from spark
val hconf = SerializableConfiguration(spark.sparkContext.hadoopConfiguration)
ContextRDD(
rf.toTileLayerRDD($"prediction")
.left.get
.filter{
case (_: SpatialKey, null) ⇒ false // remove any null Tiles
case _ ⇒ true
},
tlm)
.regrid(1024) //Regrid the Tiles so that they are 1024 x 1024
.toGeoTiffs()
.foreach{ case (sk: SpatialKey, gt: SinglebandGeoTiff) ⇒
val path = new Path(new Path("file:///tmp/output"), s"${sk.col}_${sk.row}.tif")
gt.write(path, hconf.value)
}
EDIT: Just got the same behavior on 3.4
EDIT2: If I remove the disableLosslessIntegers from the connection the issue goes away, but the all integer numbers come back as {low: 20, high:0} type structures which breaks my entire application
The following code works fine on neo4j 3.3 using the 1.7.2 neo4j-driver for node:
import {v1 as neo4j} from 'neo4j-driver';
const url: string = process.env.COREDB_URL || '';
const user: string = process.env.COREDB_USERNAME || '';
const password: string = process.env.COREDB_PASSWORD || '';
const driver = neo4j.driver(url, neo4j.auth.basic(user, password), {disableLosslessIntegers: true});
let connection = driver.session()
async function go() {
let res = await connection.run(`create (b:Banana {tag: 'test'}) return b,id(b) as id`, {});
let b = res.records[0].get('b').properties
console.log('b',b)
let id = res.records[0].get('id')
console.log('id',id)
res = await connection.run(`MATCH (u) where id(u)=$id return u as id`, {id: id});
console.log(res.records)
let id2 = res.records[0].get('id').properties;
console.log('id2',id2)
}
go().then(() => console.log('done')).catch((e) => console.log(e.message))
it gives the following output:
> node tools\test-id.js
b { tag: 'test' }
id 1858404
[ Record {
keys: [ 'id' ],
length: 1,
_fields: [ [Node] ],
_fieldLookup: { id: 0 } } ]
id2 { tag: 'test' }
done
Under 3.5.1 it does not work. The second statement returns no records:
> node tools\test-id.js
b { tag: 'test' }
id 1856012
[]
Cannot read property 'get' of undefined
BTW, the reason I need to do the get by id right after the create is that I am using an apoc trigger to add things to the node after creation, and apoc triggers apparently run after the object is created and returned, so I need the second get to see the transformed node.
But, for this distilled example I removed the trigger from my DB to ensure that it was not causing the issue
I am using Groovy script to pull the data from oracle DB from Jenkins Job/Combo Box. It is taking lot of time to pull the data.
How to improve the performance?
import groovy.sql.Sql
Properties properties = new Properties()
File propertiesFile = new File('/opt/groovy/db.properties')
propertiesFile.withInputStream {
properties.load(it)
}
def Param = []
def arg = []
args.each{ arg.push(it)}
def dbUrl = 'jdbc:oracle:thin:#' + properties.dbServer + ':52000/' +
properties.dbSchema
sql = Sql.newInstance( dbUrl, properties.dbUser, properties.dbPassword,
properties.dbDriver )
switch (arg[0]) {
case { it == 'APP' }:
Param.push('Select')
query = "SELECT DISTINCT APP FROM INV ORDER BY APP"
sql.eachRow(query) { row ->
Param.push(row[0])
}
def App_array_final = Param.collect{ '"' + it + '"'}
print App_array_final
break;
I'm attempting to build a ROM-based Window function using DSPComplex and FixedPoint types, but seem to keep running into the following error:
chisel3.core.Binding$ExpectedHardwareException: vec element 'dsptools.numbers.DspComplex#32' must be hardware, not a bare Chisel type
The source code for my attempt at this looks like the following:
class TaylorWindow(len: Int, window: Seq[FixedPoint]) extends Module {
val io = IO(new Bundle {
val d_valid_in = Input(Bool())
val sample = Input(DspComplex(FixedPoint(16.W, 8.BP), FixedPoint(16.W, 8.BP)))
val windowed_sample = Output(DspComplex(FixedPoint(24.W, 8.BP), FixedPoint(24.W, 8.BP)))
val d_valid_out = Output(Bool())
})
val win_coeff = Vec(window.map(x=>DspComplex(x, FixedPoint(0, 16.W, 8.BP))).toSeq) // ROM storing our coefficients.
io.d_valid_out := io.d_valid_in
val counter = Reg(UInt(10.W))
// Implicit reset
io.windowed_sample:= io.sample * win_coeff(counter)
when(io.d_valid_in) {
counter := counter + 1.U
}
}
println(getVerilog(new TaylorWindow(1024, fp_seq)))
I'm actually reading the coefficients in from a file (this particular window has a complex generation function that I'm doing in Python elsewhere) with the following sequence of steps
val filename = "../generated/taylor_coeffs"
val coeff_file = Source.fromFile(filename).getLines
val double_coeffs = coeff_file.map(x => x.toDouble)
val fp_coeffs = double_coeffs.map(x => FixedPoint.fromDouble(x, 16.W, 8.BP))
val fp_seq = fp_coeffs.toSeq
Does this mean the DSPComplex type isn't able to be translated to Verilog?
Commenting out the win_coeff line seems to make the whole thing generate (but clearly doesn't do what I want it to do)
I think you should try using
val win_coeff = VecInit(window.map(x=>DspComplex.wire(x, FixedPoint.fromDouble(0.0, 16.W, 8.BP))).toSeq) // ROM storing our coefficients.
which will create hardware values like you want. The Vec just creates a Vec of the type specfied
Below is a simplified version of my code. I'm simply trying to get Shiny to pass an input value to an rmongodb query, run the query based on the user input, and then plot the mean of a variable. The code below includes everything needed to replicate my issue, including insertion of documents into the collection.
I'd be very grateful for any help! I'm pulling my hair out (and there wasn't much left to begin with). I suspect that I'm placing the reactive() command inappropriately or something along those lines.
Many thanks to whoever can provide assistance.
#Install required packages and call each
library(devtools)
install_github(repo = "mongosoup/rmongodb")
library(rmongodb)
library(shiny)
#Establish connection with mongodb, check status, name database and collection, insert some documents, return one document
mongo <- mongo.create()
mongo.insert(mongo, "simpledb.main",'{"user":"Joe", "age":34}')
mongo.insert(mongo, "simpledb.main",'{"user":"Joe", "age":31}')
mongo.insert(mongo, "simpledb.main",'{"user":"Joe", "age":53}')
mongo.insert(mongo, "simpledb.main",'{"user":"Kate", "age":29}')
mongo.insert(mongo, "simpledb.main",'{"user":"Lisa", "age":21}')
mongo.insert(mongo, "simpledb.main",'{"user":"Henry", "age":34}')
mongo.insert(mongo, "simpledb.main",'{"user":"David", "age":43}')
if(mongo.is.connected(mongo) == TRUE) {
help("mongo.count")
mongo.count(mongo, "simpledb.main")
}
if(mongo.is.connected(mongo) == TRUE) {
mongo.find.one(mongo, "simpledb.main")
}
#Code needed for Shiny UI
ui <- fluidPage(
fluidRow(
column(2, textInput(inputId = "userName", label = "", value = "Enter name here"))),
mainPanel(plotOutput(outputId = "main_plot"))
)
#Code needed for Shiny server
server <- function(input, output) {
queryReactive <- reactive({
nameFinal <- paste0(input$userName)
query = mongo.bson.buffer.create()
mongo.bson.buffer.append(query, "user", nameFinal)
query = mongo.bson.from.buffer(query)
})
#Run the query and store results as an R list object
queryresults <- mongo.find.all(mongo=mongo, ns = coll, query=queryReactive)
#Convert the R list object into a data frame
resultsdf <- data.frame(matrix(unlist(queryresults), nrow=length(queryresults), byrow=T), stringsAsFactors=FALSE)
output$main_plot <- renderPlot({boxplot(as.numeric(resultsdf$X3))})
}
#Code needed to call Shiny UI and server
shinyApp(ui = ui, server = server)
There is no need for a reactive command in your server function. I have simplified and corrected your function below:
server <- function(input, output) {
output$main_plot <- renderPlot({
nameFinal <- paste0(input$userName)
query = mongo.bson.buffer.create()
mongo.bson.buffer.append(query, "user", nameFinal)
query = mongo.bson.from.buffer(query)
queryresults <- mongo.find.all(mongo=mongo, ns = "simpledb.main", query=query)
if (length(queryresults) > 0) {
resultsdf <- data.frame(matrix(unlist(queryresults), nrow=length(queryresults), byrow=T), stringsAsFactors=FALSE)
boxplot(as.numeric(resultsdf$X3))
}
else boxplot(c(0))
})
}