I'm trying to observe how do spark streaming uses the RDDs inside DStream to join two DStreams, but seeing strange results which is confusing.
In my code, I am collecting data from a socket stream, split them into 2 PairedDStreams by some logic. In order to have some batches collected for join, I have created a window to collect last three batches. However, the results of join is clueless. Please help me understand.
object Join extends App {
val conf = new SparkConf().setMaster("local[4]").setAppName("KBN Streaming")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(BATCH_INTERVAL_SEC))
val lines = ssc.socketTextStream("localhost", 8091)
//println(s"lines.slideDuration : ${lines.slideDuration}")
val ds = lines.map(x => x)
import scala.util.Random
val randNums = List(1, 2, 3, 4, 5, 6)
val less = ds.filter(x => x.length <= 2)
val lessPairs = less.map(x => (Random.nextInt(randNums.size), x))
val greater = ds.filter(x => x.length > 2)
val greaterPairs = greater.map(x => (Random.nextInt(randNums.size), x))
val join = lessPairs.join(greaterPairs).window(Seconds(30), Seconds(30))
Test Results:
------------------------------------------- Time: 1473344240000 ms
------------------------------------------- (1,b) (4,s)
------------------------------------------- Time: 1473344240000 ms
------------------------------------------- (5,333)
------------------------------------------- Time: 1473344250000 ms
------------------------------------------- (2,x)
------------------------------------------- Time: 1473344250000 ms
------------------------------------------- (4,the)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (2,a) (0,b)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (2,ten) (1,one) (3,two)
------------------------------------------- Time: 1473344260000 ms
------------------------------------------- (4,(b,two))
When join is called, the two RDDs are recomputed again and thus they will contain different values than those shown when printed. So, we need to cache when the both RDDs are computed for the first time and thus same values will be used when join is called later (instead of recomputing both RDDs once again). I tried this on multiple examples and it works fine. I was missing the basic core concept of Spark.
Excerpt from "Learning Spark" book:
Persistence (Caching)
As discussed earlier, Spark RDDs are lazily evaluated, and sometimes we may wish to use the same RDD multiple times. If we do this naively, Spark will recompute the RDD and all of its dependencies each time we call an action on the RDD.
I have a simple graph that reads from a pubsub message (currently just a single string key), creates a very short window, generates 3 integers that use this key via a beam.ParDo, and a simple Map that creates a single "config" with this as a key.
Ultimately, there are 2 PCollections:
items: [('key', 0), ('key', 1), ...]
infos: [('key', 'the value is key')]
I want a final beam.Map over items that uses infos as a dictionary side input so I can look up the value in the dictionary.
Using the LocalRunner, the final print works with the side input.
On DataFlow the first two steps print, but the final Map with the side input never is called, presumably because it somehow is an unbounded window (despite the earlier window function).
I am using runner_v2, dataflow prime, and streaming engine.
p = beam.Pipeline(options=pipeline_options)
pubsub_message = (
p | beam.io.gcp.pubsub.ReadFromPubSub(
subscription='projects/myproject/testsubscription') |
'SourceWindow' >> beam.WindowInto(
def _create_items(pubsub_key: bytes) -> Iterable[tuple[str, int]]:
for i in range(3):
yield pubsub_key.decode(), i
def _create_info(pubsub_key: bytes) -> tuple[str, str]:
return pubsub_key.decode(), f'the value is {pubsub_key.decode()}'
items = pubsub_message | 'CreateItems' >> beam.ParDo(_create_items) | beam.Reshuffle()
info = pubsub_message | 'CreateInfo' >> beam.Map(_create_info)
def _print_item(keyed_item: tuple[str, int], info_dict: dict[str, str]) -> None:
key, _ = keyed_item
log(key + '::' + info_dict[key])
_ = items | 'MapWithSideInput' >> beam.Map(_print_item, info_dict=beam.pvalue.AsDict(info))
Here is the output in local runner:
Creating item 0
Creating item 1
Creating item 2
Creating info b'key'
key::the value is key
key::the value is key
key::the value is key
Here is the DataFlow graph:
I've tried various windowing functions over the AsDict, but I can never get it to be exactly the same window as my input.
Thoughts on what I might be doing wrong here?
we have bounded data , around 3.5 million records in BigQuery. These data needs to be processed using Dataflow (mostly it is some external API calls + transformations)
From the document - https://cloud.google.com/dataflow/docs/resources/faq#beam-java-sdk
I see Batch mode uses single thread and stream uses 300 threads per worker.For us, most of my operation is Network bound because of external API calls.
Considering this, which one would be more performant and cost efficient ? Batch - by spinning x workers or Stream with x workers and 300 threads.
If it is streaming then I should send the data which is present in BigQuery to pub/sub ? Is my understanding correct ?
The Batch vs Streaming decision usually comes from the source that you are reading from (Bounded vs Unbounded). When reading from BigQueryIO, it comes is bounded.
There are ways to convert from a BoundedSource to an UnboundedSource) (see Using custom DataFlow unbounded source on DirectPipelineRunner) but I don't see it recommended anywhere, and I am not sure you would get any benefit from it. Streaming has to keep track of checkpoints and watermarks, which could result in an overhead for your workers.
Here is an example of a DoFn that processes multiple items concurrently:
class MultiThreadedDoFn(beam.DoFn):
def __init__(self, func, num_threads=10):
self.func = func
self.num_threads = num_threads
def setup(self):
self.done = False
self.input_queue = queue.Queue(2)
self.output_queue = queue.Queue()
self.threads = [
threading.Thread(target=self.work, daemon=True)
for _ in range(self.num_threads)]
for t in self.threads:
def work(self):
while not self.done:
windowed_value = self.input_queue.get(timeout=0.1)
except queue.Empty:
pass # check self.done
def start_bundle(self):
self.pending = 0
def process(self, element,
self.pending += 1
element, timestamp, (window,)))
while not self.output_queue.empty():
yield self.output_queue.get(block=False)
self.pending -= 1
except queue.Empty:
def finish_bundle(self):
while self.pending > 0:
yield self.output_queue.get()
self.pending -= 1
def teardown(self):
self.done = True
for t in self.threads:
It can be used as
def func(n):
time.sleep(n / 10)
return n + 1
with beam.Pipeline() as p:
p | beam.Create([1, 3, 5, 7] * 10 + [9]) | beam.ParDo(MultiThreadedDoFn(func)) | beam.Map(logging.error)
Use Case
I have some terabytes of US property data to merge. It is spread across two distinct file formats and thousands of files. The source data is split geographically.
I can't find a way to branch a single pipeline into many independent processing flows.
This is especially difficult because the Dataframe API doesn't seem to support a PTransform on a collection of filenames.
Detailed Background
The distribution of files is like this:
StateData - 51 total files (US states + DC)
CountyData - ~2000 total files (county specific, grouped by state)
The ideal pipeline would split into thousands of independent processing steps and complete in minutes.
1 -> 51 (each US state + DC starts processing)
51 -> thousands (each US state then spawns a process that merges the counties, combining at the end for the whole state)
The directory structure is like this:
Sample Data
This is extremely abbreviated, but imagine something like this:
State Level Data - 51 of these (~200 cols wide)
County Level Data - thousands of these (~300 cols wide)
Roland Heights
Roland Heights
Roland Heights
So we join many county-level to a single state level, and thus have an aggregated, more-complete state-level data set.
Then we aggregate all the states, and we have a national level data set.
Desired Outcome
I can successfully merge one state at a time (many county to one state). I built a pipeline to do that, but the pipeline starts with a single CountyData CSV and a single StateData CSV. The issue is getting to the point where I can load the CountyData and StateData.
In other words:
# I need to find a way to generalize this flow to
# dynamically created COUNTY and STATE variables.
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv
COUNTY = "county-data/AL/*.csv"
STATE = "state-data/AL.zip"
def key_by_uid(elem):
return (elem.uid, elem)
with beam.Pipeline() as p:
county_df = p | read_csv(COUNTY)
county_rows_keyed = to_pcollection(county_df) | beam.Map(key_by_uid)
state_df = pd.read_csv(STATE, compression="zip")
state_rows_keys = to_pcollection(state_df, pipeline=p) | beam.Map(key_by_uid)
merged = ({ "state": state_rows_keys, "county": county_rows_keyed } ) | beam.CoGroupByKey() | beam.Map(merge_logic)
merged | WriteToParquet()
Starting with a list of states
By state, generate filepatterns to the source data
By state, load and merge the filenames
Flatten the output from each state into a US data set.
Write to Parquet file.
with beam.Pipeline(options=pipeline_options) as p:
merged_data = (
| beam.Create(cx.STATES)
| "PathsKeyedByState" >> tx.PathsKeyedByState()
# ('AL', {'county-data': 'gs://data/county-data/AL/COUNTY*.csv', 'state-data': 'gs://data/state-data/AL.zip'})
| "MergeSourceDataByState" >> tx.MergeSourceDataByState()
| "MergeAllStateData" >> beam.Flatten()
merged_data | "WriteParquet" >> tx.WriteParquet()
The issue I'm having is something like this:
I have two filepatterns in a dictionary, per state. To access those I need to use a DoFn to get at the element.
To communicate the way the data flows, I need access to Pipeline, which is a PTransform. Ex: df = p | read_csv(...)
These appear to be incompatible needs.
Here's an alternative answer.
Read the state data one at a time and flatten them, e.g.
state_dataframe = None
for state in STATES:
df = p | read_csv('/path/to/state')
df['state'] = state
if state_dataframe is None:
state_dataframe = df
state_dataframe = state_dataframe.append(df)
Similarly for county data. Now join them using dataframe operations.
I'm not sure exactly what kind of merging you're doing here, but one way to structure this pipeline might be to have a DoFn that takes the county data in as a filename as an input element (i.e. you'd have a PCollection of county data filenames), opens it up using "normal" Python (e.g. pandas), and then reads the relevant state data in as a side input to do the merge.
I find myself using rmarkdown/rnotebooks quite a bit to do exploratory analysis since I can combine code, prose and graphs. Many a times, I'll write my entire predictive modeling approach and the model itself within markdown.
However, then I end up with forecast models embedded within rmarkdown, unlinked to a target within my drake_plan. Today, I save these to disk first, then read them back in to my plan using file_in or other similar approach.
My question is - can I have a markdown document return an object directly to a drake target?
plan = drake_plan(
dat = read_data(),
model = analyze_data(dat)
analyse_data = function(dat){
result = render(....)
This way - I can get my model directly into my drake target, but if I need to investigate the model, I can open up my markdown/HTML.
I recommend you include those models as targets in the plan, but what you describe is possible. R Markdown and knitr automatically run code chunks in the calling environment, so the variable assignments you make in the report are available.
simulate <- function(n){
tibble(x = rnorm(n), y = rnorm(n))
render_and_return <- function(input, output) {
rmarkdown::render(input, output_file = output, quiet = TRUE)
return_value # Assigned in the report.
lines <- c(
"output: html_document",
"```{r show_data}",
"return_value <- head(readd(large))", # return_value gets assigned here.
writeLines(lines, "report.Rmd")
plan <- drake_plan(
large = simulate(1000),
subset = render_and_return(knitr_in("report.Rmd"), file_out("report.html")),
#> target large
#> target subset
#> # A tibble: 6 x 2
#> x y
#> <dbl> <dbl>
#> 1 1.30 -0.912
#> 2 -0.327 0.0622
#> 3 1.29 1.18
#> 4 -1.52 1.06
#> 5 -1.18 0.0295
#> 6 -0.985 -0.0475
Created on 2019-10-10 by the reprex package (v0.3.0)
I've been using F# for nearly six months and have been so sure that F# Interactive should have the same performance as compiled, that when I bothered to benchmark it, I was convinced it was some kind of compiler bug. Though now it occurs to me that I should have checked here first before opening an issue.
For me it is roughly 3x slower and the optimization switch does not seem to be doing anything at all.
Is this supposed to be standard behavior? If so, I really got trolled by the #time directive. I have the timings for how long it takes to sum 100M elements on this Reddit thread.
Thanks to FuleSnabel, I uncovered some things.
I tried running the example script from both fsianycpu.exe (which is the default F# Interactive) and fsi.exe and I am getting different timings for two runs. 134ms for the first and 78ms for the later. Those two timings also correspond to the timings from unoptimized and optimized binaries respectively.
What makes the matter even more confusing is that the first project I used to compile the thing is a part of the game library (in script form) I am making and it refuses to compile the optimized binary, instead switching to the unoptimized one without informing me. I had to start a fresh project to get it to compile properly. It is a wonder the other test compiled properly.
So basically, something funky is going on here and I should look into switching fsianycpu.exe to fsi.exe as the default interpreter.
I tried the example code in pastebin I don't see the behavior you describe. This is the result from my performance run:
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2836 ms
reduce array, result 450015000, time 594 ms
for loop array, result 450015000, time 180 ms
reduce list, result 450015000, time 593 ms
fsi -O --exec .\Interactive.fsx
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2617 ms
reduce array, result 450015000, time 589 ms
for loop array, result 450015000, time 168 ms
reduce list, result 450015000, time 603 ms
It's expected that Seq.reduce would be the slowest, the for loop the fastest and that the reduce on list/array is roughly similar (this assumes locality of list elements which isn't guaranteed).
I rewrote your code to allow for longer runs w/o running out of memory and to improve cache locality of data. With short runs the uncertainity of measurements makes it hard to compare the data.
module fs
let stopWatch =
let sw = new System.Diagnostics.Stopwatch()
sw.Start ()
let total = 300000000
let outer = 10000
let inner = total / outer
let timeIt (name : string) (a : unit -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a ()
for i = 2 to outer do
a () |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, result %A, time %d ms" name v d
let sumTest(args) =
let numsList = [1..inner]
let numsArray = [|1..inner|]
printfn "Total iterations: %d, Outer: %d, Inner: %d" total outer inner
let sumsSeqReduce () = Seq.reduce (+) numsList
timeIt "reduce sequence of list" sumsSeqReduce
let sumsArray () = Array.reduce (+) numsArray
timeIt "reduce array" sumsArray
let sumsLoop () =
let mutable total = 0
for i in 0 .. inner - 1 do
total <- total + numsArray.[i]
timeIt "for loop array" sumsLoop
let sumsListReduce () = List.reduce (+) numsList
timeIt "reduce list" sumsListReduce
#load "Program.fs"
fs.sumTest [||]
PS. I am running on Windows with Visual Studio 2015. 32bit or 64bit seemed to make only marginal difference