It is evident that the out of box join capability in spark streaming does not warrent a lot of real life use cases. The reason being it joins only the data contained in the micro batch RDDs.
Use case is to join data from two kafka streams and enrich each object in stream1 with it's corresponding object in stream2 in spark and save it to HBase.
Implementation would
maintain a dataset in memory from objects from stream2, adding or replacing objects as and when they are recieved
for every element in stream1, access the cache to find a matching object from stream2, save to HBase if match is found or put it back on the kafka stream if not.
This question is on exploration of Spark streaming and it's API to find a way to implement the above mentioned.
You can join the incoming RDDs to other RDDs -- not just the ones in that micro-batch. Basically you keep a "running total" RDD that you fill something like:
var globalRDD1: RDD[...] = sc.emptyRDD
var globalRDD2: RDD[...] = sc.emptyRDD
dstream1.foreachRDD(rdd => if (!rdd.isEmpty) globalRDD1 = globalRDD1.union(rdd))
dstream2.foreachRDD(rdd => if (!rdd.isEmpty) {
globalRDD2 = globalRDD2.union(rdd))
globalRDD1.join(globalRDD2).foreach(...) // etc, etc
}
A good start would be to look into mapWithState. This is a more efficient replacement for updateStateByKey. These are defined on PairDStreamFunction, so assuming your objects of type V in stream2 are identified by some key of type K, your first point would go like this:
def stream2: DStream[(K, V)] = ???
def maintainStream2Objects(key: K, value: Option[V], state: State[V]): (K, V) = {
value.foreach(state.update(_))
(key, state.get())
}
val spec = StateSpec.function(maintainStream2Objects)
val stream2State = stream2.mapWithState(spec)
stream2State is now a stream where each batch contains the (K, V) pairs with the latest value seen for each key. You can do a join on this stream and stream1 to perform the further logic for your second point.
Related
I have a Kafka producer that reads data from two large files and sends them in the JSON format with the same structure:
def create_sample_json(row_id, data_file): return {'row_id':int(row_id), 'row_data': data_file}
The producer breaks every file into small chunks and creates JSON format from each chunk and sends them in a for-loop finally.
The process of sending those two files happens simultaneously through multithreading.
I want to do join from those streams (s1.row_id == s2.row_id) and eventually some stream processing while my producer is sending data on Kafka. Because the producer generates a huge amount of data from multiple sources, I can't wait to consume them all, and it must be done simultaneously.
I am not sure if Table API is a good approach but this is my pyflink code so far:
from pyflink.datastream.stream_execution_environment import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings
from pyflink.table.expressions import col
from pyflink.table.table_environment import StreamTableEnvironment
KAFKA_SERVERS = 'localhost:9092'
def log_processing():
env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///flink_jar/kafka-clients-3.3.2.jar")
env.add_jars("file:///flink_jar/flink-connector-kafka-1.16.1.jar")
env.add_jars("file:///flink_jar/flink-sql-connector-kafka-1.16.1.jar")
settings = EnvironmentSettings.new_instance() \
.in_streaming_mode() \
.build()
t_env = StreamTableEnvironment.create(stream_execution_environment=env, environment_settings=settings)
t1 = f"""
CREATE TEMPORARY TABLE table1(
row_id INT,
row_data STRING
) WITH (
'connector' = 'kafka',
'topic' = 'datatopic',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'MY_GRP',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
t2 = f"""
CREATE TEMPORARY TABLE table2(
row_id INT,
row_data STRING
) WITH (
'connector' = 'kafka',
'topic' = 'datatopic',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'MY_GRP',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
p1 = t_env.execute_sql(t1)
p2 = t_env.execute_sql(t2)
// please tell me what should I do next:
// Questions:
// 1) Do I need to consume data in my consumer class separately, and then insert them into those tables, or data will be consumed from what we implemented here (as I passed the name of the connector, topic, bootstartap.servers, etc...)?
// 2) If so:
2.1) how can I make join from those streams in Python?
2.2) How can I prevent the previous data as my Producer will send thousands of messages? I want to make sure that not to make duplicate queries.
// 3) If not, what should I do?
Thank you very much.
// 1) Do I need to consume data in my consumer class separately, and then insert them into those tables, or data will be consumed from what we implemented here (as I passed the name of the connector, topic, bootstartap.servers, etc...)?
The later one, data will be consumed by the 'kafka' table connector which we implemented. And you need to define a Sink table as the target you insert, the sink table could a kafka connector table with a topic that you want to ouput.
2.1) how can I make join from those streams in Python?
You can write SQL to join table1 and table2 and then insert into your sink table in Python
2.2) How can I prevent the previous data as my Producer will send thousands of messages? I want to make sure that not to make duplicate queries.
You can filter these messages before 'join' or before 'insert', a 'WHERE' clause is enough in your case
I'm modding a game. I'd like to optimize my code if possible for a frequently called function. The function will look into a dictionary table (consisting of estimated 10-100 entries). I'm considering 2 patterns a) direct reference and b) lookup with ipairs:
PATTERN A
tableA = { ["moduleName.propertyName"] = { some stuff } } -- the key is a string with dot inside, hence the quotation marks
result = tableA["moduleName.propertyName"]
PATTERN B
function lookup(type)
local result
for i, obj in ipairs(tableB) do
if obj.type == "moduleName.propertyName" then
result = obj
break
end
end
return result
end
***
tableB = {
[1] = {
type = "moduleName.propertyName",
... some stuff ...
}
}
result = lookup("moduleName.propertyName")
Which pattern should be faster on average? I'd expect the 'native' referencing to be faster (it is certainly much neater), but maybe this is a silly assumption? I'm able to sort (to some extent) tableB in a order of frequency of the lookups whereas (as I understand it) tableA will have in Lua random internal order by default even if I declare the keys in proper order.
A lookup table will always be faster than searching a table every time.
For 100 elements that's one indexing operation compared to up to 100 loop cycles, iterator calls, conditional statements...
It is questionable though if you would experience a difference in your application with so little elements.
So if you build that data structure for this purpose only, go with a look-up table right away.
If you already have this data structure for other purposes and you just want to look something up once, traverse the table with a loop.
If you have this structure already and you need to look values up more than once, build a look up table for that purpose.
I've bumped into significant performance degradation of stored procedure when using HANA Graph script.
My task is following - I'm doing a BFS traverse on graph using standard BFS feature of HANA SP03. My graph is pretty dense and the result can easily go to a couple or several thousands of rows.
CREATE PROCEDURE "MY_PROC" (IN word VARCHAR(100), IN category VARCHAR(100), OUT res "RESULT" DEFAULT EMPTY)
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
Graph g = Graph("SCHEMA1","MYGRAPH");
Multiset<Edge> filteredEdges = Multiset<Edge>(:g);
TRAVERSE BFS :g FROM Vertex(:g, :word)
ON VISIT EDGE (Edge e) {
Vertex sourceV = SOURCE(:e);
IF (:sourceV."WORD" != :word) {
filteredEdges = :filteredEdges UNION {:e};
}
};
--copy all results into output object
res = SELECT :e."TARGET", :e."CATEGORY_ID" FOREACH e IN :filteredEdges;
END;
I'm returning a TABLE type and use the following statement, pretty much the simplest thing possible as per tutorial:
It takes up to 10 seconds in my environment to prepare that result, which is obviously not acceptable. I've tested running time of other parts combined and it's up to tens of milliseconds. In case when result collection has only several hundreds of records running time became moderate - 100-200 miliseconds.
Is there another faster way of returning thousands of data from the graph script? I have a lot of liberty in my implementation, so I'll consider any approach that works. What I need in OUT parameter is a collection of some attributes of vertexes and of edges.
Thanks in advance
I think I got the answer, thanks to SAP HANA team guys.
There are several key ideas:
narrow initial graph to the smallest sub-graph possible using Subgraph
use custom BFS via NEIGHBORS instead of standard TRAVERSE BFS, just set limit big enough
use UNION ALL instead of UNION if logic allows - it's faster
so initial procedure transforms into something like this. Performance of the system soared to tens of milliseconds:
CREATE PROCEDURE "MY_PROC" (IN word VARCHAR(100), IN category VARCHAR(100), IN is_direct_category BOOLEAN, OUT res "TARGET_CATEGORY_RESULT" DEFAULT EMPTY)
LANGUAGE GRAPH READS SQL DATA AS
BEGIN
Graph = Graph("SCHEMA1","MYGRAPH");
Vertex startV = Vertex(:g_all, :word);
Multiset<Vertex> m_reachable= NEIGHBORS(:g_all, :startV, 0, 100);
Graph g = Subgraph(:g_all, :m_reachable);
if (:is_direct_category == TRUE) {
Multiset<Edge> properEdges = e in Edges(:g) where :e."CATEGORY_ID" == :category;
Graph res_g = Subgraph(:g, :properEdges);
Multiset<Edge> e_res = Edges(:res_g);
res = SELECT :hypoEdge."TARGET", :hypoEdge."CATEGORY_ID" FOREACH hypoEdge IN :e_res;
} else {
Multiset<Edge> e_res = Edges(:g);
res = SELECT :hypoEdge."TARGET", :hypoEdge."CATEGORY_ID" FOREACH hypoEdge IN :e_res;
}
END;
Now, I have the below code:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"));
Looks like you want to read some messages from pubsub and convert each of them to multiple parts by splitting a message on space characters, and then feed the parts to the rest of your pipeline. No special configuration of PubsubIO is needed, because it's not a "reading data" problem - it's a "transforming data you have already read" problem - you simply need to insert a ParDo which takes your "composite" record and breaks it down in the way you want, e.g.:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"))
.apply(ParDo.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
String composite = c.element();
for (String part : composite.split(" ")) {
c.output(part);
}
}}));
}));
I take it you mean that the data you want is present in different elements of the PCollection and want to extract and group it somehow.
A possible approach is to write a DoFn function that processes each String in the PCollection. You output a key value pair for each piece of data you want to group. You can then use the GroupByKey transform to group all the relevant data together.
For example you have the following messages from pubsub in your PCollection:
User 1234 bought item A
User 1234 bought item B
The DoFn function will output a key value pair with the user id as key and the item bought as value. ( <1234,A> , <1234, B> ).
Using the GroupByKey transform you group the two values together in one element. You can then perform further processing on that element.
This is a very common pattern in bigdata called mapreduce.
You can output an Iterable<A> then use Flatten to squash it. Unsurprisingly this is termed flatMap in many next-gen data processing platforms, c.f. spark / flink.
i'm using riak to store json documents right now, and i want to sort them based on some attribute, let's say there's a key, i.e
{
"someAttribute": "whatever",
"order": 1
}
so i want to sort the documents based on the "order".
I am currently retrieving the documents in riak with the erlang interface. i can retrieve the document back as a string, but i dont' really know what to do after that. i'm thinking the map function just reduces the json document itself, and in the reduce function, i'd make a check to see whether the item i'm looking at has a higher "order" than the head of the rest of the list, and if so append to beginning, and then return a lists:reverse.
despite my ideas above i've had zero results after almost an entire day, i'm so confused with the erlang interface in riak. can someone provide insight on how to write this map/reduce function, or just how to parse the json document?
As far as I know, You do not have access to Input list in Map. You emit from Map a document as 1 element list.
Inputs (all the docs to handle as {Bucket, Key}) -> Map (handle single doc) -> Reduce (whole list emitted from Map).
Maps are executed per each doc on many nodes whereas Reduce is done once on so called coordinator node (the one where query was called).
Solution:
Define Inputs (as a list or bucket)
Retrieve Value in Map and emit whole doc or {Id, Val_to_sort_by)
Sort in Reduce (using regular list:keysort)
This is not a map reduce solution but you should check out Riak Search.
so i "solved" the problem using javascript, still can't do it using erlang.
here is my query
{"inputs":"test",
"query":[{"map":{"language":"javascript",
"source":"function(value, keyData, arg){ var data = Riak.mapValuesJson(value)[0]; var obj = {}; obj[data.order] = data; return [ obj ];}"}},
{"reduce":{"language":"javascript",
"source":"function(values, arg){ return [ values.reduce(function(acc, item){ for(var order in item){ acc[order] = item[order]; } return acc; }) ];}",
"keep":true}}
]
}
so in the map phase, all i do is create a new array, obj, with the key as the order, and the value as the data itself. so visually, the obj is like this
{"1":{"firstName":"John","order":1}
in the reduce phase, i'm just putting it in the accumulator, so basically that's the sort if you think about it, because when you're done, everything will be put in order for you. so i put 2 json documents for testing, one is above, the ohter is just firstName: Billie, order 2. and here is my result for the query above
[{"1":{"firstName":"John","order":1},"2":{"firstName":"Billie","order":2}}]
so it works! . but i still need to do this in ERLANG, any insights?