Table API to join two Kafka streams simultaneously - join

I have a Kafka producer that reads data from two large files and sends them in the JSON format with the same structure:
def create_sample_json(row_id, data_file): return {'row_id':int(row_id), 'row_data': data_file}
The producer breaks every file into small chunks and creates JSON format from each chunk and sends them in a for-loop finally.
The process of sending those two files happens simultaneously through multithreading.
I want to do join from those streams (s1.row_id == s2.row_id) and eventually some stream processing while my producer is sending data on Kafka. Because the producer generates a huge amount of data from multiple sources, I can't wait to consume them all, and it must be done simultaneously.
I am not sure if Table API is a good approach but this is my pyflink code so far:
from pyflink.datastream.stream_execution_environment import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings
from pyflink.table.expressions import col
from pyflink.table.table_environment import StreamTableEnvironment
KAFKA_SERVERS = 'localhost:9092'
def log_processing():
env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file:///flink_jar/kafka-clients-3.3.2.jar")
env.add_jars("file:///flink_jar/flink-connector-kafka-1.16.1.jar")
env.add_jars("file:///flink_jar/flink-sql-connector-kafka-1.16.1.jar")
settings = EnvironmentSettings.new_instance() \
.in_streaming_mode() \
.build()
t_env = StreamTableEnvironment.create(stream_execution_environment=env, environment_settings=settings)
t1 = f"""
CREATE TEMPORARY TABLE table1(
row_id INT,
row_data STRING
) WITH (
'connector' = 'kafka',
'topic' = 'datatopic',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'MY_GRP',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
t2 = f"""
CREATE TEMPORARY TABLE table2(
row_id INT,
row_data STRING
) WITH (
'connector' = 'kafka',
'topic' = 'datatopic',
'properties.bootstrap.servers' = '{KAFKA_SERVERS}',
'properties.group.id' = 'MY_GRP',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
"""
p1 = t_env.execute_sql(t1)
p2 = t_env.execute_sql(t2)
// please tell me what should I do next:
// Questions:
// 1) Do I need to consume data in my consumer class separately, and then insert them into those tables, or data will be consumed from what we implemented here (as I passed the name of the connector, topic, bootstartap.servers, etc...)?
// 2) If so:
2.1) how can I make join from those streams in Python?
2.2) How can I prevent the previous data as my Producer will send thousands of messages? I want to make sure that not to make duplicate queries.
// 3) If not, what should I do?
Thank you very much.

// 1) Do I need to consume data in my consumer class separately, and then insert them into those tables, or data will be consumed from what we implemented here (as I passed the name of the connector, topic, bootstartap.servers, etc...)?
The later one, data will be consumed by the 'kafka' table connector which we implemented. And you need to define a Sink table as the target you insert, the sink table could a kafka connector table with a topic that you want to ouput.
2.1) how can I make join from those streams in Python?
You can write SQL to join table1 and table2 and then insert into your sink table in Python
2.2) How can I prevent the previous data as my Producer will send thousands of messages? I want to make sure that not to make duplicate queries.
You can filter these messages before 'join' or before 'insert', a 'WHERE' clause is enough in your case

Related

BioPython Entrez article limit

I've been using the classic article function which returns the articles for a string
from Bio import Entrez, __version__
print('Biopython version : ', __version__)
def article_machine(t):
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
retmax='100000',
retmode='xml',
term=t)
return list(Entrez.read(handle)["IdList"])
print(len(article_machine('T cell')))
I've noticed now that there's a limit to the amount of articles I receive (not the one I put in retmax).
The amount I get is 9999 PMIDS, for key words who used to return 100k PMIDS (T cell for example)
The amount I get now
The amount I used to get
I know it's not a bug in the package itself but in NCBI.
Has someone managed to solve it?
from : The E-utilities In-Depth: Parameters, Syntax and More
retmax
Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. If usehistory is set to 'y', the remainder of the retrieved set will be stored on the History server; otherwise these UIDs are lost. Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.
To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using that contains additional logic to batch PubMed search results
Unfortunately my code devised on the above mentioned info, doesnt work:
from Bio import Entrez, __version__
print('Biopython version : ', __version__)
def article_machine(t):
all_res = []
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
rettype='count',
term=t)
number = int(Entrez.read(handle)['Count'])
print(number)
retstart = 0
while retstart < number:
retstart += 1000
print('\n retstart now : ' , retstart ,' out of : ', number)
Entrez.email = 'email'
handle = Entrez.esearch(db='pubmed',
sort='relevance',
rettype='xml',
retstart = retstart,
retmax = str(retstart),
term=t)
all_res.extend(list(Entrez.read(handle)["IdList"]))
return all_res
print(article_machine('T cell'))
changing while retstart < number: with while retstart < 5000:
the code works, but as soon as retmax exceeds 9998, that is using
the former while loop needed to access all the results, I get the following error:
RuntimeError: Search Backend failed: Exception:
'retstart' cannot be larger than 9998. For PubMed,
ESearch can only retrieve the first 9,999 records
matching the query.
To obtain more than 9,999 PubMed records, consider
using EDirect that contains additional logic
to batch PubMed search results automatically
so that an arbitrary number can be retrieved.
For details see https://www.ncbi.nlm.nih.gov/books/NBK25499/
See https://www.ncbi.nlm.nih.gov/books/NBK25499/ that actually should be https://www.ncbi.nlm.nih.gov/books/NBK179288/?report=reader
Try to have a look at NCBI APIs to see if there is something could work out your problem usinng a Python interface, I am not an expert on this , sorry : https://www.ncbi.nlm.nih.gov/pmc/tools/developers/

Google DataFlow: Read from BigQuery, combine three string fields, write key/value fields to Google Cloud Spanner

None of the provided DataFlow templates match what I need to do, so I'm trying to write my own. I managed to run the example code like word count example without issue, so I tried to butcher together parts separate examples that read from BigQuery and writes to Spanner but there's just so many things in the source code I don't understand and cannot adapt to my own problem.
I'm REALLY lost on this and any help is greatly appreciated!
The goal is to use DataFlow and Apache Beam SDK to read from a BigQuery table with 3 string fields and 1 integer field, then concatenate the content of the 3 string fields into one string and put that new string in a new field called "key", then I want to write the key field and the integer field (which is unchanged) to a Spanner table that already exists, ideally append rows with a new key and update the integer field of rows with a key that already exists.
I'm trying to do this in Java because there is no i/o connector for Python. Any advice on doing this with Python are much appreciated.
For now I would be super happy if I could just read a table from BigQuery and write whatever I get from that table to a table in Spanner, but I can't even make that happen.
Problems:
I'm using Maven and I don't know what dependencies I need to put in the pom file
I don't know which package and import I need at the beginning of my java file
I don't know if I should use readTableRows() or read(SerializableFunction) to read from BigQuery
I have no idea how to access the string fields in the PCollection to concatenate them or how to make the new PCollection with only the key and integer field
I somehow need to make the PCollection into a Mutation to write to Spanner
I want to use an INSERT UPDATE query to write to the Spanner table, which doesn't seem to be an option in the Spanner i/o connector.
Honestly, I'm too embarrassed to even show that code I'm trying to run.
public class SimpleTransfer {
public static void main(String[] args) {
// Create and set your PipelineOptions.
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
// For Cloud execution, set the Cloud Platform project, staging location, and specify DataflowRunner.
options.setProject("myproject");
options.setStagingLocation("gs://mybucket");
options.setRunner(DataflowRunner.class);
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
String tableSpec = "database.mytable";
// read whole table from bigquery
rowsFromBigQuery =
p.apply(
BigQueryIO.readTableRows()
.from(tableSpec);
// Hopefully some day add a transform
// Somehow make a Mutation
PCollection<Mutation> mutation = rowsFromBigQuery;
// Only way I found to write to Spanner, not even sure if that works.
SpannerWriteResult result = mutation.apply(
SpannerIO.write().withInstanceId("myinstance").withDatabaseId("mydatabase").grouped());
p.run().waitUntilFinish();
}
}
It's intimidating to deal with these strange data types, but once you get used to the TableRow and Mutation types, you'll be able to code robust pipelines.
The first thing you need to do is take your PCollection of TableRows, and convert those into an intermediate format that is convenient for you. Let's use Beam's KV, which defines a key-value pair. In the following snippet, we're extracting the values from the TableRow, and concatenating the string you want:
rowsFromBigQuery
.apply(
MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings()
TypeDescriptors.integers()))
.via(tableRow -> KV.of(
(String) tableRow.get("myKey1")
+ (String) tableRow.get("myKey2")
+ (String) tableRow.get("myKey3"),
(Integer) tableRow.get("myIntegerField"))))
Finally, to write to Spanner, we use Mutation-type objects, which define the kind of mutation that we want to apply to a row in Spanner. We'll do it with another MapElements transform, which takes N inputs, and returns N outputs. We define the insert or update mutations there:
myKvPairsPCollection
.apply(
MapElements.into(TypeDescriptor.of(Mutation.class))
.via(elm -> Mutation.newInsertOrUpdateBuilder("myTableName)
.set("key").to(elm.getKey())
.set("value").to(elm.getValue()));
And then you can pass the output to that to SpannerIO.write. The whole pipeline looks something like this:
Pipeline p = Pipeline.create(options);
String tableSpec = "database.mytable";
// read whole table from bigquery
PCollection<TableRow> rowsFromBigQuery =
p.apply(
BigQueryIO.readTableRows().from(tableSpec));
// Take in a TableRow, and convert it into a key-value pair
PCollection<Mutation> mutations = rowsFromBigQuery
// First we make the TableRows into the appropriate key-value
// pair of string key and integer.
.apply(
MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings()
TypeDescriptors.integers()))
.via(tableRow -> KV.of(
(String) tableRow.get("myKey1")
+ (String) tableRow.get("myKey2")
+ (String) tableRow.get("myKey3"),
(Integer) tableRow.get("myIntegerField"))))
// Now we construct the mutations
.apply(
MapElements.into(TypeDescriptor.of(Mutation.class))
.via(elm -> Mutation.newInsertOrUpdateBuilder("myTableName)
.set("key").to(elm.getKey())
.set("value").to(elm.getValue()));
// Now we pass the mutations to spanner
SpannerWriteResult result = mutations.apply(
SpannerIO.write()
.withInstanceId("myinstance")
.withDatabaseId("mydatabase").grouped());
p.run().waitUntilFinish();
}

How to aggregate data using apache beam api with multiple keys

I am new to google cloud data platform as well as to Apache beam api. I would like aggregate data based on multiple keys. In my requirement I will get a transaction feed having fields like customer id,customer name,transaction amount and transaction type. I would like to aggregate the data based on customer id & transaction type. Here is an example.
customer id,customer name,transction amount,transaction type
cust123,ravi,100,D
cust123,ravi,200,D
cust234,Srini,200,C
cust444,shaker,500,D
cust123,ravi,100,C
cust123,ravi,300,C
O/p should be
cust123,ravi,300,D
cust123,ravi,400,C
cust234,Srini,200,C
cust444,shaker,500,D
In google most of the examples are based on single key like group by single key. Can any please help me on how my PTransform look like in my requirement and how to produce aggregated data along with rest of the fields.
Regards,
Ravi.
Here is an easy way. I concatenated all the keys together to form a single key and then did the the sub and after than split the key to organize the output to a way you wanted. Please let me know if any question.
The code does not expect header in the CSV file. I just kept it short to show the main point you are asking.
import apache_beam as beam
import sys
class Split(beam.DoFn):
def process(self, element):
"""
Splits each row on commas and returns a tuple representing the row to process
"""
customer_id, customer_name, transction_amount, transaction_type = element.split(",")
return [
(customer_id +","+customer_name+","+transaction_type, float(transction_amount))
]
if __name__ == '__main__':
p = beam.Pipeline(argv=sys.argv)
input = 'aggregate.csv'
output_prefix = 'C:\\pythonVirtual\\Mycodes\\output'
(p
| 'ReadFile' >> beam.io.ReadFromText(input)
| 'parse' >> beam.ParDo(Split())
| 'sum' >> beam.CombinePerKey(sum)
| 'convertToString' >>beam.Map(lambda (combined_key, total_balance): '%s,%s,%s,%s' % (combined_key.split(",")[0], combined_key.split(",")[1],total_balance,combined_key.split(",")[2]))
| 'write' >> beam.io.WriteToText(output_prefix)
)
p.run().wait_until_finish()
it will produce output as below:
cust234,Srini,200.0,C
cust444,shaker,500.0,D
cust123,ravi,300.0,D
cust123,ravi,400.0,C

SPARK - Joining two data streams - maintenance of cache

It is evident that the out of box join capability in spark streaming does not warrent a lot of real life use cases. The reason being it joins only the data contained in the micro batch RDDs.
Use case is to join data from two kafka streams and enrich each object in stream1 with it's corresponding object in stream2 in spark and save it to HBase.
Implementation would
maintain a dataset in memory from objects from stream2, adding or replacing objects as and when they are recieved
for every element in stream1, access the cache to find a matching object from stream2, save to HBase if match is found or put it back on the kafka stream if not.
This question is on exploration of Spark streaming and it's API to find a way to implement the above mentioned.
You can join the incoming RDDs to other RDDs -- not just the ones in that micro-batch. Basically you keep a "running total" RDD that you fill something like:
var globalRDD1: RDD[...] = sc.emptyRDD
var globalRDD2: RDD[...] = sc.emptyRDD
dstream1.foreachRDD(rdd => if (!rdd.isEmpty) globalRDD1 = globalRDD1.union(rdd))
dstream2.foreachRDD(rdd => if (!rdd.isEmpty) {
globalRDD2 = globalRDD2.union(rdd))
globalRDD1.join(globalRDD2).foreach(...) // etc, etc
}
A good start would be to look into mapWithState. This is a more efficient replacement for updateStateByKey. These are defined on PairDStreamFunction, so assuming your objects of type V in stream2 are identified by some key of type K, your first point would go like this:
def stream2: DStream[(K, V)] = ???
def maintainStream2Objects(key: K, value: Option[V], state: State[V]): (K, V) = {
value.foreach(state.update(_))
(key, state.get())
}
val spec = StateSpec.function(maintainStream2Objects)
val stream2State = stream2.mapWithState(spec)
stream2State is now a stream where each batch contains the (K, V) pairs with the latest value seen for each key. You can do a join on this stream and stream1 to perform the further logic for your second point.

Is it possible to read a message from a PubSub and separate its data in different elements of a PCollection<String>? If so, how?

Now, I have the below code:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"));
Looks like you want to read some messages from pubsub and convert each of them to multiple parts by splitting a message on space characters, and then feed the parts to the rest of your pipeline. No special configuration of PubsubIO is needed, because it's not a "reading data" problem - it's a "transforming data you have already read" problem - you simply need to insert a ParDo which takes your "composite" record and breaks it down in the way you want, e.g.:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"))
.apply(ParDo.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
String composite = c.element();
for (String part : composite.split(" ")) {
c.output(part);
}
}}));
}));
I take it you mean that the data you want is present in different elements of the PCollection and want to extract and group it somehow.
A possible approach is to write a DoFn function that processes each String in the PCollection. You output a key value pair for each piece of data you want to group. You can then use the GroupByKey transform to group all the relevant data together.
For example you have the following messages from pubsub in your PCollection:
User 1234 bought item A
User 1234 bought item B
The DoFn function will output a key value pair with the user id as key and the item bought as value. ( <1234,A> , <1234, B> ).
Using the GroupByKey transform you group the two values together in one element. You can then perform further processing on that element.
This is a very common pattern in bigdata called mapreduce.
You can output an Iterable<A> then use Flatten to squash it. Unsurprisingly this is termed flatMap in many next-gen data processing platforms, c.f. spark / flink.

Resources