I'm using a sliding time window of X size and Y period. In order to mark the output of each window, I'd like to get the timestamp of the current window of PCollection.
PCollection<T> windowedInput = input
.apply(Window<T>into(
SlidingWindows.of(Duration.standardMinutes(10))
.every(Duration.standardMinutes(1))));
// Extract key from each input and run a function per group.
//
// Q: ExtractKey() depends on the window triggered time.
// How can I pass the timestamp of windowedInputs to ExtractKey()?
PCollection<KV<K, Iterable<T>>> groupedInputs = windowedInputs
.apply(ParDo.of(new ExtractKey()))
.apply(GroupByKey.<K, Ts>create());
// Run Story clustering and write outputs.
//
// Q: Also I'd like to add a window timestamp suffix to the output.
// How can I pass (or get) the timestamp to SomeDoFn()?
PCollection<String> results = groupedInputs.apply(ParDo.of(new SomeDoFn()));
A DoFn is allowed to access the window of the current element via an optional BoundedWindow parameter on the #ProcessElement method:
class SomeDoFn extends DoFn<KV<K, Iterable<T>>, String> {
#ProcessElement
public void process(ProcessContext c, BoundedWindow window) {
...
}
}
Related
I have a topic in kafka where i am getting multiple type of events in json format. I have created a filestreamsink to write these events to S3 with bucketing.
FlinkKafkaConsumer errorTopicConsumer = new FlinkKafkaConsumer(ERROR_KAFKA_TOPICS,
new SimpleStringSchema(),
properties);
final StreamingFileSink<Object> errorSink = StreamingFileSink
.forRowFormat(new Path(outputPath + "/error"), new SimpleStringEncoder<>("UTF-8"))
.withBucketAssigner(new EventTimeBucketAssignerJson())
.build();
env.addSource(errorTopicConsumer)
.name("error_source")
.setParallelism(1)
.addSink(errorSink)
.name("error_sink").setParallelism(1);
public class EventTimeBucketAssignerJson implements BucketAssigner<Object, String> {
#Override
public String getBucketId(Object record, Context context) {
StringBuffer partitionString = new StringBuffer();
Tuple3<String, Long, String> tuple3 = (Tuple3<String, Long, String>) record;
try {
partitionString.append("event_name=")
.append(tuple3.f0).append("/");
String timePartition = TimeUtils.getEventTimeDayPartition(tuple3.f1);
partitionString.append(timePartition);
} catch (Exception e) {
partitionString.append("year=").append(Constants.DEFAULT_YEAR).append("/")
.append("month=").append(Constants.DEFAULT_MONTH).append("/")
.append("day=").append(Constants.DEFAULT_DAY);
}
return partitionString.toString();
}
#Override
public SimpleVersionedSerializer<String> getSerializer() {
return SimpleVersionedStringSerializer.INSTANCE;
}
}
Now i want to publish hourly count of each event as metrics to prometheus and publish a grafana dashboard over that.
So please help me how can i achieve hourly count for each event using flink metrics and publish to prometheus.
Thanks
Normally, this is done by simply creating a counter for requests and then using the rate() function in Prometheus, this will give you the rate of requests in the given time.
If You, however, want to do this on Your own for some reason, then You can do something similar to what has been done in org.apache.kafka.common.metrics.stats.Rate. So You would, in this case, need to gather list of samples with the time at which they were collected, along with the window size You want to use for calculation of the rate, then You could simply do the calculation, i.e. remove samples that went out of scope and has expired and then simply calculate how many samples are in the window.
You could then set the Gauge to the calculated value.
I'm trying to join two streams in apache flink to get some results.
The current state of my project is, that I am fetching twitter data and map it into a 2-tuple, where the language of the user and the sum of tweets in a defined time window get saved.
I do these both for the number of tweets per language and retweets per language. The tweet/retweet aggregation works fine in other processes.
I now want to get the percentage of the number of retweets to the number of all tweets in a time window.
Therefore I use the following code:
Time windowSize = Time.seconds(15);
// Sum up tweets per language
DataStream<Tuple2<String, Integer>> tweetsLangSum = tweets
.flatMap(new TweetLangFlatMap())
.keyBy(0)
.timeWindow(windowSize)
.sum(1);
// ---
// Get retweets out of all tweets per language
DataStream<Tuple2<String, Integer>> retweetsLangMap = tweets
.keyBy(new KeyByTweetPostId())
.flatMap(new RetweetLangFlatMap());
// Sum up retweets per language
DataStream<Tuple2<String, Integer>> retweetsLangSum = retweetsLangMap
.keyBy(0)
.timeWindow(windowSize)
.sum(1);
// ---
tweetsLangSum.join(retweetsLangSum)
.where(new KeySelector<Tuple2<String, Integer>, String>() {
#Override
public String getKey(Tuple2<String, Integer> tweet) throws Exception {
return tweet.f0;
}
})
.equalTo(new KeySelector<Tuple2<String, Integer>, String>() {
#Override
public String getKey(Tuple2<String, Integer> tweet) throws Exception {
return tweet.f0;
}
})
.window(TumblingEventTimeWindows.of(windowSize))
.apply(new JoinFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple4<String, Integer, Integer, Double>>() {
#Override
public Tuple4<String, Integer, Integer, Double> join(Tuple2<String, Integer> in1, Tuple2<String, Integer> in2) throws Exception {
String lang = in1.f0;
Double percentage = (double) in1.f1 / in2.f1;
return new Tuple4<>(in1.f0, in1.f1, in2.f1, percentage);
}
})
.print();
When I print tweetsLangSum or retweetsLangSum the output seems to be fine. My problem is that I never get an output from the join. Does anyone have an idea why? Or am I using the window function in the first step of aggregation wrong when it comes to the join?
This might be caused by a mix of different time semantics. The KeyedStream.timeWindow() method is a shortcut that creates a window operator based on the configured time characteristics, i.e., an event-time window if event-time is enabled or a processing-time window otherwise. For the join, you explicitly define an event-time window.
Did you enable event-time processing?
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
I have a Beam pipeline that starts off with reading multiple text files where each line in a file represents a row that gets inserted into Bigtable later in the pipeline. The scenario requires confirming that the count of rows extracted from each file & count of rows later inserted into Bigtable match. For this I am planning to develop a custom Windowing strategy so that lines from a single file get assigned to a single window based on the file name as the key that will be passed to the Windowing function.
Is there any code sample for creating custom Windowing functions?
Although I changed my strategy for confirming the inserted number of rows, for anyone who is interested in windowing elements read from a batch source e.g. FileIO in a batch job, here's the code for creating a custom windowing strategy:
public class FileWindows extends PartitioningWindowFn<Object, IntervalWindow>{
private static final long serialVersionUID = -476922142925927415L;
private static final Logger LOG = LoggerFactory.getLogger(FileWindows.class);
#Override
public IntervalWindow assignWindow(Instant timestamp) {
Instant end = new Instant(timestamp.getMillis() + 1);
IntervalWindow interval = new IntervalWindow(timestamp, end);
LOG.info("FileWindows >> assignWindow(): Window assigned with Start: {}, End: {}", timestamp, end);
return interval;
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return this.equals(other);
}
#Override
public void verifyCompatibility(WindowFn<?, ?> other) throws IncompatibleWindowException {
if (!this.isCompatible(other)) {
throw new IncompatibleWindowException(other, String.format("Only %s objects are compatible.", FileWindows.class.getSimpleName()));
}
}
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
}
and then it can be used in the pipeline as below:
p
.apply("Assign_Timestamp_to_Each_Message", ParDo.of(new AssignTimestampFn()))
.apply("Assign_Window_to_Each_Message", Window.<KV<String,String>>into(new FileWindows())
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes());
Please keep in mind that you will need to write the AssignTimestampFn() so that each message carries a timestamp.
We have a problem making asList() method sortable.
We thought we could do this by just extending the View class and override the asList method but realized that View class has a private constructor so we could not do this.
Our other attempt was to fork the Google Dataflow code on github and modify the PCollectionViews class to return a sorted list be using the Collections.sort method as shown in the code snippet below
#Override
protected List<T> fromElements(Iterable<WindowedValue<T>> contents) {
Iterable<T> itr = Iterables.transform(
contents,
new Function<WindowedValue<T>, T>() {
#SuppressWarnings("unchecked")
#Override
public T apply(WindowedValue<T> input){
return input.getValue();
}
});
LOG.info("#### About to start sorting the list !");
List<T> tempList = new ArrayList<T>();
for (T element : itr) {
tempList.add(element);
};
Collections.sort((List<? extends Comparable>) tempList);
LOG.info("##### List should now be sorted !");
return ImmutableList.copyOf(tempList);
}
Note that we are now sorting the list.
This seemed to work, when run with the DirectPipelineRunner but when we tried the BlockingDataflowPipelineRunner, it didn't seem like the code change was being executed.
Note: We actually recompiled the dataflow used it in our project but this did not work.
How can we be able to achieve this (as sorted list from the asList method call)?
The classes in PCollectionViews are not intended for extension. Only the primitive view types provided by View.asSingleton, View.asSingleton View.asIterable, View.asMap, and View.asMultimap are supported.
To obtain a sorted list from a PCollectionView, you'll need to sort it after you have read it. The following code demonstrates the pattern.
// Assume you have some PCollection
PCollection<MyComparable> myPC = ...;
// Prepare it for side input as a list
final PCollectionView<List<MyComparable> myView = myPC.apply(View.asList());
// Side input the list and sort it
someOtherValue.apply(
ParDo.withSideInputs(myView).of(
new DoFn<A, B>() {
#Override
public void processElement(ProcessContext ctx) {
List<MyComparable> tempList =
Lists.newArrayList(ctx.sideInput(myView));
Collections.sort(tempList);
// do whatever you want with sorted list
}
}));
Of course, you may not want to sort it repeatedly, depending on the cost of sorting vs the cost of materializing it as a new PCollection, so you can output this value and read it as a new side input without difficulty:
// Side input the list, sort it, and put it in a PCollection
PCollection<List<MyComparable>> sortedSingleton = Create.<Void>of(null).apply(
ParDo.withSideInputs(myView).of(
new DoFn<Void, B>() {
#Override
public void processElement(ProcessContext ctx) {
List<MyComparable> tempList =
Lists.newArrayList(ctx.sideInput(myView));
Collections.sort(tempList);
ctx.output(tempList);
}
}));
// Prepare it for side input as a list
final PCollectionView<List<MyComparable>> sortedView =
sortedSingleton.apply(View.asSingleton());
someOtherValue.apply(
ParDo.withSideInputs(sortedView).of(
new DoFn<A, B>() {
#Override
public void processElement(ProcessContext ctx) {
... ctx.sideInput(sortedView) ...
// do whatever you want with sorted list
}
}));
You may also be interested in the unsupported sorter contrib module for doing larger sorts using both memory and local disk.
We tried to do it the way Ken Knowles suggested. There's a problem for large datasets. If the tempList is large (so sort takes some measurable time as it's proportion to O(n * log n)) and if there are millions of elements in the "someOtherValue" PCollection, then we are unecessarily re-sorting the same list millions of times. We should be able to sort ONCE and FIRST, before passing the list to the someOtherValue.apply's DoFn.
Ok so here is a clearer explanation :
I have now understood that I need to use sparse ou SparseMultigraph type to be able to have bidirectional edges so I have changed my GraphML class as such :
class GraphML
{
public GraphML(String filename) throws ParserConfigurationException, SAXException, IOException
{
//Step 1 we make a new GraphML Reader. We want a directed Graph of type node and edge.
GraphMLReader<SparseMultigraph<node, edge>, node, edge> gmlr =
new GraphMLReader<SparseMultigraph<node, edge>, node, edge>(new VertexFactory(), new EdgeFactory());
//Next we need a Graph to store the data that we are reading in from GraphML. This is also a directed Graph
// because it needs to match to the type of graph we are reading in.
final SparseMultigraph<node, edge> graph = new SparseMultigraph<node, edge>();
gmlr.load(filename, graph);
// gmlr.load(filename, graph); //Here we read in our graph. filename is our .graphml file, and graph is where we
// will store our graph.
BidiMap<node, String> vertex_ids = gmlr.getVertexIDs(); //The vertexIDs are stored in a BidiMap.
Map<String, GraphMLMetadata<node>> vertex_color = gmlr.getVertexMetadata(); //Our vertex Metadata is stored in a map.
Map<String, GraphMLMetadata<edge>> edge_meta = gmlr.getEdgeMetadata(); // Our edge Metadata is stored in a map.
// Here we iterate through our vertices, n, and we set the value and the color of our nodes from the data we have
// in the vertex_ids map and vertex_color map.
for (node n : graph.getVertices())
{
n.setValue(vertex_ids.get(n)); //Set the value of the node to the vertex_id which was read in from the GraphML Reader.
n.setColor(vertex_color.get("d0").transformer.transform(n)); // Set the color, which we get from the Map, vertex_color.
//Let's print out the data so we can get a good understanding of what we've got going on.
System.out.println("ID: "+n.getID()+", Value: "+n.getValue()+", Color: "+n.getColor());
}
// Just as we added the vertices to the graph, we add the edges as well.
for (edge e : graph.getEdges())
{
e.setValue(edge_meta.get("d1").transformer.transform(e)); //Set the edge's value.
System.out.println("Edge ID: "+e.getID());
}
TreeBuilder treeBuilder = new TreeBuilder(graph);
// create a simple graph for the demo:
//First we make a VisualizationViewer, of type node, edge. We give it our Layout, and the Layout takes a graph in it's constructor.
//VisualizationViewer<node, edge> vv = new VisualizationViewer<node, edge>(new FRLayout<node, edge>(graph));
VisualizationViewer<node, edge> vv = new VisualizationViewer<node, edge>(new TreeLayout<node, edge>(treeBuilder.getTree()));
//Next we set some rendering properties. First we want to color the vertices, so we provide our own vertexPainter.
vv.getRenderContext().setVertexFillPaintTransformer(new vertexPainter());
//Then we want to provide labels to our nodes, Jung provides a nice function which makes the graph use a vertex's ToString function
//as it's way of labelling. We do the same for the edge. Look at the edge and node classes for their ToString function.
vv.getRenderContext().setVertexLabelTransformer(new ToStringLabeller<node>());
vv.getRenderContext().setEdgeLabelTransformer(new ToStringLabeller<edge>());
// Next we do some Java stuff, we create a frame to hold the graph
final JFrame frame = new JFrame();
frame.setTitle("GraphMLReader for Trees - Reading in Attributes"); //Set the title of our window.
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); //Give a close operation.
//Here we get the contentPane of our frame and add our a VisualizationViewer, vv.
frame.getContentPane().add(vv);
//Finally, we pack it to make sure it is pretty, and set the frame visible. Voila.
frame.pack();
frame.setVisible(true);
}
}
And then changed my tree builder class constructor to SparseMultigraph :
public class TreeBuilder
{
DelegateForest<node,edge> mTree;
TreeBuilder(SparseMultigraph<node, edge> graph)
{
mTree = new DelegateForest<node, edge>();
for (node n : graph.getVertices())
{
mTree.addVertex(n);
}
for (edge e : graph.getEdges())
{
mTree.addEdge(e, graph.getSource(e),graph.getDest(e));
}
}
public DelegateForest<node, edge> getTree()
{
return mTree;
}
}
when I run my Main class :
public class Main
{
public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException
{
String filename = "attributes.graphml";
if(args.length > 0)
filename = args[0];
new GraphML(filename);
}
}
I don't get an error but edges are not present (node are their in the graph but not properly displayed).
Thanks
Zied