How to realize merge operation in Lambda-architecture? - lambda-architecture

I am implementing Lambda architecture, using spark and spark streaming for batch layer and speed layer respectively. As to now, I store both batch views and real-time views in HBase but in different table.
I am stuck at how to merge batch views generated by batch views and real-time views generated by speed layer, in order for queries. How to do it right? Should I just dump them into the same HBase table and the client go query directly to the HBase?

First of all, I think that HBase is not the best option for real-time views, as heavily loaded random read/random write is not the strongest side of the HBase.
Anyway, the one way can be the following:
cache batch view in Spark as DataFrame/DataSet for instance
fetch real-time via via Spark and represent it as DataFrame/DataSet too
create appropriate pipeline to merge those structures when needed, e.g. upon request from the UI, etc.
Very simplified flow for doing that can be found in my github

Related

Is apache-beam a good choice when the event time ordering has to preserved when writing to sink?

I'm considering using apache beam to write a streaming pipeline to apply a stream of mutations to replicate events from a source database into a destination database in the order of event time. The source could be either kafka or pubsub.
An example would be something like this except that the order in which the mutations are applied to the sink must be in order in which they arrived.
I did go over some of the previous questions asked on preserving order:
Processing Total Ordering of Events By Key using Apache Beam
Sort elements within a fixed window - Cloud Dataflow - This seems to be same use case i'm interested in.
I understand that if I go down the apache beam road i would have to
choose a windowing strategy with accommodation for late data (either a fixed windowing strategy with a allowed lateness or with global window, have triggers to emit panes and buffer for late data)
apply transformations
GroupByKey over a single key(so that everthing goes to the same worker), sort and write to sink
In addition to the above, I would have to make sure the windows(if i follow a fixed window strategy) are executed in order. Step 3 is bound to be the bottleneck.
If [2] above in the list of steps is a lot of computation then apache beam would make sense to take advantage of parallelism which beam offers. But if [2] is just a simple one to one mapping, does apache beam make sense for this replication usecase. Please let me know if i'm missing something.
Note: We do have a batch pipeline on dataflow using apache beam to load a datadump on gcs to database where the entire data is on disk and the order in which its written to sink does not matter.
Preserving order it's possible, but not sure if it's straightforward or efficient.
It also depends on how much data (elements/sec) you're expecting as well as what the sink type is. Potentially you could have the pipeline write out ordered entries to GCS, and the sink just reads the files in, in order, as a secondary process.
Your other option, of using parallel writes and make sure the database is usable only till the output watermark time of the last beam stage, it's maybe doable, but not really the core use case of Dataflow/Apache Beam.
Maybe there could be ways to process the stream out of order, but write to an intermediate sink that can easily be read from in order. i.e. writing out the mutation batches with a step or file number that can easily be used to order the files when applied to the final sink.
The window + write to final sink architecture is going to be difficult to get right, probably too complex for low volume of elements, and too inefficient for large volume. This is a good example of what this could look like.
But again, keep in mind that all this approaches are definitely not the core use case for Dataflow/Apache Beam.

Simple inquiry about streaming data directly into Cloud SQL using Google DataFlow

So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.

Multiple export using google dataflow

Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination

Is there a solution like Apache ActiveMQ on top of HDFS?

I want to store webpages fetched by a web crawler. I don't have any random access. so whenever i want to read the stored data, i read from the start to the end.
We have tried solutions like HBase but one of the most good things about HBase is random access to records which we don't need at all. HBase has not proved to be stable for us after 1.5 years of test.
I want just a stack or queue on top of HDFS becuase the number of webpages is about 1 billion. I don't even want the queue behaviour of ActiveMQ i just want to be able to store the webpages so that i can read them all in case of a failure.
I don't want to use Files because i don't want to handle things like file rotations, file consistencies and ...
It is worth to mention that we need HDFS so we can run MapReduce jobs on the data when we want to send all the stored data to a solr cluster and to have good things like redundancy and availability by HDFS.
Is there a service on HDFS that just stores JMS records without any functionality for random access and without transparent view of records?

multiple db connections vs. centralized/redundant db

I have a project to create a dashboard that will connect to existing systems as well as create new features based on combining data from the existing systems. For example, the dashboard will be able to generate "orders" containing data merged from "members" (MS Access DB), "employees" (MySQL DB) and "products" (flat file), and there will also be new attributes particular to "orders."
At first I thought it would be most efficient to have my application connect to each of the systems separately and perform cross-vendor joins between the different databases. But then I thought that creating a centralized/redundant db (built with scripts pushing and pulling data between the systems) might also be useful because it would empower some semi-technical staff to use products like OOBase, which can only make a single connection.
Are there any other advantages to creating a centralized/redundant DB like the one I'm talking about? Or are multiple direct connections the best approach?
Thanks in advance for any tips.
To give you are short answer: yes, you want a central data storage.
You don't want to run complex reports on your live database. As your live database will grow you will want to do some housekeeping and clean it up but keep the data for analysi.
You will also want the data to be aggregated so you could perform historical analysis.
For the data which comes from different sources some clean-up will be required. And you will probably need to know how to link your data together and there are quire a lot of things like that you will have to be aware of to do the job properly.
You might consider reading on data warehousing (wikipedia) and business intelligence (wikipedia).
If you want to have 'new features' added to this system you could also look up orchestration (wikipedia. It will allow you to link your heterogeneous business processes together.
All of these are quite specialized and complex disciplines on their own so you might want to have a specialist to consult you.
Be very, very careful to copy lots of data around. If you do, here are some important guidelines:
Make sure that one system is defined as the master and no other system may tamper with the data.
Always copy data from the master to the slaves.
When you copy the data, use a checksum of some kind to make sure all data has been copied. Make sure you can handle "yesterday, the copy failed".
If a slave must make a change, push the change to the master and then use the standard "update" path to merge it back to the slave. Avoid "save change on slave and update the master some time in the future".

Resources