Your advice on a Hadoop MapReduce job - join

I have 2 files stored on a HDFS filesystem:
tbl_userlog: <website url (non canonical)> <tab> <username> <tab> <timestamp>
example: www.website.com, foobar87, 201101251456
tbl_websites: <website url (canonical)> <tab> <total hits>
example: website.com, 25889
I have written an Hadoop sequence of jobs which joins the 2 files on the website, performs a filter on the amount of total hits > n per website and then counts for each user the amount of websites he has visited which has > n total hits. The details of the sequence are as following:
A Map-only job which canonicizes the url in tbl_userlog (i.e. removes www, http:// and https:// from the url field)
A Map-only job which sorts tbl_websites on the url
An identity Map-Reduce job which takes the output of the 2 previous jobs as KeyValueTextInput and feeds them to a CompositeInput in order to make use of Hadoop native joining feature defined with jobConf.set("mapred.join.expr", CompositeInputFormat.compose("inner" (...))
A Map and Reduce job which filters the result of the previous job on total hits > n in its Map phase, groups the results on the in the shuffling phase, and performs the count on the number of websites for each user in the Reduce phase.
In order to chain these steps, I just call the jobs sequentially in the described order. Each individual job outputs its results into HDFS which the following job in the chain then retrieves and processes in turn.
As I am new to Hadoop, I would like to ask for your counseling:
Is there a better way to chain these jobs? In this configuration all intermediate results are written to HDFS and then read back.
Do you see any design flaw in this job, or could it be written more elegantly by making use of some Hadoop feature that I have missed?
I am using Apache Hadoop 0.20.2 and using higher-level frameworks such as Pig or Hive is not possible in the scope of the project.
Thanks in advance for your replies!

I think what you have will work with a couple of caveats. Before I start listing them, I want to make two definitions clear. A map-only job is a job that has a defined Mapper and run's with 0 reducers. If the job is running with > 0 IdentityReducers, then the job is not a map-only job. A reduce-only job is a job that has a define Reducer and run's with an IdentityMapper.
Your first job, can be a map-only job, since all you're doing is canonicalizing URLs. But if you want to use CompositeInputFormat, you should run with an IdentityReducer with more than 0 reducer's.
For your second job, I don't know what you mean by a map-only job that sorts. Sorting by it's very nature is a reduce side task. You probably mean that it has a define Mapper but no Reducer. But in order for the URLs to be sorted, you should run with an IdentityReducer with more than 0 reducer's.
Your third job is an interesting idea, but you have to be careful with CompositeInputFormat. There are two conditions that must be met for you to be able to use this input format. The first is that there has to be the same number of files in both input directories. This can be achieved by setting the same number of reducer's for Job1 and Job2. The second condition is that the input files CANNOT be splittable. This can be achieved by using a non splittable compression such as bzip.
This job sounds good. Although you can filter website that have < n hits in the reducer of the previous job and save yourself some I/O.
There's obviously more than one solution to a problem in software, so while you're solution would work, I wouldn't recommend it. Having 4 MapReduce jobs for this task is a bit expensive IMHO. The implementation I have in mind is a M-R-R workflow that uses Secondary Sort.

As far as chaining jobs is concerned, you should have a look at Oozie, which is a workflow manager. I have yet to use it, but that's where I'd start.

Related

Jenkins plugin with viewing\aggregating possibilities depending on one of the parameters

I'm looking for plugin where I could have aggregation of settings and view for many cases, the same way it is in multi-branch pipeline. But instead of basing on various branches I want to base on one branch but varying on parameters. Below picture is from mentioned multi-branch pipeline, instead of "Branches" I'm looking for "Cases" and instead of "Name" column I need to have configurable Parameter.
Additionally to it, I need to have various Periodic build triggers in way
H 22 * * 5 %param1=value1 %param2=value3
H 22 * * 5 %param1=value2 %param2=value3
The second case could be done in standard job, but since there will be many such cases launched periodically every week or two weeks or every month, and difference in param1 is crucial and is important to have it readable and easily visible to quickly distinguish which case have failed.
I was looking for such plugin but couldn't find something like this. Maybe someone knows such plugin or way to solve it.
I have alternative of creating "super"job which in build steps would launch my current job with specific parameters. Then my readability would change from many rows to many columns since the number is over 20 it will IMHO significantly decrease readability(in column solution) and additionally not all cases would be launched with same periodicity. So there would be necessity to have some ready sets assigned by parameter, and most often the super build cases would have mostly skips in it. What would result that one might not see last result for one of the cases.
Note, that param2, has always same value for periodic launches. Other values are used only with manual trigger. Param2 can but doesn't have to be visible on "multi-branch pipeline" like solution.
I hope my explanation of issue is clear. Looking forward for answers\suggestions etc. :)

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.
As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.
What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?
Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

Too many 'steps' when executing a pipeline

We have a large data set which needs to be partition into 1,000 separate files, and the simplest implementation we wanted to use is to apply PartitionFn which, given an element of the data set, returns a random integer between 1 and 1,000.
The problem with this approach is it ends up creating 1,000 PCollections and the pipeline does not launch as there seems to be a hard limit on the number of 'steps' (which correspond to the boxes shown on the job monitoring UI in execution graph).
Is there a way to increase this limit (and what is the limit)?
The solution we are using to get around this issue is to partition the data into a smaller subsets first (say 50 subsets), and for each subset we run another layer of partitioning pipelines to produce 20 subsets of each subset (so the end result is 1000 subsets), but it'll be nice if we can avoid this extra layer (as ends up creating 1 + 50 pipelines, and incurs the extra cost of writing and reading the intermediate data).
Rather than using the Partition transform and introducing many steps in the pipeline consider using either of the following approaches:
Many sinks support the option to specify the number of output shards. For example, TextIO has a withNumShards method. If you pass this 1000 it will produce 1000 separate shards in the specified directory.
Using the shard number as a key and using a GroupByKey + a DoFn to write the results.

How can I reduce number of handshakes to a redis client?

I am writing an application which requires a very fast response time, I have some queries to redis which would require intersection and union of multiple sets .
An example would be
((A union B) intersection C)
However when I do it with java client, each query requires 1 more handshake thus increasing my response time.
I was wondering if there was a way that I could do it in a single handshake, Lua scripting looks like a good option but I'm not sure how it would work internally
You can also use pipeline to reduce the RTT.
With pipeline, you can send multiple commands to Redis at one time, and read all replies latter.
Redis lua script is a blocking method which does everything at once
inside the redis server.
so you will avoid network round trip for
sure which you needed.
Since it is blocking you should always be
aware of when to use it, if all those union and intersection takes a
5 sec in production environment then it is blocking for other
commands that are waiting to be executed. other commands may not get executed and comes out throwing an time out error which may cause things worse.
So alternately you can do
partial lua script calls for every 50 elements are so (derive an
optimal number by trying out with different numbers).
Also consider using sunion and sinter of multiple elements instead of using only 2 if it fits.
ie,
sunion set1 set2 set3...
instead of
sunionstore temp set1 set2
sunionstore temp temp set3 and so on.
Hope this helps.

Set num of output shard in Write.to(Sink) in dataflow

I am having a customized sink extending FileBasedSink to which I write to by calling PCollection.apply(Write.to(MySink)) in dataflow (very simpler to XmlSink.java). However it seems by default simply calling Write.to will always result to 3 output shards? Is there any way that I could define the number of output shard (like TextTO.Write.withNumShards) just in customized sink class definition? or I have to define another customized PTransformer like TextIO.Write?
Unfortunately, right now FileBasedSink does not support specifying the number of shards.
In practice, the number of shards you get will be dependent on how the framework chooses to optimize the parts of the pipeline producing the collection you're writing, so there's essentially no control over that.
I've filed a JIRA issue for your request so you can subscribe to the status.

Resources