Trying to setup Flume on edge node, I was checking through many blogs but haven't got much idea as most of them are referring a single node cluster, can someone suggest it is good idea to setup on edge node or this will be on server where HDFS or any worker node setup (Data-node), if yes then what will be configuration to setup this on Edge node.
As suggested by Viren in production environment on edge node only you need to configure flume, its not you can't do on namenode server but we need to avoid that for performance issues.
If this is a production environment, its a good idea to avoid NameNode server(s), Resource Manager server(s), journal nodes and DataNodes. That leaves you with edge node.
The process would be to:
1) Install Hadoop client.
2) Install Flume
3) Configure the flume in flume.conf file (or whatever name you want to give). You can find many sample configurations online.
4) Make monitoring type = http for quick check of performance data.
5) Open the ports for Sources and Sinks.
5) Start the agent.
6) Check the agent log to see all components started.
7) Try sending some sample data and check if it reaches destination.
8) Debug any failures.
Let me know if you need more information.
Related
I am looking at how to get the information on the number of consumers from a Kafka server running in a docker container.
But I'll also take almost any info to help point me in a direction that is forward movement. I've been trying through Python ond URI requests, but I'm getting the feeling I need to get back to Java to ask Kafka questions on it's status?
In relation to the current threads I've seen, many handy scripts from $KAFKA_HOME are referenced but, the configured systems I have access to do not have $KAFKA_HOME defined - nor do they have the contents of that directory. My world is a docker container without a CLI access. So I haven't been able to apply the solutions requiring shell scripts or other tools from $KAFKA_HOME to my running system.
One of the things I have tried is a python script using requests.get(uri...)
where the uri looks like:
http://localhost:9092/connectors/
The code looks like:
r = requests.get("http://%s:%s/connectors" % (config.parameters['kafkaServerIPAddress'],config.parameters['kafkaServerPort']))
currentConnectors=r.json()
So far I get a "nobody's home at that address" response.
I'm really stuck and a pointer to something akin to "Beginners Guide to Getting KAFKA Monitoring Information" would be great. Also if there's a way to grab the helpful kafka shell scripts & tools, that would be great to - where do they come from?
One last thing - I'm new enough to Kafka that I don't know what I don't know.
Thanks.
running in a Docker container
That shouldn't matter, but Confluent maintains a few pages that go over how to configure the containers for monitoring
https://docs.confluent.io/platform/current/installation/docker/operations/monitoring.html
https://docs.confluent.io/platform/current/kafka/monitoring.html
number of consumers
Such a metric doesn't exist
Python and URI requests
You appear to be using the /connectors endpoint of the Kafka Connect REST API (which runs on port 8083, not 9092). It is not a monitoring endpoint for brokers or non-Connect-API consumers
way to grab the helpful kafka shell scripts & tools
https://kafka.apache.org/downloads > Binary downloads
You don't need container shell access, but you will need external network access, just as all clients outside of a container would.
One friend of mine and I are trying to develop a CorDapp for a financial use case, I can run the cordapp-tutorial and the demos, however they only run on localhost.
We would like to create two "real" nodes and I understood correctly we should build two Corda nodes, my pc as one node server and his pc as another node server, but how can we effectively connect over the internet? On slack I have been told to enable dev-mode, but how do you enable it?
We have a corda.jar and the nodea.conf, but the part I don't really understand from the documentation is:
"Each node server by default must have a node.conf file in the current working directory. After first execution of the node server there will be many other configuration and persistence files created in this workspace directory. The directory can be overridden by the --base-directory= command line argument."
What is intended as working directory?
I've read this documentation
: Corda Nodes
Thank to all, I think I will be asking a lot of question in the near future :D
In Corda 3.1, you can use the network bootstrapper to create a dev-mode network of nodes running on two separate machines as follows:
Create the nodes by following the instructions here (e.g. by using gradlew deployNodes)
Navigate to the folder where the nodes were created (e.g. build/nodes)
Open the node.conf file of each node and change the localhost part of its p2pAddress to the IP address of the machine where the node will be run (e.g. p2pAddress="10.18.0.166:10007")
After making these changes, we need to redistribute the updated nodeInfo files to each node, so that they have the updated IP addresses for each node. Use the network bootstrapper tool to automatically update the files and have them distributed to each node:
java -jar network-bootstrapper.jar kotlin-source/build/nodes
Move the node folders to their individual machines (e.g. using a USB key). It is important that none of the nodes - including the notary - end up on more than one machine. Each computer should also have a copy of runnodes and runnodes.bat.
For example, you may end up with the following layout:
Machine 1: Notary, PartyA, runnodes, runnodes.bat
Machine 2: PartyB, PartyC, runnodes, runnodes.bat
After starting each node, the nodes will be able to see one another and agree ledger updates among themselves
Warning
The bootstrapper must be run after the node.conf files have been modified, but before the nodes are distributed across machines. Otherwise, the nodes will not have the updated IP addresses for each node and will not be able to communicate.
Each of the nodes will have a node.conf file. To enable devMode add this line to the node.conf file.
devMode=true
I'm launching a Kafka connector in distributed mode in a local 'launch' Docker container (separate to the Kafka node container). The connector works as expected, but when I kill the launch container the connector stops working. I expected it to continue working since I believed it to be registered and running on a worker on the Kafka node in a different container. My setup in more detail follows:
Currently I'm running everything through Docker containers locally. I have:
A Zookeeper node (3.4.9)
A Kafka node (Apache, 0.10.1.0)
A 'launch' node.
the launch node downloads the appropriate Kafka version and unzips it's contents. It then builds the connector source, sets the classpath to include the necessary JARs, then executes the connector as such:
connect-distributed.sh config/connect-distributed.properties
The distributed properties file sets the group id, the various topic names, schemas and converters and also the bootstrap servers (which point to the Kafka node (2) above).
This command seems to execute properly, with the restful connector http service being started successfully. I can then issue the POST request to http://example:8083/connectors, supplying the configuration for the connector tasks. The command completes without error and the connector is successfully started. I can consume from a topic in the Kafka node (2) and I see output that indicates the connector is working and sending data through.
When I kill the launch node (3) I expect the connector to continue running since I registered it with the Kafka cluster, albeit a cluster of one. The connector does not continue to run and appears to die with the launch node. Isn't the connector supposed to be managed now by a worker in the cluster? Do I need to change how I'm launching the connector or am I misunderstanding something?
Kafka Connectors do not execute on the Kafka brokers. They are executed in "Kafka Connect Worker" processes, which is what your question is calling "a 'launch' node". These processes accept REST requests for connectors and run the connectors in the worker processes. Under the hood, those processes are simply interacting with the Kafka brokers via normal producers and consumers. Kafka Connect is providing a framework on top of those clients to make it easy to build scalable connectors so connector developers only need to focus on how to pull or push data to the system the connector is written for. This means that processing only continues if at least one worker process is still alive.
There are two types of worker processes. In standalone mode, the connector configuration is not persisted anywhere -- you generally pass it in via the command line. Offset information (i.e. which data you've already copied) is maintained on the local filesystem. Therefore, in this mode, you can only assume you'll resume where you left off if you restart the process on the same node with access to the same filesystem.
In distributed mode, the workers coordinate to distribute the work and they share common, persistent storage (in Kafka) for connector configs, offsets, etc. This means that if you start up one instance and create a connector, shutting down that instance will halt all work. However, when you start an instance again, it will resume where it left off without re-submitting the connector configuration because that information has been persisted to Kafka. If you start multiple instances, they will coordinate to load balance the tasks between them, and if one instance fails (due to crash, elastically scaling down the # of instances you are running, power failure, etc), the remaining instances will redistribute the work automatically.
You can find more details about workers, the different types, and how failover works in distributed mode in Confluent's Kafka Connect documentation
Try to deploy multiple Usergrid containers on different machines, and make them point to a Cassandra cluster. But I cannot find documents about running multiple Usergrid nodes, and I only found instructions about Cassandra cluster.
Is this the right way to scale up my Usergrid services ? Or, what is the best practice to run multiple Usergrid nodes ?
My understanding is this is the correct way to go about it. You just need to to deploy the ROOT.war file to a new Tomcat instance.
Docs for configuring the usergrid-deployment.properties file so that UG knows where Cass and ES instances are, then deploying to Tomcat are steps 4 and 5 here: https://usergrid.apache.org/docs/installation/deployment-guide.html#deploying-the-usergrid-stack
You can also use the AWS cloudformation scripts in the repo to have AWS handle this for you (https://github.com/apache/usergrid/tree/master/deployment/aws)
There are no documented architecture about scalable usergrid deployment. You need to configure your own deployment based on your requirements. Some samples can be found on the internet, this presentation helped me to configure our usergrid installation: http://events.linuxfoundation.org/sites/events/files/slides/Intro-To-Usergrid%20-%20ApacheCon%20EU%202014.pdf (pages 47-48).
And here is my deployment strategy: All the components (tomcat, C*, es) are java applications, so putting them on to the same machine will be expensive on RAM. So, separate the layers, and scale them independently. For example, if your application chokes on incoming user connections, just scale up tomcat cluster (behind a LB probably). Spend time on configuring Cassandra, and don't stick to the default values - your data will be there and you don't want to lose it.
I have just started working with DSE 4.8.4 in AWS EC2. Launched 2 "m3.xlarge" instances in us-west-1a availability zone. Of course, both nodes are in same region and same availability zone. This is fresh installation and does not have any user defined keyspace and no data etc.
On both instances, DSE 4.8.4 was installed as per datastax documentation. Service 'dse' starts on both nodes as individual with default endpoint_snitch as "com.datastax.bdp.snitch.DseSimpleSnitch" and I have used ALL private IP addresses on both nodes in cassandra.yaml file.
Now, when I changed 2nd node's .yaml file seeds property to point to 1st node Private IP address, dse service on 2nd node no longer starts with error indicating "Unable to find gossip ...", with hint to fix snitch settings.
I looked around and it seems, snitch should be used as Ec2Snitch.
Q1) Does both nodes need to have same snitch?
Q2) Will cassandra-rackdc.properties file need any changes due to us-west-1a?
Q3) Should I follow steps described in http://docs.datastax.com/en/cassandra/2.1/cassandra/initialize/initializeSingleDS.html ?
My objective is to build this 2-node cluster (manually as I did not use OpsCenter) with suitable changes to relevant config files.
I would really appreciate if someone could please point me in right direction?