I am trying to write huge amount of logs to hdfs. For that i am using flume with hdfs as sink and avro as source. What i need to do is serialize my logs using avro over the network to my flume. The source of the flume is configured as:
a1.sources = r1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.channels = c1
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 4141
EDIT: fixed code block
Use Flume's RpcClient:
RpcClient client = RpcClientFactory.getDefaultInstance(host, 4141);
client.append(EventBuilder.withBody(message));
client.close();
Related
I have one powerful machine(remote machine), accessible through SSH. My data is stored at remote machine.
I want to run & access data on the remote machine. For this, I ran a dask-scheduler & a dask-worker on the remote machine. Then I ran a jupyter notebook on my laptop (local machine) with client=Client(‘schedular-ip:8786’), but it still refer data on the local machine, not of the remote machine.
How do I refer to data of the remote machine from notebook, running on the local machine?
import dask.dataframe as dd
from dask.distributed import Client
client = Client('remote-ip:8786')
ddf = dd.read_csv(
'remote-machine-file.csv',
header=None,
assume_missing=True,
dtype=object,
)
It fails with
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-37-17d26dadb3a8> in <module>
----> 1 ddf = dd.read_csv('remote-machine-file.csv', header=None, assume_missing=True, dtype=object)
/usr/local/conda/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read(urlpath, blocksize, lineterminator, compression, sample, sample_rows, enforce, assume_missing, storage_options, include_path_column, **kwargs)
735 storage_options=storage_options,
736 include_path_column=include_path_column,
--> 737 **kwargs,
738 )
739
/usr/local/conda/lib/python3.7/site-packages/dask/dataframe/io/csv.py in read_pandas(reader, urlpath, blocksize, lineterminator, compression, sample, sample_rows, enforce, assume_missing, storage_options, include_path_column, **kwargs)
520
521 # Infer compression from first path
--> 522 compression = infer_compression(paths[0])
523
524 if blocksize == "default":
IndexError: list index out of range
When using dask.dataframe with a distributed.Client, while the majority of the I/O is done by remote workers, dask does rely on the client machine being able to access the data for scheduling.
To run anything purely on the worker, you can always have the worker schedule the operation, e.g. with:
client = Client()
# use the client to have the worker run the dask.dataframe command!
f = client.submit(dd.read_csv, fp)
# because the worker is holding a dask dataframe object, requesting
# the result brings the dask.dataframe object/metadata to the
# local client, while leaving the data on the remote machine
df = f.result()
Alternatively, you can partition the job manually yourself,
e.g. if you have many files, then read them into memory on
the workers, and finally construct the dask dataframe locally with dask.dataframe.from_delayed:
import pandas as pd
files_on_remote = ['data/file_{}.csv'.format(i) for i in range(100)]
# have the workers read the data with pandas
futures = client.map(pd.read_csv, files_on_remote)
# use dask.dataframe.from_delayed to construct a dask.dataframe from the
# remote pandas objects
df = ddf.from_delayed(futures)
I know the Enterprise (Cloudera for example) way, by using a CM (via browser) or by Cloudera REST API one can access monitoring and configuring facilities.
But how to schedule (run and rerun) flume agents livecycle, and monitor their running/failure status without CM? Are there such things in the Flume distribution?
Flume's JSON Reporting API can be used to monitor health and performance.
Link
I tried adding flume.monitoring.type/port to flume-ng on start. And it completely fits my needs.
Lets create a simple agent a1 for example. Which listens on localhost:44444 and logs to console as a sink:
# flume.conf
a1.sources = s1
a1.channels = c1
a1.sinks = d1
a1.sources.s1.channels = c1
a1.sources.s1.type = netcat
a1.sources.s1.bind = localhost
a1.sources.s1.port = 44444
a1.sinks.d1.channel = c1
a1.sinks.d1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 100
a1.channels.c1.transactionCapacity = 10
Run it with additional parameters flume.monitoring.type/port:
flume-ng agent -n a1 -c conf -f flume.conf -Dflume.root.logger=INFO,console -Dflume.monitoring.type=http -Dflume.monitoring.port=44123
And then monitor output in browser at localhost:44123/metrics
{"CHANNEL.c1":{"ChannelCapacity":"100","ChannelFillPercentage":"0.0","Type":"CHANNEL","EventTakeSuccessCount":"570448","ChannelSize":"0","EventTakeAttemptCount":"570573","StartTime":"1567002601836","EventPutAttemptCount":"570449","EventPutSuccessCount":"570448","StopTime":"0"}}
Just try some load:
dd if=/dev/urandom count=1024 bs=1024 | base64 | nc localhost 44444
I am new in freeradius. I do not understand why radiusd does not take into account the clients.conf configuration file.
Extract from server logs :
-including configuration file /etc/freeradius/clients.conf
----------------------------------------------------------
--------------------
-radiusd: #### Loading Clients ####
- client localhost {
ipaddr = 127.0.0.1
require_message_authenticator = no
secret = <<< secret >>>
nas_type = "other"
proto = "*"
limit {
max_connections = 16
lifetime = 0
idle_timeout = 30
}
}
client localhost_ipv6 {
ipv6addr = ::1
require_message_authenticator = no
secret = <<< secret >>>
limit {
max_connections = 16
lifetime = 0
idle_timeout = 30
}
}
and my clients.conf in /etc/freeradius/ :
client dockernet
{
ipaddr = 172.17.0.0
secret = testing123
netmask = 24
shortname = dockernet
}
Ok , i am running freeradius with docker.
I am modifing the wrong config file
When FreeRADIUS starts up in debug mode, e.g.
radiusd -X
it prints out all the files it's reading. You need to run this to check that the file you are editing is the one actually being used.
Note that the configuration is often in different places depending on the installation.
Installed from source, the config is /usr/local/etc/raddb or /etc/raddb. On RedHat/CentOS based systems it's in /etc/raddb, and on Debian/Ubuntu systems it's in /etc/freeradius or /etc/freeradius/3.0.
For more advanced use cases, the -d option can tell FreeRADIUS to read its configuration from a different location, e.g.
radiusd -X -d /opt/raddb
This problem often comes about from having two installations, e.g. one installed from packages, and then installing from source on the same system.
enter image description here
enter image description here
my flume configuration file is as follows:
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /home/hadoop/flume-1.5.0-bin/log_exec_tail
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
And start my flume agent with the following stript:
bin/flume-ng agent -n a1 -c conf -f conf/flume_log.conf -Dflume.root.logger=INFO,console
question 1: the run result is as follows, however I don't konw if it run successful or not!
question 2: And there is the sentences as follows and I don't know what the mean is about "the queation of flume test":
NOTE: To test that the Flume agent is running properly, open a new terminal window and change directories to /home/horton/solutions/:
horton#ip:~$ cd /home/horton/solutions/
Run the following script, which writes log entries to nodemanager.log:
$ ./test_flume_log.sh
If successful, you should see new files in the /user/horton/flume_sink directory in HDFS
Stop the logagent Flume agent
As per your flume configuration, whenever the file /home/hadoop/flume-1.5.0-bin/log_exec_tail is changed, it will do a tail operation and append the results in the console.
So to test it working correctly,
1. run the command bin/flume-ng agent -n a1 -c conf -f conf/flume_log.conf -Dflume.root.logger=INFO,console
2. Open a terminal and add few lines in the file /home/hadoop/flume-1.5.0-bin/log_exec_tail
3. Save it
4. Now check the terminal where you triggered flume command
5. You can see newly added lines displayed
I am trying to give normal text file to flume as source and sink is hdfs ,the source ,channel and sink are showing registered and started but nothing is comming in output directory of hdfs.M new to flume can anyone help me through this???????
Conf for flume .conf file are
agent12.sources = source1
agent12.channels = channel1
agent12.sinks = HDFS
agent12.sources.source1.type = exec
agent12.sources.source1.command = tail -F /usr/sap/sample.txt
agent12.sources.source1.channels = channel1
agent12.sinks.HDFS.channels = channel1
agent12.sinks.HDFS.type = hdfs
agent12.sinks.HDFS.hdfs.path= hdfs://172.18.36.248:50070:user/root/xz
agent12.channels.channel1.type =memory
agent12.channels.channel1.capacity = 1000
agent started using
/usr/bin/flume-ng agent -n agent12 -c usr/lib//flume-ng/conf/sample.conf -f /usr/lib/flume-ng/conf/flume-conf.properties.template