How do I give Channel Directory (Checkpoint & Data Dir) name Date Dynamically - flume

I use Channel as a backup in flume without any sink and it's working correctly. Below is my working code, but how can I give directory or file name dynamically? (I want to give name date wise when date is change new directory is created Dynamically and exist previous as a backup.)
# Name the components on this agent
a1.sources = r1
a1.channels =c1
# Describe/configure the source r1
a1.sources.r1.type = http
a1.sources.r1.port = 40441
a1.sources.r1.bind = X.X.X.X
a1.sources.r1.channels = c1
# Use a channel which buffers events in file
a1.channels.c1.type = file
a1.channels.c1.dataDirs = /data/disk11/flume/Test/dataDirs{**%y%m%d**}
a1.channels.c1.checkpointDir =data/disk11/flume/Test/checkpointDir{**%y%m%d**}
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

Related

How to iterate over dates (days/ hours/ months) inside the data pipeline using beam on cloud dataflow?

Greeting folks1
I Am trying to load data from GCS to BigQuery using Cloud Dataflow.
data inside the bucket are storing in the following structure
"bucket_name/user_id/date/date_hour_user_id.csv"
example "my_bucket/user_1262/2021-01-02/2021-01-02_18_user_id.csv"
if I have 5 users for example ["user_1262", "user_1263", "user_1264", "user_1265", "user_1266"]
and i wanna load to bq (1 hour of data) for all clients for example hour = "18" in a range of 1 week I wanna iterate over all
clients to get the file with the prefix 18 I have created this code but the iteration infect the data
pipeline for each moving from one client to another the code runs a new pipeline.
def run(argv=None):
mydate=['2021-01-02 00:00:00', '2021-01-02 23:00:00']
fmt = '%Y-%m-%d %H:%M:%S'
hour = dt.timedelta(hours=1)
day = dt.timedelta(days=1)
start_time, end_time = [dt.datetime.strptime(d, fmt) for d in mydate]
currdate = start_time
cols = ['cols0','cols1']
parser = argparse.ArgumentParser(description="User Input Data .")
args, beam_args = parser.parse_known_args(argv)
while currdate <= end_time:
str_date = currdate.strftime('%Y-%m-%d')
str_hour = '%02d' % (int(currdate.strftime('%H')))
print("********WE ARE PROCESSING FILE ON DATE ---> %s HOUR --> %s" % (str_date, str_hour))
user_list = ["user_1262", "user_1263", "user_1264", "user_1265", "user_1266"]
for user_id in user_list:
file_path_user = "gs://user_id/%s/%s/%s_%s_%s.csv" % (user_id, str_date, str_date, str_hour, user_id)
with beam.Pipeline(options=PipelineOptions(beam_args)) as p:
input_data = p | 'ReadUserfile' >> beam.io.ReadFromText(file_path_user_table, columns=cols)
decode = input_data | 'decodeData' >> beam.ParDo(de_code())
clean_data = decode | 'clean_dt' >> beam.Filter(clea_data)
writetobq....
currdate += day
run()
You can continue to generate a list of input files in your pipeline creation script. However instead of creating a new pipeline for each input file, you can put them into a list. Then make your pipeline begin with a Create transform reading that list, followed by a textio.ReadAllFromText transform. This will create a PCollection out of your list of files, and then begin reading from that list of files.

twitter source to hive sink using flume

I am trying to connect twitter source to hive sink using flume.
I have my property file given below
# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = k1
# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
#TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = xxx
TwitterAgent.sources.Twitter.consumerSecret = xxx
TwitterAgent.sources.Twitter.accessToken = xxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxx
TwitterAgent.sources.Twitter.keywords = kafka, flume, hadoop, hive
# Describing/Configuring the sink
TwitterAgent.channels = MemChannel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.sinks = k1
TwitterAgent.sinks.k1.type = hive
TwitterAgent.sinks.k1.channel = MemChannel
TwitterAgent.sinks.k1.hive.metastore = thrift://xxxx:9083
TwitterAgent.sinks.k1.hive.database = sample
TwitterAgent.sinks.k1.hive.table = tweets_twitter
#TwitterAgent.sinks.k1.hive.partition = user_location
TwitterAgent.sinks.k1.useLocalTimeStamp = false
TwitterAgent.sinks.k1.round = true
TwitterAgent.sinks.k1.roundValue = 10
TwitterAgent.sinks.k1.roundUnit = minute
TwitterAgent.sinks.k1.serializer = DELIMITED
TwitterAgent.sinks.k1.serializer.delimiter = "\t"
TwitterAgent.sinks.k1.serializer.serdeSeparator = '\t'
#TwitterAgent.sinks.k1.serializer.fieldnames =user_friends_count,user_location,user_email
# Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 100000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
TwitterAgent.channels.MemChannel.byteCapacity = 6912212
# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.k1.channel = MemChannel
I am not creating any database or table in hive here. Should i need to create database name, table name, partition column, field names?? before starting the agent ??
If so where should i get the schema of twitter streaming data??
I am starting the flume agent using this below command
bin/flume-ng agent --conf ./conf/ -f conf/twitter_hive.conf Dflume.root.logger=DEBUG,console -n TwitterAgent --classpath "/usr/hdp/2.6.3.0-235/hive-hcatalog/share/hcatalog/*":"/usr/hdp/2.6.3.0-235/hive/lib/*"
Where should i get the schema of twitter data to create hive tables, to be mention in twitter.conf property file
HiveSink was introduced in version 1.6 and as per the documentation, yes the metastore, database name and the table name are mandatory. The partition part however is optional as flume can create the missing partitions.
As for the schema for twitter, it seems to be a problem some others have also faced and I found this link quite useful (You may have already come across this). It mentions some of the data structures available in hive that you may need to work with data coming in JSON format. You may however need to alter some of the bits and pieces for your scenario but this should give you a good start.
I hope this helps.

loading large files into hdfs using Flume (spool directory)

We copied a 150 mb csv file into flume's spool directory, when it is getting loaded into hdfs, the file was splitting into smaller size files like 80 kb's. is there a way to load the file without getting split into smaller files using flume? because more metadata will be generated inside namenode about the smaller files, so we need to avoid it.
My flume-ng code looks like this
# Initialize agent's source, channel and sink
agent.sources = TwitterExampleDir
agent.channels = memoryChannel
agent.sinks = flumeHDFS
# Setting the source to spool directory where the file exists
agent.sources.TwitterExampleDir.type = spooldir
agent.sources.TwitterExampleDir.spoolDir = /usr/local/flume/live
# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000
# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = hdfs://info3s7:54310/spool5
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=2000
agent.sinks.flumeHDFS.hdfs.rollSize = 0
agent.sinks.flumeHDFS.hdfs.batchSize =1000000
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600
# Connect source and sink with channel
agent.sources.TwitterExampleDir.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel
What you want is this:
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount = 0
agent.sinks.flumeHDFS.hdfs.rollInterval = 0
agent.sinks.flumeHDFS.hdfs.rollSize = 10000000
agent.sinks.flumeHDFS.hdfs.batchSize = 10000
From the flume documentation
hdfs.rollSize: File size to trigger roll, in bytes (0: never roll based on file size)
In your example you use rollInterval of 2000 which will roll over the file after 2000 seconds, resulting in small files.
Also note that batchSize reflects the number of events before the file is flushed to HDFS, not necessarily the number of events before the file is closed and a new one created. You'll want to set that to some value small enough to not time out writing a large file but large enough to avoid overhead of many requests to HDFS.

Apache Flume multiple agent

I have tested Apache Flume to transfer files from local to HDFS. But if the source files from multiple servers (transfer files from different servers' local to HDFS), can I just run one Flume instance and just add more agents into flume-conf.properties?
If I can, how can I edit the following parameters in flume-conf.properties:
agent1.sources.spooldirSource1.spoolDir = ?(server1/path)
agent2.sources.spooldirSource2.spoolDir = ?(server2/path)
And also, how can I run flume?
./flume-ng agent -n agent -c conf -f apache-flume-1.4.0-bin/conf/flume-conf.properties
can only run one flume. What about more than two?
Add multiple sources for what you need but configure them to use the same channel - which will then use the same source. So it's something like (note that this snippet is incomplete):
agent1.sources.spooldirSource1.spooldir = server1/path
agent1.sources.spooldirSource1.channel = myMemoryChannel
agent1.sources.spooldirSource2.spooldir = server2/path
agent1.sources.spooldirSource2.channel = myMemoryChannel
Using the same channel for two source isn't the good pratice , you can easly get outOfMemory for the channel (for MemoryChannel) and in this case.
it's better to use a channel for every source (for the same agent)
a1.sources = r1 r2
a1.sinks = k1 k2
a1.channels = c1 c2
then link source r1 to the channel c1 and source r2 to channel c2

Excel VBA list files from a URL

Ok, how would I hist all the files in a folder that is located on a remote server that I have to access using a URL e.g. http://domain.com/folder and inside the folder there are a bunch of files that I would like listing into excel. There are functions do this when trying to list files in a folder that are on your current C:\ but they don't work when trying to list files from a URL. I am not sure if this can be done! Thanks
Map the server directory as a local drive on your machine (e.g. O:) and use a small VBA procedure, e.g.
Sub ServerDir()
Dim Idx As Integer, FN As String, R As Range
Set R = Selection
Idx = 1
' URL = http://repository.XXXXX.com/content/dav/Repository/Users/Users-M/mike.d/mike.d-Public/
' mapped http://repository.XXXXX.com/content/dav/ to O:\
FN = Dir("O:\Repository\Users\Users-M\mike.d\mike.d-Public\*.ppt")
Do While FN <> ""
R(Idx, 1) = FN
FN = Dir()
Idx = Idx + 1
Loop
End Sub

Resources