Loading whole file from source into HDFS in flume - flume

How to get source filename as it is from source into HDFS in flume?
Ex: source file /usr/sample.txt hdfs: /tmp/sample.txt not like flumeevetns.23343.tmp
how to stop appending timestamp and .tmp?Ex:flumeevent.12334343.tmp(Here 12334343.tmp) I dont want it.
How to read as a whole file from Flume?
How to read csv file in Flume?

You need to add a parameter for spooldir which adds a header which is false by
default.
agentname.sources.sourcename.fileHeader=true
It will keep the same name of file and push into HDFS.

Related

In Rails I want to read excel file form Live Path like http://www.carsa.jp/admin/data.xlsx

I wants to read a excel file existing on Live URL of another website.
When I hit that URL in browser file is downloading. While in my rails app it is giving below error
No such file or directory # rb_sysopen - http://www.carsa.jp/admin/data.xlsx (Errno::ENOENT)
My Rails app code is as below
data = Roo::Excelx.new('http://www.carsa.jp/admin/data.xlsx')
header = data.row(1)
puts header
Note: If I download file and place it within my application it is working fine but the requirement is to read it from the third-party website in a scheduled job as per the above script.
data = Roo::Excelx.new('lib/data.xlsx')
header = data.row(1)
puts header
Try using Roo::Spreadsheet.open instead of Roo::Excelx.new. According to the Roo Readme:
Roo::Spreadsheet.open can accept both paths and File instances.
This should do the trick:
Roo::Spreadsheet.open('http://www.carsa.jp/admin/data.xlsx')

Watson Studio "Spark Environment" - how to increase `spark.driver.maxResultSize`?

I'm running a spark job where I'm reading, manipulating and merging a lot of txt files into a single file, but I'm hitting this issue:
Py4JJavaError: An error occurred while calling o8483.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 838 tasks (1025.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Is it possible to increase the size of spark.driver.maxResultSize?
Note: this question is about the WS Spark “Environments” NOT about Analytics Engine.
You can increase the default value through the Ambari console if you are using "Analytics Engine" spark cluster instance. You can get the link and credentials to the Ambari console from IAE instance in console.bluemix.net. From Ambari console, add a new property in
Spark2 -> "Custom spark2-defaults" -> Add property -> spark.driver.maxResultSize = 2GB
Make sure the spark.driver.maxResultSize values is less than driver memory which is set in
Spark2 -> "Advanced spark2-env" -> content -> SPARK_DRIVER_MEMORY
Another suggestion if you are just trying to create a single CSV file and don't want to change spark conf values since u don't know how large the final file would be, is to use a function like below which uses hdfs getmerge function to create a single csv file just like pandas.
def writeSparkDFAsCSV_HDFS(spark_df, file_location,file_name, csv_sep=',', csv_quote='"'):
"""
It can be used to write large spark dataframe as a csv file without running
into memory issues while converting to pandas dataframe.
It first writes the spark df to a temp hdfs location and uses getmerge to create
a single file. After adding a header, the merged file is moved to hdfs.
Args:
spark_df (spark dataframe) : Data object to be written to file.
file_location (String) : Directory location of the file.
file_name (String) : Name of file to write to.
csv_sep (character) : Field separator to use in csv file
csv_quote (character) : Quote character to use in csv file
"""
# define temp and final paths
file_path= os.path.join(file_location,file_name)
temp_file_location = tempfile.NamedTemporaryFile().name
temp_file_path = os.path.join(temp_file_location,file_name)
print("Create directories")
#create directories if not exist in both local and hdfs
!mkdir $temp_file_location
!hdfs dfs -mkdir $file_location
!hdfs dfs -mkdir $temp_file_location
# write to temp hdfs location
print("Write to temp hdfs location : {}".format("hdfs://" + temp_file_path))
spark_df.write.csv("hdfs://" + temp_file_path, sep=csv_sep, quote=csv_quote)
# merge file from hadoop to local
print("Merge and put file at {}".format(temp_file_path))
!hdfs dfs -getmerge $temp_file_path $temp_file_path
# Add header to the merged file
header = ",".join(spark_df.columns)
!rm $temp_file_location/.*crc
line_prepender(temp_file_path, header)
#move the final file to hdfs
!hdfs dfs -put -f $temp_file_path $file_path
#cleanup temp locations
print("Cleanup..")
!rm -rf $temp_file_location
!hdfs dfs -rm -r $temp_file_location
print("Done!")

contentType getting prefixed to data written from HDFS sink

I am using HDFS sink and writing to HDFS. But the payload I write to HDFS is prefixed with ?contentType "text/plain" though this in not in the payload.
Please let me know why this is getting prefixed and how to remove it.
stream create --definition ":streaming --spring.cloud.stream.bindings.output.producer.headerMode=raw > myprocessor --spring.cloud.stream.bindings.output.content-type=text/plain --spring.cloud.stream.bindings.input.consumer.headerMode=raw|hdfs --spring.hadoop.fsUri=hdfs://127.0.0.1:50071 --hdfs.directory=/ws/sparkoutput --hdfs.file-name=sparkstream --hdfs.enable-sync=true --hdfs.flush-timeout=10000 --spring.cloud.stream.bindings.input.consumer.headerMode=raw --spring.cloud.stream.bindings.input.content-type=text/plain" --name sparkstream
If you are assuming that header mode for the hdfs input is raw then you should make the output of myprocessor raw as well - i.e.
myprocessor --spring.cloud.stream.bindings.output.content-type=text/plain --spring.cloud.stream.bindings.input.consumer.headerMode=raw --spring.cloud.stream.bindings.output.producer.headerMode=raw
Or alternatively you should remove the header settings on hdfs (since the sink will just process the payload then).

Flume hdfs sink

I am trying to use Flume with hdfs as sink. File is getting exported. But I want to customize the name of the output file. I am using hdfs.filePrefix property for it. It is always creating a file with FlumeData.timestamp.
Please paste your configuration.
I tried it and it did work.
My setting:
agent.sinks.flumeHDFS.hdfs.filePrefix = stackoverflow
and I get the expected result.

How to call input file which is qlready in the package

In my Hadoop Map Reduce application I have one input file.I want that when I execute the jar of my application, then the input file will automatically be called.To do this I code one class to specify the input,output and file itself but from where I am calling the file, there I want to specify the file path. To do that I have used this code:
QueriesTest.class.getResourceAsStream("/src/main/resources/test")
but it is not working (cannot read the input file from the generated jar)
so I have used this one
URL url = this.getClass().getResource("/src/main/resources/test") here I am getting the problem of URL. So please help me out. I am using Hadoop 0.21.
I'm not sure what you want to tell us with your resource loading, but the usual way to add an input file is this:
Configuration conf = new Configuration();
Job job = new Job(conf);
Path in = new Path("YOUR_PATH_IN_HDFS");
FileInputFormat.addInputPath(job, in);
job.setInputFormatClass(TextInputFormat.class); // could be a sequencefile also
// set the other stuff
job.waitForCompletion(true);
Make sure your file resides in HDFS then.

Resources