how to use flume for data streaming between two directories/locations? - flume

how to use flume for data streaming between two directories?
spool_dir.sources = src-1
spool_dir.channels = channel-1
spool_dir.sinks = sink-1
# source
spool_dir.sources.src-1.type = spooldir
spool_dir.sources.src-1.channels = channel-1
spool_dir.sources.src-1.spoolDir = /usr/lib/flume/source
#sink
spool_dir.sinks.sink-1.type = spooldir
spool_dir.sinks.sink-1.channels = channel-1
spool_dir.sinks.sink-1.spoolDir = /usr/lib/flume/sink
# Bind the source and sink to the channel
spool_dir.sources.src-1.channels = channel-1
spool_dir.sinks.sink-1.channel = channel-1

What i understand your asking is you want to monitor files coming into one folder and copy them to say another folder, if thats the case then your source looks good but for sink use file_Roll, instead of spool dir.
a1.channels = c1
a1.sinks = k1
a1.sinks.k1.type = file_roll
a1.sinks.k1.channel = c1
a1.sinks.k1.sink.directory = /var/log/flume
Basically in Flume Source and Sink are different interfaces, so you have to choose spool as source saying i want to monitor particular directory, but then i want to write into file or hdfs using one of the sink, so there is no spool sink instead there is file_roll sink, it might or might not work for you. But choose one of the flume sink as target http://flume.apache.org/FlumeUserGuide.html#flume-sinks

Related

Is there a way to create empty file created at last after output files successfully emitted for each window in Apache Beam?

I have a streaming pipeline which consumes events tagged with timestamps. All I want to do is to batch them into FixedWindows of 5 mins each and then, write all events in a window into a single / multiple files based on shards, with empty file at last (this file should be created only after all events in that window are successfully emitted into files).
Basically i would expect an output for like this,
|---window_1_output_file
|---window_1_empty_file (this file should be created only after window_1_output_file has been created).
Window strategy with triggering being used is as follows,
timestampedLines = timestampedLines.apply("FixedWindows", Window.<String>into(
FixedWindows.of(Utilities.resolveDuration(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(options.getWindowElementCount())))
.withAllowedLateness(Utilities.resolveDuration(options.getWindowLateness()))
.accumulatingFiredPanes()
Is there a way to create this empty file created at last after output files successfully emitted for each window in Apache Beam? And where to apply this logic to create this additional empty file?
Thanks in advance.
Could you do something like so:
// Generate one element per window
PCollection<String> empties = p.apply(GenerateSequence.from(0)
.withRate(1, Utilities.resolveDuration(options.getWindowDuration())))
.apply(MapElements.into(TypeDescriptors.strings()).via(elm -> "");
// Your actual PCollection data
PCollection<String> myActualData = p.apply...........
PCollection<String> myActualDataWithDataOnEveryWindow =
PCollectionTuple.of(empties).and(myActualData).apply(Flatten.create());
Once you have an element in every window, you can do what you were doing:
myActualDataWithDataOnEveryWindow.apply("FixedWindows", Window.<String>into(
FixedWindows.of(Utilities.resolveDuration(options.getWindowDuration())))
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(options.getWindowElementCount())))
.withAllowedLateness(Utilities.resolveDuration(options.getWindowLateness()))
.accumulatingFiredPanes())
.apply(FileIO........);
Is this pointing you somewhere useful? Or am I too lost? : )

Watson Studio "Spark Environment" - how to increase `spark.driver.maxResultSize`?

I'm running a spark job where I'm reading, manipulating and merging a lot of txt files into a single file, but I'm hitting this issue:
Py4JJavaError: An error occurred while calling o8483.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 838 tasks (1025.6 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Is it possible to increase the size of spark.driver.maxResultSize?
Note: this question is about the WS Spark “Environments” NOT about Analytics Engine.
You can increase the default value through the Ambari console if you are using "Analytics Engine" spark cluster instance. You can get the link and credentials to the Ambari console from IAE instance in console.bluemix.net. From Ambari console, add a new property in
Spark2 -> "Custom spark2-defaults" -> Add property -> spark.driver.maxResultSize = 2GB
Make sure the spark.driver.maxResultSize values is less than driver memory which is set in
Spark2 -> "Advanced spark2-env" -> content -> SPARK_DRIVER_MEMORY
Another suggestion if you are just trying to create a single CSV file and don't want to change spark conf values since u don't know how large the final file would be, is to use a function like below which uses hdfs getmerge function to create a single csv file just like pandas.
def writeSparkDFAsCSV_HDFS(spark_df, file_location,file_name, csv_sep=',', csv_quote='"'):
"""
It can be used to write large spark dataframe as a csv file without running
into memory issues while converting to pandas dataframe.
It first writes the spark df to a temp hdfs location and uses getmerge to create
a single file. After adding a header, the merged file is moved to hdfs.
Args:
spark_df (spark dataframe) : Data object to be written to file.
file_location (String) : Directory location of the file.
file_name (String) : Name of file to write to.
csv_sep (character) : Field separator to use in csv file
csv_quote (character) : Quote character to use in csv file
"""
# define temp and final paths
file_path= os.path.join(file_location,file_name)
temp_file_location = tempfile.NamedTemporaryFile().name
temp_file_path = os.path.join(temp_file_location,file_name)
print("Create directories")
#create directories if not exist in both local and hdfs
!mkdir $temp_file_location
!hdfs dfs -mkdir $file_location
!hdfs dfs -mkdir $temp_file_location
# write to temp hdfs location
print("Write to temp hdfs location : {}".format("hdfs://" + temp_file_path))
spark_df.write.csv("hdfs://" + temp_file_path, sep=csv_sep, quote=csv_quote)
# merge file from hadoop to local
print("Merge and put file at {}".format(temp_file_path))
!hdfs dfs -getmerge $temp_file_path $temp_file_path
# Add header to the merged file
header = ",".join(spark_df.columns)
!rm $temp_file_location/.*crc
line_prepender(temp_file_path, header)
#move the final file to hdfs
!hdfs dfs -put -f $temp_file_path $file_path
#cleanup temp locations
print("Cleanup..")
!rm -rf $temp_file_location
!hdfs dfs -rm -r $temp_file_location
print("Done!")

Local file for Google Speech

I followed this page:
https://cloud.google.com/speech/docs/getting-started
and I could reach the end of it without problems.
In the example though, the file
'uri':'gs://cloud-samples-tests/speech/brooklyn.flac'
is processed.
What if I want to process a local file? In case this is not possible, how can I upload my .flac via command line?
Thanks
You're now able to process a local file by specifying a local path instead of the google storage one:
gcloud ml speech recognize '/Users/xxx/cloud-samples-tests/speech/brooklyn.flac' \ --language-code='en-US'
You can send this command by using the gcloud tool (https://cloud.google.com/speech-to-text/docs/quickstart-gcloud).
Solution found:
I created my own bucket (my_bucket_test), and I upload the file there via:
gsutil cp speech.flac gs://my_bucket_test
If you don't want to create a bucket (costs extra time and money) - you can stream the local files. The following code is copied directly from the Google cloud docs:
def transcribe_streaming(stream_file):
"""Streams transcription of the given audio file."""
import io
from google.cloud import speech
client = speech.SpeechClient()
with io.open(stream_file, "rb") as audio_file:
content = audio_file.read()
# In practice, stream should be a generator yielding chunks of audio data.
stream = [content]
requests = (
speech.StreamingRecognizeRequest(audio_content=chunk) for chunk in stream
)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
)
streaming_config = speech.StreamingRecognitionConfig(config=config)
# streaming_recognize returns a generator.
responses = client.streaming_recognize(
config=streaming_config,
requests=requests,
)
for response in responses:
# Once the transcription has settled, the first result will contain the
# is_final result. The other results will be for subsequent portions of
# the audio.
for result in response.results:
print("Finished: {}".format(result.is_final))
print("Stability: {}".format(result.stability))
alternatives = result.alternatives
# The alternatives are ordered from most likely to least.
for alternative in alternatives:
print("Confidence: {}".format(alternative.confidence))
print(u"Transcript: {}".format(alternative.transcript))
Here is the URL incase the package's function names are edited over time: https://cloud.google.com/speech-to-text/docs/streaming-recognize

Flume Subdirectory

How to make Flume Spooling Directory Source work with the Sub Directories of a folder too.
My source folder have several other folders too, I want my flume agent to look into these sub directories too for a file to dump it into sink.
Is there any way to do it?
The Spooling Directory won't check any of subdirectories, unless you explicitly configure it to check those subdirectories, eg:
a1.channels = ch-1
a1.sources = src-1 src-sub-1 src-sub-2
a1.sources.src-1.type = spooldir
a1.sources.src-sub-1.type = spooldir
a1.sources.src-sub-2.type = spooldir
a1.sources.src-1.channels = ch-1
a1.sources.src-sub-2.channels = ch-1
a1.sources.src-sub-1.channels = ch-1
a1.sources.src-1.spoolDir = /var/log/apache/flumeSpool
a1.sources.src-sub-1.spoolDir = /var/log/apache/flumeSpool/subdir
a1.sources.src-sub-2.spoolDir = /var/log/apache/flumeSpool/secondSubdir
In the currently release version of Flume (1.6.0) there isn't a way to do so however there is an issue being worked on to do this:
https://issues.apache.org/jira/browse/FLUME-1899
There is a patch available in the issue - this may/may not help you depending on whether you're able to build a custom Flume deployable.
a1.sources.src-1.recursiveDirectorySearch=true
It will check all sub-directories present in directory.

Load or Stress Testing Tool with URL Import Functionality

Can someone recommend a load testing tool which allows you to either:
a. replay an IIS (7) log(s) to simulate a real live site daily run;
b. import a CSV or equivalent list of URLS so we can achieve a similar thing as above but at a URL level;
c. .net API so I can create simple tests easily from my list of URLS is also a good way to go.
I do not really want to record my tests.
I think I can do B) with WAPT but need to create an XML file manually, not too much grief, but wondering if any tools cover these scenarios out the box.
Visual Studio Test Edition would require some code to parse the file into a suitable test run.
It is a great load testing solution.
Our load testing service lets you write a very simple script using JavaScript to pull data out of a CSV file and then fetch those URLs. For example, the following code would pluck 10 random URLs from the CSV file and fetch them as part of a single session:
var c = browserMob.openHttpClient();
var csv = browserMob.getCSV("urls.csv");
browserMob.beginTransaction();
for (var i = 0; i < 10; i++) {
browserMob.beginStep("Step 1");
var url = csv.random().get("url");
c.get(url);
browserMob.endStep();
}
browserMob.endTransaction();
The CSV file itself needs to be a normal CSV file with the first row containing a header named "url". This script would be run repeatedly for each virtual user participating in a load test.
We have support for so called 'uri-format' in our open-source tool called Yandex.Tank You simply put all your uris to a file, one uri -- one line, then specify headers in your load.ini like this:
[phantom]
address=example.org
rps_schedule=line(1, 1600, 2m)
headers = [Host: mts-maps.yandex.ru]
[Connection: close] [Bloody: yes]
ammo_file = ammo.uri
ammo.uri:
/
/index.html
/1/example.html
/2/example.html

Resources