I am learning hadoop, machine learning and spark. I have downloaded Cloudera 5.7 Quick Start VM. I have also downloaded the examples from as a zip file and copied them to the Cloudera VM. I have a challenge running the machine learning and any examples from I tried running the simple word count example but failed. Below are my steps and the error i get
[cloudera#quickstart.cloudera] cd /spark-master/examples/src/main/python/ml
[cloudera#quickstart.cloudera] spark-submit
All examples I try to run fail with the below error.
Traceback (most recent call last):
File "/home/cloudera/training/spark-master/examples/src/main/python/ml/", line 23, in
from pyspark.sql import SparkSession
I did a search for the file pyspark.sql but I could only find the below file
cd /spark-master
find . -name pyspark.sql
Please advise on how i can resolve these errors so that i can run this example in order speed up my machine learning and big data.
the code for the word count example is below
from __future__ import print_function
# $example on$
from import Word2Vec
# $example off$
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession\
# $example on$
# Input data: Each row is a bag of words from a sentence or document.
documentDF = spark.createDataFrame([
("Hi I heard about Spark".split(" "), ),
("I wish Java could use case classes".split(" "), ),
("Logistic regression models are neat".split(" "), )
], ["text"])
# Learn a mapping from words to Vectors.
word2Vec = Word2Vec(vectorSize=3, minCount=0, inputCol="text", outputCol="result")
model =
result = model.transform(documentDF)
for feature in"result").take(3):
# $example off$

line 23: spark = SparkSession\
SparkSession is new in Spark 2.0, and Cloudera only ships with Spark 1.6 by default. You can either download the examples from Spark 1.6 or install Spark 2.0 on Cloudera.


Apache Flume agent not starting but not showing error

I'm attempting to run an Apache Flume agent from an AWS EC2 cluster but when I start the agent, it neither starts nor throws an obvious error.
I'm just starting with the simple example from Apache's documentation.
When I run:
ubuntu#ip-172-31-41-5:~/Flume$ ./bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=DEBUG,console
The console output is the following:
Info: Sourcing environment configuration script /home/ubuntu/Flume/conf/
Info: Including HBASE libraries found via (/home/ubuntu/hbase-2.4.4/bin/hbase) for HBASE access
Info: Including Hive libraries found via () for Hive access
+ exec /usr/lib/jvm/java-8-openjdk-amd64/bin/java -Xmx20m -Dflume.root.logger=DEBUG,console -Dflume.root.logger=DEBUG,console -cp '/home/ubuntu/Flume/conf:/home/ubuntu/Flume/lib/*:/home/ubuntu/hbase-2.4.4/conf:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar:/home/ubuntu/hbase-2.4.4:/home/ubuntu/hbase-2.4.4/lib/shaded-clients/hbase-shaded-client-2.4.4.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/audience-annotations-0.5.0.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/commons-logging-1.2.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/log4j-1.2.17.jar:/home/ubuntu/hbase-2.4.4/lib/client-facing-thirdparty/slf4j-api-1.7.30.jar:/home/ubuntu/hbase-2.4.4/conf:/lib/*' -Djava.library.path=:/usr/java/packages/lib/amd64:/usr/lib/x86_64-linux-gnu/jni:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu:/usr/lib/jni:/lib:/usr/lib org.apache.flume.node.Application --conf-file example.conf --name a1 ./bin/flume-ng agent --conf-file example.conf --name a1
The agent doesn't throw an error but never gets further than this. I have also tried some variations including --conf-file conf/example.conf
Flume and Java appear to be installed correctly:
ubuntu#ip-172-31-41-5:~/Flume$ ./bin/flume-ng version
Source code repository:
Revision: 1a15927e594fd0d05a59d804b90a9c31ec93f5e1
Compiled by rgoers on Sun Oct 16 14:44:15 MST 2022
From source with checksum bbbca682177262aac3a89defde369a37
ubuntu#ip-172-31-41-5:~/Flume$ java -version
openjdk version "11.0.17" 2022-10-18
OpenJDK Runtime Environment (build 11.0.17+8-post-Ubuntu-1ubuntu222.04)
OpenJDK 64-Bit Server VM (build 11.0.17+8-post-Ubuntu-1ubuntu222.04, mixed mode, sharing)
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 = c1
# If this file is placed at FLUME_CONF_DIR/, it will be sourced
# during Flume startup.
# Enviroment variables can be set here.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
# Give Flume more memory and pre-allocate, enable remote monitoring via JMX
# export JAVA_OPTS="-Xms100m -Xmx2000m"
# Let Flume write raw event data and configuration information to its log files for debugging
# purposes. Enabling these flags is not recommended in production,
# as it may result in logging sensitive user information or encryption secrets.
# export JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "
# Note that the Flume conf directory is always included in the classpath.
The only clue that I have is is in flume.log which shows the following error. I've even copied example.conf into the main Flume directory but it doesn't seem to make a difference.
03 Dec 2022 21:10:03,538 ERROR [main] (org.apache.flume.node.Application.main:506) - A fatal error occurred while running. Exception Unable to read file /home/ubuntu/Flume/example.confat org.apache.flume.node.FileConfigurationSource.<init>( ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.FileConfigurationSourceFactory.createConfigurationSource( ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.ConfigurationSourceFactory.getConfigurationSource( ~[flume-ng-node-1.11.0.jar:1.11.0]at org.apache.flume.node.Application.main( ~[flume-ng-node-1.11.0.jar:1.11.0]Caused by: java.nio.file.NoSuchFileException: /home/ubuntu/Flume/example.confat sun.nio.fs.UnixException.translateToIOException( ~[?:1.8.0_352]at sun.nio.fs.UnixException.rethrowAsIOException( ~[?:1.8.0_352]at sun.nio.fs.UnixException.rethrowAsIOException( ~[?:1.8.0_352]at sun.nio.fs.UnixFileSystemProvider.newByteChannel( ~[?:1.8.0_352]at java.nio.file.Files.newByteChannel( ~[?:1.8.0_352]at java.nio.file.Files.newByteChannel( ~[?:1.8.0_352]at java.nio.file.Files.readAllBytes( ~[?:1.8.0_352]at org.apache.flume.node.FileConfigurationSource.<init>( ~[flume-ng-node-1.11.0.jar:1.11.0]... 3 more

Debugging AML Model Deployment

I have an ML model (trained locally) in python. Previously the model has been deployed to a Windows IIS server and it's working fine.
Now, I am trying to deploy it as a service on Azure container instance (ACI) with 1 core, and 1 GB of memory. I took references from one and two Microsoft docs. The docs use SDK for all the steps, but I am using the GUI feature from the Azure portal.
After registering the model, I created an entry script and a conda environment YAML file (see below), and uploaded both to "Custom deployment asset" (at Deploy model area).
Unfortunately, after hitting deploy, the Deployment state is stuck at Transitioning state. Even after 4 hours, the state remains the same and there were no Deployment logs too, so I am unable to find what I am doing wrong here.
NOTE: below is just an excerpt of the entry script
import pandas as pd
import pickle
import re, json
import numpy as np
import sklearn
def init():
global model
global classes
model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'randomForest50.pkl')
model = pickle.load(open(model_path, "rb"))
classes = lambda x : ["F", "M"][x]
def run(data):
namesList = json.loads(data)["data"]["names"]
pred = list(map(classes, model.predict(preprocessing(namesList))))
return str(pred[0])
except Exception as e:
error = str(e)
return error
name: gender_prediction
- python
- numpy
- scikit-learn
- pip:
- pandas
- pickle
- re
- json
The issue was in the YAML file. The dependencies/libraries in the YAML should be according to conda environment. So, I changed everything accordingly, and it worked.
Modified YAML file:
name: gender_prediction
- python=3.7
- numpy
- scikit-learn
- pip:
- azureml-defaults
- pandas
- pickle4
- regex
- inference-schema[numpy-support]

Dependency parse large text file with python

I am trying to parse a large txt file (about 2000 sentence). when I want to set the model_path, I get this massage:
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
And also when I set the CLASSPATH to this file, another message comes out:
NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar!
Set the CLASSPATH environment variable.
Would you help me to solve it?
This is my code:
import nltk
from nltk.parse.stanford import StanfordDependencyParser
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
For more information, on stanford-parser.jar, see:
import os
os.environ['CLASSPATH'] = "stanford-corenlp-full-2018-10-05/*"
dependency_parser = StanfordDependencyParser( model_path="edu\stanford\lp\models\lexparser\englishPCFG.ser.gz")
NLTK was unable to find stanford-parser.jar! Set the CLASSPATH
environment variable.
For more information, on stanford-parser.jar, see:
os.environ['CLASSPATH'] = "stanford-corenlp-full-2018-10-05/stanford-parser-full-2018-10-17/stanford-parser.jar"
>>> dependency_parser = StanfordDependencyParser( model_path="stanford-corenlp-full-2018-10-05/stanford-parser-full-2018-10-17/edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
NLTK was unable to find stanford-parser-(\d+)(.(\d+))+-models.jar!
Set the CLASSPATH environment variable.
For more information, on stanford-parser-(\d+)(.(\d+))+-models.jar, see:
You should get the new stanfordnlp dependency parser that is native to Python!
It will run slower on the CPU than GPU, but it still should run reasonably fast.
Just run pip install stanfordnlp to install.
import stanfordnlp'en') # This downloads the English models for the neural pipeline
nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
doc = nlp("Barack Obama was born in Hawaii. He was elected president in 2008.")
There is also a helpful command line tool:
python -m stanfordnlp.run_pipeline -l en example.txt
Full details here:

How to change the Python version in Azure Machine Learning sdk ContainerImage with CondaDependencies

I am trying to get my Faster R-CNN model into an Container Instance on ACI. For that I need my docker image to posses python version 3.5.*. I specify that in my conda yaml file, but every time I spin an instance up and docker run -it *** /bin/bash into it I see that it only has Python 3.6.7.
How can I get my Docker image to have Python version 3.5.*? I already tried conda installing Python version 3.5.2, but that didn't work as eventually it didn't posses 3.5.2, but only 3.6.7. (dfimage lets you see the dockerfile from which the image was created,
My yaml:
name: project_environment
- python=3.5.2
- pip:
- matplotlib
- opencv-python==
- azureml-core==1.0.6
- numpy
- cntk
- cython
- anaconda
Notebook cell:
from azureml.core.conda_dependencies import CondaDependencies
svmandss = CondaDependencies.create(python_version="3.5.2", pip_packages=[
"cython"], )
with open("fasterrcnn.yml","w") as f:
Another notebook cell with ContainerImage specifications.
image_config = ContainerImage.image_configuration(execution_script="",runtime="python",conda_file="./fasterrcnn.yml",dependencies=listdir("utils"),docker_file="./Dockerfile")
service = Webservice.deploy_from_model(workspace=ws,
models=[Model(workspace=ws, name='Faster-RCNN')],
For better readability see my GitHub issue: (
Currently, the version of Python is fixed to what's in Azure ML's base image, when deploying the web service. We're investigating removing this limitation in future.
Since this is one of the top Google answers when searching for "azureml python version" I'm posting the answer here. The documentation is not very clear when it comes to this, but the following will work:
from azureml.core import Workspace
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
ws = Workspace.from_config()
# This is the important part
conda_dep = CondaDependencies(conda_dependencies_file_path="pipeline/environment.yml")
aml_run_config = RunConfiguration(conda_dependencies=conda_dep)
# Define compute target - must be preconfigured in th workspace
compute_target = ws.compute_targets['my-azureml-target'] = compute_target
from azureml.pipeline.steps import PythonScriptStep
script_source_dir = "./pipeline"
step_1_script = ""
step_1 = PythonScriptStep(
from azureml.pipeline.core import Pipeline
# Build the pipeline
pipeline1 = Pipeline(workspace=ws, steps=[step_1])
from azureml.core import Experiment
# Submit the pipeline to be run
pipeline_run1 = Experiment(ws, 'Test-pipeline').submit(pipeline1)
This assumes the following directory structure:
where is the file above, is the script you would like to run and environment.yml is the conda environment file - including the python version.
I was able to change the Python version by registering the environment in Azure ML Workspace:
from azureml.core.environment import Environment, Workspace
environment = Environment.from_conda_specification(name='myenv', file_path='environment.yml')
environment.python.user_managed_dependencies = False
workspace = Workspace.from_config()
environment = environment.register(workspace=workspace)
env_build =
Then, configure the endpoint for publishing as follows:
from azureml.core.model import InferenceConfig
environment = Environment.get(workspace=workspace, name='myenv')
inference_config = InferenceConfig(
This is using Azure ML SDK 1.29.0. Perhaps this has already been fixed and the original method works as well, but I didn't test that.
This is no longer an issue for me. I found another way to get my code to work with python version 3.6.7.
This is however still an issue if you ask me. If in the future I do need python version 3.5 then there will not be a solution as of now.
You can still post an answer if you would like.

Issues with Flume HDFS sink from Twitter

I currently have this configuration in Flume :
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'TwitterAgent'
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = YPTxqtRamIZ1bnJXYwGW
TwitterAgent.sources.Twitter.consumerSecret = Wjyw9714OBzao7dktH0csuTByk4iLG9Zu4ddtI6s0ho
TwitterAgent.sources.Twitter.accessToken = 2340010790-KhWiNLt63GuZ6QZNYuPMJtaMVjLFpiMP4A2v
TwitterAgent.sources.Twitter.accessTokenSecret = x1pVVuyxfvaTbPoKvXqh2r5xUA6tf9einoByLIL8rar
TwitterAgent.sources.Twitter.keywords = hadoop, big data, analytics, bigdata, cloudera, data science, data scientiest, business intelligence, mapreduce, data warehouse, data warehousing, mahout, hbase, nosql, newsql, businessintelligence, cloudcomputing = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
The twitter app auth keys are correct.
And I keep getting this error in the flume log file:
ERROR org.apache.flume.SinkRunner
Unable to deliver event. Exception follows.
org.apache.flume.EventDeliveryException: java.lang.IllegalArgumentException: hadoop1
at org.apache.flume.sink.hdfs.HDFSEventSink.process(
at org.apache.flume.sink.DefaultSinkProcessor.process(
at org.apache.flume.SinkRunner$
Caused by: java.lang.IllegalArgumentException: hadoop1
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(
at org.apache.hadoop.hdfs.DFSClient.<init>(
at org.apache.hadoop.hdfs.DFSClient.<init>(
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(
at org.apache.hadoop.fs.FileSystem.createFileSystem(
at org.apache.hadoop.fs.FileSystem.access$200(
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
at org.apache.hadoop.fs.FileSystem$Cache.get(
at org.apache.hadoop.fs.FileSystem.get(
at org.apache.hadoop.fs.Path.getFileSystem(
at org.apache.flume.sink.hdfs.BucketWriter$
at org.apache.flume.sink.hdfs.BucketWriter$
at org.apache.flume.sink.hdfs.BucketWriter$8$
at org.apache.flume.sink.hdfs.BucketWriter.runPrivileged(
at org.apache.flume.sink.hdfs.BucketWriter.access$800(
at org.apache.flume.sink.hdfs.BucketWriter$
at java.util.concurrent.FutureTask$Sync.innerRun(
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(
at java.util.concurrent.ThreadPoolExecutor$
... 1 more
Caused by: hadoop1
... 23 more
Does any one here knows why and could explain it to me?
Thanks in advance.
According to the Exception, the problem is that the host hadoop1 is unknown.
according to the flume configuration file the path you have given is
which is supposed to be accessible from the machine with the flume agent. since machine names cannot be used to access the HDFS without being in the same domain, you need to access the HDFS using the IP address as set in core-site.xml
