I am trying to use flume to use the Twitter Stream API and index the tweet to my elasticsearch. I setup my flume.conf to use com.cloudera.flume.source.TwitterSource as twitter source (with my dev tokens) and I use the default elastisearch for the sink.
I am able to get the tweets (because I also save it into HDFS, and when I open the file I can see the tweets) but when i search into my elasticsearch, I get as response :
{
_index: twitter-2014-02-14
_type: tweet-rt
_id: ilL5ZrBRSlqrZcsVUbnO-g
_version: 1
_score: 1
_source: {
#message: org.elasticsearch.common.xcontent.XContentBuilder#12da4409
#timestamp: 2014-02-14T10:16:13.000Z
#fields: {
timestamp: 1392372973000
}
}
here example of my flume config.
# - ElasticSearch Sink
TwitterAgent.sinks.ES.type = elasticsearch
TwitterAgent.sinks.ES.channel = FileChannel
TwitterAgent.sinks.ES.hostNames = 192.168.10.100:9300
TwitterAgent.sinks.ES.indexName = twitter
TwitterAgent.sinks.ES.indexType = tweet-rt
TwitterAgent.sinks.ES.clusterName = testou
Do I have to add something else ? I dont understand why ES cannot deserialize my tweet.
Any ideas?
thankyou
This is weird. It's doing some form of identityHashCode on the XContentBuilder to get that message and it should not.
I think I'd recommend clearing out Flume and re-installing. I'd be concerned about classpath and JAR dependency issues.
What version of Flume?
For others who come across this error, this is a bug in flume elastic search sink which has been fixed now. See https://issues.apache.org/jira/browse/FLUME-2126
If you are on flume version earlier than 1.6 you may want to cherry pick and build one with this patch against your version.
Related
My goal is to write transformed data from a MongoDB collection into Neo4j using Spark Structured Streaming. According to the Neo4j docs, this should be possible with the "Neo4j Connector for Apache Spark" version 4.1.2.
Batch queries so far work fine. However, with the following example below, I run into an error message:
spark-shell --packages org.mongodb.spark:mongo-spark-connector:10.0.2,org.neo4j:neo4j-connector-apache-spark_2.12:4.1.2_for_spark_3
val dfTxn = spark.readStream.format("mongodb")
.option("spark.mongodb.connection.uri", "mongodb://<IP>:<PORT>")
.option("spark.mongodb.database", "test")
.option("spark.mongodb.collection", "txn")
.option("park.mongodb.read.readPreference.name","primaryPreferred")
.option("spark.mongodb.change.stream.publish.full.document.only", "true")
.option("forceDeleteTempCheckpointLocation", "true").load()
val query = dfPaymentTx.writeStream.format("org.neo4j.spark.DataSource")
.option("url", "bolt://<IP>:<PORT>")
.option("save.mode", "Append")
.option("checkpointLocation", "/tmp/checkpoint/myCheckPoint")
.option("labels", "Account")
.option("node.keys", "txn_snd").start()
This gives me the following error message:
java.lang.UnsupportedOperationException: Data source org.neo4j.spark.DataSource does not support streamed writing
Although the Connector should officially support streaming starting with version 4.x. Does anybody have an idea what I'm doing wrong?
Thanks in advance!
Incase, if the connector doesnt support streaming writes, you can try something like below.
you can leverage foreachBatch() functionality from spark structured streaming and write the data into Neo4j in batch mode.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch
def process_entry(df, id):
df.write.ToNeo4j(url=url, table="mytopic", mode="append", properties=props)
query = df.writeStream.foreachBatch(process_entry).start()
In the above code, you can have your Neo4j Writer logic and you can write the data into database using batch mode.
In the Flume agent I am collection the elements from Kafka topics and I need to insert them in ES. However I need to perform a previous digestion process in the sink, so I need to write a custom sink to pass the data from the Agent's channel to a java digestion module (which I have written already).
Can anyone share with me a template of a custom sink and can use as a reference? Flumes official website doesn't say much about this topic:
A custom sink’s class and its dependencies must be included in the agent’s classpath when starting the Flume agent. The type of the custom sink is its FQCN.
https://flume.apache.org/FlumeUserGuide.html#custom-sink
And once the custom sink is ready, How could I link the following three files to make the agent work:
custom sink
ingestion jar (java module to perform the ingestion process)
FlumeAgent.properties
Thank you for any feedback. I will keep adding information as soon as I progress in this task.
Hope you are trying to use Flume to recieve events from Kafka (source) and forwarding it to ES (sink) with some data processing logic already you have.
With this understanding, I would suggest you to look into Flume interceptors which is responsible for altering/filtering the events on fly before sending to Sink.
So all your business logic to alter the events can be implemented as a custom interceptor and it should be configured to the Flume channel.
For reference you can checkout the native interceptors source code already available. This should probably give you an idea on the Flume interceptor framework.
Here is the ES Sink source code
Sample Flume config
a1.sources = kafkaSource
a1.sinks = ES_Sink
a1.channels = channel1
a1.sources.kafkaSource.interceptors = i1
a1.sources.kafkaSource.interceptors.i1.type = org.apache.flume.interceptor.<Custom_Interceptor_name>$Builder
a1.sinks.ES_Sink.channel = channel1
a1.sinks.ES_Sink.type = elasticsearch
a1.sinks.ES_Sink.hostNames = 127.0.0.1:9200
It looks like Base CRM has upgraded their API and replaced all of their endpoints/parameters.
Previously I was able to retrieve "Won" deals using this call:
session = BaseCrm::Session.new("<LEGACY_ACCESS_TOKEN>")
session.deals.all(stage: :won, sort_by: :last_activity, sort_order: :desc, page: 1)
This query recently started ignoring my parameters, yet it continued to respond with unfiltered data (that was fun when I realized that was happening).
The new syntax is:
client = BaseCRM::Client.new(access_token: "<YOUR_PERSONAL_ACCESS_TOKEN>")
client.deals.where(organization_id: google.id, hot: true)
yet this does not work:
client.deals.where(stage_name: :won)
client.deals.where(stage_name: "Won")
client.deals.where(stage_id: 8) # specified ID found in Base Docs for "Won"
etc.
I've looked into the most recent updates to the Base CRM Gem as well as the Base CRM API Docs but have not found a solution to searching by specific deal stage.
Has anyone had any luck with the new API and this kind of query?
Is there a way to use the legacy API?
I've left message with Base but I really need to fix this, you know, yesterday.
Thanks for your help!
ADDITIONAL INFO
The legacy API/gem responded with JSON where the v2 API/gem responds with a BaseCRM::Deal object:
$ session.deals.find(123456)
# <BaseCRM::Deal
dropbox_email="dropbox#67890.deals.futuresimple.com",
name="Cool Deal Name",
owner_id=54321,
creator_id=65432,
value=2500,
estimated_close_date=nil,
last_activity_at="2016-04-21T02:29:43Z",
tags=[],
stage_id=84588,
contact_id=098765432,
custom_fields={:"Event Location"=>"New York, NY", :Source=>"Friend"},
last_stage_change_at="2016-04-21T02:08:20Z",
last_stage_change_by_id=559951,
created_at="2016-04-18T22:16:35Z",
id=123456,
updated_at="2016-04-21T02:08:20Z",
organization_id=nil,
hot=false,
currency="USD",
source_id=1466480,
loss_reason_id=nil
>
Checkout stage_id. Is this a bug? According to the Docs stage_id should return an integer between 1 and 10.
I am kind of newbie to apache flume , I have configured single tier agent with sink group -load balance manually , I would like to know how can i test the sink group load balancing ? any idea folks
You can define two different sinks and mention them in the Sink Groups as below,
agent1.sinkgroups = g1
agent1.sinkgroups.g1.sinks = HDFS1 HDFS2
agent1.sinkgroups.g1.processor.type = load_balance
agent1.sinkgroups.g1.processor.backoff = true
agent1.sinkgroups.g1.processor.selector = round_robin
Here both of them are HDFS sinks.
You can mention the process selector (round_robin[default], random or custom selector) which defines how should the load be balanced between two sinks.
When you run the agent, you can see that two different set of data is stored in two respective HDFS paths(sinks).
Other two optional parameters are backoff and selector.maxTimeOut
You can refer this link for more info Flume 1.6.0 User Guide
I would like to monitor elasticsearch using nagios.
Basiclly, I want to know if elasticsearch is up.
I think I can use the elasticsearch Cluster Health API (see here)
and use the 'status' that I get back (green, yellow or red), but I still don't know how to use nagios for that matter ( nagios is on one server and elasticsearc is on another server ).
Is there another way to do that?
EDIT :
I just found that - check_http_json. I think I'll try it.
After a while - I've managed to monitor elasticsearch using the nrpe.
I wanted to use the elasticsearch Cluster Health API - but I couldn't use it from another machine - due to security issues...
So, in the monitoring server I created a new service - which the check_command is check_command check_nrpe!check_elastic. And now in the remote server, where the elasticsearch is, I've editted the nrpe.cfg file with the following:
command[check_elastic]=/usr/local/nagios/libexec/check_http -H localhost -u /_cluster/health -p 9200 -w 2 -c 3 -s green
Which is allowed, since this command is run from the remote server - so no security issues here...
It works!!!
I'll still try this check_http_json command that I posted in my qeustion - but for now, my solution is good enough.
After playing around with the suggestions in this post, I wrote a simple check_elasticsearch script. It returns the status as OK, WARNING, and CRITICAL corresponding to the "status" parameter in the cluster health response ("green", "yellow", and "red" respectively).
It also grabs all the other parameters from the health page and dumps them out in the standard Nagios format.
Enjoy!
Shameless plug: https://github.com/jersten/check-es
You can use it with ZenOSS/Nagios to monitor cluster health, data indices, and individual node heap usage.
You can use this cool Python script for monitoring your Elasticsearch cluster. This script check your IP:port for Elasticsearch status. This one and more Python script for monitoring Elasticsearch can be found here.
#!/usr/bin/python
from nagioscheck import NagiosCheck, UsageError
from nagioscheck import PerformanceMetric, Status
import urllib2
import optparse
try:
import json
except ImportError:
import simplejson as json
class ESClusterHealthCheck(NagiosCheck):
def __init__(self):
NagiosCheck.__init__(self)
self.add_option('H', 'host', 'host', 'The cluster to check')
self.add_option('P', 'port', 'port', 'The ES port - defaults to 9200')
def check(self, opts, args):
host = opts.host
port = int(opts.port or '9200')
try:
response = urllib2.urlopen(r'http://%s:%d/_cluster/health'
% (host, port))
except urllib2.HTTPError, e:
raise Status('unknown', ("API failure", None,
"API failure:\n\n%s" % str(e)))
except urllib2.URLError, e:
raise Status('critical', (e.reason))
response_body = response.read()
try:
es_cluster_health = json.loads(response_body)
except ValueError:
raise Status('unknown', ("API returned nonsense",))
cluster_status = es_cluster_health['status'].lower()
if cluster_status == 'red':
raise Status("CRITICAL", "Cluster status is currently reporting as "
"Red")
elif cluster_status == 'yellow':
raise Status("WARNING", "Cluster status is currently reporting as "
"Yellow")
else:
raise Status("OK",
"Cluster status is currently reporting as Green")
if __name__ == "__main__":
ESClusterHealthCheck().run()
I wrote this a million years ago, and it might still be useful: https://github.com/radu-gheorghe/check-es
But it really depends on what you want to monitor. The above measures:
if Elasticsearch responds to HTTP
if ingestion rate drops under the defined levels
if total number of documents drops the defined levels
But of course there's much more that might be interesting. From query time to JVM heap usage. We wrote a blog post about the most important ones here: https://sematext.com/blog/top-10-elasticsearch-metrics-to-watch/
Elasticsearch has APIs for all these, so you may be able to use a generic check_http_json to get the needed metrics. Alternatively, you may want to use something like Sematext Monitoring for Elasticsearch, which gets these metrics out of the box, then forward threshold/anomaly alerts to Nagios. (disclosure: I work for Sematext)