Indexing and Parsing XML files with ElasticSearch - parsing

I need to index multiple XML files under multiple directories into ElasticSearch and parse them into JSON format, possibly adding some tags. Is it possible to be done with ElastichSearch and Logstash, and if so how can I do it?
Thank you!

It is possible. Point logstash to your XML files and use tagging to tag different files differently to determine how they will be handled by Logstash down the road. Inside of Logstash, you can set up filters to add tags, and other fields, and in the output portion of logstash you can specify what files gets added to what index inside of elasticsearch

Related

kubeflow OutputPath/InputPath question when writing/reading multiple files

I've a data-fetch stage where I get multiple DFs and serialize those. I'm currently treating OutputPath as directory - create it if it doesn't exist etc. and then serialize all the DFs in that path with different names for each DF.
In a subsequent pipeline stage (say, predict) I need to retrieve all those through InputPath.
Now, from the documentation it seems InputPath/OutputPath as file. Does kubeflow as any limitation if I use it as directory?
The ComponentSpec's {inputPath: input_name} and {outputPath: output_name} placeholders and their Python analogs (input_name: InputPath()/output_name: OutputPath()) are designed to support both files/blobs and directories.
They are expected to provide the path for the input/output data. No matter whether the data is a blob/file or a directory.
The only limitation is that UX might not be able to preview such artifacts.
But the pipeline itself would work.
I have experimented with a trivial pipeline - no issue is observed if InputPath/OutputPath is used as directory.

Problems with Dockerbeats dashboard containerName field

I have dockerbeats set up on a local cluster that is running ELK stack and some other misc. dockers (all containers controlled via kubernetes). I set up the dashboard from Ingensi (Ingensi dockerbeat Dashboard) for kibana and ran into an issue with the containerNames field while setting up the graphs. Now, for context, my docker containers have names like these:
k8s_dockerbeats.79c42f90_dockerbeats-796n9_default_472faa11-1b3a-11e6-8bf4-28924a2bffbf_2832ea88
(as well as supporting containers for kubernetes with similar container names) [2]: http://i.stack.imgur.com/hvIUG.png
k8s_POD.6d00e006_dockerbeats-796n9_default_472faa11-1b3a-11e6-8bf4-28924a2bffbf_3ddcfe44
When I set up the dashboard in kibana I get an issue where I get multiple containerNames from the same container. For example instead of a single containerName output I get the containerName split up into smaller segments:
k8s_dockerbeats
79c42f90_dockerbeats
796n9
28924a2bffbf_3ddcfe44
and so on...
I assume that the format of the container name is confusing the dashboard (maybe in the way that it parses the name information) and I could probably go around renaming every container to a more sensible name.
But before I do that, is there a way to configure the dashboard in such a way that I read in the entire container name string so that it does not break up like it does in the first image? (assuming I'll have to dig into the .json files from the repository mentioned above)
Thanks in advance if anyone answers this.
It sounds like the container name is being analyzed by Elasticsearch. You need to make sure that the container name field is marked as not_analyzed in the Elasticsearch index template. You can do this by installing the index template provided by Dockerbeat.
Marking the field as not_analyzed ensures that the data is not tokenized and it gets indexed as is. It will only be searchable by specifying the exact string.
You will need to delete your current indexes after installing the new index template in order to change the mappings.
Install the provided index template:
curl -XPUT 'http://elasticsearch:9200/_template/dockerbeat' -d#dockerbeat.template.json
Delete your the existing indexes:
curl -XDELETE 'http://elasticsearch:9200/dockerbeat-*'
You can view your current mappings by querying Elasticearch:
curl http://elasticsearch:9200/dockerbeat-*/_mapping

Flume morphline interceptor: For Data cleaning

I have a simple structured input coming real time.
But it has garbage also in the values like in some place '#' or Hexadecimal characters are there.
How can i use morphline flume interceptor to clean the data?
My sink here will be hbase.
Sounds like you can use Morphline over flume for your requirements.
In general, Morphline offers some basic functions for parsing and transforming data. On top of that, you can build your own Morphline Command and then use your custom method in your morphlines config.

Is there a way to send files AS-IS via Fluentd?

I'm trying to use using Fluentd to aggregate log files from various servers. And by default it parses the log lines in various ways (and I can see the value in doing that) but in my current situation I would like to send the files AS-IS, without parsing and without changing a thing.
I'm using the in_tail plugin with the following configurations:
<source>
type tail
format none
read_from_head true
path /path/to/logs/*.log
pos_file /path/to/logs/pos_file
tag mylog
</source>
And even this none format parses the logs. For example
I am a line of log
gets parsed as
{"message":"hello world. I am a line of log!"}
I guess the question is: Is there a way for it to send the tail content, without altering anything?
Thanks!
Well, all messages in fluentd will be handled as JSON objects but what you could do is on the receiving end match with a file output (out_file) and that would basically just create log files on the receiving end with the same content as the source.
http://docs.fluentd.org/articles/out_file
You could even "hack" it to output with the format csv and set the delimiter to a whitespace. That could also work...

Can the Sphinx Search Engine index a folder of xml files?

I have folders that contain xml that I need to index in Sphinx. I explored the xmlpipe2 driver, and my understanding is that it only reads xml generated from a script, aka, one source. Is there a way to index a folder of xml files if I don't have the option of putting it a single xml file?
a XMLPipe Script is just a script that outputs XML, which sphinx then ingests.
It doesnt matter WHERE that script, gets the data from that it outputs.
It could get it from other the XML files, the script would just walk the folder structure, read all the files, and output XML.

Resources