Telegraf config file - substrings and converting hexadecimal to int for ingestion - influxdb

I have incoming BLE beacon data from a gateway in the following format:
{"msg":"advData","gmac":"94A408B02508","obj":
[
{"type":32,"dmac":"AC233FE0784F","data1":"0201060303F1FF1716E2C56DB5DFFB48D2B060D0F5A71096E000000000C564","rssi":-45,"time":"2022-10-13 02:46:24"},
{"type":32,"dmac":"AC233FE078A1","data1":"0201060303F1FF1716E2C56DB5DFFB48D2B060D0F5A71096E000000000C564","rssi":-42,"time":"2022-10-13 02:46:26"}
]
}
and I want to extract the attributes gmac, dmac, rssi, and process attribute data1 and ingest these into influxdb via a telegraf config file.
I can successfully ingest gmac, dmac, and rssi using the below telegraf config:
## Data format to consume.
## Each data format has its own unique set of configuration options, read
## more about them here:
## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_INPUT.md
data_format = "json_v2"
tagexclude = ["topic"]
[[inputs.mqtt_consumer.json_v2]]
measurement_name = "a"
timestamp_path = "obj.#.time"
timestamp_format = "unix"
[[inputs.mqtt_consumer.json_v2.tag]]
path = "gmac"
rename = "g"
[[inputs.mqtt_consumer.json_v2.tag]]
path = "obj.#.dmac"
rename = "d"
[[inputs.mqtt_consumer.json_v2.field]]
path = "obj.#.rssi"
type = "int"
rename = "r"
However, I'm not sure how to process the data1 attribute where I need to (1) extract characters 15 and 16 and convert this from a hexadecimal value to an integer, and (2) extract characters 13 and 14 and convert each hexadecimal value to an integer before combining them together as a float (character 13 is the whole number component, character 14 is the decimal component).
Can anybody provide some guidance here?
Many thanks!

Got it working thanks to some help over at Influx Community, I've pasted the relevant section of the telegraf config file if this is of help to anyone else:
data_format = "json_v2"
tagexclude = ["topic"]
[[inputs.mqtt_consumer.json_v2]]
measurement_name = "a"
timestamp_path = "obj.#.time"
timestamp_format = "unix"
[[inputs.mqtt_consumer.json_v2.tag]]
path = "gmac"
# g is gateway MAC address
rename = "g"
[[inputs.mqtt_consumer.json_v2.tag]]
path = "obj.#.dmac"
# d is beacon MAC address
rename = "d"
[[inputs.mqtt_consumer.json_v2.field]]
path = "obj.#.rssi"
type = "int"
# r is RSSI of beacon
rename = "r"
[[inputs.mqtt_consumer.json_v2.field]]
path = "obj.#.data1"
data_type = "string"
[[processors.starlark]]
namepass = ["a"]
source = '''
def apply(metric):
data1 = metric.fields.pop("data1")
tempWhole = int("0x" + data1[26:28], 0)
tempDecimal = int("0x" + data1[28:30], 0)
tempDecimal = tempDecimal / 100
# t is temperature of chip to two decimal points precision
metric.fields["t"] = tempWhole + tempDecimal
# b is battery level in mV
metric.fields["b"] = int("0x" + data1[30:34], 0)
return metric
'''

Related

influxdb/telegraf mqtt consumer parsing, value is not named correctly

Hi have this telegraf configuration
[[inputs.mqtt_consumer]]
servers = ["tcp://test_mosquitto_1:1883"]
# data_format = "influx"
username = "rasp"
password = "XXXXY"
topics = [
"battery/#"
]
data_format = "value"
data_type = "float" # required
[[inputs.mqtt_consumer.topic_parsing]]
data_format = "value"
data_type = "float"
topic = "battery/+/+/temperature"
measurement = "measurement/_/_/_"
tags = "_/site/device_name/_"
fields = "_/_/_/temperature"
[[inputs.mqtt_consumer.topic_parsing]]
data_format = "value"
data_type = "int"
topic = "battery/+/+/voltage"
measurement = "measurement/_/_/_"
tags = "_/site/device_name/_"
fields = "_/_/_/voltage"
Im pushing topics over mqtt to "battery/hamburg/devicename2312/temperature" and the payload is the value for Temperatur. The location hamburg should be taged( site ) and the device name should be taged. It works everything except that the value is not named correctly see influxdb log:
battery,device_name=101A14420210010,host=5cc0065d3907,site=hamburg,topic=battery/hamburg/101A14420210010/temperature value=23.35001,temperature="temperature" 1653991738177023790 telegraf_1 |
i have now "value" in my influx database and "temperature" (as a string) with value "temperature". I just want that telegraf saves the value to "temperature"
Here you see the mqtt explorer view
after hours of googling and reading it works now.
here is the changed part of the config:
[[inputs.mqtt_consumer.topic_parsing]]
data_format = "value"
data_type = "float"
topic = "battery/+/+/temperature"
measurement = "measurement/_/_/_"
tags = "_/site/device_name/field"
fields = "_/_/_/temperature"
[[processors.pivot]]
tag_key = "field"
value_key = "value"
more information here :
https://www.influxdata.com/blog/pivot-mqtt-plugin/
Hi it seems I have currently the same question, but can't figure out the answer for me. Could you just paste den whole mqtt consumer config please? so with inputs.mqtt_consumer.
Mines currently looks like that
[[inputs.mqtt_consumer]]
name_override = "chn0"
servers = ["tcp://127.0.0.1:1883"]
topics = [
"vzlogger/data/chn0/raw/#"
]
data_format = "json"
I tried to adapt your code to mine but I get a strange behavior.
[[inputs.mqtt_consumer]]
servers = ["tcp://127.0.0.1:1883"]
topics = [
"vzlogger/data/chn0/raw"
]
data_format = "value"
data_type = "float"
[[inputs.mqtt_consumer.topic_parsing]]
topic = "vzlogger/+/chn0/+"
measurement = "measurement/_/_/_"
tags = "_/_/channel/_"
fields = "_/_/_/chn0"
[[processors.pivot]]
tag_key = "field"
value_key = "value"
it creates a new measurement what is not bad at all.
it writes still the value into the field/tag "value".
Field chn0 gets the value raw.
In my first code snippet I just put each channel (I have three different) into a different measurment, but this is not a good solution from my point of view.
battery,device_name=....,host=....,site=hamburg,topic=battery/hamburg/101A14420210010/temperature value=23.35001,temperature="temperature" 1653991738177023790
[[inputs.mqtt_consumer.topic_parsing]]
data_format = "value"
data_type = "float"
topic = "battery/+/+/temperature"
measurement = "measurement/_/_/_"
tags = "_/site/device_name/field" <<<< "field" gets replaced with
the actual name of the tag which is temperature
battery/hamburg/101A14420210010/temperature
fields = "_/_/_/temperature"
[[processors.pivot]]
tag_key = "field" <<<< use the "field" value to replace te next
value_key which is called "value"
value_key = "value" <<<< replace value=23.35001 in output with temperature=23.35001

Using a single pattern to capture multiple values containing in a file in lua script

i have a text file that contains data in the format YEAR, CITY, COUNTRY. data is written as one YEAR, CITY, COUNTRY per line. eg -:
1896, Athens, Greece
1900, Paris, France
Previously i was using the data hard coded like this
local data = {}
data[1] = { year = 1896, city = "Athens", country = "Greece" }
data[2] = { year = 1900, city = "Paris", country = "France" }
data[3] = { year = 1904, city = "St Louis", country = "USA" }
data[4] = { year = 1908, city = "London", country = "UK" }
data[5] = { year = 1912, city = "Stockholm", country = "Sweden" }
data[6] = { year = 1920, city = "Antwerp", country = "Netherlands" }
Now i need to read the lines from the file and get the values in to the private knowledge base "local data = {} "
Cant figure out how to capture multiple values using a single pattern from the data in the file.
My code so far is
local path = system.pathForFile( "olympicData.txt", system.ResourceDirectory )
-- Open the file handle
local file, errorString = io.open( path, "r" )
if not file then
-- Error occurred; output the cause
print( "File error: " .. errorString )
else
-- Read each line of the file
for line in file:lines() do
local i, value = line:match("%d")
table.insert(data, i)
-- Close the file
io.close(file)
end
file = nil
Given that you read a line like
1896, Athens, Greece
You can simply obtain the desired values using captures.
https://www.lua.org/manual/5.3/manual.html#6.4.1
Captures: A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the
substrings of the subject string that match captures are stored
(captured) for future use. Captures are numbered according to their
left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the
part of the string matching "a*(.)%w(%s*)" is stored as the first
capture (and therefore has number 1); the character matching "." is
captured with number 2, and the part matching "%s*" has number 3.
As a special case, the empty capture () captures the current string
position (a number). For instance, if we apply the pattern "()aa()" on
the string "flaaap", there will be two captures: 3 and 5.
local example = "1896, Athens, Greece"
local year, city, country = example:match("(%d+), (%w+), (%w+)")
print(year, city, country)

Using AKSampleDescriptor

Using AKSamplerDescriptor
I am using an adapted AKSampler example, in which I try to use the sforzando output of Fluid.sf3 melodicSounds. Sforzando creates .sfz files for each instrument, but all pointing for the global sample to a huge .wav file.
In all the instrument.sfz files there is an offset and endpoint description for the part of the wave file to be used.
When I load the .sfz file I get a crash due to memory problems. It seems that for every defined region in the .sfz file the complete .wav file (140 mB) is loaded again.
The most likely is that loading the sample file with the AKSampleDescriptor as done in the AKSampler example will ignore offset and endpoint (AKSampleDescriptor.startPoint and AKSampleDescriptor.endPoint) while reloading the complete .wav file.
Is there a way to load just the part start-to-end wanted from the sample file, because the complete file has al the sample data for all the instruments (I know and use polyphony that extracts only one instrument at the time and works fine, but this is for other use)
Or, and that seems the best to me, just load the file once and than have the sampledescriptors point to the data in memory
Good suggestions, Rob. I just ran into this one-giant-WAV issue myself, having never seen it before. I was also using Sforzando for conversion. I'll look into adding the necessary capabilities to AKSampler. In the meantime, it might be easier to write a program to cut up the one WAV file into smaller pieces and adjust the SFZ accordingly.
Here is some Python 2.7 code to do this, which I have used successfully with a Sforzando-converted sf2 soundfont. It might need changes to work for you--there is huge variability among sfz files--but at least it might help you get started. This code requires the PyDub library for manipulating WAV audio.
import os
import re
from pydub import AudioSegment
def stripComments(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
def updateSplitList(splitList, regionLabels, values):
if len(values) > 3:
start = int(values['offset'])
length = int(values['end']) - start
name = regionLabels.pop(0)
splitList.add((name, start, length))
def lookupSplitName(splitList, offset, end):
for (name, start, end) in splitList:
if offset == start and end == end:
return name
return None
def outputGroupAndRegion(outputFile, splitList, values):
if values.has_key('lokey') and values.has_key('hikey') and values.has_key('pitch_keycenter'):
outputFile.write('<group> lokey=%s hikey=%s pitch_keycenter=%s\n' % (values['lokey'], values['hikey'], values['pitch_keycenter']))
elif values.has_key('key') and values.has_key('pitch_keycenter'):
outputFile.write('<group> key=%s pitch_keycenter=%s\n' % (values['key'], values['pitch_keycenter']))
if len(values) > 3:
outputFile.write(' <region> ')
if values.has_key('lovel') and values.has_key('hivel'):
outputFile.write('lovel=%s hivel=%s ' % (values['lovel'], values['hivel']))
if values.has_key('tune'):
outputFile.write('tune=%s ' % values['tune'])
if values.has_key('volume'):
outputFile.write('volume=%s ' % values['volume'])
if values.has_key('offset'):
outputFile.write('offset=0 ')
if values.has_key('end'):
outputFile.write('end=%d ' % (int(values['end']) - int(values['offset'])))
if values.has_key('loop_mode'):
outputFile.write('loop_mode=%s ' % values['loop_mode'])
if values.has_key('loop_start'):
outputFile.write('loop_start=%d ' % (int(values['loop_start']) - int(values['offset'])))
if values.has_key('loop_end'):
outputFile.write('loop_end=%d ' % (int(values['loop_end']) - int(values['offset'])))
outputFile.write('sample=samples/%s' % lookupSplitName(splitList, int(values['offset']), int(values['end'])) + '.wav\n')
def process(inputFile, outputFile):
# create a list of region labels
regionLabels = list()
for line in open(inputFile):
if line.strip().startswith('region_label'):
regionLabels.append(line.strip().split('=')[1])
# read entire input SFZ file
sfz = open(inputFile).read()
# strip comments and create a mixed list of <header> tags and key=value pairs
sfz_list = stripComments(sfz).split()
inSection = "none"
default_path = ""
global_sample = None
values = dict()
splitList = set()
# parse the input SFZ data and build up splitList
for item in sfz_list:
if item.startswith('<'):
inSection = item
updateSplitList(splitList, regionLabels, values)
values.clear()
continue
elif item.find('=') < 0:
#print 'unknown:', item
continue
key, value = item.split('=')
if inSection == '<control>' and key == 'default_path':
default_path = value.replace('\\', '/')
elif inSection == '<global>' and key == 'sample':
global_sample = value.replace('\\', '/')
elif inSection == '<region>':
values[key] = value
# split the wav file
bigWav = AudioSegment.from_wav(global_sample)
#print "%d channels, %d bytes/sample, %d frames/sec" % (bigWav.channels, bigWav.sample_width, bigWav.frame_rate)
frate = float(bigWav.frame_rate)
for (name, start, length) in splitList:
startMs = 1000 * start / frate
endMs = 1000 * (start + length) / frate
wav = bigWav[startMs : endMs]
wavName = 'samples/' + name + '.wav'
wav.export(wavName, format='wav')
# parse the input SFZ data again and generate the output SFZ
for item in sfz_list:
if item.startswith('<'):
inSection = item
outputGroupAndRegion(outputFile, splitList, values)
values.clear()
continue
elif item.find('=') < 0:
#print 'unknown:', item
continue
key, value = item.split('=')
if inSection == '<control>' and key == 'default_path':
default_path = value.replace('\\', '/')
elif inSection == '<global>' and key == 'sample':
global_sample = value.replace('\\', '/')
elif inSection == '<region>':
values[key] = value
dirPath = '000'
fileNameList = os.listdir(dirPath)
for fileName in fileNameList:
if fileName.endswith('.sfz'):
inputFile = os.path.join(dirPath, fileName)
outputFile = open(fileName, 'w')
print fileName
process(inputFile, outputFile)

how to speed up "POS tag" with StanfordPOSTagger?

I wanted to take none phrases of tweets, code is following. The problem is that it only process 300 tweets at a time and spend 5 minutes, how to speed up?
by the way, some code edited according to text blob.
I use dataset of gate-EN-twitter(https://gate.ac.uk/wiki/twitter-postagger.html) and NLTK interface to the Stanford POS tagger to tag tweets
from nltk.tag import StanfordPOSTagger
from nltk.tokenize import word_tokenize
import time,nltk
start_time = time.time()
CFG = {
('NNP', 'NNP'): 'NNP',
('NN', 'NN'): 'NNI',
('NNI', 'NN'): 'NNI',
('JJ', 'JJ'): 'JJ',
('JJ', 'NN'): 'NNI',
}
st = StanfordPOSTagger('/models/gate-EN-twitter.model','/twitie_tagger/twitie_tag.jar', encoding='utf-8')
def _normalize_tags(chunk):
'''Normalize the corpus tags.
("NN", "NN-PL", "NNS") -> "NN"
'''
ret = []
for word, tag in chunk:
if tag == 'NP-TL' or tag == 'NP':
ret.append((word, 'NNP'))
continue
if tag.endswith('-TL'):
ret.append((word, tag[:-3]))
continue
if tag.endswith('S'):
ret.append((word, tag[:-1]))
continue
ret.append((word, tag))
return ret
def noun_phrase_count(text):
matches1=[]
print('len(text)',len(text))
for i in range(len(text)//1000):
tokenized_text = word_tokenize(text[i*1000:i*10000+1000])
classified_text = st.tag(tokenized_text)
tags = _normalize_tags(classified_text)
merge = True
while merge:
merge = False
for x in range(0, len(tags) - 1):
t1 = tags[x]
t2 = tags[x + 1]
key = t1[1], t2[1]
value = CFG.get(key, '')
if value:
merge = True
tags.pop(x)
tags.pop(x)
match = '%s %s' % (t1[0], t2[0])
pos = value
tags.insert(x, (match, pos))
break
matches = [t[0] for t in tags if t[1] in ['NNP', 'NNI']]
matches1+=matches
print("--- %s seconds ---" % (time.time() - start_time))
fdist = nltk.FreqDist(matches1)
return [(tag,num) for (tag, num) in fdist.most_common()]
noun_phrase_count(tweets)
Looks like a duplicate of Stanford POS tagger with GATE twitter model is slow so you may find more info there.
Additionally; if there's any chance of stumbling upon identical inputs (tweets) twice (or more), you can consider a dictionary with the tweet (plain str) as key, and tagged as value, so that when you encounter a tweet, you first check if it's in your dict already. If not, tag it and put it there (and if this route is viable, why not pickle/unpickle that dictionary so that debugging/subsequent runs of your code go faster as well).

gap characters corresponding to the missing residues in a protein sequence using biopython

I have extracted sequence from pdb without missing residues but i need to have a sequence that has gaps (dash character) replacing missing residues. how can i do it?
thank you
I did it manually by parsing ATOM records of pdb to get existing residues and by parsing REMARK 465 to get missing residues:
pname = '4g5j.pdb' #downloaded file from PDB
fin = open(pname,'r')
content = fin.readlines()
fin.close()
res = []
mis_res = []
print('CHAIN A will be used for ALIGNMENT')
het_chain = 'A'
for i,line in enumerate(content):
if line[0:4] == 'ATOM':
split = [line[:6], line[6:11], line[12:16], line[17:20], line[21], line[22:26], line[30:38], line[38:46], line[46:54]]
if split[4] != het_chain:
continue
res.append(int(split[5]))
for i,line in enumerate(content):
if line[0:10] == 'REMARK 465':
split = [line[:10], line[19], line[21:26]]
if split[1] == het_chain:
mis_res.append(int(split[2]))
resindexes = sorted(list(set(sorted(res))))
missed_resindexes = sorted(list(set(mis_res)))
missed_resindexes = [el for el in missed_resindexes if el not in resindexes]
all_indexes = sorted(resindexes+missed_resindexes)
print(len(all_indexes))
#here you should have your real sequence!
real_seq = 'GSMGEAPNQALLRILKETEFKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEYLIPQQG'
missed_seq = real_seq[:]
for i,el in enumerate(real_seq):
if all_indexes[i] in missed_resindexes:
print(i)
missed_seq = missed_seq[:i]+'-'+missed_seq[i+1:]
print(missed_seq)
OUTPUT:
---GEAPNQALLRILKETEFKKIKVLGS----TVYKGLWIPEGEKVKIPVAIKE----------KEILDEAYVMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNWCVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAEGGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILEKGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERMHLPSPTDSNFYRALMDEEDMDDVVDADEY------

Resources