druid-kinesis-indexing-service giving unparseable characters

druid-kinesis-indexing-service giving unparseable characters - stream

I am switching our ingestion from using tranquility to use druid-kinesis-indexing-service. However, when I connect to the data it is showing lines with correct json sandwiched between unparseable characters:
eg:
�0{"message": {"ex_json_key":1}�0,
�0{"message": {"ex_json_key":2}�0,
This means that the parser is not able to parse these lines correctly. I have tried fiddling with many of the input configurations in the supervisor spec but they do not seem to make a difference. This does not seem to be an issue at all using the same kinesis stream in tranquility. Would anyone know what the issue is here and/or the way to fix it?
Thanks!
Abbreviated version of our supervisor-spec is here:
> {
> "type": "kinesis",
> "spec": {
> "dataSchema": {
> "dataSource": "new_source_kinesis",
> "metricsSpec": [
> ],
> "granularitySpec": {
> "segmentGranularity": "hour",
> "queryGranularity": "minute",
> "rollup": true,
> "type": "uniform"
> },
> "dimensionsSpec": {
> "dimensions": [
> "coln"
> ]
> },
> "timestampSpec": {
> "column": "timecol",
> "format": "auto"
> }
> },
> "ioConfig": {
> "stream": "stream_name",
> "inputFormat": {
> "type": "json",
> "flattenSpec": {
> "useFieldDiscovery": true,
> "fields": [
> {
> "type": "path",
> "name": "coln",
> "expr": "$.message.n"
> }
> ]
> }
> },
> "endpoint": "kinesis.us-east-1.amazonaws.com",
> "taskCount": 2
>
> },
> "tuningConfig": {
> "type": "kinesis",
> "reportParseExceptions":true,
> "logParseExceptions":true,
> "intermediatePersistPeriod": "PT10M",
> "maxRowsInMemory": 75000
> }
> }
> }

We were able to solve by following this portion of the documentation
https://druid.apache.org/docs/latest/development/extensions-core/kinesis-ingestion.html#deaggregation
Our steps were:
set "deaggregate": true in the ioConfig portion of the supervisor-spec
adding amazon-kinesis-client 1.9.2 under the kinesis-indexing-service extensions folder on the middle-managers/coordinator
sudo wget https://repo1.maven.org/maven2/com/amazonaws/amazon-kinesis-client/1.9.2/amazon-kinesis-client-1.9.2.jar -P /druid-0.18.1/extensions/druid-kinesis-indexing-service/
removing the existing amazon-kinesis-client 1.13 from druid-0.18.1/lib
sudo rm amazon-kinesis-client-1.13.3.jar
(without doing this step we were error Caused by: java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: com/amazonaws/services/kinesis/model/Record)
restarting the middlemangers/coordinator

Related

special character conversation in bash output

Hi :P I'm a beginner here and i was doing a small project in bash to automatically create prewritten text files when at the last moment I realized that characters like "" (example: "format_version" goes out format_version)were deleted when writing output to my file. Please help me ('-')
python ID_1.py
python ID_2.py
echo "bash(...)"
ID1=$(cat link1.txt)
ID2=$(cat link2.txt)
echo "Name of your pack=$NP"
read NP
echo "This description=$DC"
read DC
cd storage/downloads
echo "{
("format_version"): 1,
"header": {
"name": "$NP",
"description": "$DC",
"uuid": "$ID1",
"version": [0, 0, 1]
},
"modules": [
{
"type": "resources",
"uuid": "$ID2",
"version": [0, 0, 1]
}
]
}" > manisfest.json

Override json file values with environment variables docker

Assume that I've a complex json file that is used to configurate my project.
Like the json below:
{
"apis": {
"payment": {
"base_url": "https://example.com/"
},
"order": {
"base_url": "https://example.com/"
},
},
"features": {
"authentication": {
"authProviders": true,
"registration": false
}
},
"availableLocales": [
"en",
"es"
]
}
With .Net there's a feature that allows us to override the values based on environment variables.
If I wanted to override the value of apis.payment.base_url I could pass an environment variable: APIS__PAYMENT__BASE_URL and the value would be replaced.
Since I'm currently not using .Net is there any alternatives?
This is what I'm using right now, but this does not fit my needs
FROM code as prepare-build
ENV JQ_VERSION=1.6
RUN wget --no-check-certificate \
https://github.com/stedolan/jq/releases/download/jq-${JQ_VERSION}/jq-linux64 \
-O /tmp/jq-linux64
RUN cp /tmp/jq-linux64 /usr/bin/jq
RUN chmod +x /usr/bin/jq
WORKDIR /code/public
RUN jq 'reduce path(recurse | scalars) as $p (.;setpath($p; "$" + ($p | join("_"))))' \
./configurations/settings.json > ./configurations/settings.temp.json && \
yez | cp ./configurations/settings.temp.json ./configurations/settings.json
WORKDIR /code/deploy
RUN echo "#!/usr/bin/env sh" | tee -a /code/deploy/start.sh > /dev/null && \
echo 'export EXISTING_VARS=$(printenv | awk -F= '\''{print $1}'\'' | sed '\''s/^/\$/g'\'' | paste -sd,);' | tee -a /code/deploy/start.sh > /dev/null && \
echo 'for file in $CONFIGURATIONS_FOLDER;' | tee -a /code/deploy/start.sh > /dev/null && \
echo 'do' | tee -a /code/deploy/start.sh > /dev/null && \
echo ' cat $file | envsubst $EXISTING_VARS | tee $file' | tee -a /code/deploy/start.sh > /dev/null && \
echo 'done' | tee -a /code/deploy/start.sh > /dev/null && \
echo 'nginx -g '\''daemon off;'\''' | tee -a /code/deploy/start.sh > /dev/null
WORKDIR /code
This was I have a problem that, I need to pass all the json paths as environment variables, to override it correctly. If not, the variables will be replaced with the path of it, only.
I think the best approach would be:
Read the environment variables and create a json file with their values, then override the existing json file with the values of the created one.
Does anyone have any thing that could help me achieve this?
To summarize.
In order to make easy to identify which environment variables I should use, let's assume it will have a prefix of SETTINGS.
Example of how I would override values.
JSON PATH
EQUIVALENT ENVIRONMENT VARIABLE
APIS.PAYMENT.BASE_URL
SETTINGS__APIS__PAYMENT__BASE_URL
AVAILABLELOCALES[0]
SETTINGS__AVAILABLELOCALES__0

The task can be solved using jq.
The version is robust against settings that do not match a path in the document.
Variables
SETTINGS__APIS__PAYMENT__BASE_URL=https://example2.com
SETTINGS__AVAILABLELOCALES__0=cs
SETTINGS__UNAVAILABLE__PATH=1
Code
jq 'def settings:
def prepareVariables:
[$ENV | to_entries[] | select(.key | startswith("SETTINGS__"))] # select all variables that starts with "SETTINGS__"
| map(.key |= (. / "__" | map(tonumber? // .))[1:]); # convert variable names to path arrays
[paths(scalars) | [., map(ascii_upcase? // .)]] | # collect all leaf paths from input file and add uppercase path
reduce .[] as $leafPath # add leaf paths to corresponding settings
(prepareVariables; map(select($leafPath[1] == .key) |= . + {path: $leafPath[0]})) |
map(select(has("path"))); # drop settings for unknown paths
. as $input |
reduce settings[] as $setting # apply new settings from variables to input file
($input; . | setpath($setting["path"]; $setting["value"]))
' input.json
Output
{
"apis": {
"payment": {
"base_url": "https://example2.com"
},
"order": {
"base_url": "https://example.com/"
}
},
"features": {
"authentication": {
"authProviders": true,
"registration": false
}
},
"availableLocales": [
"cs",
"es"
]
}

I'm a jq novice, and I'd be very interested in a better jq script, but here's one way to use environment variables to modify a settings.json file.
$ cat settings.json
{
"apis": {
"payment": {
"base_url": "https://example.com/"
},
"order": {
"base_url": "https://example.com/"
}
},
"features": {
"authentication": {
"authProviders": true,
"registration": false
}
},
"availableLocales": [
"en",
"es"
]
}
$ printenv|grep SETTINGS__
SETTINGS__APIS__PAYMENT__BASE_URL=https://example2.com
SETTINGS__AVAILABLELOCALES__0=cs
$ jq -n '
inputs as $i
| [ $i
| ..
| keys_unsorted?
| .[]
| strings
]
| unique as $allKeys
|
def fixCase:
. as $w
| reduce ($allKeys[]|select(length == ($w|length))) as $k
("";. + $k|match($w;"i").string)
;
def envpaths:
[
$ENV
| to_entries[]
| select(.key | startswith("SETTINGS__"))
| [[ (.key|split("__"))[1:][]
| if test("^[0-9]+$") then tonumber else fixCase end
],
.value
]
]
;
reduce envpaths[] as $p ($i; .|setpath($p[0];$p[1]))' settings.json
# the output
{
"apis": {
"payment": {
"base_url": "https://example2.com"
},
"order": {
"base_url": "https://example.com/"
}
},
"features": {
"authentication": {
"authProviders": true,
"registration": false
}
},
"availableLocales": [
"cs",
"es"
]
}
See it work on jqplay.org.

OPA/Rego execute function for each element of an array

I am new at OPA/Rego and I am trying to write a policy to check if an Azure Network Security Group contains all the rules that I define on an array
package sample
default compliant = false
toSet(arr) = {x | x := arr[_]}
checkProperty(rule, index, propertySingular, propertyPlural) = true
{
object.get(input.properties.securityRules[index].properties, propertySingular, "") == object.get(rule, propertySingular, "")
count(toSet(object.get(input.properties.securityRules[index].properties, propertyPlural, [])) - toSet(object.get(rule, propertyPlural, []))) == 0
}
existRule(rule) = true
{
input.properties.securityRules[i].name == rule.name
input.properties.securityRules[i].properties.provisioningState == rule.provisioningState
input.properties.securityRules[i].properties.description == rule.description
input.properties.securityRules[i].properties.protocol == rule.protocol
checkProperty(rule, i, "sourcePortRange", "sourcePortRanges")
checkProperty(rule, i, "destinationPortRange", "destinationPortRanges")
checkProperty(rule, i, "sourceAddressPrefix", "sourceAddressPrefixes")
checkProperty(rule, i, "destinationAddressPrefix", "destinationAddressPrefixes")
input.properties.securityRules[i].properties.access == rule.access
input.properties.securityRules[i].properties.priority == rule.priority
input.properties.securityRules[i].properties.direction == rule.direction
}
compliant
{
rules := [
{
"name": "name1",
"provisioningState": "Succeeded",
"description": "description1",
"protocol": "*",
"sourcePortRange": "*",
"destinationPortRange": "53",
"destinationAddressPrefix": "*",
"access": "Allow",
"priority": 1,
"direction": "Inbound",
"sourceAddressPrefixes":
[
"xx.xx.xx.xx",
"xx.xx.xx.xx",
"xx.xx.xx.xx"
],
},
{
"name": "name2",
"provisioningState": "Succeeded",
"description": "description2",
"protocol": "*",
"sourcePortRange": "*",
"destinationPortRange": "54",
"sourceAddressPrefix": "*",
"access": "Allow",
"priority": 2,
"direction": "Outbound",
"destinationAddressPrefixes":
[
"xx.xx.xx.xx",
"xx.xx.xx.xx",
"xx.xx.xx.xx"
]
}
]
#checks
existRule(rules[i])
}
The issue seem to be that when execute existRule(rules[i]) if one of the rules match it returns true, don't mather if other rules doesn't
If I replace existRule(rules[i]) with existRule(rules[0]) or existRule(rules[1]), it return true or false depending on if the rule on that position matchs.
Is there any way to get the result of the execution of existRule(rules[i]) for all the elements of the array?
I already tried result := [existRule(rules[i])] but it only return one element with true

Sure! Use a list comprehension and call the function inside of it, then compare the size of the result to what you had before. Given your example, you would replace existRule(rules[i]) with something like this:
compliantRules := [rule | rule := rules[_]
existRule(rule)]
count(compliantRules) == count(rules)

Is there a way to use OCR to extract specific data from a CAD technical drawing?

I'm trying to use OCR to extract only the base dimensions of a CAD model, but there are other associative dimensions that I don't need (like angles, length from baseline to hole, etc). Here is an example of a technical drawing. (The numbers in red circles are the base dimensions, the rest in purple highlights are the ones to ignore.) How can I tell my program to extract only the base dimensions (the height, length, and width of a block before it goes through the CNC)?
The issue is that the drawings I get are not in a specific format, so I can't tell the OCR where the dimensions are. It has to figure out on its own contextually.
Should I train the program through machine learning by running several iterations and correcting it? If so, what methods are there? The only thing I can think of are Opencv cascade classifiers.
Or are there other methods to solving this problem?
Sorry for the long post. Thanks.

I feel you... it's a very tricky problem, and we spent the last 3 years finding a solution for it. Forgive me for mentioning the own solution, but it will certainly solve your problem: pip install werk24
from werk24 import Hook, W24AskVariantMeasures
from werk24.models.techread import W24TechreadMessage
from werk24.utils import w24_read_sync
from . import get_drawing_bytes # define your own
def recv_measures(message: W24TechreadMessage) -> None:
for cur_measure in message.payload_dict.get('measures'):
print(cur_measure)
if __name__ == "__main__":
# define what information you want to receive from the API
# and what shall be done when the info is available.
hooks = [Hook(ask=W24AskVariantMeasures(), function=recv_measures)]
# submit the request to the Werk24 API
w24_read_sync(get_drawing_bytes(), hooks)
In your example it will return for example the following measure
{
"position": <STRIPPED>
"label": {
"blurb": "ø30 H7 +0.0210/0",
"quantity": 1,
"size": {
"blurb": "30",
"size_type":" "DIAMETER",
"nominal_size": "30.0",
},
"unit": "MILLIMETER",
"size_tolerance": {
"toleration_type": "FIT_SIZE_ISO",
"blurb": "H7",
"deviation_lower": "0.0",
"deviation_upper": "0.0210",
"fundamental_deviation": "H",
"tolerance_grade": {
"grade":7,
"warnings":[]
},
"thread": null,
"chamfer": null,
"depth":null,
"test_dimension": null,
},
"warnings": [],
"confidence": 0.98810
}
or for a GD&T
{
"position": <STRIPPED>,
"frame": {
"blurb": "[⟂|0.05|A]",
"characteristic": "⟂",
"zone_shape": null,
"zone_value": {
"blurb": "0.05",
"width_min": 0.05,
"width_max": null,
"extend_quantity": null,
"extend_shape": null,
"extend": null,
"extend_angle": null
},
"zone_combinations": [],
"zone_offset": null,
"zone_constraint": null,
"feature_filter": null,
"feature_associated": null,
"feature_derived": null,
"reference_association": null,
"reference_parameter": null,
"material_condition": null,
"state": null,
"data": [
{
"blurb": "A"
}
]
}
}
Check the documentation on Werk24 for details.

Although a managed offering, Mixpeek is one free option:
pip install mixpeek
from mixpeek import Mixpeek
mix = Mixpeek(
api_key="my-api-key"
)
mix.upload(file_name="design_spec.dwg", file_path="s3://design_spec_1.dwg")
This /upload endpoint will extract the contents of your DWG file, then when you search for terms it will include the file_path so you can render it in your HTML.
Behind the scenes it uses the open source LibreDWG library to run a number of AutoCAD native commands such as DATAEXTRACTION.
Now you can search for a term and the relevant DWG file (in addition to the context in which it exists) will be returned:
mix.search(query="retainer", include_context=True)
[
{
"file_id": "6377c98b3c4f239f17663d79",
"filename": "design_spec.dwg",
"context": [
{
"texts": [
{
"type": "text",
"value": "DV-34-"
},
{
"type": "hit",
"value": "RETAINER"
},
{
"type": "text",
"value": "."
}
]
}
],
"importance": "100%",
"static_file_url": "s3://design_spec_1.dwg"
}
]
More documentation here: https://docs.mixpeek.com/

difference between detect-secrets and detect-secrets-hook results

I'm evaluating detect-secrets and I'm not sure why I get different results from detect-secrets and the hook.
Here is a log of a simplification:
$ cat docs/how-to-2.md
AZ_STORAGE_CS="DefaultEndpointsProtocol=https;AccountName=storageaccount1234;AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A==;EndpointSuffix=core.windows.net"
$ detect-secrets scan --string $(cat docs/how-to-2.md)
AWSKeyDetector : False
ArtifactoryDetector : False
Base64HighEntropyString: True (5.367)
BasicAuthDetector : False
CloudantDetector : False
HexHighEntropyString : False
IbmCloudIamDetector : False
IbmCosHmacDetector : False
JwtTokenDetector : False
KeywordDetector : False
MailchimpDetector : False
PrivateKeyDetector : False
SlackDetector : False
SoftlayerDetector : False
StripeDetector : False
TwilioKeyDetector : False
$ detect-secrets-hook docs/how-to-2.md
$ detect-secrets-hook --baseline .secrets.baseline docs/how-to-2.md
I would have expected that detect-secrets-hook would tell me about that Azure storage account key that has high entropy.
more details about the baseline:
$ cat .secrets.baseline
{
"custom_plugin_paths": [],
"exclude": {
"files": null,
"lines": null
},
"generated_at": "2020-10-09T10:06:54Z",
"plugins_used": [
{
"name": "AWSKeyDetector"
},
{
"name": "ArtifactoryDetector"
},
{
"base64_limit": 4.5,
"name": "Base64HighEntropyString"
},
{
"name": "BasicAuthDetector"
},
{
"name": "CloudantDetector"
},
{
"hex_limit": 3,
"name": "HexHighEntropyString"
},
{
"name": "IbmCloudIamDetector"
},
{
"name": "IbmCosHmacDetector"
},
{
"name": "JwtTokenDetector"
},
{
"keyword_exclude": null,
"name": "KeywordDetector"
},
{
"name": "MailchimpDetector"
},
{
"name": "PrivateKeyDetector"
},
{
"name": "SlackDetector"
},
{
"name": "SoftlayerDetector"
},
{
"name": "StripeDetector"
},
{
"name": "TwilioKeyDetector"
}
],
"results": {
".devcontainer/Dockerfile": [
{
###obfuscated###
}
],
"deployment/export-sp.sh": [
{
###obfuscated###
}
],
"docs/pip-install-from-artifacts-feeds.md": [
{
###obfuscated###
}
]
},
"version": "0.14.3",
"word_list": {
"file": null,
"hash": null
}
}

This is definitely peculiar behavior, but after some investigation, I realize that you've stumbled upon an edge case of the tool.
tl;dr
HighEntropyStringPlugin supports a limited set of characters (not including ;)
To reduce false positives, HighEntropyStringPlugin leverages the heuristic that strings are quoted in certain contexts.
To improve UI, inline string scanning does not require quoted strings.
Therefore, the functionality differs: when run through detect-secrets-hook, it does not parse the string accordingly due to the existence of ;. However, when run through detect-secrets scan --string, it does not require quotes, and breaks the string up.
Detailed Explanation
HighEntropyString tests are pretty noisy, if not aggressively pruned for false positives. One way it attempts to do this is via applying a rather strict regex (source), which requires it to be inside quotes. However, for certain contexts, this quoted requirement is removed (e.g. YAML files, and inline string scanning).
When this quoted requirement is removed, we get the following breakdown:
>>> line = 'AZ_STORAGE_CS="DefaultEndpointsProtocol=https;AccountName=storageaccount1234;AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A==;EndpointSuffix=core.windows.net"'
>>> with self.non_quoted_string_regex(is_exact_match=False):
... self.regex.findall(line)
['AZ_STORAGE_CS=', 'DefaultEndpointsProtocol=https', 'AccountName=storageaccount1234', 'AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A==', 'EndpointSuffix=core', 'windows', 'net']
When we do so, we can see that the AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A== would trigger the base64 plugin, as demonstrated below:
$ detect-secrets scan --string 'AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A=='
AWSKeyDetector : False
ArtifactoryDetector : False
Base64HighEntropyString: True (5.367)
BasicAuthDetector : False
CloudantDetector : False
HexHighEntropyString : False
IbmCloudIamDetector : False
IbmCosHmacDetector : False
JwtTokenDetector : False
KeywordDetector : False
MailchimpDetector : False
PrivateKeyDetector : False
SlackDetector : False
SoftlayerDetector : False
StripeDetector : False
TwilioKeyDetector : False
However, when this quoted requirement is applied, then the entire string payload is scanned as one potential secret: DefaultEndpointsProtocol=https;AccountName=storageaccount1234;AccountKey=1OM7c6u5Ocp/zyUMWcRChowzd8czZmxPhzHZ8o45X7tAryr6JFF79+zerFFQS34KzVTK0yadoZGkvZh42A==;EndpointSuffix=core.windows.net
This doesn't get flagged because it fails the original base64 regex rules, which doesn't know how to handle ;.
>>> self.regex.findall(line)
[]
Therefore, this functionality differs, though, not immediately apparent through the invocation pattern described.
How Do I Fix This?
This is a much more challenging question, since allowing other characters would change the entropy calculation, and the probability of flagging strings. There has been some discussion around creating a plugin for all characters, but the team has not yet been able to decide on a default entropy limit for this.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

druid-kinesis-indexing-service giving unparseable characters - stream

Related

special character conversation in bash output

Override json file values with environment variables docker

OPA/Rego execute function for each element of an array

Is there a way to use OCR to extract specific data from a CAD technical drawing?

difference between detect-secrets and detect-secrets-hook results

Categories

Resources