Schema registry on AWS - avro

I'm evaluating kinesis as replacement for kafka. One of the things I'm missing is Schema registry equivalent solution. In particular I need:
schema upgrade - validate compatibility with the previous version
version avro schemas in a similar way as schema registry does
What are the options to handle 2 above? The only thing that I found was glue catalogue, but it doesn't seem to
At the end I also want to use firehose (output to redshift), but from what I understand this is not possible and will require writing custom lambda.

AWS just launched Schema Registry functionality for AWS Kinesis, https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html

https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sch-dis-ref.html Is where it gets us - still not as complete / elegant as what the kafka ecosystem by confluent bundle, say is. I am doubting how compatible schema work here too.
Might deviate from your ques - but would you consider MSK, or deploying kafka on AWS (although an overhead for managing it) ?

AWS Glue provide option for schema registry which can be used with AVRO data format.
at this moment glue only support java as producer and consumer.
it is quite easy to use

Related

InfluxDB - 2.0 - stand alone DB

I am using Influxdb with Grafana for a while and I like it.
I am confused with the new version of Influxdb2.0. I was searching the doc and could not find useful info.
I have some questions.
Will Influxdb be available only as bundled with db + ui as 1 single binary going forward? Can we have standalone DB?
Will the Flux replace the current SQL like InfluxQL ? Or InfluxQL will also be supported.
Yes, I believe the intention is to bundle the UI into the single binary so that it is always available with no additional installs. You can continue to use Grafana though - ignoring the bundled UI entirely*. There's no problem to ignore it so the DB is still "standalone". Since it is OSS, you could build a binary without the bundled UI if that is important for your use case.
InfluxDB 2.0 OSS is currently in RC0 (as of late Oct 2020). This version supports both InfluxQL via a compatibility API (/query) and Flux via the new /api/v2/query API querying. The query and response formats are different. The docs have examples. In general, Flux is the direction InfluxDB is going.
*There may be some rough edges in the RC around configuring the first user without using the UI and only using the API. I have not tried this. I would expect the API to continue to improve is this area.

difference between spring-cloud-starter-dataflow-server (Data Flow Server Starter) and spring-cloud-starter-dataflow-server-local (Local Data Flow Se

I've recently started understanding the Spring Cloud Data Flow, also called as SCDF. I've just started looking at https://codenotfound.com/spring-batch-admin-example.html which seems very nice example, also would need more examples to really understand the use of Spring Cloud Data Flow with Spring Batch, as I've good experience with Spring Batch.
What's the difference between spring-cloud-starter-dataflow-server (Data Flow Server Starter) and spring-cloud-starter-dataflow-server-local (Local Data Flow Server Starter) ?
We used to ship spring-cloud-starter-dataflow-server-local as a standalone uber-jar for local deployments a few years ago. Similarly, we used to have spring-cloud-starter-dataflow-server-kubernetes, spring-cloud-starter-dataflow-server-cloudfoundry, and others.
However, we have consolidated all the supported platform implementations of SCDF into a single uber-jar, and that is spring-cloud-starter-dataflow-server. Please only use this artifact for any development/deployment, even if it is only used locally.
As for feature capabilities, we have a dedicated page that lists them. Once you dig into the relevant sections ranging from developer guides [example: batch developer guide] to recipes, hopefully, you will have an idea.
And, likewise, you might find the architecture and concepts useful for your research, which will cover the broad set of capabilities that SCDF supports including first-class orchestration experience for Spring Batch workloads.

Is there any available panel to store machine learning models and their config files in a structured manner?

Saving different models with their corresponding config files, tracking the results and parameters, searching among them using customized filters and maybe always having a pointer to the current SOTA can be quite time-saving.
I couldn't even find something similar to TensorFlow Hub on the local server. Right now, closest I could get is Git LFS.
Is there anything better out there?
I found the answer. A few open source projects are trying to do the job. The first one is named Data Science Version Control or DVC. Which according to the docs is:
simple command line Git-like experience. Does not require installing and maintaining any databases. Does not depend on any proprietary online services;
It manages and versions datasets and machine learning models. Data is saved in S3, Google cloud, Azure, Alibaba cloud, SSH server, HDFS or even local HDD RAID;
It makes projects reproducible and shareable, it helps to answer the question: "how the model was built";
It helps manage experiments with Git tags or branches and metrics tracking;
The other possible solution to think of is MinIO which is an object storage server
suited for storing unstructured data such as photos, videos, log files, backups and container / VM images.
Microsoft Azure has a service called Azure Machine Learnig service that does exactly this, but goes much further with governance/model explainability/DevOps etc. We also include free tiers to a lot of services and announced unlimited private repos on GitHub recently.

Gitlab-CI, Review Apps, GKE, the good way?

I'm starting with Kubernetes (through GKE) and I want to setup Gitlab Review Apps.
My use case is quite simple. I've read tons of articles but I could not find clear explanations and best practices on the way to do it. This is the reason why I'm asking here.
Here is what I want to achieve :
I have a PHP application, based on Symfony4, versioned on my Gitlab CE instance (self-hosted)
I setup my Kubernetes using GKE into Gitlab
I want, on each merge request, deploy a new environment on my cluster where I am able to test the application and the new feature (this is the principle of Review Apps).
As far as I read, I've only found simple implementations of it. What I want to do, is deploy to a new LAMP (or LEMP) environment to test my new feature.
The thing that I don't understand is how to proceed to deploy my application.
I'm able to write Docker files, store them on mi Gitlab registry, etc ...
In my use case, what is the best way to proceed?
In my application repository, do I have to store a Docker file which includes all my LAMP configuration (a complete image with all my LAMP setup)? I don't like this approach, it seems strange to me.
Do I have to store different custom images (for Apache, MySQL, PHP-FPM, Redis) on my registry and call them and deploy them on GKE during review Stage in my gitlab-cy.yml file?
I'm a little bit stuck on that and I can't share code because it's more about the way to handle everything.
If you have any explanations to help me, it would be great!
I can, of course, explain a little bit more if needed!
Thanks a lot for your help.
Happy new year!

Tinkerpop common version for multiple databases

Summary
I am devloping a app that is intendent to work across multiple graph databases suppoted by tinkerpop
Details
Based on my research the same version of tinkerpop library (gremlin-python) does not work with the latest version of all the graph databases. What is the best approach for this situation. The databases I intent to test are
JanusGraph 0.2.0 supported gremlin-python 3.2.7
NEO4J 3.3.3 supported gremlin-python 3.3.2
I am still trying to integrating some more databases like orientDB and Amazon Neptune do you know what version they will supported.
This issue can be a little tricky especially with non-open source systems that don't publish version and feature support clearly. For open source systems, you can typically find the version of TinkerPop they support for a particular version by looking at the pom.xml of the project. For OrientDB that means finding the version you want (in this case 3.2.3.0) and then looking for the gremlin-core dependency:
https://github.com/orientechnologies/orientdb-gremlin/blob/3.2.3.0/driver/pom.xml#L47
The version points to a property, so examine the pom a bit further and you'll see that number defined above:
https://github.com/orientechnologies/orientdb-gremlin/blob/3.2.3.0/driver/pom.xml#L14
So OrientDB 3.2.3.0 supports TinkerPop 3.2.3. With closed source systems you can only search around until you find the answer your looking for or ask the vendor directly I guess - I've seen that Neptune is on 3.3.x, but I'm not sure of what version of "x".
Just because all of these systems support different versions of TinkerPop and the general recommendation is to use a matching TinkerPop version to connect to them doesn't mean that you can't get a 3.3.x driver to connect to a 3.2.x based server. You may not have the best experience doing so and you would need to be aware of a few things as you do that, but I think it can be done.
The key to this to work from a driver perspective is to ensure that you have the right serialization configuration for the graph you are connecting to. This is true whether you are trying to connect to a same version system or not. By default, TinkerPop ensures that these configurations within the same version are aligned so that they work out of the box. This is why we tend to recommend that you use the same version when possible. When not possible, you need to make those alignments manually.
For example, if you scroll down in this link a bit to the "Serialization" section you will find the supported formats for Neptune:
https://docs.aws.amazon.com/neptune/latest/userguide/access-graph-gremlin-differences.html
As long as you configure your driver to match one of those formats it should work for you. The same could be said of JanusGraph, which in contrast to Neptune, will not support Gryo or GraphSON 3.0 as it is bound to the 3.2.x line. The configuration for the serializers can be found in JanusGraph's packaging of Gremlin Server:
https://github.com/JanusGraph/janusgraph/blob/v0.2.0/janusgraph-dist/src/assembly/static/conf/gremlin-server/gremlin-server.yaml#L15-L21
As to how you configure your python driver for serialization? Admittedly, there isn't a lot written on that. The key is to set the message_serializer when configuring the Client (from gremlinpython 3.3.2):
https://github.com/apache/tinkerpop/blob/3.3.2/gremlin-python/src/main/jython/gremlin_python/driver/client.py#L44-L45
You can see there that by default it is set to GraphSON 3.0. So, that's perfect for Neptune, but not JanusGraph. For JanusGraph, which doesn't support GraphSON 3.0 yet, you would just change the configuration to use the GraphSON 2.0 serializer:
https://github.com/apache/tinkerpop/blob/3.3.2/gremlin-python/src/main/jython/gremlin_python/driver/serializer.py#L149
So, that is just getting a connection working - then there are other things to consider:
If you use a new version of gremlinpython against an older server, you will need to make sure that you are aware of any features that aren't supported on the server (e.g. don't use math() step from your 3.3.x client because it won't work on a 3.2.x server)
CosmosDB has may allow you to connect with 3.3.x, but it doesn't have full Gremlin support and at this time does not support bytecode based traversals - only strings
A number of bugs have been fixed in GraphSON serialization over these releases and sometimes certain types may have a revised serialization scheme that may prevent a 3.3.x from talking to a 3.2.x - I can't think of any big issues like that offhand that would immediately jump out, but I'm pretty sure it's happened - perhaps something in serialization of Tree and perhaps some of the extended types. You can always look at the full list of GraphSON types here and compare between published versions if you run into trouble.

Resources