How to select a subset of Avro files from Azure Data Lake Gen2 by data content - avro

I have lots of Avro files in an Azure Data Lake Gen2 storage sent by an Event Hub service with capture enabled. These Avro files contain data from different sensors and engines. The structure of the directory is organized in folders with the following path format (typical of Azure Blobs):
namespace/eventhub/partition/year/month/day/hour/minute/file.avro
I need to access to some of these files, in order to get data to 1) pre-process and 2) train or re-train a machine learning model. I'd like to know what procedure could I follow to download or mount just the files containing data of a particular engine and/or sensor, given that not data from all of them are present in all Avro files. Let's assume I'm interested just in files containing data from:
Engine = engine_ID_4012
Sensor = sensor_engine_4012_ID_0114
I'm aware that Spark offers some advantages working with Avro files, so I could consider to carry out this task using Databricks. Otherwise the option is Azure Machine Learning service, but maybe there are other possiblities, for instance a combination. The goal is to speed up the data ingestion process, avoiding to read files with no needed data.
Thanks.

Thanks for reaching out. In azure machine learning, you can:
1. create a datastore to connect to your storage service (ADLS Gen2 in your case)
sample code: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data#azure-data-lake-storage-generation-2
2. create a filedataset from the adlsgen2 pointing to your avro files.
sample code: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets#create-a-filedataset
3. learn about how to download or mount those files on your compute in ML experiments:
https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-with-datasets

Related

How to separate application and data syncing implementations in Kubernetes?

I want to build an application/service which uses a static application data which can get updated over time. Currently, I implemented this by having both the application and the data within the same container. However, this requires redeployment when either application or data change. I want to separate the app and data volume implementation so that I can update the application and the data independently, meaning I won't have to rebuild the application layer when the application data is updated and vice versa.
Here are the characteristics of the Application Data and its Usage:
Data is not frequently updated, but read very frequently by the application
Data is not a database, it's a collection of file objects with size ranging from 100MB to 4GB which is initially stored in a cloud storage
Data stored in the cloud storage serves as a single source of truth for the application data
The application will only read from the Data. The process of updating data in cloud storage is an external process outside the application scope.
So here, we are interested in sync-ing the data in cloud storage to the volume in Kubernetes deployment. What's the best way to achieve this objective in Kubernetes?
I have several options in mind:
Using one app container in one deployment, in which the app will also include the logic for data loading and update which pulls data
from cloud storage to the container --> simple but tightly coupled with the storage read-write implementation
Using the cloud store directly from the app --> this doesn't require container volume, but I was concerned with the huge file size because the app is an interactive service which requires a quick response
Using two containers in one deployment sharing the same volume --> allow great flexibility for the storage read-write implementation
one container for application service reading from the shared volume
one container for updating data and listening to update data request which writes data to the shared volume --> this process will pull data from cloud storage to the shared volume
Using one container with a Persistent Disk
an external process which writes to the persistent disk (not sure how to do this yet with cloud storage/file objects, need to find a way to sync gcs to persistent disk)
one container for application service which reads from the mounted volume
Using Fuse mounts
an external process which writes to cloud storage
a container which uses fuse mounts
I am currently leaning towards option 3, but I am not sure if it's the common practice of achieving my objective. Please let me know if you have better solutions.
Yes. 3. is the most common option but make sure you use an initContainer to copy the data from your cloud storage to a local volume. That local volume could be any of the types supported by Kubernetes.

Blockchain ledger storage

I've been studying Blockchain since a while and I've been looking out for information explaining where the blockchain ledger is saved and how it is saved locally (as in, locally in a full node). What I have most of the time found is state database being used by Ethereum or Hyperledger Fabric using LevelDB or RocksDB e.t.c for saving state information. I've been struggling real hard to know where the blockchain ledger gets saved apart from states being saved in some on-disk key-value store/database as I am studying LinkedList and Merkle Tree (Hash Tree) which are being used to store new blocks that gets created and being hashed and saved in merkle tree for verification purpose by full nodes and half nodes can query & verify if transactions exist.
Thanks and best,
Rohit
In Bitcoin-core, the blocks are stored in .dat files in the blocks filder under the data directory (default on linux is ~/.bitcoin). These files are not necessarily numbered or organized in any strict fashion, because they are downloaded as available, instead of waiting for each sequential block to become available for download from a peer. For those reasons, the .dat files There is a levelDB (in ~/.bitcoin/blocks/index) which indexes the blockchain by storing the names and locations of the .dat files.
Linked lists and merkle trees are not data storage mechanisms, but abstract data types, which can exist in a database, as flat files, etc. A merkle tree can make validation much faster because it improves the efficiency of the verification algorithms, usually a hash function.
In Hyperledger Fabric, the state database is not for storing all the blocks, it saves the current state of an asset only e.g. if a bank account has a transaction of 10 debit and another transaction of 2 credit, the state DB will have the current value of 8.
The actual blocks are saved in in a local file in peers, which can be queried via the SDK.

How can I connect the Object Storage with WML(Watson Machine Learning) service?

I'd like to connect the object storage with WML(Watson Machine Learning Service) in IBM Cloud.
I uploaded a stream model file(.str) to WML and this model has multiple input file nodes(3 csv files and 13 sad files).
So, I put those input files into Object Storage and changed model's nodes to use Object Storage's file in Data Science Experience.
But that changed model(modified in DSX) couldn't be uploaded to WML.(ERROR: models refer to undefined field)

Reading video during cloud dataflow, using GCSfuse, download locally, or write new Beam reader?

I am building a python cloud video pipeline that will read video from a bucket, perform some computer vision analysis and return frames back to a bucket. As far as I can tell, there is not a Beam read method to pass GCS paths to opencv, similar to TextIO.read(). My options moving forward seem to download the file locally (they are large), use GCS fuse to mount on a local worker (possible?) or write a custom source method. Anyone have experience on what makes most sense?
My main confusion was this question here
Can google cloud dataflow (apache beam) use ffmpeg to process video or image data
How would ffmpeg have access to the path? Its not just a question of uploading the binary? There needs to be a Beam method to pass the item, correct?
I think that you will need to download the files first and then pass them through.
However instead of saving the files locally, is it possible to pass bytes through to opencv. Does it accept any sort of ByteStream or input stream?
You could have one ParDo which downloads the files using the GCS API, then passes it to a opencv through a stream, ByteChannel stdin pipe, etc.
If that is not available, you will need to save the files to disk locally. Then pass opencv the filename. This could be tricky because you may end up using too much disk space. So make sure to garbage collect the files properly and delete the files from local disk after opencv processes them.
I'm not sure but you may need to also select a certain VM machine type to ensure you have enough disk space, depending on the size of your files.

Apache Samza local storage - OrientDB / Neo4J graph instead of KV store

Apache Samza uses RocksDB as the storage engine for local storage. This allows for stateful stream processing and here's a very good overview.
My use case:
I have multiple streams of events that I wish to process taken from a system such as Apache Kafka.
These events create state - the state I wish to track is based on previous messages received.
I wish to generate new stream events based on the calculated state.
The input stream events are highly connected and a graph such as OrientDB / Neo4J is the ideal medium for querying the data to create the new stream events.
My question:
Is it possible to use a non-KV store as the local storage for Samza? Has anyone ever done this with OrientDB / Neo4J and is anyone aware of an example?
I've been evaluating Samza and I'm by no means an expert, but I'd recommend you to read the official documentation, and even read through the source code—other than the fact that it's in Scala, it's remarkably approachable.
In this particular case, toward the bottom of the documentation's page on State Management you have this:
Other storage engines
Samza’s fault-tolerance mechanism (sending a local store’s writes to a replicated changelog) is completely decoupled from the storage engine’s data structures and query APIs. While a key-value storage engine is good for general-purpose processing, you can easily add your own storage engines for other types of queries by implementing the StorageEngine interface. Samza’s model is especially amenable to embedded storage engines, which run as a library in the same process as the stream task.
Some ideas for other storage engines that could be useful: a persistent heap (for running top-N queries), approximate algorithms such as bloom filters and hyperloglog, or full-text indexes such as Lucene. (Patches accepted!)
I actually read through the code for the default StorageEngine implementation about two weeks ago to gain a better sense of how it works. I definitely don't know enough to say much intelligently about it, but I can point you at it:
https://github.com/apache/samza/tree/master/samza-kv-rocksdb/src/main/scala/org/apache/samza/storage/kv
https://github.com/apache/samza/tree/master/samza-kv/src/main/scala/org/apache/samza/storage/kv
The major implementation concerns seem to be:
Logging all changes to a topic so that the store's state can be restored if a task fails.
Restoring the store's state in a performant manner
Batching writes and caching frequent reads in order to save on trips to the raw store.
Reporting metrics about the use of the store.
Do the input stream events define one global graph, or multiple graphs for each matching Kafka/Samza partition? That is important as Samza state is local not global.
If it's one global graph, you can update/query a separate graph system from the Samza task process method. Titan on Cassandra would one such graph system.
If it's multiple separate graphs, you can use the current RocksDB KV store to mimic graph database operations. Titan on Cassandra does just that - uses Cassandra KV store to store and query the graph. Graphs are stored either via matrix (set [i,j] to 1 if connected) or edge list. For each node, use it as the key and store its set of neighbors as the value.

Resources