See the contents of Checkpoint files? - machine-learning

Per documentation, variables in a session can be saved and restored to/from a binary file with tf.train.Saver object.
But is there any way to see the content of the binary file?

A checkpoint file is an sstable. The value for each record is a serialized
SavedTensorSlices message. (resource here)
To see the content of the serialized SavedTensorSlices message, we just unerialize the content into a SavedTensorSlices object. Something like below:
SavedTensorSlices message;
message.ParseFromString(value);
cout << message.DebugString();

The files are read/written using TensorSliceReader and TensorSliceWriter in C++ using what seems to be a special format for tensor data.
The files contain the values of the saved tensors. The simplest way to inspect those values would be to restore the tensors from the checkpoint file and inspect the tensors directly.

Related

Adding date to the key expression in s3 sink properties of spring cloud stream

We have a spring cloud data flow stream, which processes input files and produces output files in S3 bucket.
We are using following key-expression property to specify the folder for the output file.
app.s3-sink-rabbit.s3.key-expression='XYZ/abc/'+headers.file_name
We are trying to add date in YYYYMMDD as folder for our output files.
i.e. output location should be XYZ/abc/20230110/{filename}
We understood that folder gets created automatically in S3 if it is not found, while generating file.
We could append date in YYYYMMDD and then '/' to the file name through program, but we want to know if it can be done through some expression in property.
I believe the following may do what you want:
app.s3-sink-rabbit.s3.key-expression='XYZ/abc/'+T(java.time.LocalDate).now().format(T(java.time.format.DateTimeFormatter).BASIC_ISO_DATE)+'/'+headers.file_name
Try the key-expression property.
From the S3MessageHandler javadocs...
An S3 Object {#code key} for upload and download can be determined by the provided {#link #keyExpression} or the {#link File#getName()} is used directly. The former has precedence.

how to understand the concept of Object Container Files in avro?

I'm quite confused about the concept of Object Container Files in Avro.
https://avro.apache.org/docs/current/spec.html#Object+Container+Files
Does Object Container Files mean the files which produced by Avro when serializing the data? Avro persist the serialized data into one or more files, does this file call Object Container Files?
If you're to store Avro files on disk, those are represented by the Container file specifications mentioned there.
The files contain binary data, after data is serialized
One file contains a schema and many serialized records matching that schema

Adtf dat files - streams and structure types

ADTF dat file contains streams of data. In the .dat file there is only a stream name. To find the structure of the stream one has to go through DDL .description file.
Sometimes the .description files are incomplete or are missing link from stream name to corresponding structure.
Is there some additional information about structure name hidden in the .dat file itself? (Or my understanding is completely wrong?)
You must differ between ADTF 2.x and ADTF 3.x and their (adtf)dat file structure.
ADTF 2.x:
You are right, you can only interpret data with ddl. The stream must point to a structure described in Media Description.
Sometimes the .description files are incomplete or are missing link
from stream name to corresponding structure.
You can avoid this by enable the Option Create Media Description in Harddisk Recorder. Then a *.dat.description will be stored next to the same-titled *.dat file, which contains the correct stream and structure reference, because it was available during recording.
Is there some additional information about structure name hidden in the .dat file itself?
No, it is only the stream name. So you need to know the data structure behind to interpret. If you have the header (c-struct), you can also convert to ddl and refer to that.
ADTF 3.x:
To avoid these problems for not available or incorrect description files, the DDL is now stored in the *.adtfdat file in ADTF 3.x

HDFS Flume sink - Roll by File

Is it possible for HDFS Flume sink to roll whenever a single file (from a Flume source, say Spooling Directory) ends, instead of rolling after certain bytes (hdfs.rollSize), time (hdfs.rollInterval), or events (hdfs.rollCount)?
Can Flume be configured so that a single file is a single event?
Thanks for your input.
Reagarding your first question, it is not possible due to the sinks logic is disconnected from the sources logic. I mean, a sink only sees events being put into the channel which must be processed by him; the sink does not know if an event is the first or the last regarding a file.
Of course, you could try to create your own source (or extend an existing one) in order to add a header to the event with a value meaning "this is the last event". Then, another custom sink could behave depending on such a header: for instance, if the header is not set, then the events are not persisted but stored in memory until the header is seen; then all the information is persisted in the final backend as a bach. Other possibility is that custom sink persists the data in a file until the header is seen; then the file is closed and another one is opened.
Regarding your second question, it depends on the sink. The spooldir source behaves based on the deserializer parameter; by default its value is LINE, what means:
Specify the deserializer used to parse the file into events. Defaults to parsing each line as an event. The class specified must implement EventDeserializer.Builder.
But other custom Java classes can be configured, as said above; for instance, a deserialized for the whole file.
You can set rollsize to a small number combined with BlobDeserializer to load file by file instead of combining into blocks. This is really helpful when you have unsplittable binary files such as PDF or gz files.
This is part of the configuration that is relevant:
#Set deserializer to BlobDeserializer and set the maximum blob size to be 1GB.
#Notice that the blobs have to fit in memory so this doesn't work for files that cannot fit in memory.
agent.sources.spool.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
agent.sources.spool.deserializer.maxBlobLength = 1000000000
#Set rollSize to 1024 to avoid combining multiple small files into one part.
agent.sinks.hdfsSink.hdfs.rollSize = 1024
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.rollInterval = 0
The answer to the question "Can Flume be configured so that a single file is a single event?" is yes.
Yo only have to configure the following property to be 1:
hdfs.rollCount = 1
I'm looking for a solution for your first question, because sometimes the file is too big and it's needed to split the file in several chunks.
You can use any event headers in hdfs.path. ( https://flume.apache.org/FlumeUserGuide.html#hdfs-sink )
If you are using Spooling Directory Source, you can enable putting the file name in the events using fileHeaderKey or basenameHeaderKey ( https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source ).
Can Flume be configured so that a single file is a single event?
It could be, however it is not recommended. The underlying implementation (protobuf) limits file sizes to 64m. Flume events are to be small in size due to its architecture and design. (Fault-tolerance, etc.)

Rails: possible to check if a string is binary?

In a particular Rails application, I'm pulling binary data out of LDAP into a variable for processing. Is there a way to check if the variable contains binary data? I don't want to continue with processing of this variable if it's not binary. I would expect to use is_a?...
In fact, the binary data I'm pulling from LDAP is a photo. So maybe there's an even better way to ensure the variable contains binary JPEG data? The result of this check will determine whether to continue processing the JPEG data, or to render a default JPEG from disk instead.
There is actually a lot more to this question than you might think. Only since Ruby 1.9 has there been a concept of characters (in some encoding) versus raw bytes. So in Ruby 1.9 you might be able to get away with requesting the encoding. Since you are getting stuff from LDAP the encoding for the strings coming in should be well known, most likely ISO-8859-1 or UTF-8.
In which case you can get the encoding and act on that:
some_variable.encoding # => when ASCII-8BIT, treat as a photo
Since you really want to verify that the binary data is a photo, it would make sense to run it through an image library. RMagick comes to mind. The documentation will show you how to verify that any binary data is actually JPEG encoded. You will then also be able to store other properties such as width and height.
If you don't have RMagick installed, an alternative approach would be to save the data into a Tempfile, drop down into Unix (assuming you are on Unix) and try to identify the file. If your system has ImageMagick installed, the identify command will tell you all about images. But just calling file on it will tell you this too:
~/Pictures$ file P1020359.jpg
P1020359.jpg: JPEG image data, EXIF standard, comment: "AppleMark"
You need to call the identify and file commands in a shell from Ruby:
%x(identify #{tempfile})
%x(file #{tempfile})

Resources