Access checksum of zfs dataset via cli - checksum

Is it possible to read/access the checksum of a zfs dataset? I want to access it to validate that it didnt change between boots.
Reading https://en.wikipedia.org/wiki/ZFS#ZFS_data_integrity: Is the top checksum of a Merkle Tree like checksuming scheme in zfs accessible from userspace?

There's a (mainly for developers) tool called zdb which can do this. It's hard to use and its format is not always backwards compatible :-)
However, if all you want is to make sure that a filesystem hasn't changed, you can use snapshots for this purpose. First, create a snapshot at the point you want to compare to later on with zfs snapshot <pool>/<fs>#<before-reboot-snap>. Then there are two different ways to compare the filesystem to that snapshot later:
After reboot, run zfs diff <pool>/<fs>#<before-reboot-snap> <pool>/<fs>. This will show you a list of "diffs" between the snapshot and the current filesystem:
# ls /tank/hello
file1 file2 file3 file4 file5
# zfs snapshot tank/hello#snap
# zfs diff tank/hello#snap tank/hello
# touch /tank/hello/file6
# zfs diff tank/hello#snap tank/hello
M /tank/hello/
+ /tank/hello/file6
# rm /tank/hello/file6
# zfs diff tank/hello#snap tank/hello
M /tank/hello/
Note that even after I deleted the new file, the directory it lived in is still marked as modified.
Take another snapshot after the reboot, and then use zfs send -i #<before-reboot-snap> <pool>/<fs>#<after-reboot-snap> to create a stream of all the changes that happened between those snapshots, and analyze it with another tool called zstreamdump:
zfs send -i #snap tank/hello#snap2 | zstreamdump
BEGIN record
hdrtype = 1
features = 4
magic = 2f5bacbac
creation_time = 59036f98
type = 2
flags = 0x4
toguid = 2f080aca53bff68e
fromguid = 66a1da82cd5f1571
toname = tank/hello#snap2
END checksum = 91043406e5/38f3c4043049b/ed0867661876670/1e265bea2b6c3315
SUMMARY:
Total DRR_BEGIN records = 1
Total DRR_END records = 1
Total DRR_OBJECT records = 12
Total DRR_FREEOBJECTS records = 5
Total DRR_WRITE records = 1
Total DRR_WRITE_BYREF records = 0
Total DRR_WRITE_EMBEDDED records = 0
Total DRR_FREE records = 17
Total DRR_SPILL records = 0
Total records = 37
Total write size = 512 (0x200)
Total stream length = 13232 (0x33b0)
The example above shows that there have been a bunch of diffs -- anything like WRITE, FREE, OBJECT, or FREEOBJECTS indicates a change from the original snapshot.

Related

Splitting file processing by initial keys

Use Case
I have some terabytes of US property data to merge. It is spread across two distinct file formats and thousands of files. The source data is split geographically.
I can't find a way to branch a single pipeline into many independent processing flows.
This is especially difficult because the Dataframe API doesn't seem to support a PTransform on a collection of filenames.
Detailed Background
The distribution of files is like this:
StateData - 51 total files (US states + DC)
CountyData - ~2000 total files (county specific, grouped by state)
The ideal pipeline would split into thousands of independent processing steps and complete in minutes.
1 -> 51 (each US state + DC starts processing)
51 -> thousands (each US state then spawns a process that merges the counties, combining at the end for the whole state)
The directory structure is like this:
đź“‚state-data/
|-đź“śAL.zip
|-đź“śAK.zip
|-đź“ś...
|-đź“śWY.zip
đź“‚county-data/
|-đź“‚AL/
|-đź“śCOUNTY1.csv
|-đź“śCOUNTY2.csv
|-đź“ś...
|-đź“śCOUNTY68.csv
|-đź“‚AK/
|-đź“ś...
|-đź“‚.../
|-đź“‚WY/
|-đź“ś...
Sample Data
This is extremely abbreviated, but imagine something like this:
State Level Data - 51 of these (~200 cols wide)
uid
census_plot
flood_zone
abc121
ACVB-1249575
R50
abc122
ACVB-1249575
R50
abc123
ACVB-1249575
R51
abc124
ACVB-1249599
R51
abc125
ACVB-1249599
R50
...
...
...
County Level Data - thousands of these (~300 cols wide)
uid
county
subdivision
tax_id
abc121
04021
Roland Heights
3t4g
abc122
04021
Roland Heights
3g444
abc123
04021
Roland Heights
09udd
...
...
...
...
So we join many county-level to a single state level, and thus have an aggregated, more-complete state-level data set.
Then we aggregate all the states, and we have a national level data set.
Desired Outcome
I can successfully merge one state at a time (many county to one state). I built a pipeline to do that, but the pipeline starts with a single CountyData CSV and a single StateData CSV. The issue is getting to the point where I can load the CountyData and StateData.
In other words:
#
# I need to find a way to generalize this flow to
# dynamically created COUNTY and STATE variables.
#
from apache_beam.dataframe.convert import to_pcollection
from apache_beam.dataframe.io import read_csv
COUNTY = "county-data/AL/*.csv"
STATE = "state-data/AL.zip"
def key_by_uid(elem):
return (elem.uid, elem)
with beam.Pipeline() as p:
county_df = p | read_csv(COUNTY)
county_rows_keyed = to_pcollection(county_df) | beam.Map(key_by_uid)
state_df = pd.read_csv(STATE, compression="zip")
state_rows_keys = to_pcollection(state_df, pipeline=p) | beam.Map(key_by_uid)
merged = ({ "state": state_rows_keys, "county": county_rows_keyed } ) | beam.CoGroupByKey() | beam.Map(merge_logic)
merged | WriteToParquet()
Starting with a list of states
By state, generate filepatterns to the source data
By state, load and merge the filenames
Flatten the output from each state into a US data set.
Write to Parquet file.
with beam.Pipeline(options=pipeline_options) as p:
merged_data = (
p
| beam.Create(cx.STATES)
| "PathsKeyedByState" >> tx.PathsKeyedByState()
# ('AL', {'county-data': 'gs://data/county-data/AL/COUNTY*.csv', 'state-data': 'gs://data/state-data/AL.zip'})
| "MergeSourceDataByState" >> tx.MergeSourceDataByState()
| "MergeAllStateData" >> beam.Flatten()
)
merged_data | "WriteParquet" >> tx.WriteParquet()
The issue I'm having is something like this:
I have two filepatterns in a dictionary, per state. To access those I need to use a DoFn to get at the element.
To communicate the way the data flows, I need access to Pipeline, which is a PTransform. Ex: df = p | read_csv(...)
These appear to be incompatible needs.
Here's an alternative answer.
Read the state data one at a time and flatten them, e.g.
state_dataframe = None
for state in STATES:
df = p | read_csv('/path/to/state')
df['state'] = state
if state_dataframe is None:
state_dataframe = df
else:
state_dataframe = state_dataframe.append(df)
Similarly for county data. Now join them using dataframe operations.
I'm not sure exactly what kind of merging you're doing here, but one way to structure this pipeline might be to have a DoFn that takes the county data in as a filename as an input element (i.e. you'd have a PCollection of county data filenames), opens it up using "normal" Python (e.g. pandas), and then reads the relevant state data in as a side input to do the merge.

Size in MB of mnesia table

How do you read the :mnesia.info?
For example I only have one table, some_table, and :mnesia.info returns me this.
---> Processes holding locks <---
---> Processes waiting for locks <---
---> Participant transactions <---
---> Coordinator transactions <---
---> Uncertain transactions <---
---> Active tables <---
some_table: with 16020 records occupying 433455 words of mem
schema : with 2 records occupying 536 words of mem
===> System info in version "4.15.5", debug level = none <===
opt_disc. Directory "/home/ubuntu/project/Mnesia.nonode#nohost" is NOT used.
use fallback at restart = false
running db nodes = [nonode#nohost]
stopped db nodes = []
master node tables = []
remote = []
ram_copies = ['some_table',schema]
disc_copies = []
disc_only_copies = []
[{nonode#nohost,ram_copies}] = [schema,'some_table']
488017 transactions committed, 0 aborted, 0 restarted, 0 logged to disc
0 held locks, 0 in queue; 0 local transactions, 0 remote
0 transactions waits for other nodes: []
Also calling:
:mnesia.table_info("some_table", :size)
It returns me 16020 which I think is the number of keys, but how can I get the memory usage?
First, you need mnesia:table_info(Table, memory) to obtain the number of words occupied by your table, in your example you are getting the number of items in the table, not the memory. To transform that value to MB, you can first use erlang:system_info(wordsize) to get the word size in bytes for your machine architecture(on a 32 bit system a word is 4 bytes and 64 bits it's 8 bytes), multiply it by your Mnesia table memory to obtain the size in bytes and finally transform the value to MegaBytes like:
MnesiaMemoryMB = (mnesia:table_info("some_table", memory) * erlang:system_info(wordsize)) / (1024*1024).
You can use erlang:system_info(wordsize) to get the word size in bytes, on a 32 bit system a word is 32 bits or 4 bytes, on 64 bit it's 8 bytes. So your table is using 433455 x wordsize.

How can I read a directory on iso9660 from the path table when the table does not include size?

According to the spec for the structure of an iso9660 / ecma119, the path table contains records for each path, including the location of the starting sector and its name, but not its size. I can find the directory entry, but don't know how many sectors (normally 2048 bytes) it contains. Is it one? Two? Six?
If I "walk the directory tree", each directory entry includes the referenced location and size, so I can know how many bytes (essentially, how many sectors, since a directory must use entire sectors) to read. However, the path table only includes the starting location, and not the size, leaving me not knowing how many bytes to read.
In an example iso I have (ubuntu-18.04.1-live-server-amd64.iso fwiw), the root directory entry in the primary volume descriptor shows:
Root Directory:
Directory Record Length: 34
Extended Attribute Length: 0
Location of Extent: 20 $00000014 00:00:20
Data Length: 2048 $00000800
Recording Date and Time: 23:39:04 07/25/2018 GMT 0
File Flags: $02 visible regular dir non-record no-perms single-extent
File Unit Size: 0
Interleave Gap Size: 0
Volume Sequence Number: 1
File Identifier: . (current directory)
Since it says the Data Length is 2048, I know to read just one sector.
However, the root directory entry in the path table shows:
Path Record Length: 10 $0A
Extended Attribute Length: 0 $00
Location of Extent: 20 $00000014 00:00:20
Parent Directory Number: 1 $0001
File Identifier: . (current directory)
It also points to sector 20, but doesn't tell me how many sectors it uses, leaving me guessing.
Yes, unused bytes in a sector should be all 0x00, so if I read in a sector, read records, and then come to one whose first byte (length) is 0x00, then I know I have reached the end of records, but that has three issues:
If that were the canonical way, why bother including size in the directory entry?
If it includes 2 or 3 sectors, it is more efficient for me to read them all at once than one at a time.
If I have a directory whose records precisely fill a sector, without some size attribute, I don't know if the next sector is supposed to be read as an entry, or if the directory ended here.
Basically, I know how to read the ordered path table to get the directory entry, but don't know how to use that to know how many sectors to read for the directory itself. I could, in theory, read the parent to get the entry for this directory to know the size, but that adds a seek and read and pretty much defeats the purpose of the path table.
Ah, I figured it out. Because the directory entries always start with a directory entry for the directory itself, and the data length always is bytes 10-17 (10-13 for little-endian, 13-17 for big-endian), you can just read bytes 10-17 from the beginning of the sector and get the size. Still not as efficient as putting it in the path table itself - no idea why they did not - but it works.

loading large files into hdfs using Flume (spool directory)

We copied a 150 mb csv file into flume's spool directory, when it is getting loaded into hdfs, the file was splitting into smaller size files like 80 kb's. is there a way to load the file without getting split into smaller files using flume? because more metadata will be generated inside namenode about the smaller files, so we need to avoid it.
My flume-ng code looks like this
# Initialize agent's source, channel and sink
agent.sources = TwitterExampleDir
agent.channels = memoryChannel
agent.sinks = flumeHDFS
# Setting the source to spool directory where the file exists
agent.sources.TwitterExampleDir.type = spooldir
agent.sources.TwitterExampleDir.spoolDir = /usr/local/flume/live
# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000
# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = hdfs://info3s7:54310/spool5
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=2000
agent.sinks.flumeHDFS.hdfs.rollSize = 0
agent.sinks.flumeHDFS.hdfs.batchSize =1000000
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600
# Connect source and sink with channel
agent.sources.TwitterExampleDir.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel
What you want is this:
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount = 0
agent.sinks.flumeHDFS.hdfs.rollInterval = 0
agent.sinks.flumeHDFS.hdfs.rollSize = 10000000
agent.sinks.flumeHDFS.hdfs.batchSize = 10000
From the flume documentation
hdfs.rollSize: File size to trigger roll, in bytes (0: never roll based on file size)
In your example you use rollInterval of 2000 which will roll over the file after 2000 seconds, resulting in small files.
Also note that batchSize reflects the number of events before the file is flushed to HDFS, not necessarily the number of events before the file is closed and a new one created. You'll want to set that to some value small enough to not time out writing a large file but large enough to avoid overhead of many requests to HDFS.

SE 4.10 bcheck <filename>, SE 2.10 bcheck <filename.ext> and other bcheck anomalies

ISQL-SE 4.10.DD6 (DOS 6.22):
BCHECK C-ISAM B-tree Checker version 4.10.DD6
C-ISAM File: c:\dbfiles.dbs\*.*
ERROR: cannot open C-ISAM file
In SE2.10 it worked with wilcards * .* for all files, but in SE4.10 it doesn’t. I have an SQL script which my users periodically run to reorg customer and transactions tables. Then I have a FIX.BAT DOS script [bcheck –y * .*] as a utility option for my users in case any tables get screwed up. Since users running the reorg will now increment the table version number, example: CUSTO102, 110, … now I’m going to have to devise a way to strip the .DAT extensions from the .DBS dir and feed it to BCHECK. Before, my reorg would always re-create a static CUSTOMER.DAT with CREATE TABLE customer IN “C:\DBFILES.DBS\CUSTOMER”; but that created the write permission problem and had to revert back to SE’s default datafile journaling…
Before running BCHECK on CUSTO102, its .IDX file size was 22,089 bytes and its .DAT size is 882,832 bytes.
After running BCHECK on CUSTO102, its .IDX size increased to 122,561 bytes, however a new .IDY file was created with 88,430 bytes..
What's a .IDY file ???
C:\DBFILES.DBS> bcheck –y CUSTO102
BCHECK C-ISAM B-tree Checker version 4.10.DD6
C-ISAM File: c:\dbfiles.dbs\CUSTO102
Checking dictionary and file sizes.
Index file node size = 512
Current C-ISAM index file node size = 512
Checking data file records.
Checking indexes and key descriptions.
Index 1 = unique key
0 index node(s) used -- 1 index b-tree level(s) used
Index 2 = duplicates (2,30,0)
42 index node(s) used -- 3 index b-tree level(s) used
Index 3 = unique key (32,5,0)
29 index node(s) used -- 2 index b-tree level(s) used
Index 4 = duplicates (242,4,2)
37 index node(s) used -- 2 index b-tree level(s) used
Index 5 = duplicates (241,1,0)
36 index node(s) used -- 2 index b-tree level(s) used
Index 6 = duplicates (46,4,2)
38 index node(s) used -- 2 index b-tree level(s) used
Checking data record and index node free lists.
ERROR: 177 missing index node pointer(s)
Fix index node free list ? yes
Recreating index node free list.
Recreating index 6.
Recreating index 5.
Recreating index 4.
Recreating index 3.
Recreating index 2.
Recreating index 1.
184 index node(s) used, 177 free -- 1083 data record(s) used, 0 free
The problem with the wild cards is more likely an issue with the command interpreter that was used to run bcheck than with bcheck itself. If you give bcheck a list of file names (such as 'abc def.dat def.idx', then it will process the C-ISAM file pairs (abc.dat, abc.idx), (def.dat, def.idx) and (def.dat, def.idx - again). Since it complained about being unable to open 'c:\dbfiles.dbs\*.*', it means that the command interpreter did not expand the '*.*' bit, or there was nothing for it to expand into.
I expect that the '.IDY' file is an intermediate used while rebuilding the indexes for the table. I do not know why it was not cleaned up - maybe the process did not complete.
About sizes, I think your table has about 55,000 rows of size 368 bytes each (SE might say 367; the difference is the record status byte at the end, which is either '\0' for deleted or '\n' for current). The unique index on the CHAR(5) column (index 3) requires 9 bytes per entry, or about 56 keys per index node, for about 1000 index nodes. The duplicate indexes are harder to size; you need space for the key value plus a list of 4-byte numbers for the duplicates, all packed into 512-byte pages. The 22 KB index file was missing a lot of information. The revised index file is about the right size. Note that index 1 is the 'ROWID' index; it does not occupy any space. (Index 1 is also why although every table created by SE is stored in a C-ISAM file, not all C-ISAM files are necessarily compatible with SE.)

Resources