Hadoop Namenode Metadata - fsimage and edit logs

Hadoop Namenode Metadata - fsimage and edit logs - memory

I understand that the fsimage is loaded into the memory on startup and any further transactions are added to the edit log rather than to the fsimage for performance reasons.
The fsimage in memory gets refreshed when the namenode is restarted. For efficiency, secondary name node periodically does a checkpoint to update the fsimage so that the namenode recovery is faster. All these are fine.
But one point which i fail to understand is this,
Lets say that a file already exists and the info about this file is in the fsimage in memory.
Now i move this file to a different location, which is updated in the edit log.
Now when i try to list the old file path, it complains thats it does not exists or whatever.
Does this mean that namenode looks at the edit log as well which is contradictory to the purpose of the fsimage in memory? or how does it know that the file location has changed?

Answer is by looking at information in the edit logs. If information is not available in the edit logs This question stands true for use-case when we write the new file to hdfs. While your namenode is running if you remove fsimage file and try to read the hdfs file it is able to read.
Removing the fsimage file from the running namenode will not cause issue with the read / write operations. When we restart the namenode, there will be errors stating that image file is not found.
Let me try to give some more explanation to help you out.
Only on start up hadoop looks fsimage file, in case if it is not there, namenode does not come up and log for formatting the namenode.
hadoop format -namenode command creates fsimage file (if edit logs are present). After namenode startup file metadata is fetched from edit logs (and if not found information in edit logs searched thru fsimage file). so fsimage just works as checkpoint where inforamtion is saved last time. This is also one of the reason secondary node keeps on sync (after 1 hour / 1 milliion transactions) from edit logs so that on start up from last checkpoint not much needs to be synced.
if you will turn the safemode ( command : hdfs dfsadmin -safemode enter) on and will use saveNamespace (command : hdfs dfsadmin -saveNamespace), it will show below mentioned log message.
2014-07-05 15:03:13,195 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Saving image file /data/hadoop-namenode-data-temp/current/fsimage.ckpt_0000000000000000169 using no compression
2014-07-05 15:03:13,205 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: Image file /data/hadoop-namenode-data-temp/current/fsimage.ckpt_0000000000000000169 of size 288 bytes saved in 0 seconds.
2014-07-05 15:03:13,213 INFO org.apache.hadoop.hdfs.server.namenode.NNStorageRetentionManager: Going to retain 2 images with txid >= 0
2014-07-05 15:03:13,237 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 170

I'm kind of late to this question, but I think it's worth a clearer response.
If I got you right You want to know, if metadata are stored in edit log why after deleting a file, when we try to list the old file path, it complains that it does not exists or whatever? and how namenode knows that file or directory has been deleted without reading edit log?
It is exactly mentioned in chapter 11 in Hadoop the definitive guide book:
When a filesystem client performs a write operation (such as creating
or moving a file), the transaction is first recorded in the edit log.
The namenode also has an in-memory representation of the filesystem
metadata, which it updates after the edit log has been modified.
The in-memory metadata is used to serve read requests.
Having said that the answer is simple, because after updating the edit log namenode updates the in memory-representation. so when read request received it knows that the file or directory has been deleted and will complain that this does not exist.

The entire file system namespace, including the "mapping of blocks to files" and file system properties, is stored in a file called the FsImage.Remember "mapping of blocks to files" is a part of FsImage.This is stored both in memory and on disk.Along with FsImage, Hadoop will also store in memory, block to datanode mapping through block reports while the name node is (re)started and periodically.So when you move a file to a different location, this will be tracked in the edit log on disk and also when a block report is sent by data node to namenode, namenode will get an up-to-date view of where blocks are located on the cluster.So that way, you will not be able to see the data in old path since block report has updated "mapping of blocks to datanodes".But remember the update has happened only in the memory.Now after a certain amount of time, either in checkpointing or when a name node is restarted, editlogs on disk which already have the updates that you have done(in your case movement of file) will get merged with the old FsImage on disk and creates a new FsImage.Now this updated FsImage will be loaded into memory and the same process repeats.

Related

Duplicati and Backup of live Pervasive Database Missing Data

We have recently started using Duplicati for backup of some of our data systems. We run an ERP solution that uses Pervasive (v10).
When Duplicati begins its backup process, to the best of my understanding, it's using either the file date, and or the file byte size to determine what to back up.
The problem that I see with that solution is that some of the data is missing from the table. For example, the workorders module we are certain had new rows of data on the server (source machine) that were NOT copied over to the new file.
Last night we backed up our ERP platform then restored to a new location so as to do a compare of what was backed up during the evening against what the source machine had. We noticed that there are rows missing from one table in the restored backup, that are there in the source table.
The backup is being created from the data directory. We are NOT using the integrated backup that came with the ERP suite.
What I personally believe is happening is that the database isn't writing out the data to the table until the last client disconnects from the ERP software. Also, the byte size of the file missing data and the source machine are the same, even though the source file holds more data.
Last week we did the same test that we did last night and I noticed when I closed the ERP suite, the file updates its modified stamp and the new rows are added to the table, but not before the client disconnects.
Can someone shed some light on why this is happening?

Are the data files open according to Pervasive when the backup occurs? If so, you should be using some sort of agent to close the files or put them into Continuous Operation mode or Backup Agent.
From the docs:
Continuous Operations provides the ability to backup data files while
database applications are running and users are connected.
When Continuous Operation mode is started, a delta file (.^^^) is created and the original data file is 'closed' so backup programs can access the file and back it up.
Backup Agent puts a GUI front end to Continuous Operation mode but is only supported with PSQL v11 and newer.

With Duplicati, you can set --disable-filetime-check=true to ignore the timestamps and sizes, and scan each file for changes.
This option is not active by default, because it may take a long time to fully read the file contents. For normal file operations, the OS should set the timestamp, but some applications, like TrueCrypt, will revert the timestamp.

Separate transaction logs in Neo4j Enterprise

I'am trying to change logical logs out of the *.db folder to put this in another disk volume. However I don't see any option in the neo4j config files that will allow you to do this. It's possible to do this configuration?
My neo4j version is 3.2.1.
Thanks

No, it is - at this time - not possible to move the transaction logs to some other place. Note that while the term logs is technically correct, these files are essential to the integrity of the database (unlike a regular log it would be very unwise to delete them) and it is therefore logical that they live together with the actual datafiles.
Hope this helps,
Tom

see file
conf/neo4j.conf
and line
#dbms.directories.logs=logs

neo4j cluster: No such log version

I try use neo4j HA cluster (neo4j 2.0.1), but I got error "No such log version: ..." after database copied from master.
I deleted all *.log files, but it was not help for me.
Can you help me with this problem?
TIA.

"Log" in Neo4j refers to the write-ahead-log that the database uses to ensure durability and consistency. It is stored in files named something like nioneo_logical_log.vX, where X is a number. You should never delete these files manually, even if the database is turned off, this may lead to data loss. If you wish to limit the amount of disk space used by neo for logs, use this configuration:
keep_logical_logs=128M
The error you are seeing means that the database copied cannot perform an incremental update to catch up with the master, because the log files have been deleted. This may happen normally, if the copy is very old, or if you have keep_logical_logs set to false.
What you need to do to resolve this is to set keep_logical_logs to some value that makes sense to you, perhaps "keep_logical_logs=2 days" to keep logs up to two days old around. Once you've done that, you need to do a new full store copy on your slave, which can be triggered by shutting the slave off, deleting the database folder (data/graph.db), and starting the slave back up again.
Newer versions of Neo will do this automatically.

although i am getting "succeeded" state in the command history table, the data is not moving

Could anyone tell me how to verify whether my data is reaching the specified location or not??I am able to issue the command successfully but unable to see the data...i am trying to move my data from local disk to a file in the local disk itself.i am using the following configuration -
host : text("/home/hadoop/file1.txt") | agentSink("localhost",35853);
node2 : collectorSource(35853) | collectorSink("file:///home/hadoop/","file2.txt");

it's hard to tell exactly:
make sure port 35853 is open
I guess node2 is on the same machine as the agent (cause you configured the agent to localhost)
I'd change the agentSink to console (temporarily) just to make sure the file reading is woking properly
use apache-flume users mailing list :-) I found them very responsive and helpful

Please explain about AbInitio recovery file(.rec)?When should we roll back the file?

Please tell the concept of AbInitio recovery file.
When the Abinitio graph fails in execution which cases should we rollback the recovery file and in which cases we shouldnt rollback the recovery file.
Please provide links for any AbInitio materials.
Thanks.

The only time you would want to use the recovery file (.rec) is when you are executing a multi-phase graph and at least one phase has completed. You can then use the .rec file to restart the graph from the most recently completed phase.
However, you should only use the rec file if something external to the graph caused failure. Examples of this are: network going down, shared disk becoming unavailable or something similar. If you have a bug in your code and that cause failure, then you'll want to use m_rollback to remove both the rec file and any intermediate files ab initio created and start over.
Ab Initio does not publish their manuals, you will have to contact Ab Initio directly for materials.

m_rollback with -d option will delete the job, its temporary files and the recovery file after the rollback is successful.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart