Get directory size from WebHDFS? - webhdfs

I see that webhdfs does not support directory size. In HDFS, I can use
hdfs dfs -du -s -h /my/directory
Is there a way to derive this from webHDFS? I need to do this programmatically, not by viewing the page.

I think WebHDFS's GETCONTENTSUMMARY can provide you the information. More information here: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Get_Content_Summary_of_a_Directory
Here is the schema for GETCONTENTSUMMARY: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#ContentSummary_JSON_Schema
You'll see that it has the filed "spaceConsumed" which is the disk space consumed.

Related

Optimize an image once and avoid doing it again

I have a large respository of images, mostly JPEG, which I'd like to optimize using a library like ImageMagick or a Linux CLI tool like jpegtran (as covered in JPG File Size Optimization - PHP, ImageMagick, & Google's Page Speed), but I don't want to have to track which ones have been optimized already and I don't want to re-optimize every one again later. Is there some sort of flag I could easily add to the file that would make it easy to detect and skip the optimization? Preferably one that would stay with the file when backed up to other filesystems?
E.g.: a small piece of exif data, a filesystem flag, some harmless null bytes added at the end of the file, a tool that is already intelligent enough to do this itself, etc..
You could use "extended attributes" which are metadata and stored in the filesystem. Read and write them with xattr:
# Read existing attributes
xattr -l image.png
# Set an optimised flag of your own invention/description
xattr -w optimised true image.png
# Read attributes again
xattr -l image.png
optimised: true
The presence of an extended attribute can be detected in a long listing - it is the # sign after the permissions:
-rw-r--r--# 1 mark staff 275 29 May 07:54 image.png
As you allude in your comments, make sure that any backup programs you use honour the attributes before committing to it as a strategy. FAT-32 filesystems are notoriously poor at this sort of thing - though tar file or similar may survive a trip to Windows-land and back.
As an alternative, just set a comment in the EXIF header - I have already covered that in this related answer...
To piggyback off of Mark Setchell's answer, if you use xattr you'll most likely need to use a trusted namespace, otherwise you're likely to get an "Operation not supported" error. Most of the documentation I could find referred to "setfattr", however, the same trusted namespace rules apply to "xattr" as well.
For example, using the user namespace:
# Set an optimised flag of your own invention/description
xattr -w user.optimised true image.png

GRC file meta sink how

I'm new to GRC. I want to modify the attached config to replace the "file sink" at the bottom of the image with "file meta sink", so I can use a program to read the header and then the complex samples. I guess I might need a different stream adapter, but it is not clear which one or how.
Help please.
Thanks,
Don

How to change "max_kv_size_per_doc"?

I need to change this parameter. Default is 1048576 bytes. I want at least 2097152 bytes.
I already know this is not recommended for a production environment as I have already read in this forum.
I'm using Couchbase 4.5 with Docker.
I already changed the file "opt/couchbase/etc/couchdb/default.ini" and restarted the container and later the service inside the container. But it doesn't take effect.
I also tried this:
curl -X POST http://Administrator:password#localhost:8091/diag/eval -d 'rpc:eval_everywhere(erlang, apply, [fun() -> couch_config:set("mapreduce", "max_kv_size_per_doc", "2097152") end, []]).'
But with no success.
During my tests, I changed the file "opt/couchbase/etc/couchdb/default.ini" to set a new value for the parameter "function_timeout" and it worked. I tested with a sleep inside map function and the log showed timeout error.
It means that when I restart Couchbase it takes the new configuration.
But change the parameter "max_kv_size_per_doc" makes no difference. Does anyone know why?
Any help would be appreciated.
Regards,
Angelo

Does speed of tar.gz file listing depend on tar size?

I am using the tf function to list the contents of a tar.gz file. It is pretty large ~1 GB. There are around 1000 files organized in a year/month/day file structure.
The listing operation takes quit a bit of time. Seems like a listing should be fast. Can anyone enlighten me on the internals?
Thanks -
Take a look at wikipedia, for example, to verify that each file inside the tar is preceed by a header. To verify all files inside the tar, is necessary to read the whole tar.
There's no "index" in the beggining of the tar to indicate it's contents.
Tar has simple file structure. If you want list them, you must parse all file.
If you want find one file, you can stop process. But must be sure archive has only one file version. This is typical on packed archives because adding on that is unsupported.
for example you can do like this:
tar tvzf somefile.gz|grep for find something|\
while read file; do foundfile="$file"; last; done
at this loop will break and do not read everything, but only from start to file position.
If you must do something more with list, save it to any temporary file. you can gzip this file for place saving if it is needed:
tar tvzf somefile.gz|gzip >temporary_filelist.gz

ack misses results (vs. grep)

I'm sure I'm misunderstanding something about ack's file/directory ignore defaults, but perhaps somebody could shed some light on this for me:
mbuck$ grep logout -R app/views/
Binary file app/views/shared/._header.html.erb.bak.swp matches
Binary file app/views/shared/._header.html.erb.swp matches
app/views/shared/_header.html.erb.bak: <%= link_to logout_text, logout_path, { :title => logout_text, :class => 'login-menuitem' } %>
mbuck$ ack logout app/views/
mbuck$
Whereas...
mbuck$ ack -u logout app/views/
Binary file app/views/shared/._header.html.erb.bak.swp matches
Binary file app/views/shared/._header.html.erb.swp matches
app/views/shared/_header.html.erb.bak
98:<%= link_to logout_text, logout_path, { :title => logout_text, :class => 'login-menuitem' } %>
Simply calling ack without options can't find the result within a .bak file, but calling with the --unrestricted option can find the result. As far as I can tell, though, ack does not ignore .bak files by default.
UPDATE
Thanks to the helpful comments below, here are the new contents of my ~/.ackrc:
--type-add=ruby=.haml,.rake
--type-add=css=.less
ack is peculiar in that it doesn't have a blacklist of file types to ignore, but rather a whitelist of file types that it will search in.
To quote from the man page:
With no file selections, ack-grep only searches files of types that it recognizes. If you have a file called foo.wango, and ack-grep doesn't know what a .wango file is, ack-grep won't search it.
(Note that I'm using Ubuntu where the binary is called ack-grep due to a naming conflict)
ack --help-types will show a list of types your ack installation supports.
If you are ever confused about what files ack will be searching, simply add the -f option. It will list all the files that it finds to be searchable.
ack --man states:
If you want ack to search every file,
even ones that it always ignores like
coredumps and backup files, use the
"−u" switch.
and
Why does ack ignore unknown files by
default? ack is designed by a
programmer, for programmers, for
searching large trees of code. Most
codebases have a lot files in them
which aren’t source files (like
compiled object files, source control
metadata, etc), and grep wastes a lot
of time searching through all of those
as well and returning matches from
those files.
That’s why ack’s behavior of not
searching things it doesn’t recognize
is one of its greatest strengths: the
speed you get from only searching the
things that you want to be looking at.
EDIT: Also if you look at the source code, bak files are ignored.
Instead of wrestling with ack, you could just use plain old grep, from 1973. Because it uses explicitly blacklisted files, instead of whitelisted filetypes, it never omits correct results, ever. Given a couple of lines of config (which I created in my home directory 'dotfiles' repo back in the 1990s), grep actually matches or surpasses many of ack's claimed advantages - in particular, speed: When searching the same set of files, grep is faster than ack.
The grep config that makes me happy looks like this, in my .bashrc:
# Custom 'grep' behaviour
# Search recursively
# Ignore binary files
# Output in pretty colors
# Exclude a bunch of files and directories by name
# (this both prevents false positives, and speeds it up)
function grp {
grep -rI --color --exclude-dir=node_modules --exclude-dir=\.bzr --exclude-dir=\.git --exclude-dir=\.hg --exclude-dir=\.svn --exclude-dir=build --exclude-dir=dist --exclude-dir=.tox --exclude=tags "$#"
}
function grpy {
grp --include=*.py "$#"
}
The exact list of files and directories to ignore will probably differ for you: I'm mostly a Python dev and these settings work for me.
It's also easy to add sub-customisations, as I show for my 'grpy', that I use to grep Python source.
Defining bash functions like this is preferable to setting GREP_OPTIONS, which will cause ALL executions of grep from your login shell to behave differently, including those invoked by programs you have run. Those programs will probably barf on the unexpectedly different behaviour of grep.
My new functions, 'grp' and 'grpy', deliberately don't shadow 'grep', so that I can still use the original behaviour any time I need that.

Resources