Web hdfs open and read a file - webhdfs

I'm trying to automate a process whereby I import multiple files within a hdfs folder. Is there a way like I can get multiple files in one command instead of getting the json and downloading each file with every iteration.
In the below command for the is there a way I can put a wildcard instead of actual file name so I can get multiple files in a single read?
curl -i -L "http://:/webhdfs/v1/?op=OPEN
[&offset=][&length=][&buffersize=]"

Related

Exclude a directory from `podman/docker export` stream and save to a file

I have a container that I want to export as a .tar file. I have used a podman run with a tar --exclude=/dir1 --exclude=/dir2 … that outputs to a file located on a bind-mounted host dir. But recently this has been giving me some tar: .: file changed as we read it errors, which podman/docker export would avoid. Besides the export I suppose is more efficient. So I'm trying to migrate to using the export, but the major obstacle is I can't seem to find a way to exclude paths from the tar stream.
If possible, I'd like to avoid modifying a tar archive already saved on disk, and instead modify the stream before it gets saved to a file.
I've been banging my head for multiple hours, trying useless advices from ChatGPT, looking at cpio, and attempting to pipe the podman export to tar --exclude … command. With the last I did have small success at some point, but couldn't make tar save the result to a particularly named file.
Any suggestions?
(note: I do not make distinction between docker and podman here as their export command is completely the same, and it's useful for searchability)

Nitrogen - File upload directly to database

In the Nitrogen Web framework, files uploaded always end in the ./scratch/ directory when using #upload{}. From here you are supposed to manage the uploaded files, for example, by copying them to their final destination directory.
However, in case the destination is a database, is there a way of uploading these files straight to the database? Use case RIAK-KV.
You can upload a file to Riak KV using an HTTP POST request. You can see the details at in the Creating Objects documentation which shows how to do it using curl.
To send the contents of a file instead of a value, something like this should work:
curl -XPOST http://127.0.0.1:8098/types/default/buckets/scratch/keys/newkey
-d #path/to/scratch.file
-H "Content-Type: application/octet-stream"

Making an instance of GraphDB on Docker

I am trying to make an instance of GraphDB on Docker. After creating the instance, I need to make a repository to import the data to the instance. However, when I make a repository, it says that the repository does not exist. When I use the loadrdf command to import data I receive an error regarding that the repository does not exist.
dist/bin/loadrdf -f -i repo-test -m parallel /opt/graphdb/home/data/*.ttl
The default data location of GraphDB is the data sub-directory of GraphDB's home directory, which in turn defaults to the distribution directory.
For the docker image this is /opt/graphdb/dist, so the default data directory is /opt/graphdb/dist/data.
But also in the docker image the default home is changed to /opt/graphdb/home, so the data directory becomes /opt/graphdb/home/data. This is done by passing the -Dgraphdb.home=/opt/graphdb/home java option when starting GraphDB.
So, when you created your repository it was created at /opt/graphdb/home/data/repositories/repo-test.
Your problem is that the loadrdf tool doesn't know about the changed home directory.
To overcome this try exporting the GDB_JAVA_OPTS variable with value -Dgraphdb.home=/opt/graphdb/home before running loadrdf, or as a one-liner:
GDB_JAVA_OPTS='-Dgraphdb.home=/opt/graphdb/home' ./dist/bin/loadrdf -f -i repo-test -m parallel /opt/graphdb/home/data/*.ttl

Download google sheets file as csv to cpanel using cron task

I have a specific task to accomplish which involves downloading a file from Google sheets. I need to always have just one file downloaded so the new file will overwrite any previous one (if it exists)
I have tried the following command but I can't quite get it to work. Not sure what's missing.
/usr/local/bin/php -q https://docs.google.com/spreadsheets/d/11rFK_fQPgIcMdOTj6KNLrl7pNrwAnYhjp3nIrctPosg/ -o /usr/local/bin/php /home/username/public_html/wp-content/uploads/wpallimport/files.csv
Managed to solve with the following:
curl --user-agent cPanel-Cron https://docs.google.com/spreadsheets/d/[...]/edit?usp=sharing --output /home/username/public_html/wp-content/uploads/wpallimport/files/file.csv

GSUTIL not re-uploading a file that has already been uploaded earlier that day

I'm running GSUTIL v3.42 from a Windows CMD script on a Windows server 2008 R2 using Python 2.7.6. Files to be uploaded arrive in an "outgoing" directory and are uploaded in parallel by GSUTIL to an "incoming" bucket. The script requests a listing of the "incoming" bucket after uploading has finished and then compares the files listed with those it attempted to upload, in order to detect any upload failures. Another separate script moves files from the "incoming" bucket to a "processed" bucket afterwards.
If I attempt to upload the identical file (same name/size/content/date etc.) a second time, it doesn't upload, although I get no errors and nothing in my logging to indicate failure. I am not using the "no clobber" option, so I would expect gsutil to just upload the file.
In the scenario below, assume that the file has been successfully uploaded and moved to the "processed" bucket already on that day. In case timings matter, the second upload is being attempted within half an hour of the first.
File A arrives in "outgoing" directory.
I get a file listing of "outgoing" and write this to dirListing.txt
I perform a GSUTIL upload using
type dirListing.txt | python gsutil -m cp -I -L myGsutilLogFile.txt gs://myIncomingBucket
I then perform a GSUTIL listing
python gsutil ls -l -h gs://myIncomingBucket > bucketListing.txt
File match dirListing.txt and bucketListing.txt to detect mismatches and hence upload failures.
On the second run, File A isn't being uploaded in step 3 and consequently isn't returned in step 4, causing a mismatch in step 5. [I've checked the content of all of the relevant files and it's definitely in dirListing.txt and not in bucketListing.txt]
I need the ability to re-process a file in case the separate script that moves the file from the "incoming" to the "processed" bucket fails for some reason or doesn't do what it should do. I have to upload in parallel because there are normally hundreds of files on each run.
Is what I've described above expected behaviour from GSUTIL? (I haven't seen anything in the documentation that suggests this) If so, is there any way of forcing GSUTIL to re-attempt the upload? Or am I missing something obvious, please? I have debug output from GSUTIL if that's necessary/useful.
From the above, it looks like you're uploading using "-L" to log to a manifest file. If you're using the same manifest file, and the file has already been uploaded once, then gsutil will not try to re-upload the file. From the docs on "-L" in "gsutil help cp":
If the log file already exists, gsutil will use the file as an
input to the copy process, and will also append log items to the
existing file. Files/objects that are marked in the existing log
file as having been successfully copied (or skipped) will be
ignored.

Resources