hdfs dfs -ls webhdfs:// not listing all files - webhdfs

When I run the following command it lists 91 files
hdfs dfs -ls /data
But when I use the follwing command it returns only 89 files
hdfs dfs -ls webhdfs://x.x.x.x:14000/data
What could be the reason?

I haven't seen this issue in the last 3 years. I use this command almost every day. can you paste the missing files that from the webhdfs command and also what user is listing the files in both the cases and does this user have sufficient privileges to read the files?

Related

Web hdfs open and read a file

I'm trying to automate a process whereby I import multiple files within a hdfs folder. Is there a way like I can get multiple files in one command instead of getting the json and downloading each file with every iteration.
In the below command for the is there a way I can put a wildcard instead of actual file name so I can get multiple files in a single read?
curl -i -L "http://:/webhdfs/v1/?op=OPEN
[&offset=][&length=][&buffersize=]"

Import data into Cassandra using docker in window

Hi i am trying to load into cassandra in docker. Unfortunately, i can't make it. I pretty sure the path is correct, as i directly copy and paste from the properties section. May i know is there any alternative to solve it?
p.s. i am using windows 11, latest cassandra 4.1
cqlsh:cds502_project> COPY data (id)
... FROM 'D:\USM\Data Science\CDS 502 Big data storage and management\Assignment\Project\forest area by state.csv'
... WITH HEADER = TRUE;
Using 7 child processes
Starting copy of cds502_project.data with columns [id].
Failed to import 0 rows: OSError - Can't open 'D:\\USM\\Data Science\\CDS 502 Big data storage and management\\Assignment\\Project\\forest area by state.csv' for reading: no matching file found, given up after 1 attempts
Processed: 0 rows; Rate: 0 rows/s; Avg. rate: 0 rows/s
0 rows imported from 0 files in 0.246 seconds (0 skipped).
above is my code and the result. I have tried https://www.geeksforgeeks.org/export-and-import-data-in-cassandra/ exactly and it works when i create the data inside the docker, export and reimport it, but not working when i use external data.
I also notice the csv file i exported using cassandra in docker is missing in my laptop but can be access by docker.
Behaviour you are observing is what is expected from docker. What I know there are cp commands in Kubernetes which copy the data from outside to inside container and vice-versa. Either you can check those commands to take the data from inside or outside the docker or other way is you can push your csv into docker using a Docker Image.
you need to leverage Docker bind mounting a volume in order to access local files within your container. docker run -it -v <path> ...
See references below:
https://www.digitalocean.com/community/tutorials/how-to-share-data-between-the-docker-container-and-the-host
https://www.docker.com/blog/file-sharing-with-docker-desktop/

Is there any way to make a COPY command on Dockerfile silent or less verbose?

We have a Dockerfile which copies the folder to another folder.
Something like
COPY . /mycode
But the problem is that there is tons of files in the generated code, and it creates 10K+ lines on the jenkins log where we are running the CICD pipeline.
copying /bla/blah/bla to copying /bla/blah/bla 10k times.
Is there a way to make this COPY less verbose or silent? jenkins admin has already warned us that our log file is nearing his max limit.
You can tar/zip the files on the host, so there's only one file to copy. Then untar/unzip after it's been copied and direct the output of untar/unzip to /dev/null.
By definition the docker cp command doesn't have any "silent" switches, but perhaps redirecting the output may help, have you tried:
docker cp -a CONTAINER:SRC_PATH DEST_PATH|- &> /dev/null
I know it's not the most elegant, but if what you seek is to supress the console output, that may help.
Sources
[1] https://docs.docker.com/engine/reference/commandline/cp/
[2] https://csatlas.com/bash-redirect-stdout-stderr/

Why does it show "File not found" when I am trying to run a command from a docker file to find and remove specific logs?

I have a docker file which has below command.
#Kafka log cleanup for log files older than 7 days
RUN find /opt/kafka/logs -name "*.log.*" -type f -mtime -7 -exec rm {} \;
While executing it gives an error opt/kafka/logs not found. But I can access to that directory. Any help on this is appreciated. Thank you.
Changing the contents of a directory defined with VOLUME in your Dockerfile using a RUN step will not work. The temporary container will be started with an anonymous volume and only changes to the container filesystem are saved to the image layer, not changes to the volume.
The RUN step, along with every other step in the Dockerfile, are used to build the image, and this image is the input to the container, it does not use your running containers or volumes for the build input, so it makes no sense to cleanup files that are not created as part of your image build.
If you do delete files created in your image build, you should make sure this is done within the same RUN step. Otherwise, files you delete are already written to an image layer, and are transferred and stored on disk, just not visible in containers based on the layer that includes the delete step.

Heroku Bash: ls command not displaying .dat files

I am storing some .dat files in the public folder of my rails app on heroku. However, I cannot display those files in the heroku bash.
Notice the "total 4" in the ls -l result
~/public/files $ ls
README.txt
~/public/files $ ls -a
. .. README.txt
~/public/files $ ls -l
total 4
-rw------- 1 u5517 5517 25 2013-04-13 23:35 README.txt
So I know they are there, they are just not being shown. I need to be able to look at them to verify my app is working correctly.
Thanks.
Heroku dynos have ephemeral file systems, so if you are writing files in your web process, they will not be available to other dynos, including one created from a heroku run bash session. Please also remember that dynos are restarted at least every 24 hours, so unless these are just temp files, it would be better to put the file somewhere like S3 that has long-term persistance and can be accessed by all your dynos.

Resources