How to prevent storing files i haven't imported\pinned into my node? - storage

I have just installed an IPFS Desktop app on my computer for the first time ever, gone to Files sectoion and removed all 2 pinned files that were there. I didn't even get why something was pinned by default right after installation.
Then, I just started to watch what would happen. After a few minutes I've started to see spikes in network bandwidth as well as an amount of blocks and storage size started to increase.
So, the questions are:
If I haven't even imported\pinned any file yet, why the storage is started to fill? I guess it was filling with someones files.
How can I prevent it and "seed" only files\data I manually add to my IPFS node?
I'd like to just "seed" my files in read-only mode and prevent constant writes and wearing out my SSD as well as exclude unneeded network traffic.

IPFS caches things you access by default.
That cache is cleared during "garbage collection", which happens by default once every hour.
You can change this default behavior:
reprovider/strategy "pinned" https://docs.ipfs.io/how-to/configure-node/#reprovider
routing.type "dhtclient" https://docs.ipfs.io/how-to/configure-node/#routing

Related

In Perforce (P4V), I am trying to delete a workspace but getting an error that says the File System P4ROOT doesnt have enough space

the exact error message says The filesystem 'P4ROOT' has only 1.9G free, but the server configuration requires at least 2G available.
I am trying to delete a new workspace i made by accident but keep getting this error, P4V now wont let me do anything, including deleting this workspace that seems to be causing the issue. How do i fix this?
P4ROOT is on the server, so if you're connecting to a remote server, you need to contact the admin of that server and let them know that it's wedged. Your workspace specifically is not the problem, the overall lack of space on the server is. All that needs to happen to fix it is increasing the available disk space. (Deleting your workspace would free up a little space on the remote server by pruning the associated db entries, but those are very small compared to the depot files.)
The "requires 2G available" thing is because by default the server looks for an available 2GB of empty space before it starts any operation; that's to provide reasonable assurance that it won't run out of space completely during the operation, since actually hitting a hard limit can be hard to recover from (db tables might be in a partially-written state, etc).
If the admin wants to try fixing this by obliterating large files (this is usually a pain and I'd recommend just throwing a bigger hard drive at the problem instead), they can temporarily lower that threshold to be able to run the obliterate, but I'd recommend bumping it back afterwards.

Memory Monitoring in Apache Ignite

I am using Apache Ignite 2.8.0.
i am running my server without persistence.
i have some records in my cache. it shows "totalAllocatedSize":18869600.
Now i cleared my cache, again it shows as the same "totalAllocatedSize":18869600.(i don't have any records in my cache)
why it shows like this, actually i don't have any records in cache, so it need to be show as 0. but it shows the previous value i got when some records in my cache..
why it's behave like this? or How i will get my actual memory used right now?
Like many databases, Apache Ignite will not de-allocate memory it has already allocated. You can see that you have space available by decreased fillFactor metric.

Dataflow job takes too long to start

I'm running a job which reads about ~70GB of (compressed data).
In order to speed up processing, I tried to start a job with a large number of instances (500), but after 20 minutes of waiting, it doesn't seem to start processing the data (I have a counter for the number of records read). The reason for having a large number of instances is that as one of the steps, I need to produce an output similar to an inner join, which results in much bigger intermediate dataset for later steps.
What should be an average delay before the job is submitted and when it starts executing? Does it depend on the number of machines?
While I might have a bug that causes that behavior, I still wonder what that number/logic is.
Thanks,
G
The time necessary to start VMs on GCE grows with the number of VMs you start, and in general VM startup/shutdown performance can have high variance. 20 minutes would definitely be much higher than normal, but it is somewhere in the tail of the distribution we have been observing for similar sizes. This is a known pain point :(
To verify whether VM startup is actually at fault this time, you can look at Cloud Logs for your job ID, and see if there's any logging going on: if there is, then some VMs definitely started up. Additionally you can enable finer-grained logging by adding an argument to your main program:
--workerLogLevelOverrides=com.google.cloud.dataflow#DEBUG
This will cause workers to log detailed information, such as receiving and processing work items.
Meanwhile I suggest to enable autoscaling instead of specifying a large number of instances manually - it should gradually scale to the appropriate number of VMs at the appropriate moment in the job's lifetime.
Another possible (and probably more likely) explanation is that you are reading a compressed file that needs to be decompressed before it is processed. It is impossible to seek in the compressed file (since gzip doesn't support it directly), so even though you specify a large number of instances, only one instance is being used to read from the file.
The best way to approach the solution of this problem would be to split a single compressed file into many files that are compressed separately.
The best way to debug this problem would be to try it with a smaller compressed input and take a look at the logs.

How to avoid slow TokuMX startup?

We run a TokuMX replica-set (2 instances + arbiter) with about about 120GB data (on disk) and lots of indices.
Since the upgrade to TokuMX 2.0 we noticed that restarting the SECONDARY instance always took a very long time. The database kept getting stuck at STARTUP2 for 1h+, before switching to normal mode. While the server is at STARTUP2, it's running at a continuous CPU load - we assume it's rebuilding its indices, even though it was shut down properly before.
While this is annoying, with the PRIMARY being available it caused no downtime. But recently during an extended maintenance we needed to restart both instances.
We stopped the SECONDARY first, then the PRIMARY and started them in reverse order. But this resulted in both taking the full 1h+ startup-time and therefore the replica-set was not available for this time.
Not being able to restart a possibly downed replica-set without waiting for such a long time, is a risk we'd rather not take.
Is there a way to avoid the (possible) full index-rebuild on startup?
#Chris - We are revisiting your ticket now. It may have been inadvertently closed prematurely.
#Benjamin: You may want to post this on https://groups.google.com/forum/#!forum/tokumx-user where many more TokuMX users, and the Tokutek support team, will see it.
This is a bug in TokuMX, which is causing it to load and iterate the entire oplog on startup, even if the oplog has been mostly (or entirely) replicated already. I've located and fixed this issue in my local build of TokuMX. The pull request is here: https://github.com/Tokutek/mongo/pull/1230
This has reduced my node startup times from hours to <5 seconds.

Uploading files to ec2, first to ebs volume then moving to s3

http://farm8.staticflickr.com/7020/6702134377_cf70482470_z.jpg
OK sorry for the terrible drawing but it seemed a better way to organize my thoughts and convey them. I have been wrestling for a while with how to create an optimal de-coupled easily scale-able system for uploading files to a web app on AWS.
Uploading directly to S3 would work except for the fact the files need to be instantly accessible to the uploader for manipulation then once manipulated they can go to s3 where they will be served to all instances.
I played with the idea of creating a SAN with something like glusterfs then uploading directly to that and serving from that. I have not ruled it out but from varying sources the reliability of this solution might be less than ideal (if anyone has better insight on this I would love to hear). In any case I wanted to formulate a more "out of the box" (in the context of AWS) solution.
So to elaborate on this diagram, I want the file to be uploaded to the local filesystem of the instance it happens to go to, which is an EBS volume. The storage location of the file would not be served to the public (i.e. /tmp/uploads/ ) It could still be accessed by the instance through a readfile() operation in PHP so that the user could see and manipulate it right after uploading. Once the user is finished manipulating the file a message to move it to s3 could be queued in SQS.
My question is then once I save the file "locally" on the instance (which could be any instance due to the load balancer), how can I record which instance it is on (in the DB) so that subsequent requests through PHP to read or move the file will find said file.
If anyone with more experience in this has some insight I would be very grateful. Thanks.
I have a suggestion for a different design that might solve your problem.
Why not always write the file to S3 first? And then copy it to the local EBS file system on whichever node you're on while you're working on it (I'm not quite sure what manipulations you need to do, but I'm hoping it doesn't matter). When you're finished modifying the file, simply write it back to S3 and delete it from the local EBS volume.
In this way, none of the nodes in your cluster need to know which of the others might have the file because the answer is it's always in S3. And by deleting the file locally, you get a fresh version of the file if another node updates it.
Another thing you might consider if it's too expensive to copy the file from S3 every time (it's too big, or you don't like the latency). You could turn on the session affinity in the load balancer (AWS calls it sticky sessions). This can be handled by your own cookie or by the ELB. Now subsequent requests from the same browser come to the same cluster node. Simply check the modified time of the file on the local EBS volume against the S3 copy and replace if it's more recent. Then you get to take advantage of the local EBS file system while the file's being worked on.
Of course there are a bunch of things I don't get about your system. Apologies for that.

Resources