Dev / test infrastructure lack of - devops

As a developer during fast paced dev and test cycles do you have sufficient infrastructure (VMs, disk space...)to do all the things you want to do?
or are you challenged with infrastructure? i.e. lack of VMs, disk space etc.
We are struggling with this and just deleting all unnecessary data (including logs etc) so we have enough disk space.
I would like to know if this is an issue for other developers.

Related

Does containerization always lead to cpu, ram and storage cost savings as compared to VMs?

Being a naive in the world of containers, and after reading a lot of literature online, I was wondering if someone could render some guidance.
I wanted to know if containers always lead to cost savings in terms of cpu, memory and storage when compared with the same application running inside a VM.
I can think of a scenario when it won’t when the scaleset in case of VM running inside an orchestrator like kubernetes is a high number leading to more consumption of compute.
I was wondering what is the general understanding here
Containerization is not about cost savings in terms of CPU/RAM/Storage, but a lot more.
When an app gets deployed on a VM, you need to have specific tools like Ansible/Chef/Puppet to optimize deployments, and you also need additional tools to monitor the load to increase/decrease the number of VMs running, you also need additional tools to provide WideIP support across the running services in case of a REST API, and the list goes on.
With Containers running on Kubernetes, you have all these features built in to some extent, and when you deploy Servicemesh framework like Istio, you get additional features which add lot of value with minimum effort including Circuit Breakers, retries, authentication, etc.

How big can a GKE container image get before it's a problem?

This question is admittedly somewhat vague. If you have suggestions how to better word it, please by all means, give me feedback...
I want to understand how big a GKE container image can get before there may be problems, either serious or minor. For example, I've built a docker image (not deployed yet) that is 683 MB.
(As an aside, the reason it's so big is that I'm running a computer vision library licensed from a company with certain attributes: (1) uses native libraries that are not compatible with Alpine; (2) uses Java; (3) uses Node.js to run a required licensing daemon in same container; (4) has some very large machine learning model files.)
Although the service will have auto-scaling enabled, I expect the auto-scaling to be fairly light. It might add a new pod occasionally, but not major spikes up and down.
The size of the container will determine how many resources to assign it and thus how much CPU, memory and disk space your nodes.must have. I have seen containers require over 2 GB of memory and still work fine within the cluster.
There probably is an upper limit but the containers would have to be enormous, your container size should not pose any issues aside from possibly container startup
In practice, you're going to have issues pushing an image to GCR before you have issues running it on GKE, but there isn't a hard limit outside the storage capabilities of your nodes. You can get away with O(GB) pretty easily.

AWS server became slow after traffic increase

I have a single page Angular app that makes request to a Rails API service. Both are running on a t2xlarge Ubuntu instance. I am using a Postgres database.
We had increase in traffic, and my Rails API became slow. Sometimes, I get an error saying Passenger queue full for rails application.
Auto scaling on the server is working; three more instances are created. But I cannot trace this issue. I need root access to upgrade, which I do not have. Please help me with this.
As you mentioned that you are using T2.2xlarge instance type. Firstly I want to tell you should not use T2 instance type for production environment. Cause of T2 instance uses CPU Credit. Lets take a look on this
What happens if I use all of my credits?
If your instance uses all of its CPU credit balance, performance
remains at the baseline performance level. If your instance is running
low on credits, your instance’s CPU credit consumption (and therefore
CPU performance) is gradually lowered to the base performance level
over a 15-minute interval, so you will not experience a sharp
performance drop-off when your CPU credits are depleted. If your
instance consistently uses all of its CPU credit balance, we recommend
a larger T2 size or a fixed performance instance type such as M3 or
C3.
Im not sure you won't face to the out of CPU Credit problem because you are using Xlarge type but I think you should use other fixed performance instance types. So instance's performace maybe one part of your problem. You should use cloudwatch to monitor on 2 metrics: CPUCreditUsage and CPUCreditBalance to make sure the problem.
Secondly, how about your ASG? After scale-out, did your service become stable? If so, I think you do not care about this problem any more because ASG did what it's reponsibility.
Please check the following
If you are opening a connection to Database, make sure you close it.
If you are using jquery, bootstrap, datatables, or other css libraries, use the CDN links like
<link rel="stylesheet" ref="https://cdnjs.cloudflare.com/ajax/libs/bootstrap-select/1.12.4/css/bootstrap-select.min.css">
it will reduce a great amount of load on your server. do not copy the jquery or other external libraries on your own server when you can directly fetch it from other servers.
There are a number of factors that can cause an EC2 instance (or any system) to appear to run slowly.
CPU Usage. The higher the CPU usage the longer to process new threads and processes.
Free Memory. Your system needs free memory to process threads, create new processes, etc. How much free memory do you have?
Free Disk Space. Operating systems tend to thrash when the file systems on system drives run low on free disk space. How much free disk space do you have?
Network Bandwidth. What is the average bytes in / out for your
instance?
Database. Monitor connections, free memory, disk bandwidth, etc.
Amazon has CloudWatch which can provide you with monitoring for everything except for free disk space (you can add an agent to your instance for this metric). This will also help you quickly see what is happening with your instances.
Monitor your EC2 instances and your database.
You mention T2 instances. These are burstable CPUs which means that if you have consistenly higher CPU usage, then you will want to switch to fixed performance EC2 instances. CloudWatch should help you figure out what you need (CPU or Memory or Disk or Network performance).
This is totally independent of AWS Server. Looks like your software needs more juice (RAM, StorageIO, Network) and it is not sufficient with one machine. You need to evaluate the metric using cloudwatch and adjust software needs based on what is required for the software.
It could be memory leaks or processing leaks that may lead to this as well. You need to create clusters or server farm to handle the load.
Hope it helps.

Do I need to worry about corrupt memory in an otherwise correct program?

We're working on an application meant to run on an embedded system, in a moderately harsh environment (a controller for a heating system in a residential building).
That application should run for years without needing to reboot the system. It runs on an embedded PC running Linux. The program instantiates several classes whose lifetime is the same as the application's.
Should I worry about memory becoming corrupt over such a long lifetime? Does it make sense to periodically check the class invariants to detect any such memory corruption? Or does modern hardware make such corruption astronomically unlikely?
I have seen my share of cheap sd cards on boards, they can die on you easily.
Few months ago have been dealing with one maker, under high data throughput SD card was unable to react in time. Some irq failure messages pop up and whole partition blows up.
If it's not intended for mass production I would definitely suggest you to choose some good and recommended storage.
But really, I can not remember memory corruption issues(besides rom), I would worry about memory leaks. Those are the most nasty problems for embedded system intended to last long without reboot.
Have to be really careful, they can happen either in userspace or in kernel space. Even software which you have always had confidence in may have them, depending on the build version. Have to choose Linux distribution carefully, if there is no dedicated kernel development team usually this stuff is outsourced to companies which build stable systems, where every included package is tested and confirmed to not leak.
In the end, definitely a few cycles of stress testing are needed, if there are problems with memory you will notice.

Large Storage Solution

We are a small bootstrapped ISP in a third world country where bandwidths are usually expensive and slow. We recently got a customer who need storage solution, of 10s of TB of mostly video files (its a tv station). The thing is I know my way around linux but I have never done anything like this before. We have a backblaze 3 storage pod casing which we are thinking of using as a storage server. The Server will be connected to customer directly so its not gonna go through the internet, because 100+mbps speed is unheard off in this part of the world.
I was thinking of using 4TB HDD all formatted with ext4 and using LVM to make them one large volume (50-70tb at least). So customer logs in to an FTP like client and dumps whatever files he/she wants. But the customer only sees a single volume, and we can add space as his requirements increases. Of course this is just on papers from preliminary research as i don't have prior experience with this kind of system. Also I have to take cost in to consideration so can't go for any proprietary solution.
My questions are:
Is this the best way to handle this probably, are there equally good or better solutions out there?
For large storage solutions (at least large for me) what are my cost effective options when it comes to dealing with data corruption and HD failure.
Would love to hear any other solutions and tips you guys might have. thanks!
ZFS might be a good option but there is no native bug-free solution for Linux, yet. I would recommend other operating systems in that case.
Today I would recommend Linux MD raid5 on enterprise disks or raid6 on consumer/desktop disks. I would not assign more than 6 disks to an array. LVM can then be used to tie the arrays to a logical volume suitable for ext4.
The ext4-filesystem is well tested and stable while XFS might be better for large file storage. The downside to XFS is that it is not possible to shrink an XFS filesystem. I would prefer ext4 because of it's more flexible nature.
Please also take into consideration that backups are still required even if you are storing your data on raid-arrays. The data can silently corrupt or be accidentally deleted.
In the end, everything depends on what the customer wants. Telling the customer the price of the service usually has an effect on the requirements.
I would like to add to the answer that mingalsuo gave. As he stated, it really comes down to the customer requirements. You don't say what, specifically, the customer will do with this data. Is it for archive only? Will they be actively streaming the data? What is your budget for this project? These types of answers will better determine the proposed solution. Here are some options based on a great many assumptions. Maybe one of them will be a good fit for your project.
CAPACITY:
In this case, you are not that concerned about performance but more interested in capacity. In this case, the number of spindles don't really matter much. As Mingalsuo stated, put together a set of RAID-6 SATA arrays and use LVM to produce a large volume.
SMALL BUSINESS PERFORMANCE:
In this case, you need performance. The customer is going to store files but also requires the ability for a small number of simultaneous data streams. Here you want as many spindles as possible. For streaming, it does little good to focus on the size of the controller cache. Just focus on the number of spindles. You want as many as possible. Keep in mind that the time to rebuild a failed drive increases with the size of the drive. And, during a rebuild, your performance will suffer. For these reasons I'd suggest smaller drives. Maybe 1TB drives at most. This will provide you with faster rebuild times and more spindles for streaming.
ENTERPRISE PERFORMANCE:
Here you need high performance - similar to that that an enterprise demands. You require many simultaneous data streams and performance is required. In this case, I would stay away from SATA drives and use 900G or 1.2TB SAS drives instead. I would also suggest that you consider abstracting the storage layer from the server layer. Create a Linux server and use iSCSI (or fibre) to connect to the storage device. This will allow you to load balance if possible, or at the very least make recovery from disaster easier.
NON TRADITIONAL SOLUTIONS:
You stated that the environment has few high-speed connections to the internet. Again, depending on the requirements, you still might consider cloud storage. Hear me out :) Let's assume that the files will be uploaded today, used for the next week or month, and then rarely read. In this case, these files are sitting on (potentially) expensive disks for no reason except archive. Wouldn't it be better to keep those active files on expensive (local) disk until they "retire" and then move them to less expensive disk? There are solutions that do just that. One, for example, is called StorSimple. This is an appliance that contains SAS (and even flash) drives and uses cloud storage to automatically migrate "retired" data from the local storage to cloud storage. Because this data is retired it wouldn't matter if it took longer than normal to move it to the cloud. And, this appliance automatically pulls it back from the cloud to local storage when it is accessed. This solution might be too expensive for your project but there are similar ones that you might find will work for you. The added benefit of this is that your data is automatically backed up by the cloud provider and you have an unlimited supply of storage at your disposal.

Resources