is it possible to destroy your computer by running an expensive program on a big data set? - heat

this is my machine.
I'm running the perceptron algorithm on a very large data set.
I'm working in a room with no air con in a hot room.
It takes a long time to process this data, there are many operations involved, my computer is getting very hot.
Is it possible to destroy my machine in this way? Or cause serious harm?

It's depends on computer configuration. Most modern hardware has overheat protection, they just shutdown on throttle performance or otherheat. But be careful, you can just loose your calculations because of this.

Related

Do I need to worry about corrupt memory in an otherwise correct program?

We're working on an application meant to run on an embedded system, in a moderately harsh environment (a controller for a heating system in a residential building).
That application should run for years without needing to reboot the system. It runs on an embedded PC running Linux. The program instantiates several classes whose lifetime is the same as the application's.
Should I worry about memory becoming corrupt over such a long lifetime? Does it make sense to periodically check the class invariants to detect any such memory corruption? Or does modern hardware make such corruption astronomically unlikely?
I have seen my share of cheap sd cards on boards, they can die on you easily.
Few months ago have been dealing with one maker, under high data throughput SD card was unable to react in time. Some irq failure messages pop up and whole partition blows up.
If it's not intended for mass production I would definitely suggest you to choose some good and recommended storage.
But really, I can not remember memory corruption issues(besides rom), I would worry about memory leaks. Those are the most nasty problems for embedded system intended to last long without reboot.
Have to be really careful, they can happen either in userspace or in kernel space. Even software which you have always had confidence in may have them, depending on the build version. Have to choose Linux distribution carefully, if there is no dedicated kernel development team usually this stuff is outsourced to companies which build stable systems, where every included package is tested and confirmed to not leak.
In the end, definitely a few cycles of stress testing are needed, if there are problems with memory you will notice.

Tensorflow scalibility

I am using tensorflow to train DNN, my network structure is very simple, each minibatch takes about 50ms when only one parameter server and one worker. In order to process huge samples, I am using distributed ASGD training, however, I found that increasing worker count could not increase throughput, for example, 40 machines could achieve 1.5 million samples per second, after doubling parameter server machine count and worker machine count, cluster still could only process 1.5 million samples per second or even worse. The reason is each step takes much longer when cluster is large. Does tensorflow have good scalibility, and any advice for speeding up training?
General approach to solving these problems is to find where bottlenecks are. You could be hitting a bottleneck in software or in your hardware.
General example of doing the math -- suppose you have 250M parameters, and each backward pass takes 1 second. This means each worker will be sending 1GB/sec of data and receiving 1GB/sec of data. If you have 40 machines, that'll be 80GB/sec of transfer between workers and parameter server. Suppose parameter server machines only have 1GB/sec fully duplex NIC cards. This means that if you have less than 40 parameter server shards, then your NIC card speed will be the bottleneck.
After ruling that out, you should consider interconnect speed. You may have N network cards in your cluster, but the cluster most likely can't handle all network cards sending data to all other network cards. Can your cluster handle 80GB/sec of data flowing between 80 machines? Google designs their own network hardware to handle their interconnect demands, so this is an important problem constraint.
Once you checked that your network hardware can handle the load, I would check software. IE, suppose you have a single worker, how does "time to send" scale with the number of parameter server shards? If the scaling is strongly sublinear, this suggests a bottleneck, perhaps some inefficient scheduling of threads or some-such.
As an example of finding and fixing a software bottleneck, see grpc RecvTensor is slow issue. That issue involved gRPC layer become inefficient if you are trying to send more than 100MB messages. This issue was fixed in upstream gRPC release, but not integrated into TensorFlow release yet, so current work-around is to break messages into pieces 100MB or smaller.
The general approach to finding these is to write lots of benchmarks to validate your assumptions about the speed.
Here are some examples:
benchmark sending messages between workers(local)
benchmark sharded PS benchmark (local)

Mirrored queue performance factors

We operate two dual-node brokers, each broker having quite different queues and workloads. Each box has 24 cores (H/T) worth of Xeon E5645 # 2.4GHz with 48GB RAM, connected by Gigabit LAN with ~150μs latency, running RHEL 5.6, RabbitMQ 3.1, Erlang R16B with HiPE off. We've tried with HiPE on but it made no noticeable performance impact, and was very crashy.
We appear to have hit a ceiling for our message rates of between 1,000/s and 1,400/s both in and out. This is broker-wide, not per-queue. Adding more consumers doesn't improve throughput overall, just gives that particular queue a bigger slice of this apparent "pool" of resource.
Every queue is mirrored across the two nodes that make up the broker. Our publishers and consumers connect equally to both nodes in a persistant way. We notice an ADSL-like asymmetry in the rates too; if we manage to publish a high rate of messages the deliver rate drops to high double digits. Testing with an un-mirrored queue has much higher throughput, as expected. Queues and Exchanges are durable, messages are not persistent.
We'd like to know what we can do to improve the situation. The CPU on the box is fine, beam takes a core and a half for 1 process, then another 80% each of two cores for another couple of processes. The rest of the box is essentially idle. We are using ~20GB of RAM in userland with system cache filling the rest. IO rates are fine. Network is fine.
Is there any Erlang/OTP tuning we can do? delegate_count is the default 16, could someone explain what this does in a bit more detail please?
This is difficult to answer without knowing more about how your producers and consumers are configured, which client library you're using and so on. As discussed on irc (http://dev.rabbitmq.com/irclog/index.php?date=2013-05-22) a minute ago, I'd suggest you attempt to reproduce the topology using the MulticastMain java load test tool that ships with the RabbitMQ java client. You can configure multiple producers/consumers, message sizes and so on. I can certainly get 5Khz out of a two-node cluster with HA on my desktop, so this may be a client (or application code) related issue.

MPI/Pthread program does not scale

I have a MPI/Pthread program in which each MPI process will be running on a separate computing node. Within each MPI process, certain number of Pthreads (1-8) are launched. However, no matter how many Pthreads are launched within a MPI process, the overall performance is pretty much the same. I suspect all the Pthreads are running on the same CPU core. How can I assign threads to different CPU cores?
Each computing node has 8 cores.(two Quad core Nehalem processors)
Open MPI 1.4
Linux x86_64
Questions like this are often dependent on the problem at hand. Most likely, you are running into a resource lock issue (where the threads are competing for a lock) -- this would look like only one core was doing any work, because only one thread can (effectively) do any work at any given time.
Setting CPU affinity for a certain thread is not a good solution. You should allow for the OS scheduler to optimally determine the physical core assignment for a given pthread.
Look at your code and try to figure out where you are locking where you shouldn't be, or if you've come up with a correct parallel solution to the problem at hand. You should also test a version of the program using only pthreads (not MPI) and see if scaling is achieved.

localhost vs LAN : speed difference?

I'm currently in the process of performance profiling. We have a basic client/server application. Would the TCP transfer speed be different if I ran client/server on the same machine (localhost) vs across two computers on a LAN?
TCP transfer speed will be! because if you ran it on same computer it will forward packets locally without even touching LAN and network adapter.
But overall speed of client+server may be better on different machines, especially if you do not communicate with server too often.
When using localhost, local resources are more likely to be the performance bottleneck because of memory, disk, cpu, etc. When using two computers, its more likely the network will be the bottleneck because of latency, bandwidth, throughput, packet loss, etc.
It depends on what your application does and how it uses the network, client, and server.
Yes definitely, the latency of sending it across the network would slow the program down. The throughput wouldn't but if you're waiting for replies before sending data then this builds up because of extra latency.
I just hit this issue on a project at work. Using UDP with localhost is at least an order of magnitude faster than over a network connection (maybe two orders of magnitude), and I believe with localhost there is no MTU ceiling of 1500 as normally exists for network ports.
One unconfirmed suspicion is the built in network ports on PCs are not all the same quality, so even if they claim to be gigabit, you may not be able to really go that fast. But it may also be making lots of Windows system calls (one OS call per packet) may be a significant overhead. With TCP I can hand the OS a large chunk of data to write in a single call. With UDP I have to hand it a packet at a time, limited by the MTU size, resulting in a much larger number of OS calls. But unconfirmed as yet.
Have not tried Linux yet.
It really depends on what your application does....
As an example:
If it transfers 10GB files from client to server, then yes, it will make a difference.
I don't know if it measurable (that also depends on your LAN's speed) but from a logical point of view, of course there is a difference. Localhost will always be the fastest as the data has not be sent through another medium (like air or copper wire).
But depending on what your application does, this might or might not matter.
The transfer times would almost certainly be faster if the client and server were on the same machine. That may not actually matter to the performance of your program as a whole depending on the other resources consumed by the client and server.

Resources