Multiple hits to an API bringing server to it's knees - ruby-on-rails

I am using an API (Let's pretend its facebook) to gather data between two given dates. Because of API restrictions (like most) I can only grab so many at a time, and therefor have to page my way through the results.
Here is my issue/question though.. Is it better to
get fewer results back, and make more calls to the api
get more results back, and fewer calls to the api
I am running a 4GB instance of a cloud server..
The data I'm looking at is in XML format, and contains about 20k entries. Each entry contains probably another 20 tags within it. Once completely pulled down the data ends up being about 10MB.. my problem is that when my server is hitting the api, gathering this information the CPU and Memory spike to nearly 100%. I've tried retrieving 500 at a time, 1000 at a time, 5000 at a time.. is this something where I need to gather 20 at a time.. or is there something else I should look at?
I'm not sure what else to provide, if there is something I can provide just let me know
Updates based on answers
I host with Storm on Demand, which runs perfectly for us and seems to be great hardware - https://www.stormondemand.com/cloud-server/
I use HPricot to parse the XML (which could probably be optimized, I'm no expert here)
I do need all of the data, this service doesn't offer an export, only API.
EDIT [to help people stumbling on this later]
I switched from Hpricot to Nokogiri, MUCH faster.
Also, I was building an XML file in memory, apparently that is extremely intense, and was a very time consuming task. I've cut this operation down from about 10 minutes, to just over 1 minute by fixing these two things.

Here's a list of things to look at:
optimize your code. try profiling your code and see if you can improve it. Mast likely using a better parser (DOM vs SAX) is possible.
get a better hardware/hosting. 4GB is just memory. Most likely you are on a shared hosting/vm and CPU limited
offload some CPU/memory heavy operations to a faster service/application, like XML processing, data analysis, file io can be done in C/C++
in a proper cloud environment you should be able to spawn more VMs and adjust your jobs/load accordingly. That will cost more tough and require some kind of job manager.

The questions you need to ask is why is your CPU+ memory spiking? 4GB is plenty to be handling this data, so is your code optimized to handle this task? If not, what can you do?
Is your code optimized enough? Fair enough. You can now rewrite them using C extensions.
After optimizing your code, I'd suggest checking out processing this data 'later', as in a delayed job. This way you aren't blocking on the entire dataset which may strain your server.
You also mentioned you are running a cloud server, which I can assume you have access to more Virtual Machines. You can process this data in pararel to reduce stress per machine.

Related

Sending large amounts of data from iPhone to a backend service

I have a use case where I need to daily post ~600 Mb of json to a backend server that will persist this data. To limit the size I could compress the data with zlib or convert it altogether to avro and then compress it with snappy.
Does this seem like a good practise when it comes to mobile device backend interaction with regards to this much data?
Thank you!
I think even if you got this down by 10x, it would still be pushing it. At this rate, you'll be running into upload speed issues, bandwidth limits, all kinds of problems. I don't think users are going to be happy with this behavior. Though I suppose that does depend on the use case (which I'm now curious about).
Your backend is going to regularly experience upload failures. Remember networks are unreliable. And, with that much posting, depending on how many instances of your app are out there, could put very high demands on your backend system. If this data is important, you will have to break it up into smaller chunks, at an absolute minimum.
I'd experiment with a few different formats/compression to see how much savings you can get. But, you may also want to throw some brainpower at rethinking the implementation.

Neo4j inserting large files - huge difference in time between

I am inserting a set of files (pdfs, of each 2 MB) in my database.
Inserting 100 files at once takes +- 15 seconds, while inserting 250 files at once takes 80 seconds.
I am not quite sure why this big difference is happening, but I assume it is because the amount of free memory is full between this amount. Could this be the problem?
If there is any more detail I can provide, please let me know.
Not exactly sure of what is happening on your side but it really looks like what is described here in the neo4j performance guide.
It could be:
Memory issues
If you are experiencing poor write performance after writing some data
(initially fast, then massive slowdown) it may be the operating system
that is writing out dirty pages from the memory mapped regions of the
store files. These regions do not need to be written out to maintain
consistency so to achieve highest possible write speed that type of
behavior should be avoided.
Transaction size
Are you using multiple transactions to upload your files ?
Many small transactions result in a lot of I/O writes to disc and
should be avoided. Too big transactions can result in OutOfMemory
errors, since the uncommitted transaction data is held on the Java
Heap in memory.
If you are on linux, they also suggest some tuning to improve performance. See here.
You can look up the details on the page.
Also, if you are on linux, you can check memory usage by yourself during import by using this command:
$ free -m
I hope this helps!

Detailed multitasking monitoring

I'm trying to put together a model of a computer and run some simulations on it (part of a school assignment). It's a very simple model - a CPU, a disk and a process generator that generates user processes that take turns in using the CPU and accessing the disk (I've decided to omit the various system processes, because according to Process Explorer they use next to no CPU time - I'm basing this on the Microsoft Process Explorer tool, running on Windows 7). And this is where I've stopped at.
I have no idea how to get relevant data on how often do various processes read/write to disk and how much data at once, and how much time they spend using the CPU. Let's say I want to get some statistics for some typical operations on a PC - playing music/movies, browsing the internet, playing games, working with Office, video editing and so on...is there even a way to gather such data?
I'm simulating preemptive multitasking using RR with a time quantum of 15ms for switching processes, and this is how it looks:
->Process gets to CPU
->Process does its work in 0-15ms, gives up the CPU or is cut off
And now, two options arise:
a)process just sits and waits before it gets the CPU again or before it gets some user input if there is nothing to do
b)process requested data from disk, and does not rejoin the queue until said data is available
And i would like the decision between a) and b) in the model be done based on a probability, for example 90% for a) and 10% for b). But I do not know how to get those percentages to be at least a bit realistic for a certain type of process. Also, how much data can and does a process typically access at once?
Any hints, sources, utilities available for this?
I think I found an answer myself, albeit an unreliable one.
The Process Explorer utility for Windows measures disk I/O - by volume and by occurences. So there's a rough way to get the answer:
say a process performs 3 000 reads in 30 minutes, whilst using 2% of CPU during that time (assuming a single core CPU). So the process has used 36000ms of CPU time, divided into ~5200 blocks (this is the unreliable part - the process in all proabbility does not use the whole of the time slot, so I'll just divide by half the time slot). 3000/5200 gives a 57% chance of reading data after using the CPU.
I hope I did not misunderstand the "reads" statistic in Process Explorer.

OutOfMemoryException Processing Large File

We are loading a large flat file into BizTalk Server 2006 (Original release, not R2) - about 125 MB. We run a map against it and then take each row and make a call out to a stored procedure.
We receive the OutOfMemoryException during orchestration processing, the Windows Service restarts, uses full 2 GB memory, and crashes again.
The server is 32-bit and set to use the /3GB switch.
Also I've separated the flow into 3 hosts - one for receive, the other for orchestration, and the third for sends.
Anyone have any suggestions for getting this file to process wihout error?
Thanks,
Krip
If this is a flat file being sent through a map you are converting it to XML right? The increase in size could be huge. XML can easily add a factor of 5-10 times over a flat file. Especially if you use descriptive or long xml tag names (which normally you would).
Something simple you could try is to rename the xml nodes to shorter names, depending on the number of records (sounds like a lot) it might actually have a pretty significant impact on your memory footprint.
Perhaps a more enterprise approach, would be to subdivide this in a custom pipeline into separate message packets that can be fed through the system in more manageable chunks (similar to what Chris suggests). Then the system throttling and memory metrics could take over. Without knowing more about your data it would be hard to say how to best do this, but with a 125 MB file I am guessing that you probably have a ton of repeating rows that do not need to be processed sequentially.
Where does it crash? Does it make it past the Transform shape? Another suggestion to try is to run the transform in the Receive Port. For more efficient processing, you could even debatch the message and have multiple simultaneous orchestration instances be calling the stored procs. This would definately reduce the memory profile and increase performance.

How does shared memory vs message passing handle large data structures?

In looking at Go and Erlang's approach to concurrency, I noticed that they both rely on message passing.
This approach obviously alleviates the need for complex locks because there is no shared state.
However, consider the case of many clients wanting parallel read-only access to a single large data structure in memory -- like a suffix array.
My questions:
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
One thing to realise is that the Erlang concurrency model does NOT really specify that the data in messages must be copied between processes, it states that sending messages is the only way to communicate and that there is no shared state. As all data is immutable, which is fundamental, then an implementation may very well not copy the data but just send a reference to it. Or may use a combination of both methods. As always, there is no best solution and there are trade-offs to be made when choosing how to do it.
The BEAM uses copying, except for large binaries where it sends a reference.
Yes, shared state could be faster in this case. But only if you can forgo the locks, and this is only doable if it's absolutely read-only. if it's 'mostly read-only' then you need a lock (unless you manage to write lock-free structures, be warned that they're even trickier than locks), and then you'd be hard-pressed to make it perform as fast as a good message-passing architecture.
Yes, you could write a 'server process' to share it. With really lightweight processes, it's no more heavy than writing a small API to access the data. Think like an object (in OOP sense) that 'owns' the data. Splitting the data in chunks to enhance parallelism (called 'sharding' in DB circles) helps in big cases (or if the data is on slow storage).
Even if NUMA is getting mainstream, you still have more and more cores per NUMA cell. And a big difference is that a message can be passed between just two cores, while a lock has to be flushed from cache on ALL cores, limiting it to the inter-cell bus latency (even slower than RAM access). If anything, shared-state/locks is getting more and more unfeasible.
in short.... get used to message passing and server processes, it's all the rage.
Edit: revisiting this answer, I want to add about a phrase found on Go's documentation:
share memory by communicating, don't communicate by sharing memory.
the idea is: when you have a block of memory shared between threads, the typical way to avoid concurrent access is to use a lock to arbitrate. The Go style is to pass a message with the reference, a thread only accesses the memory when receiving the message. It relies on some measure of programmer discipline; but results in very clean-looking code that can be easily proofread, so it's relatively easy to debug.
the advantage is that you don't have to copy big blocks of data on every message, and don't have to effectively flush down caches as on some lock implementations. It's still somewhat early to say if the style leads to higher performance designs or not. (specially since current Go runtime is somewhat naive on thread scheduling)
In Erlang, all values are immutable - so there's no need to copy a message when it's sent between processes, as it cannot be modified anyway.
In Go, message passing is by convention - there's nothing to prevent you sending someone a pointer over a channel, then modifying the data pointed to, only convention, so once again there's no need to copy the message.
Most modern processors use variants of the MESI protocol. Because of the shared state, Passing read-only data between different threads is very cheap. Modified shared data is very expensive though, because all other caches that store this cache line must invalidate it.
So if you have read-only data, it is very cheap to share it between threads instead of copying with messages. If you have read-mostly data, it can be expensive to share between threads, partly because of the need to synchronize access, and partly because writes destroy the cache friendly behavior of the shared data.
Immutable data structures can be beneficial here. Instead of changing the actual data structure, you simply make a new one that shares most of the old data, but with the things changed that you need changed. Sharing a single version of it is cheap, since all the data is immutable, but you can still update to a new version efficiently.
What is a large data structure?
One persons large is another persons small.
Last week I talked to two people - one person was making embedded devices he used the word
"large" - I asked him what it meant - he say over 256 KBytes - later in the same week a
guy was talking about media distribution - he used the word "large" I asked him what he
meant - he thought for a bit and said "won't fit on one machine" say 20-100 TBytes
In Erlang terms "large" could mean "won't fit into RAM" - so with 4 GBytes of RAM
data structures > 100 MBytes might be considered large - copying a 500 MBytes data structure
might be a problem. Copying small data structures (say < 10 MBytes) is never a problem in Erlang.
Really large data structures (i.e. ones that won't fit on one machine) have to be
copied and "striped" over several machines.
So I guess you have the following:
Small data structures are no problem - since they are small data processing times are
fast, copying is fast and so on (just because they are small)
Big data structures are a problem - because they don't fit on one machine - so copying is essential.
Note that your questions are technically non-sensical because message passing can use shared state so I shall assume that you mean message passing with deep copying to avoid shared state (as Erlang currently does).
Will using shared state be faster and use less memory than message passing, as locks will mostly be unnecessary because the data is read-only, and only needs to exist in a single location?
Using shared state will be a lot faster.
How would this problem be approached in a message passing context? Would there be a single process with access to the data structure and clients would simply need to sequentially request data from it? Or, if possible, would the data be chunked to create several processes that hold chunks?
Either approach can be used.
Given the architecture of modern CPUs & memory, is there much difference between the two solutions -- i.e., can shared memory be read in parallel by multiple cores -- meaning there is no hardware bottleneck that would otherwise make both implementations roughly perform the same?
Copying is cache unfriendly and, therefore, destroys scalability on multicores because it worsens contention for the shared resource that is main memory.
Ultimately, Erlang-style message passing is designed for concurrent programming whereas your questions about throughput performance are really aimed at parallel programming. These are two quite different subjects and the overlap between them is tiny in practice. Specifically, latency is typically just as important as throughput in the context of concurrent programming and Erlang-style message passing is a great way to achieve desirable latency profiles (i.e. consistently low latencies). The problem with shared memory then is not so much synchronization among readers and writers but low-latency memory management.
One solution that has not been presented here is master-slave replication. If you have a large data-structure, you can replicate changes to it out to all slaves that perform the update on their copy.
This is especially interesting if one wants to scale to several machines that don't even have the possibility to share memory without very artificial setups (mmap of a block device that read/write from a remote computer's memory?)
A variant of it is to have a transaction manager that one ask nicely to update the replicated data structure, and it will make sure that it serves one and only update-request concurrently. This is more of the mnesia model for master-master replication of mnesia table-data, which qualify as "large data structure".
The problem at the moment is indeed that the locking and cache-line coherency might be as expensive as copying a simpler data structure (e.g. a few hundred bytes).
Most of the time a clever written new multi-threaded algorithm that tries to eliminate most of the locking will always be faster - and a lot faster with modern lock-free data structures. Especially when you have well designed cache systems like Sun's Niagara chip level multi-threading.
If your system/problem is not easily broken down into a few and simple data accesses then you have a problem. And not all problems can be solved by message passing. This is why there are still some Itanium based super computers sold because they have terabyte of shared RAM and up to 128 CPU's working on the same shared memory. They are an order of magnitude more expensive then a mainstream x86 cluster with the same CPU power but you don't need to break down your data.
Another reason not mentioned so far is that programs can become much easier to write and maintain when you use multi-threading. Message passing and the shared nothing approach makes it even more maintainable.
As an example, Erlang was never designed to make things faster but instead use a large number of threads to structure complex data and event flows.
I guess this was one of the main points in the design. In the web world of google you usually don't care about performance - as long as it can run in parallel in the cloud. And with message passing you ideally can just add more computers without changing the source code.
Usually message passing languages (this is especially easy in erlang, since it has immutable variables) optimise away the actual data copying between the processes (of course local processes only: you'll want to think your network distribution pattern wisely), so this isn't much an issue.
The other concurrent paradigm is STM, software transactional memory. Clojure's ref's are getting a lot of attention. Tim Bray has a good series exploring erlang and clojure's concurrent mechanisms
http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot-next
http://www.tbray.org/ongoing/When/200x/2009/12/01/Clojure-Theses

Resources