What is the diffrence between Tika app, Tika Server and Java Wrapper. Which one is used and when? - apache-tika

I want to use Apache Tika for enterprise-level huge and lots of documents. Which one I use, Tika Server or Tika App or Java calls? Can you suggest me a system architecture? (i.e. Load balanced 3-4 Tika physically different Server)

Making PUT calls to a REST endpoint for sending thousands of 0.5 GB documents over HTTP, one at a time, is not an appropriate scenario for the Tika Server. It will not be memory efficient and the server will likely crash with some kind of memory leak or bugs.
Although as of v1.19 there is now a -spawnChild option to periodically restart the process after it has processed -maxFiles. From v2.x, this is now the default.
For your needs, you should simply use the tika-app in batch mode, which:
Runs locally, using an input and output directory that you specify
Sets up parent/child processes to robustly handle hangs/OOMEs
Runs multiple parser threads in parallel
Can restart child every x minutes or after y files to avoid memory leaks
Logs failures
java -jar tika-app.jar -i <input_directory> -o <output_dir>

Related

dask scheduler OOM opening a large number of avro files

I'm trying to run a pipeline via dask on a cluster on gcp. The pipeline loads a lot of avro files from cloud storage (~5300 files with around 300MB each) like this
bag = db.read_avro(
'gcs://mybucket/myfiles-*.avro',
blocksize=5000000
)
It then applies some transformations and saves the data back to cloud storage (as parquet files).
I've tested this pipeline with a fraction of the avro files and it works perfectly, but when I tell it to ingest all the files, the scheduler process sits at 100% CPU for a long time and at some point it runs out of memory (I have tried scaling my master node running the scheduler up to 64GB of RAM but that still does not suffice), while the workers are idling. I assume that the problem is that it has to create an excessive amount of tasks that are all held in RAM before being distributed to the workers.
Is this some sort of antipattern that I'm using when trying to open a very large number of files? If so, is there perhaps a built-in way to better cope with this or would I have to split the avro files manually?
Avro with Dask at scale is not particularly well-trodden territory. There is no theoretical reason it should not work. You could inspect the contents of the graph to see if things are getting serialised there that are large, or if simply a massive number of tasks are being generated. If the former, it may be solvable, and you could raise an issue.
As you say, you may be able to keep the load on the scheduler down by processing sub-batches out of the total set of files at a time and waiting for completion.

Why are multiple Ruby processes on an EC2 server causing 100% CPU utilisation?

I have a Rails application which has 100% CPU utilization most of the time.
I am not able to figure out why there is so much load on the server. I am using the Puma web server with a default configuration, and am running multiple background jobs using the sucker-punch gem. There are 7 files which are using sucker punch jobs with 5 workers:
include SuckerPunch::Job
workers 5
I ran the top -i query and found the following processes running on the server. I can see multiple Ruby commands on the server. Can someone tell me whether this is normal behavior on a server, or if something is wrong?
Some Ways to Reduce Resource Contention
Your user space load is high (~48%), so you'll probably want to reduce the number of workers in your web application, increase the number of CPUs available on your instance, move to a version of Ruby that has better concurrency and real multi-core support (e.g. Rubinius or JRuby), or some combination of these options. Depending on what your code is actually doing, you may also need to re-architect your application to offload expensive I/O from the application server.
In addition, your steal time is quite high (~41%), so your EC2 instance is probably overloaded. Simply moving your application to a less-loaded instance may free up sufficient resources to reduce application wait times.

How many requests can Puma buffer simultaneously?

I understand a benefit of Puma over other Rails web servers is how it handles slow clients. While a Puma server receives and downloads a (potentially slow) request, it can still receive and download other requests that might download quicker and be passed on to a worker for processing before the slow request has finished being received.
But I can't find any information about what, if any, limits there are to this.
Can Puma download any number of requests at the same time? If 1000 slow requests hit it at the same time, would the 1001st request reach a Puma worker first assuming it wasn't a slow request?
I guess what I'm interested in generally is what impact multiple slow requests have on other requests, including each other - because I'm working on an application that's likely to involve plenty of 'slow requests' (image uploads from phones via 3G).
This great article by #nate-berkopec helps explain in principle how Puma helps with slow clients: "In clustered mode, then, Puma can deal with slow requests (thanks to a separate master process whose responsibility it is to download requests and pass them on)..." Any more light anyone can shed would be very welcome.
There are a number of considerations, such as the IO polling system, memory and concurrency concerns.
IO polling system
Edit (Sep. 9th, 2020): By now the Puma server is running on the nio4r and should no longer be subject to the limits of the select system call (where file descriptor values are limited to 1023).
As far as I know, Puma uses the select system call (unlike iodine or passenger, which also protect you from slow clients but use kqueue or epoll).
The select system call is limited on most systems (usually up to 1024 clients / maxfd). I would assume that would create a limit.
However, I know Puma is working on replacing the select system call with something both portable and effective (such as leveraging the nio4r gem).
I don't know if that was already achieved, but it will break this limit and probably improve performance.
Memory
Slow clients still consume memory as they slowly fill the buffer with their header data or slowly download the buffered data that was sent (keeping the buffer in the memory until the download is complete).
Memory limitations will always add limitations to slow client handling.
Some limitations can be elevated, such as sending static files using X-Sendfile (supported with iodine, and when Puma or passenger are running under nginx)... but this isn't really something you can fix.
Concurrency
Puma handles slow clients within the Ruby's GIL (global instruction lock). This means that no other threads / instructions can execute while Puma is handling slow clients.
This is often a non issue, but a large enough number of slow clients increases the cost of context switching and system calls. This could (potentially) slow a server considerably.
Both Passenger and iodine perform slow client buffering outside of the GIL, allowing these system calls to be truly concurrent (when multiple CPU cores are available).
This will mitigate the issue but not fully solve the issue.
Conclusion and Caveats
The biggest issue is usually the IO polling system. A solution for this is on Puma's roadmap (maybe already implemented, I'm not sure).
The other issues (memory limits and concurrency limits) are relatively less important, but they can't be mitigated without using language extensions (the iodine server is written in C and Passenger is written in C++).
Since Puma doesn't (currently) require any language extensions (except for it's integrated HTTP parsers in C and Java), these issue remain.
I should point out that I'm the author for the iodine HTTP/Websocket server, so I'm somewhat biased.

Max CPU% on Continuous WebJob

I have a continuous webjob running that reads messages from a queue, reads a file from Blob Storage, converts it, and then writes the converted file to a different blob container. All of the files are being converted properly. The Kudu site for my app service is running at nearly 100%. The process explorer in Kudu shows my webjob as the only other process running within that server. Conventional wisdom indicates that it is probably the webjob that is the problem. Are there any tools for determining what the problem might be?
Thanks!
The simplest approach is to download a process dump (From Kudu process explorer), and then analyze it locally, e.g. with windbg or Visual Studio. By looking at the threads, you should be able to identify which ones are actively spinning CPU.
Another simple test is to temporarily stop creating new blobs, and check whether that causes the CPU usage to go down. If it's still high in the absence of any processing, then something strange is going on.

In which conditions Jena turns to parallel?

I've run queries on Jena TDB and Jena In memory triple store, both on a single core machine and on a machine with 16 CPUs. I observe that in 16Cores machine Jena spawns lot of threads to handle both query and inference operations.
So I wonder, does Jena by default behaves in parallel ? or is there a way to force or avoid parallelism?
No, Jena in of itself uses minimal parallelism. In particular the query engine is entirely non-parallel in its design and implementation. Jena may on its own spawn an additional thread per-query in order to be able to monitor and kill the query if it exceeds a timeout but that will depend on the app configuration.
You've left out a lot of detail about your set up but I assume you are using Fuseki as a server for your tests?
Fuseki using the Jetty web-server framework which inherently has parallelism built in to be able to service multiple requests. So the parallelism you observe is likely just a side-effect of the web-server parallelism. If you are sending multiple requests to the the server in parallel then you will see parallel request processing on the server side.

Resources