How to fix Tika server 413 entity too large error - apache-tika

I want to use Tika server to extract text from pdfs using POST request. However, when the size of the PDF is too large, it will appear 413 Entity too large error. How can I increase the size?
My request is like this:
curl -F upload=#price.xls URL http://localhost:9998/tika/form

You can always use the pdfbox to split the document per pages

Related

Can CDNs handle base64 encoded data?

I'm trying to make an app where I take pictures from users add them to a canvas, draw stuff in them, then convert them to a base64 string and upload them.
For this purpose I'm considering the possibility to use a cdn but can't find information on what I can upload to them and how the client side uploading works. I'd like to be able to send the image as base64 and the name to be given to the file, so that when it arrives to the origin cdn, the base64 image is decoded and saved under the specified name (which I will add to the database on the server).Is this possible?Can I have some kind of save.php file on the origin cdn where I write my logic to save the file and to which I'll send XHR requests? Or how this whole thing work?I know this question may sound trivial but I'm looking for it for hours and still didn't find anything which explains in detail how the client side uploading work for CDNs.
CDNs usually do not provide such uploading service for client side, so you can not do it in this way.

Get 503 response from resumable upload api( v2 ) although the video has been uploaded successfully

when I uploaded a 150M video from my website, I came across the problem, that is:
I got 503 response instead of 201 created response from youtube, but the video has been uploaded to youtube already, since I can see it on my youtube page.
I use the apis in:
https://developers.google.com/youtube/2.0/developers_guide_protocol_resumable_uploads
So could someone tell me what has happened?
What shall I do for this case?
Great thanks!
I get the same problem when using the resumable upload. if i dont clarify the Content-Range in the request header, i could even upload a file with size 550MB. but if i cut the file into different chunks and upload them by different requests, with Content-Range header set. then it would end up with the same error message 503.
My guess of the answer is the header variable Content-Length. Google are expecting Content-Length to be the remaining size of the Total-Content-length. And all browser modify that value to be the actual size of the data which u sent.
For example if you are sending 3 bytes of data in 3 chunks, each chunk is 1 byte. during 1st chunk, you set Content-Length to be 3 (remaining size). but browser modify it to be 1, which makes Google to think only 1 byte left, and after the 1st chunk is sent, the service is closed, and you can not send any data to that specific upload URL anymore, thus error code 503 is returned.
so i guess the better solution is to ask Google to use other header variable than Content-Length. Or a little trick is to use still resumable upload method, but without cutting the file into parts, and without Content-Range header setting in the request. The same time when uploading, we can watch the status by calling the same UPLOAD_URL in the specific way Google Doc has described. At least I have managed uploading 550MB file successfully.

Check a file of HTTP URLs for HTTP response codes

I need a way to check a file that contains links to see if any of them are broken. The file contains links to thousands of different URLs. I don't need to crawl or spider any further than the URLs that are in the file - we just need a HTTP request response for each URL.
Take a look at Xenu.
It does exactly what you need, assuming it's a web page, or a text file of links. You can control how deep it follows links.

Rails: Sending cached gzip content directly to client using caches_action

I am using caches_action to cache one of the action's response
I want to save in the cache compression response and then send it as it is if browser supports that compression otherwise decompress it and then send it.
Some characteristics of my content:
1. It rarely changes
2. My server gets requests from 90% gzip enabled browsers
Do you see any issue with this approach?
If you it is a right approach then is there a easy way to achieve the same?
Compression should be handled by Apache or the webserver.
Assuming the client supports compression, the webserver will load your static file and serve a compressed response.
I suggest you to have a look at your webserver configuration.
Here's an example using Apache.

Sending binary data to (Rails) RESTful endpoint via JSON/XML?

I am currently putting together a rails-based web application which will only serve and receive data via json and xml. However, some requirements contain the ability to upload binary data (images).
Now to my understanding JSON is not entirely meant for that... but how do you in general tackle the problem of receiving binary files/data over those two entrypoints to your application?
I suggest encoding the binary data in something like base64. This would make it safe to use in XML or JSON format.
http://en.wikipedia.org/wiki/Base64
maybe you could have a look on Base64 algorithm.
This is used to "transform" everything to ascii char.
You can code and decode it. It's used for webservices, or even on dotnet Serialization.
Hope this helps a little.
Edit: I saw "new post", while posting, someone was faster.Rails base64
If you are using Rails and json and xml than you are using HTTP. "POST" is a part of HTTP and is the best way to transform binary data. Base64 is a very inefficient way of doing this.
If your server is sending data, I would recommend putting a path to the file on the server in the XML or JSON. That way your server doesn't have to base64 encode the data and your client, which already supports HTTP GET, can pull down the data without decoding it. (GET /path/to/file)
For sending files, have your server and/or client generate a unique file name and use a two step process; the client will send the xml or json message with fileToBeUploaded: "name of file.ext" and after sending the message will POST the data with the aforementioned filename. Again, client and server won't have to encode and decode the data. This can be done with one request using a multi-part request.
Base64 is easy but will quickly chew up CPU and/or memory depending on the size of the data and frequency of requests. On the server-side, it's also not an operation which is cached whereas the operation of your web server reading the file from disk is.
If your images are not too large, putting them in the database with a RoR :binary type makes a lot of sense. If you have database replicas, the images get copied for free to the other sites, there's no concern about orphaned or widowed images, and the atomic transaction issues become far simpler.
On the other hand, Nessence is right that Base64, as with any encoding layer, does add network, memory and CPU load to the transactions. If network bandwidth is your top issue, make sure your web service accepts and offers deflate/gzip compressed connections. This will reduce the cost of the Base64 data on the network layer, albeit at the cost of even more memory and CPU load.
These are architectural issues that should be discussed with your team and/or client.
Finally, let me give you a heads up about RoR's XML REST support. The Rails :binary database type will become <object type="binary" encoding="base64">...</object> XML objects when you render to XML using code like this from the default scaffolding:
def show
#myobject = MyObject.find(:id)
respond_to do |format|
format.xml { render => #myobject }
end
end
This works great for GET operations, and the PUT and POST operations are about as easy to write. The catch is the Rails PUT and POST operations don't accept the same tags. This is because the from_xml code does not interpret the type="binary" tag, but instead looks for type="binaryBase64". There is a bug with a patch at the Rails lighthouse site to correct this.

Resources