Apache Tika Server - Request Header Parameters? - apache-tika

The Apache Tika Server provides a Rest API to extract text from a document. It is also possible to set specific request header parameters like X-Tika-PDFOcrStrategy. e.g:
$ curl -T test/Dokument01.pdf http://localhost:9998/tika --header "X-Tika-PDFOcrStrategy: ocr_only"
From a lot of different documents about tika I found these documented additional header parameters:
X-Tika-OCRLanguage: eng
X-Tika-PDFextractInlineImages: true | false
X-Tika-PDFOcrStrategy: ocr_only | ocr_and_text_extraction
X-Tika-OCRoutputType: hocr
But there seems to be no documentation about how to use the X-Tika-.....? header parameters or which parameters are supported and which not.
For example I wonder if it is possible to overwrite the ImageType mode or the DPI with something like:
X-Tika-PDFocrImageType: rgb
X-Tika-PDFocrDPI: 100
My question is: Which header parameters are supported and which naming convention did these params follow?

The code that handles the X-Tika-OCR and X-Tika-PDF headers is TikaResource.processHeaderConfig.
Those header suffixes and values are then mapped onto the TesseractOCRConfig and PDFParserConfig configuration objects via reflection.
So, to see what X-Tika headers you can set, look up the options on the config class you want to tweak things on (Tesseract or PDF), then build the name, then set the header. If you are not sure what the option does, or what values it takes, look at the JavaDocs for the underlying setter method that will get called.
For eg setExtractInlineImages on PDF, that maps to X-Tika-PDFextractInlineImages

Related

Accept/Content-Type header based processing in Quart and Quart-Schema

Because I am rewriting a legacy app, I cannot change what the clients either send or accept. I have to accept and return JSON, HTML, and an in-house XML-like serialization.
They do, fortunately set headers that describe what they are sending and what they accept.
So right now, what I do is have a decoder module and an encoder module with methods that are basically if/elif/else chains. When a route is ready to process/return something, I call the decoder/encoder module with the python object and the header field, which returns the formatted object as a string and the route processes the result or returns Response().
I am wondering if there is a more Quart native way of doing this.
I'm also trying to figure out how to make this work with Quart-Schema. I see from the docs that one can do app.json_encoder = <class> and I suppose I could sub in a different processor there, but it seems application global, there's no way to set it based on what the client sends. Optimally, it would be great if I could just pass the results of a dynamically chosen parser to Quart-Schema and let it do it's thing on python objects.
Thoughts and suggestions welcome. Thanks!
You can write your own decorator like the quart-schema #validation_headers(). Inside the decorator, check the header for the Content-Type, parse it, and pass the parsed object to the func(...).

How to POST SPARQL to Virtuoso?

I am using two different HTTP POST utilities (poster out of Firefox as well as Python requests API) to post a simple SPARQL insert to Virtuoso.
My URL is: http://localhost:8890/sparql
My request parameters are:
default-graph-uri: <MY_GRAPH>
should-sponge: soft
debug: on
timeout:
format: application/xml
save: display
fname:
I put the actual SPARQL (INSERT DATA { GRAPH...) in the content of the message.
I tried different content types, none of which worked. I do get 200 but the response is in HTML even though the above parameter set specifies application/xml, however, no data is inserted. When I try content type of text/turtle, I get 409 Invalid Path, which is also referenced in this post.
I can successfully do HTTP GET, however, that has a payload length limitation which I would like to exceed for performance reasons. The only difference with the GET is that the SPARQL goes in the URL under query parameter and the POST should enable a much larger payload in the message content, by including multiple triples in the same request, not just one (I have 100s of 1000s of inserts). I was trying to follow this documentation page.
I stopped by this question days ago trying to achieve the same with curl. Since it is a powerful (and far more convenient) alternative to browser extensions, here is the formulation that eventually proved successful:
curl -X POST \
-H "Content-Type:application/sparql-update" \
-H "Accept:text/html" \
--data "select distinct ?Concept where {[] a ?Concept} LIMIT 100" http://localhost:8890/sparql
More details on the headers in this thread.
If you are using python, I would avoid using the requests library. There are some dedicated libraries for RDF which abstract the process and make your life easier.
Try:
SPARQLWrapper
RDFLib
They are both form the same family of packages from rdflib
Based on experience, I find the SPARQLWrapper significantly simpler and easier to use for your use case. It's an abstracted version of RDFLib. The docs suggest something like this could work:
from SPARQLWrapper import SPARQLWrapper, POST
sparql = SPARQLWrapper("https://example.org/sparql")
sparql.setCredentials("some-login", "some-password") # if required
sparql.setMethod(POST) # this is the crucial option
sparql.setQuery("""
<QUERY GOES HERE>
""".format(PARSE SOME VARS INTO THE QUERY HERE IF YOU WANT)
)
results = sparql.query()
print results.response.read()
Make sure you add the option for POST. You should be doing bulk I/O in no time :).
There are many aspects to this "question" making it difficult to provide a simple answer, suitable to this site. This is one of the reasons I suggested the mailing list, which is better suited to conversational and/or multi-facet assistance.
Have you tried using curl as most of our examples do?
Looking at the Poster page on Mozilla Add-Ons, I see that you may need to manually add a ? to the end of your target URI -- so http://localhost:8890/sparql? rather than http://localhost:8890/sparql -- and it's not clear whether you've done that in your testing. On the project page, I also note its last commit was in 2012, and there are a great many open issues.
I'm not at all familiar with Python, so I've not dug in there.
Have you tried setting an Accept: header? This can have significant impact on the content returned by the server.
If I understand your described efforts correctly, your format: query parameter should be output-format:, and its value should not be application/xml but one of the supported formats listed in the documentation.
Neither the virtuoso-users post you referenced nor this question have enough detail to analyze the cause of the 409 Invalid Path error. Explicit details that allow us to reproduce this result would be helpful, optimally in a distinct thread.
This seems to be a Virtuoso specific issue. You can only post a query by using content type "application/sparql-update" instead of "application/sparql-query" which is common.
The request is done as follows with Python:
headers = {
'Content-Type': 'application/sparql-update',
'Accept': 'application/json'
}
s = Session();
s.mount(server_url, HTTPAdapter(max_retries=1))
response: Response = s.post(server_url, data=<sparql_string>, headers=headers, timeout=100)
return response.json();

Rails format specifier differences

I'm about to lead a training seminar on REST for some coworkers, and I'd like to verify something regarding Rails routing.
Our app in its current form allows clients to specify format in three different ways:
1.
/path/to/resource.json
2.
/path/to/resource?format=json
3.
Accept header of the request
My question pertains to the first 2 options: is there any inherent difference in what these specifications do? Specifically, do they set only the Accept header, or the Content-Type header as well?
Well, 1 and 2 are not exactly different, since Rails typically generates routes like:
/something(.:format)
That means "there is an optional parameter format delimited with a dot". Parameters, however, can also be specified in query string, which is not part of the route.
So the second way of querying for JSON will make the route system think that the format is not in the route at all. When it comes to the controller, however, Rails will already have that query string parsed and will find the format when the time comes to respond.
That said, if you hit plain /path/to/resource without format specified anywhere, you'll get the same result as 2: you hit a route assuming there's no format given. Still, Rails will parse the headers and determine the format it should respond with.
As for what the client needs to set: accept header only, Content-Type only makes sense when the user himself sends an entity, and it's only related to "how should Rails parse incoming parameters", it's not related to response. Of course, by default Rails does its best to set Content-Type of response to be sensible.
Please checkout the following initializers:
https://github.com/rails/rails/blob/756baf296b3cb3f7bc40d5843e259276695071ab/actionpack/lib/action_dispatch/http/response.rb#L113
This is where how they look up for headers content type to set for.
if content_type = self[CONTENT_TYPE]
type, charset = content_type.split(/;\s*charset=/)
#content_type = Mime::Type.lookup(type)
#charset = charset || self.class.default_charset
end
so you can even programatically set content_type to header, or params or as a .format

How do you change the default format to XML in Symfony?

I'm writing a restful XML API for a university assignment, the spec requires no HTML frontend.
There doesn't seem to be any documentation (or guessable functionality) regarding how to change the default format? Whilst thus far I have created all templates as ...Success.xml.php it would be easier to just use the regular ones and set this globally; I really expected this functionality to be configurable from YAML.. yet I have found some hard coded references to the HTML format.
The main issue I'm encountering is that part of the assessment is returning a 404 in a certain way (not as a 404 :/), but importantly it must always return XML, and the default setup of a missing route is a HTML 404 not XML (so it only works when I use forward404 from an action running via a XML route.
So in summary, is there a way to do this / what class(es) do I have to override?
Try putting this in factories.yml
all:
request:
class: sfWebRequest
param:
default_format: xml
That will still need the template names changing though. It will just mean that urls that don't specify a format will revert to xml instead of html.
You can subclass sfPHPView and override the initialise method to affect this (copy paste the initialise method from sfView) - the lines like this need changing:
if ('html' != $format)
You then need to change the view class used ... try this:
http://mirmodynamics.com/post/2009/02/23/symfony%3A-use-your-own-View-class

Why we don't use such URL formats?

I am reworking on the URL formats of my project. The basic format of our search URLs is this:-
www.projectname/module/search/<search keyword>/<exam filter>/<subject filter>/... other params ...
On searching with no search keyword and exam filter, the URL will be :-
www.projectname/module/search///<subject filter>/... other params ...
My question is why don't we see such URLs with back to back slashes (3 slashes after www.projectname/module/search)? Please note that I am not using .htaccess rewrite rules in my project anymore. This URL works perfect functionally. So, should I use this format?
For more details on why we chose this format, please check my other question:-
Suggest best URL style
Web servers will typically remove multiple slashes before the application gets to see the request,for a mix of compatibility and security reasons. When serving plain files, it is usual to allow any number of slashes between path segments to behave as one slash.
Blank URL path segments are not invalid in URLs but they are typically avoided because relative URLs with blank segments may parse unexpectedly. For example in /module/search, a link to //subject/param is not relative to the file, but a link to the server subject with path /param.
Whether you can see the multiple-slash sequences from the original URL depends on your server and application framework. In CGI, for example (and other gateway standards based on it), the PATH_INFO variable that is typically used to implement routing will usually omit multiple slashes. But on Apache there is a non-standard environment variable REQUEST_URI which gives the original form of the request without having elided slashes or done any %-unescaping like PATH_INFO does. So if you want to allow empty path segments, you can, but it'll cut down on your deployment options.
There are other strings than the empty string that don't make good path segments either. Using an encoded / (%2F), \ (%5C) or null byte (%00) is blocked by default by many servers. So you can't put any old string in a segment; it'll have to be processed to remove some characters (often ‘slug’-ified to remove all but letters and numbers). Whilst you are doing this you may as well replace the empty string with _.
Probably because it's not clearly defined whether or not the extra / should be ignored or not.
For instance: http://news.bbc.co.uk/sport and http://news.bbc.co.uk//////////sport both display the same page in Firefox and Chrome. The server is treating the two urls as the same thing, whereas your server obviously does not.
I'm not sure whether this behaviour is defined somewhere or not, but it does seem to make sense (at least for the BBC website - if I type an extra /, it does what I meant it to do.)

Resources