Is there a way to turn off parsing of embedded docs in the tika-server? - apache-tika

I run an unmodified JAX-RS instance of the Apache tika-server 1.22 and use it as an HTTP end-point service that I post files to (mostly Office, PDF and RTF) and get plain-text renditions back with HTTP requests (using the Accept="text/plain" header) from our application.
Since Tika 1.15, the default behaviour is now to "extract all embedded documents" TIKA-2096.
I want to be able to turn this behaviour off on our tika-server so that embedded documents are NOT extracted and I only get the text rendition of the main document contents.
Is it possible to do this via a tika-config.xml file, or do I need to do a custom build and subclass EmbeddedDocumentExtractor so that it doesn't do anything?
An answer to tika-parser-exclude-pdf-attachments indicates that you can turn this behaviour off by subclassing EmbeddedDocumentExtractor, but I'd like to check if it's possible to do this via tika-config.xml without having to do a custom build of the tika-server.
I have looked at Configuring Tika but there is no mention of embedded docs here.

The answers in tika-parser-exclude-pdf-attachments are excellent for if you are calling Tika via code.
Previously there hasn't been a way to do this for embedded files in Tika Server, other than disabling the whole file type using EmptyParser with something like the below:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.EmptyParser">
<mime-exclude>image/jpeg</mime-exclude>
<mime-exclude>application/zip</mime-exclude>
</parser>
</parsers>
</properties>
This has become a common request, so I've added a feature coming up in Tika 1.25 (yet to be released) to allow for the skipping embedded files using a header setting:
curl -T test_recursive_embedded.docx http://localhost:9998/tika --header "Accept: text/html" --header "X-Tika-Skip-Embedded: true"
Any parser using the EmbeddedDocumentExtractor will honour this.

Related

how to generate html from swagger yaml file [duplicate]

There are only ugly HTML pages as download (HTML, HTML2 and dynamic all ugly), but the site, eg. edited https://app.swaggerhub.com/apis/{user}/{project}/{version}
(and many others!) offers pretty HTML interface... How to download this pretty HTML?
Complete and autonomous HTML code (file or zip of files).
I have a good and valid swagger.yaml or swagger.json file of my API, so another solution is to run a open sourse (plug and play!) tool with my API-description file.
The pretty:
The ugly:
The "pretty interface" on your screenshot is Swagger UI. It's free and open-source. There's a demo at http://petstore.swagger.io, where you can load your own YAML/JSON files from an URL and see how they would be rendered.
To use Swagger UI locally:
Go to https://github.com/swagger-api/swagger-ui and download the repository as ZIP:
Edit the dist\index.html file and change the line
url: "http://petstore.swagger.io/v2/swagger.json",
to the URL of your Swagger .json or .yaml file, e.g.
url: "http://api.mysite.com/swagger.json",
(Optional) Add/change other configuration parameters in the SwaggerUIBundle initialization code in dist\index.html.
Open the dist\index.html file in your browser to preview your API docs.
Note: If the spec does not load or "try it out" does not work, you probably need to enable CORS on the your server. See https://github.com/swagger-api/swagger-ui/blob/master/docs/usage/cors.md and https://enable-cors.org.
Upload the files from the dist folder somewhere to your server - and now you have pretty API docs too!
Alternatively, SwaggerHub (which you mentioned) provides cloud hosting for Swagger specs among other things, and has Swagger UI integrated. You can import your Swagger .json/.yaml files there and have your API docs hosted on SwaggerHub. A free plan is available.
Thanks to #tleyden at swagger-ui/issues for good clues!
Use the index and assets folder of this project, https://github.com/okfn-brasil/swagger-ui-html

Setting the mime-type for one specific file (which has a blank extension!)

I'm trying to make a universal URL link work on iOS. Apple wants me to put a file called "apple-app-site-association" in a specific place on my server. So far so good.
However, in the interests of making things harder than they need to be, Apple ALSO wants me to serve this file with mime type application/json.
I'm trying to do it by putting this in htaccess:
<Files apple-app-site-association>
AddType application/json .
</Files>
But... looks like apache doesn't want to play .htaccess with a blank file extention. Can anyone clue me in on how I can solve the Apple Sphinx's riddle, with hopefully the same results Theseus got? How do I serve this blank extension file as application/json??? I'm using httpd on CentOS 8.
The solution:
<Files "apple-app-site-association">
ForceType application/json
</Files>
There's a couple solutions for this on the web with formatting that causes internal server errors. So putting this up on the web for posterity.

large file upload via Zuul

I'm trying to upload a large file through Zuul.
Basically I have the applications set up like this:
UI: this is where the Zuul Gateway is located
Backend: this is where the file must finally arrive.
I used the functionality described here so everything works fine if I used "Transfer-Encoding: chunked". However, this can only be set via curl. I haven't found any way to set this header in the browser (the header is rejected with the error message in the console "Refused to set unsafe header ..".
Any idea how to instruct the header to set this header ?
It seems that actually there are 2 possible ways to upload large files via zuul:
By using "Transfer-Encoding: chunked" in header (but this cannot be used in a browser, as mentioned in the initial question, because this header is considered unsafe)
By bypassing the DispatcherServlet servlet used by zuul (using the /zuul path in front of the usual path that I was using).
I found the documentation not very clear in this aspect (that you can use either of the 2 options). In my case, considering that the file was being uploaded via Angular Js (hence in the browser), I had to use the second approach.

How to export swagger.json (or yaml)

How can I export a Swagger definition file? It should be a JSON or YAML file, e.g. swagger.json or swagger.yaml.
Let's say I have an endpoint looking like http://example.com//swagger/ui/index#!:
The version is api version: v1.
There is no "Export" button that I can see. So how do I export it?
The URL of the API definiton is displayed in the top bar of Swagger UI – in your example it's
/v2/api-docs?group=full-petstore-api
So the full URL appears to be
http://localhost:8080/v2/api-docs?group=full-petstore-api
In newer versions of Swagger UI, the link to the API definition is often displayed below the API title, so you can right-click the link and Save As.
If your Swagger UI does not have a visible link to the API definition, view the page source and look for the url parameter, such as:
const ui = SwaggerUIBundle({
url: "https://petstore.swagger.io/v2/swagger.json", // <-------
dom_id: '#swagger-ui',
If you don't see the url or if url is a code expression, open the browser dev tools, switch to the Network tab and disable caching. Then refresh the page and search for the API definition file (swagger.json, swagger.yaml, api-docs or similar) among HTTP requests. You can filter by XHR to narrow down the list.
Another way to find the actual url is to use the browser console and evaluate one of the following values, depending on your UI version:
Swagger UI 3.x:
ui.getConfigs().url
Swagger UI 2.x:
swaggerUi.api.url
Sometimes the OpenAPI definition may be embedded within a .js file – in this case take this file and strip out the extra parts.
Though it's already been answered and it's the correct one, I thought I shall post the much detailed version of it.. Hope this helps,
If you do have the swagger json file which you feed to the swagger UI, then to generate .yaml file just click on the below link copy-paste your json in the editor and download the yaml file. This is a straight forward method
link : https://editor.swagger.io/#
Now the second way where you don't have any swagger json file then the following steps should help,
Open the swagger ui, inspect (Shift+Ctrl+i), refresh the page and you will get the tabs like below
Choose XHR or All tab under Network tab, check for the file api-doc?group=* and click subtab response. *Now copy the content of ap-doc?group.** file and use the same editor link to convert to yaml file
link : https://editor.swagger.io/#
The JSON may also be inlined in the document, specifically for Swagger version 2.0. If you haven't found anything after walking through #Helen's answer give this a try:
View Page Source
Search for "swagger" or "spec"
If you see a <script type="application/json"> tag with something similar to the following in it, this is effectively your swagger.json content. Copy everything inside of the <script> tags and save into a file named swagger.json and you should be good to go.
<script id="swagger-data" type="application/json">
{"spec":{"definitions":{},"info":{},"paths":{},"schemes":[],"swagger":"2.0"}}
</script>
I'm using Django Rest Framework (so pip package django-rest-swagger==2.2.0) and the above answers weren't really sufficient. There were two options:
1) View the page source with developer tools. When I hit my http://localhost:8000/docs/ endpoint, I see:
The docs/ endpoint was configured in Django, so it may be different for you. When digging into the details of that, I can go to the Response tab (in Chrome) and scroll down to find the actual JSON. It's the value in window.drsSpec
2) The alternative (and perhaps easier) approach is to add ?format=openapi to my endpoint, as suggested in https://github.com/marcgibbons/django-rest-swagger/issues/590
This will directly spit out the JSON you need. I imported it into Postman by changing the swagger field to openapi which seems a little hacky but it worked 🤷🏻‍♂️
for
Swashbuckel.aspnet.core(5.5.0)
try
services.AddControllers()
.AddJsonOptions(options =>
options.JsonSerializerOptions.Converters.Add(new JsonStringEnumConverter()));
I tried this for a Web API core Project
you have to be using
System.Text.Json.Serialization;
Visit http://localhost:49846/swagger/docs/v1
The above URL will return JSON. Save the JSON as swagger.json
Please replace the port number with your port number.
This could be achieved using JUnit test case in compile time, follow https://github.com/springfox/springfox/issues/1959 for more details.

Get text from doc/docx file in pages using Apache tika

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?
Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.

Resources