Extract text content from Tika without specifying the file header - apache-tika

Is there a way to extract content from a file with a Tika server without explicitly defining the header? For example for a specific file named "file.pdf" if I do
curl -X PUT --data-binary #file.pdf localhost:9998/tika --header "Content-type: application/pdf" > file.txt
I get the extracted content in "file.txt" but if I omit the
' --header "Content-type: application/pdf" '
I get an empty "file.txt".
In general is there a way to automate the process of submitting a document to a tika server and extracting the content in txt with a single command?
Or alternatively how can I use a pipeline to redirect a possible Tika header output answer of a file to the command in the beginning of this question?
Thank you very much community!

You're calling the Tika Server wrong to get auto-detection. As detailed on the Tika Server wiki page, to have the plain text extracted from any file (including PDF) you should run Curl as:
curl -T file.pdf http://localhost:9998/tika --header "Accept: text/plain"
You need an accept header to tell Tika what format you want your result in (Plain Text or HTML for text extraction, more formats available for metadata). As long as you send the file directly with the -T option, its type will be auto-detected for you

Related

Nitrogen - File upload directly to database

In the Nitrogen Web framework, files uploaded always end in the ./scratch/ directory when using #upload{}. From here you are supposed to manage the uploaded files, for example, by copying them to their final destination directory.
However, in case the destination is a database, is there a way of uploading these files straight to the database? Use case RIAK-KV.
You can upload a file to Riak KV using an HTTP POST request. You can see the details at in the Creating Objects documentation which shows how to do it using curl.
To send the contents of a file instead of a value, something like this should work:
curl -XPOST http://127.0.0.1:8098/types/default/buckets/scratch/keys/newkey
-d #path/to/scratch.file
-H "Content-Type: application/octet-stream"

Uploading file to rest API using JMeter

Note - I have checked BlazeMeter Tutorial which uploads doc as Body Data while I use File Upload tab.
Here is how my request looks -
On execution I get following Request -
POST https://xxx
POST data:
<actual file content, not shown here>
[no cookies]
Request Headers:
Connection: keep-alive
Content-Type: multipart/form-data
Accept-Language: en-US
Authorization: bearer <>
Accept: application/json
Content-Length: 78920
Host: test-host
User-Agent: Apache-HttpClient/4.5.2 (Java/1.8.0_102)
And the request fails with 400 error -
Response code: 400
Response message: Bad Request
Since I am able to carry out file upload using curl, I assume that I missed the some configuration with JMeter. Curl looks as -
curl -X POST --header 'Content-Type: multipart/form-data' --header 'Accept: application/json' --header 'Authorization: Bearer <>' -F upload_file=#"test.pdf" 'https://xxx'
What did I miss in JMeter file upload?
Another vote for using the Java implementation in the Advanced tab in Jmeter. My headers and body were exactly the same between postman and jmeter, but it wouldn't upload my file (got response code 415) until I changed to the Java implementation.
If you can successfully upload file via curl, why don't you just record the upload through JMeter HTTP(S) Test Script Recorder like:
curl -x http://localhost:8888 -X POST --header 'Content-Type....."
If you still need to build the request manually consider two important bits:
You need to check Use multipart/form-data for POST.
The most significant, you need to supply "Parameter Name", According to HTTP Request Sampler Manual:
For the POST and PUT method, if there is no file to send, and the name(s) of the parameter(s) are omitted, then the body is created by concatenating all the value(s) of the parameters.
Looking into curl command manual in your case the "Parameter Name" should be upload_file
So the final configuration should look like:
See Performance Testing: Upload and Download Scenarios with Apache JMeter guide for above steps described in details.
My backend server is implemented in Java and in the file upload request I had to select the Impolementation as Java!
Here is the file upload section
thank you for the JAVA implementation of HTTP! file uploads are working again for me which haven't worked since 2.13
here's my post else where:
I had same issue...thought jmeter was doing something wrong since this stuff worked for me in 2.13...hasn't worked since version 3. well..saw a post somewhere that said, using the JAVA implementation of HTTP worked. Guess what? it did work for me too!!! I've been struggling trying to dissect every part of the POST. I was doing it right all along, just needed JAVA implementation of HTTP and voila!
hope that helps!

curl post data and file contemporary

I have tried:
curl -v --http1.0 --data "mac=00:00:00" -F "userfile=#/tmp/02-02-02-02-02-22" http://url_address/getfile.php
but it fails with the following message:
Warning: You can only select one HTTP request!
How can I send a mix of data and file by curl? Is it possible or not?
Thank you
Read up on how -F actually works! You can add any number of data parts and file parts in a multipart formpost that -F makes. -d however makes a "standard" clean post and you cannot mix -d with -F.
You need to first figure out which kind of post you want, then you pick either -d or -F depending on your answer.

curl needs to send '\r\n' - need transformation of a working solution

I need a transformation of the following working curl command:
curl --data-binary #"data.txt" http://www.example.com/request.asp
The data.txt includes this:
foo=bar
parameter1=4711
parameter2=4712
The key is I need to send the linebreaks and they are \r\n. Its working with the file because it has the right encoding but how do I manage to get this curl command run without the file? So a 1-liner sending the parameters with the correct \r\n on end of each.
All my tests with different URL encoding, etc. didn't work. I never got the same result like with the file.
I need this information because I have serious trouble to get this post run on my Ruby on Rails App using net/http.
Thanks!
One way to solve it is to generate the binary stream with something on the fly, like the printf command, and have curl read the data from stdin:
printf 'foo=bar\r\nparameter1=4711\r\nparameter2=4712' | curl --data-binary #- http://example.com

Getting only response header from HTTP POST using cURL

One can request only the headers using HTTP HEAD, as option -I in curl(1).
$ curl -I /
Lengthy HTML response bodies are a pain to get in command-line, so I'd like to get only the header as feedback for my POST requests. However, HEAD and POST are two different methods.
How do I get cURL to display only response headers to a POST request?
-D, --dump-header <file>
Write the protocol headers to the specified file.
This option is handy to use when you want to store the headers
that a HTTP site sends to you. Cookies from the headers could
then be read in a second curl invocation by using the -b,
--cookie option! The -c, --cookie-jar option is however a better
way to store cookies.
and
-S, --show-error
When used with -s, --silent, it makes curl show an error message if it fails.
from the man page. so
curl -sS -D - www.acooke.org -o /dev/null
follows redirects, dumps the headers to stdout and sends the data to /dev/null (that's a GET, not a POST, but you can do the same thing with a POST - just add whatever option you're already using for POSTing data)
note the - after the -D which indicates that the output "file" is stdout.
The other answers require the response body to be downloaded. But there's a way to make a POST request that will only fetch the header:
curl -s -I -X POST http://www.google.com
An -I by itself performs a HEAD request which can be overridden by -X POST to perform a POST (or any other) request and still only get the header data.
The Following command displays extra informations
curl -X POST http://httpbin.org/post -v > /dev/null
You can ask server to send just HEAD, instead of full response
curl -X HEAD -I http://httpbin.org/
Note: In some cases, server may send different headers for POST and HEAD. But in almost all cases headers are same.
For long response bodies (and various other similar situations), the solution I use is always to pipe to less, so
curl -i https://api.github.com/users | less
or
curl -s -D - https://api.github.com/users | less
will do the job.
Maybe it is little bit of an extreme, but I am using this super short version:
curl -svo. <URL>
Explanation:
-v print debug information (which does include headers)
-o. send web page data (which we want to ignore) to a certain file, . in this case, which is a directory and is an invalid destination and makes the output to be ignored.
-s no progress bar, no error information (otherwise you would see Warning: Failed to create the file .: Is a directory)
warning: result always fails (in terms of error code, if reachable or not). Do not use in, say, conditional statements in shell scripting...
Much easier – this also follows links.
curl -IL http://example.com/in-the-shadows
While the other answers have not worked for me in all situations, the best solution I could find (working with POST as well), taken from here:
curl -vs 'https://some-site.com' 1> /dev/null
headcurl.cmd (windows version)
curl -sSkv -o NUL %* 2>&1
I don't want a progress bar -s,
but I do want errors -S,
not bothering about valid https certificates -k,
getting high verbosity -v (this is about troubleshooting, is it?),
no output (in a clean way).
oh, and I want to forward stderr to stdout, so I can grep against the whole thing (since most or all output comes in stderr)
%* means [pass on all parameters to this script] (well(https://stackoverflow.com/a/980372/444255), well usually that's just one parameter: the url you are testing
real-world example (on troubleshooting proxy issues):
C:\depot>headcurl google.ch | grep -i -e http -e cache
Hostname was NOT found in DNS cache
GET HTTP://google.ch/ HTTP/1.1
HTTP/1.1 301 Moved Permanently
Location: http://www.google.ch/
Cache-Control: public, max-age=2592000
X-Cache: HIT from company.somewhere.ch
X-Cache-Lookup: HIT from company.somewhere.ch:1234
Linux version
for your .bash_aliases / .bash_rc:
alias headcurl='curl -sSkv -o /dev/null $# 2>&1'
-D, --dump-header
Write the protocol headers to the specified file.
This option is handy to use when you want to store the headers
that a HTTP site sends to you. Cookies from the headers could
then be read in a second curl invocation by using the -b,
--cookie option! The -c, --cookie-jar option is however a better
way to store cookies.

Resources