I am trying to call Apache-TIKA via their REST API. I have successfully been able to upload a PDF document and return the document's text via CURL
curl -X PUT --data-binary #<filename>.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
That translated to INDY like so:
function GetPDFText(const FileName: String): String;
var
IdHTTP: TIdHTTP;
Params: TIdMultiPartFormDataStream;
begin
IdHTTP := TIdTTP.Create;
try
Params := TIdMultiPartFormDataStream.Create;
try
Params.Add('file', FileName, 'application/pdf')
Result := IdHTTP.PUT('http://localhost:9998/tika', Params);
finally
Params.Free;
end;
finally
IdHTTP.Free;
end;
end;
Now I want to upload a word document (.docx)
I assumed that all I would need to do is change the content Type when I add my file to Params, but that doesn't seem to produce any results, although I get no error reported back. I was able to get the following CURL command to work correctly
CURL -T <myDOCXfile>.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"
How do I modify my HTTP call from CURL -X PUT to CURL -T?
There are at least two issues in your implementation:
Your translation from CURL -X PUT to TIdHTTP is wrong.
You don't specify Accept HTTP header to retrieve the extracted text in specific format.
How to translate curl -X PUT to Indy?
At first, lets make it clear that curl -X PUT --data-binary #<filename> <url> is the same as curl -T <filename> <url> when:
<url>'s scheme is HTTP or HTTPS
<url> does not end with /
Therefore using one or the other shouldn't matter in your case. See also curl documentation.
Secondly, TIdMultiPartFormDataStream is designed for use with POST verb, however nothing can stop you from passing it to TIdHTTP.Put, because it is indirectly derived from TStream. There even is a dedicated invariant of TIdHTTP.Post method that accepts TIdMultiPartFormDataStream:
function Post(AURL: string; ASource: TIdMultiPartFormDataStream): string; overload;
To upload file to the service just use TIdHTTP.Put method with TFileStream as an argument while providing proper content type of the file being uploaded in HTTP header.
And finally you're trying to extract plain text from the document, but you didn't specify content type that the service should return. This is done via Accept HTTP header. Default instance of TIdHTTP has property IdHTTP.Request.Accept initialized to 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' (this may vary depending on Indy version). Therefore by default Tika will return HTML formatted text. To get the plain text you should change it to 'text/plain; charset=utf-8'.
Fixed implementation:
uses IdGlobal, IdHTTP;
function GetDocumentText(const FileName, ContentType: string): string;
var
IdHTTP: TIdHTTP;
Stream: TIdReadFileExclusiveStream;
begin
IdHTTP := TIdHTTP.Create;
try
IdHTTP.Request.Accept := 'text/plain; charset=utf-8';
IdHTTP.Request.ContentType := ContentType;
Stream := TIdReadFileExclusiveStream.Create(FileName);
try
Result := IdHTTP.Put('http://localhost:9998/tika', Stream);
finally
Stream.Free;
end;
finally
IdHTTP.Free;
end;
end;
function GetPDFText(const FileName: string): string;
const
PDFContentType = 'application/pdf';
begin
Result := GetDocumentText(FileName, PDFContentType);
end;
function GetDOCXText(const FileName: string): string;
const
DOCXContentType = 'application/vnd.openxmlformats-officedocument.wordprocessingml.document';
begin
Result := GetDocumentText(FileName, DOCXContentType);
end;
According to the Tika's documentation it also supports posting multipart form data. If you insist on using this approach, then you should change the target resource to /tika/form and switch to Post method in your implementation:
function GetDocumentText(const FileName, ContentType: string): string;
var
IdHTTP: TIdHTTP;
FormData: TIdMultiPartFormDataStream;
begin
IdHTTP := TIdHTTP.Create;
try
IdHTTP.Request.Accept := 'text/plain; charset=utf-8';
FormData := TIdMultiPartFormDataStream.Create;
try
FormData.AddFile('file', FileName, ContentType); { older Indy versions: FormData.Add(...) }
Result := IdHTTP.Post('http://localhost:9998/tika/form', FormData);
finally
FormData.Free;
end;
finally
IdHTTP.Free;
end;
end;
Why does the original implementation in question work with PDF files?
When you Post multipart form data via TIdHTTP, Indy automatically sets content type of the request to 'multipart/form-data; boundary=...whatever...'. This is not the case when you Put (unless you set it manually before performing the request) data and therefore TIdHttp.Request.ContentType remains blank. Now I can only guess that when Tika sees empty content type it falls back to some default type which could be PDF and it's still somehow able to read the document from multipart request.
Related
I am trying to use IdHTTP to equivalence the following curl operation:
curl -X POST -F "message=#C:\Users\santon\Desktop\ESM_download\token.txt" "https://esm-db.eu/esmws/eventdata/1/query?eventid=IT-1997-0004&station=CLF&format=ascii" -o RecordFileName.zip
The curl command is used to download a file from the server that is then saved on the hard drive as DownloadedFileName.zip. An authorization is required through a token file on the hard drive called token.txt. The path of the token file is specified as a parameter of curl.
The best I could do is the following code:
procedure TMainForm.HTTPGetFile;
var
IdHTTP: TIdHTTP;
Params: TIdMultipartFormDataStream;
LHandler: TIdSSLIOHandlerSocketOpenSSL;
begin
try
Params := TIdMultipartFormDataStream.Create;
Params.AddFormField('message', '#"C:\Users\santon\Desktop\ESM_download\token.txt"');
IdHTTP := TIdHTTP.Create(nil);
LHandler:= TIdSSLIOHandlerSocketOpenSSL.Create(self);
LHandler.SSLOptions.Method := sslvTLSv1;
try
IdHTTP.IOHandler := LHandler;
IdHTTP.Post('https://esm-db.eu/esmws/eventdata/1/query?eventid=IT-1997-0004&station=CLF&format=ascii',Params);
finally
IdHTTP.Free;
LHandler.Free;
Params.Free;
end;
except
on E: Exception do
ShowMessage('Error: '+E.ToString);
end;
end;
But I keep on getting a HTTP/1.1 403 Forbidden error.
Any ideas of what I am doing wrong?
Thanks in advance
You are not loading the token file into TIdMultiPartFormDataStream correctly.
Per the curl documentation:
https://curl.se/docs/manpage.html#-F
-F, --form <name=content>
(HTTP SMTP IMAP) For HTTP protocol family, this lets curl emulate a filled-in form in which a user has pressed the submit button. This causes curl to POST data using the Content-Type multipart/form-data according to RFC 2388.
...
This enables uploading of binary files etc. To force the 'content' part to be a file, prefix the file name with an # sign. To just get the content part from a file, prefix the file name with the symbol <. The difference between # and < is then that # makes a file get attached in the post as a file upload, while the < makes a text field and just get the contents for that text field from a file.
...
Example: send an image to an HTTP server, where 'profile' is the name of the form-field to which the file portrait.jpg will be the input:
curl -F profile=#portrait.jpg https://example.com/upload.cgi
...
In your code, you are creating a text field whose content is the filename itself. You are not creating a file upload field whose content is the data from the file.
Try this instead:
procedure TMainForm.HTTPGetFile;
var
IdHTTP: TIdHTTP;
Params: TIdMultipartFormDataStream;
LHandler: TIdSSLIOHandlerSocketOpenSSL;
LOutFile: TFileStream;
begin
try
Params := TIdMultipartFormDataStream.Create;
try
Params.AddFile('message', 'C:\Users\santon\Desktop\ESM_download\token.txt');
IdHTTP := TIdHTTP.Create(nil);
try
LHandler := TIdSSLIOHandlerSocketOpenSSL.Create(IdHTTP);
LHandler.SSLOptions.Method := sslvTLSv1;
IdHTTP.IOHandler := LHandler;
LOutFile := TFileStream.Create('<path>\RecordFileName.zip', fmCreate);
try
IdHTTP.Post('https://esm-db.eu/esmws/eventdata/1/query?eventid=IT-1997-0004&station=CLF&format=ascii', Params, LOutFile);
finally
LOutFile.Free;
end;
finally
IdHTTP.Free;
end;
finally
Params.Free;
end;
except
on E: Exception do
ShowMessage('Error: ' + E.ToString);
end;
end;
I have seen a lot of examples online, but I cannot understand why my code doesn't work.
I have an url that looks like this:
http://www.domain.com/confirm.php?user=USERNAME&id=THEID
confirm.php is a page that does some checks on a MySQL database and then the only output of the page is a 0 or a -1 (true or false):
<?php
//long code...
if ( ... ) {
echo "0"; // success!
die();
} else {
echo "-1"; // fail!
die();
}
?>
My Delphi FireMonkey app has to open the URL above, passing the username and the id, and then read the result of the page. The result is only a -1 or a 0. This is the code.
//I have created a subclass of TThread
procedure TRegister.Execute;
var
conn: TIdHTTP;
res: string;
begin
inherited;
Queue(nil,
procedure
begin
ProgressLabel.Text := 'Connecting...';
end
);
//get the result -1 or 0
try
conn := TIdHTTP.Create(nil);
try
res := conn.Get('http://www.domain.com/confirm.php?user='+FUsername+'&id='+FPId);
finally
conn.Free;
end;
except
res := 'error!!';
end;
Queue(nil,
procedure
begin
ProgressLabel.Text := res;
end
);
end;
The value of res is always error!! and never -1 or 0. Where is my code wrong? The error caught from on E: Exception do is:
HTTP/1.1 406 not acceptable
I have found a solution using System.Net.HttpClient. I can simply use this function
function GetURL(const AURL: string): string;
var
HttpClient: THttpClient;
HttpResponse: IHttpResponse;
begin
HttpClient := THTTPClient.Create;
try
HttpResponse := HttpClient.Get(AURL);
Result := HttpResponse.ContentAsString();
finally
HttpClient.Free;
end;
end;
This works and gives me -1 and 0 as I expected. To get an example of a working code I have tested this:
procedure TForm1.Button1Click(Sender: TObject);
function GetURL(const AURL: string): string;
var
HttpClient: THttpClient;
HttpResponse: IHttpResponse;
begin
HttpClient := THTTPClient.Create;
try
HttpResponse := HttpClient.Get(AURL);
Result := HttpResponse.ContentAsString();
finally
HttpClient.Free;
end;
end;
function GetURLAsString(const aURL: string): string;
var
lHTTP: TIdHTTP;
begin
lHTTP := TIdHTTP.Create;
try
Result := lHTTP.Get(aURL);
finally
lHTTP.Free;
end;
end;
begin
Memo1.Lines.Add(GetURL('http://www.domain.com/confirm.php?user=user&id=theid'));
Memo1.Lines.Add(GetURLAsString('http://www.domain.com/confirm.php?user=user&id=theid'))
end;
end.
The first function works perfectly but Indy raises the exception HTTP/1.1 406 not acceptable. It seems that Indy cannot automatically handle the content type of the page. Here you can see the REST Debugger log:
HTTP Error 406 Not acceptable typically means that the server is not able to respond with the content type the client wanted. Both the Server and Client need to appropriately use the MIME type as you need. In this case, your client's Accept headers should provide the desired type of response, and your server should also be responding with the same. In your case, the Content-Type will most likely be text/plain.
So long story short, your client is expecting a MIME type which the server does not explicitly return in its response. The problem could be on either side, or perhaps both.
Your Client's Accept headers must provide the MIME type(s) you expect and need. Specifically Accept, Accept-Charset, Accept-Language, Accept-Encoding. By default in Indy TIdHTTP, these headers should accept essentially anything, assuming these headers haven't been overwritten. The Accept header is by default set to text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 where the */* opens the door for any MIME type.
Your Server's Response's Content-Type must be one of the provided MIME types, as well as the format of the response as also desired by the client. It is likely that your HTTP server is not providing the appropriate Content-Type in its response. If the server responds with anything in the */* filter (which should mean everything), then the client will accept it (assuming the server responds with text/plain). If the server responds with an invalid content type (such as just text or plain), then it could be rejected.
I am sending an HTTP Get request to Google's Map API, and I fill my StringStream with the response. However, when I try to read from the stream, I am just presented with an empty string ''.
{ Attempts to get JSON back from Google's Directions API }
function GetJSONString_OrDie(url : string) : string;
var
lHTTP: TIdHTTP;
SSL: TIdSSLIOHandlerSocketOpenSSL;
Buffer: TStringStream;
begin
{Sets up SSL}
SSL := TIdSSLIOHandlerSocketOpenSSL.Create(nil);
{Creates an HTTP request}
lHTTP := TIdHTTP.Create(nil);
{Sets the HTTP request to use SSL}
lHTTP.IOHandler := SSL;
{Set up the buffer}
Buffer := TStringStream.Create(Result);
{Attempts to get JSON back from Google's Directions API}
lHTTP.Get(url, Buffer);
Result:= Buffer.ReadString(Buffer.Size); //An empty string is put into Result
finally
{Frees up the HTTP object}
lHTTP.Free;
{Frees up the SSL object}
SSL.Free;
end;
Why am I getting an empty string back, when I can see that the StringStream Buffer has plenty of data (size of 32495 after the Get is called).
I've tested my call, and I am returned with valid JSON.
First, you are using TStringStream to receive the response data. If you are using Delphi 2009+, DO NOT do that! TStringStream is tied to a specific encoding that has to be declared in the constructor before the stream is populated with data, and it cannot be changed dynamically. The default encoding is TEncoding.Default, which represents the OS default encoding. If the HTTP response uses a different encoding, the data will not decode to a String correctly.
Second, you are not seeking the stream's Position back to 0 before calling ReadString(). An easier way to retrieve a TStringStream's content as a decoded String is to use the DataString property instead, which ignores the Position property and returns the entire stream content as a whole:
Result := Buffer.DataString;
Third, you are doing too much manual work. TIdHTTP.Get() has an overloaded version that returns a decoded String. The benefit of using this method is that it uses the actual charset of the response, rather than the charset of a TStringStream:
function GetJSONString_OrDie(const URL: string): string;
var
lHTTP: TIdHTTP;
SSL: TIdSSLIOHandlerSocketOpenSSL;
begin
{Creates an HTTP request}
lHTTP := TIdHTTP.Create(nil);
try
{Sets the HTTP request to use SSL}
lHTTP.IOHandler := TIdSSLIOHandlerSocketOpenSSL.Create(lHTTP);
{Attempts to get JSON back from Google's Directions API}
Result := lHTTP.Get(URL);
finally
{Frees up the HTTP object}
lHTTP.Free;
end;
end;
Which can be simplified further if you are using an up-to-date version of Indy (see this blog post for details):
function GetJSONString_OrDie(const URL: string): string;
var
lHTTP: TIdHTTP;
begin
{Creates an HTTP request}
lHTTP := TIdHTTP.Create(nil);
try
{Attempts to get JSON back from Google's Directions API}
Result := lHTTP.Get(URL);
finally
{Frees up the HTTP object}
lHTTP.Free;
end;
end;
Maybe first set Buffer.Position := 0?
currently I am able to run a command but i cant figure out how to get the result into a string.
I do a get like so
idhttp1.get('http://codeelf.com/games/the-grid-2/grid/',TStream(nil));
and everything seems to run ok, in wireshark i can see the results from that command. Now if i do
HTML := idhttp1.get('http://codeelf.com/games/the-grid-2/grid/');
it will freeze up the app, in wireshark i can see it sent the GET and got a response, but dont know why it freezes up. HTML is just a string var.
EDIT FULL CODE
BUTTON CLICK
login(EUserName.Text,EPassWord.Text);
procedure TForm5.Login(name: string; Pass: string);
var
Params: TStringList;
html : string;
begin
Params := TStringList.Create;
try
Params.Add('user='+name);
Params.Add('pass='+pass);
Params.Add('sublogin=Login');
//post password/username
IdHTTP1.Post('http://codeelf.com/games/the-grid-2/grid/', Params);
//get the grid source
HTML := idhttp1.Get('http://codeelf.com/games/the-grid-2/grid/');
finally
Params.Free;
end;
llogin.Caption := 'Logged In';
end;
RESPONCE
The responce i get says Transfer-Encoding: chunked\r\n and Content-Type: text/html\r\n dont know if that matters.
Thanks
Indy has support for some types of streamed HTTP responses (see New TIdHTTP hoNoReadMultipartMIME flag), but this will only help if the server uses multipart/* responses. The linked blog article explains the details further and also shows how the Indy HTTP component can feed a MIME decoder with a continuous response stream.
If this is not applicable to your case, a workaround is to go down to the "raw" TCP layer, which means send the HTTP request using a TIdTCPClient component, and then read the response line by line (or byte by byte) from the IOHandler. This gives total control over response handling. Request and Response should be processed in a thread to decouple it from the main thread.
TIdHTTP.Post() returns the response data, you should not be calling TIdHTTP.Get() to retrieve it separately:
procedure TForm5.Login(name: string; Pass: string);
var
Params: TStringList;
html : string;
begin
Params := TStringList.Create;
try
Params.Add('user='+name);
Params.Add('pass='+pass);
Params.Add('sublogin=Login');
//post password/username
HTML := IdHTTP1.Post('http://codeelf.com/games/the-grid-2/grid/', Params);
finally
Params.Free;
end;
llogin.Caption := 'Logged In';
end;
In Delphi XE2, I am trying to upload the lines of a memo to a file on my webspace with IdHTTP.Put:
procedure TForm1.btnUploadClick(Sender: TObject);
var
StringToUpload: TStringStream;
begin
StringToUpload := TStringStream.Create('');
try
StringToUpload.WriteString(memo.Lines.Text);
// Error: HTTP/1.1 405 Method Not Allowed.
IdHTTP1.Put(edtOnlineFile.Text, StringToUpload);
finally
StringToUpload.Free;
end;
end;
But I always get this error message:
So what must I do to avoid the error and make the upload?
It means the HTTP server does not support the PUT method on that URL (if at all). There is nothing you can do about that. You will likely have to upload your data another way, usually involving POST instead, or a completely different protocol, like FTP.
BTW, when using TStringStream like this, don't forget to reset the Position if you use the WriteString() method:
StringToUpload.WriteString(memo.Lines.Text);
StringToUpload.Position := 0;
Otherwise, use the constructor instead:
StringToUpload := TStringStream.Create(memo.Lines.Text);
Thanks for the above code, here is perhaps a little more information with a little helper function to assist with that Stream constructor which I found works for any string you pass through, even it contains binary stuff.
//Helper function to make JSON string correct for processing with POST / GET
function StringToStream(const AString: string): TStream;
begin
Result := TStringStream.Create(AString);
end;
//somewhere in your code, I am posting to Spring REST, encoding must be utf-8
IdHTTP1.Request.ContentType := 'application/json'; //very important
IdHTTP1.Request.ContentEncoding := 'utf-8'; //which encoding?
response := IdHTTP1.Put(URL, StringToStream(body)); //response,URL,body are declared as String