How to use Tika in server mode - apache-tika

On Tika's website it says (concerning tika-app-1.2.jar) it can be used in server mode. Does anyone know how to send documents and receive parsed text from this server once it is running?

Tika supports two "server" modes. The simpler and original is the --server flag of Tika-App. The more functional, but also more recent is the JAX-RS JSR-311 server component, which is an additional jar.
The Tika-App Network Server is very simple to use. Simply start Tika-App with the --server flag, and a --port ### flag telling it what port to listen on. Then, connect to that port and send it a single file. You'll get back the html version. NetCat works well for this, something like java -jar tika-app.jar --server --port 12345 followed by nc 127.0.0.1 12345 < MyFileToExtract will get you back the html
The JAX-RS JSR-311 server component supports a few different urls, for things like metadata, plain text etc. You start the server with java -jar tika-server.jar, then do HTTP put calls to the appropriate url with your input document and you'll get the resource back. There are loads of details and examples (including using curl for testing) on the wiki page
The Tika App Network Server is fairly simple, only supports one mode (extract to HTML), and is generally used for testing / demos / prototyping / etc. The Tika JAXRS Server is a fully RESTful service which talks HTTP, and exposes a wide range of Tika's modes. It's the generally recommended way these days to interface with Tika over the network, and/or from non-Java stacks.

Just adding to #Gagravarr's great answer.
When talking about Tika in server mode, it is important to differentiate between two versions which can otherwise cause confusion:
tika-app.jar has the --server --port 9998 options to start a simple server
tika-server.jar is a separate component using JAX-RS
The first option only provides text extraction and returns the content as HTML. Most likely, what you really want is the second option, which is a RESTful service exposing many more of Tika's features.
You can simply download the tika-server.jar from the Tika project site. Start the server using
java -jar tika-server-x.x.jar -h 0.0.0.0
The -h 0.0.0.0 (host) option makes the server listen for any incoming requests, otherwise without it it would only listen for requests from localhost. You can also add the -p option to change the port, otherwise it defaults to 9998.
Then, once the server has started you can simply access it using your browser. It will list all available endpoints.
Finally to extract meta data from a file you can use cURL like this:
curl -T testWORD.doc http://example.com:9998/meta
Returns the meta data as key/value pairs one per line. You can also have Tika return the results as JSON by adding the proper accept header:
curl -H "Accept: application/json" -T testWORD.doc http://example.com:9998/meta
[Update 2015-01-19] Previously the comment said that tika-server.jar is not available as download. Fixed that since it actually does exist as a binary download.

To enhance Gagravarr perfect answer:
If your document is got from a WEB server => curl -u
"http://myserver-domain/*path-to-doc*/doc-name.extension" | nc
127.0.0.1 12345
And it is even better if the document is protected by password => curl -u
login:*password*
"http://myserver-domain/*path-to-doc*/doc-name.extension" | nc
127.0.0.1 12345

Related

ClearML SSH port forwarding fileserver not available in WEB Ui

Trying to use clearml-server on own Ubuntu 18.04.5 with SSH Port Forwarding and not beeing able to see my debug samples.
My setup:
ClearML server on hostA
SSH Tunnel connections to access Web App from working machine via localhost:18080
Web App: ssh -N -L 18081:127.0.0.1:8081 user#hostA
Fileserver: ssh -N -L 18081:127.0.0.1:8081 user#hostA
In Web App under Task->Results->Debug Samples the Images are still refrenced by localhost:8081
Where can I set the fileserver URL to be localhost:18081 in Web App?
I tried ~/clearml.conf, but this did not work ( I think it is for my python script ).
Disclaimer: I'm a member of the ClearML team (formerly Trains)
In ClearML, debug images' URL is registered once they are uploaded to the fileserver. The WebApp doesn't actually decide on the URL for each debug image, but rather obtains it for each debug image from the server. This allows you to potentially upload debug images to a variety of storage targets, ClearML File Server simply being the most convenient, built-in option.
So, the WebApp will always look for localhost:8008 for debug images that have already been uploaded to the fileserver and contain localhost:8080 in their URL.
A possible solution is to simply add another tunnel in the form of ssh -N -L 8081:127.0.0.1:8081 user#hostA.
For future experiments, you can choose to keep using 8081 (and keep using this new tunnel), or to change the default fileserver URL in clearml.conf to point to port localhost:18081, assuming you're running your experiments from the same machine where the tunnel to 18081 exists.

What does it mean to run a local web server?

I can program and develop in Ruby on Rails/JS/HTML/CSS to make a full stack app. However, there are holes in my understanding of the HTTP request/response cycle. Are the following points correct?
If I make a Rails app, and on the command line type rails server I get a local server, which I can make requests to. If I open a browser, type localhost:3000, and press enter, I am making an HTTP request to the local server.
Rails uses by default a web server called WEBrick, though there are others like Thin, Puma, and Unicorn. These are all pieces of software, and what makes them web servers is the fact that the software implements functionality to process HTTP requests.
When I run a local web server, it means that my computer is running one of these pieces of software that listen for HTTP requests.
Is the above what it means "to run a local web server"?
I have seen other examples of ways to "run a local web server". One of the is to run npm install -g http-server in a project directory, and then navigate to localhost:8080. Is this also just software that starts running and accepts HTTP requests on port 8080?
On a Ruby command line, install rack gem: gem install rack. Then in a new Ruby file we require 'rack', start a web server:
Rack::Server.start({ app: MySimpleApp, port: 3000 })
We can then define a web application MySimpleApp that is rack-compliant (object that responds to call method):
class MySimpleApp
def self.call
(...)
end
end
So now when we navigate in our browser to localhost:3000, MySimpleApp is executed. Is rack simply running it's default WEBrick server? Is what the above commands do simply run a local web server and define what to do when an HTTP request comes in (execute MySimpleApp)?
You're pretty much right on your understanding there. HTTP is just a text-based protocol that, like many, operates over TCP/IP.
The built-in WEBrick server isn't the best example of an HTTP server written in Ruby, but it's included for legacy reasons because it's often "good enough" to get you started. Pow is considerably better and despite being produced by the same company that produced Rails it's largely written in Node.
The beauty of HTTP, like a lot of internet based protocols, is it doesn't matter what language you use so long as you comply with the standard.
Rack is a layer that operates behind HTTP and provides a thin layer of abstraction on the request/response cycle.
A server is something that opens up a port (80, 443, 8080) for some sort of data transfer. Port 80 is the HTTP port and port 443 is the HTTPS port. 8080 is a commonly used port for development (as is 3000). https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
A local server by definition is a server running on your machine.
Overall, you are definitely on the right track.

Rails app call APIs using proxy

I have subscribed to an API service which provides access based on static IP (For both Live and Testing).
Since my development area ISP doesn't provide a static IP, I have enabled API access to my staging machine IP, which is static. I installed squid and enabled/setup a proxy server in my staging server so that I can use it as a proxy and make calls to the API while i do development.
I am using Mac for my development and Networking>Proxy settings wont work for system wide( Terminal ). Due to this, I was using Trial versions of MacProxy, proxifier( proxy clients) and all was was working fine till trial expired. Are there any free alternatives to this for Mac?
I tried to setup proxy by creating ssh socks proxy and setting http_proxy="xxx". In terminal. When I check terminal IP post setting using curl ipecho.net/plain ; echo, it shows proper IPs but when I run local rails development server and tries to access the API, its rejecting call with invalid IP (it shows non proxied IP)
An free alternative that might solve your problem might be a project on github:
sshuttle (read me)
It forwards TCP and DNS requests a remote ssh server.
The most basic use of sshuttle looks like this:
./sshuttle -r username#sshserver 0.0.0.0/0 -vv
To tunnel all traffic you might do:
./sshuttle --dns -vr ssh_server 0/0
There are also helper functions available here, which can simpify some of the commands.
The system level proxy settings aren't used by ruby applications. Typically this is a code level option passed to the library you are using to make connections.
If you want Savon to use a proxy then you need to pass this to Savon when you create the client:
client = Savon.client(proxy: "http://example.org", ...)
If this call is being made inside a gem, then unless that gem already provides that option then you would need to fork it to add the option
The gem you mention seems to already implement this - it's configuration class has a proxy attribute that seems to be passed through to savon.

How to view neo4j database on the hosted linode server

I am running standalone neo4j database server at localhost:7474 on a linode instance.
Is there any way to view this in the browser?
If you have SSH access to the Linode instance then you can run ssh -L 7474:localhost:7474 youruser#123.123.123.123 which will tunnel the remote port 7474 to localhost 7474. In your browser you can now use http://localhost:7474 to see the remote server without opening anything to the world.
You want what's called a "reverse proxy". Outside of your box, you can't talk about localhost:7474 as a hostname. So you want an external facing web server that "proxies" requests and sends them to localhost:7474.
One such option is Apache mod_proxy used as a reverse proxy. Examples on how to use it are behind the link. In general it's going to boil down to a configuration directive that looks something like:
ProxyPassReverse /neo4j http://localhost:7474
You also really want to read the documentation on securing the neo4j server.
WARNING - neo4j's web interface will let you do just about anything without authentication, including delete all of your data, change it, put new data in, and so on. It is a very bad idea to expose that functionality to the entire internet. So if you use a reverse proxy as suggested above, make sure you add some authentication layer (again you can do this with apache and mod_proxy) to permit just any random person from connecting to your instance and optionally deciding to trash it.

monit and apache site behind http-basic-auth

I have a web server which is protected behind http-basic-auth. I've read through the monit docs and it doesn't seem like there's a clear way to pass credentials in order to test that the test page on the server is being returned correctly.
Any thoughts?
Thanks!
(Please don't confuse this with monit's own httpd for showing monit status in a web page)
PS this is monit 4.8.1 -- that which comes with Ubuntu Hardy 8.04
It seems to be possible to include the credentials in the URL, have you tried this?:
(from http://mmonit.com/monit/documentation/monit.html#connection_testing )
[...] Where URL-spec is an URL on the
standard form as specified in RFC
2396:
<protocol>://<authority><path>?<query>
Here is an example of an URL where all
components are used:
http://user:password#www.foo.bar:8080/document/?querystring#ref
If a username and password is included
in the URL Monit will attempt to login
at the server using Basic
Authentication.
Try this if you just want to check that your web server is listening on port 80 (and you don't care what page or data it returns):
if failed port 80 type TCP then restart

Resources