Tika 1.7 integration to Solr 5.1.0 - parsing

I want to parse (many) rss/atom/rdf feeds using Tika 1.7 (works pretty well but not perfect) and upload data into Solr 5.1.0. automatically.
I can see the data in my terminal - looks pretty nice, each item parsed: title, link, description etc - but I don't know how to load data automatically into Solr.
Any help is welcome,
Kind regards,
Christian

There's a contrib module called "SolrCell" which is a content extraction library. That works exactly as you want, and it uses Tika behind the scenes, for text and metadata extraction.
More information here: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Related

Grails 2.4.4 How do I export excel file?

I've looked at some plugins but no success.
I tried Export Plugin 1.6 as well but the view doesn't recognize r:.. and export:.. tags.
What is the best way to export rows of data from postgresql database into an excel file from a click of a button?
Thank you.
you could create a gsp which renders a .csv-file and set the content-type of the response to application/vnd.ms-excel within the controller.
that's the easiest way, but you will not be able to control the format of cells.
Apache POI - as mentioned by Abincepto - is another solution which is more complex but gives you full control over the generated excel file
Did you try directly with apache poi ?
From the website:
The Apache POI Project's mission is to create and maintain Java APIs for manipulating various file formats based upon the Office Open XML standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2). In short, you can read and write MS Excel files using Java. In addition, you can read and write MS Word and MS PowerPoint files using Java. Apache POI is your Java Excel solution (for Excel 97-2008). We have a complete API for porting other OOXML and OLE2 formats and welcome others to participate.
EDIT:
Here is a tutorial: Read / Write Excel file in Java using Apache POI
and a quick guide
EDIT2:
I just found another link using Grails that could help you. The example use another library: jexcelapi
The export plugin is dependent on the resources plugin. You can add the resources plugin and try again. I use resources 1.2.8. Also you need to add this to your dependencies:
dependencies {
............
// Needed for the export plugin?
compile 'commons-beanutils:commons-beanutils:1.8.3'
plugins {
............
runtime ":resources:1.2.8"

Get text from doc/docx file in pages using Apache tika

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?
Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.

Grails Export Plugin - Embed Image

I've got the grails export plugin working nicely but is there any way of embedding an image in the produced pdf report? I cant find an obvious way of doing it :(
many thanks
Tom
Why would you require to put an image when export plugin only talks about domain object. If you are looking for generating report which has images and other html stuff, I think itext library is the way forward.
Thanks, KJ

Advanced PDF Generation with Ruby / Rails

We have a document management system written in PHP that uses mPDF to generate rather complex PDFs. We grew to love it, and mPDF allowed us to:
Use HTML/CSS to style the pages
Produce 200+ Page Documents
Support alternating Portrait/Landscape pages throughout the document
Automatically generate Multi-Level PDF Bookmarks
Import 3rd-party PDFs into the document
We want the new version of the system to be writen on Ruby on Rails, and for that we would need a Ruby PDF Generation alternative. We checked out Prawn, PDFKit, Wicked PDF, and Prince XML, but reading their docs (which are often one page worth), I'm not sure if they are as feature-full as mPDF. They seem to go for the "Easy of Use" rather than functionality.
Is there a PDF Generator for Ruby that is as advanced as mPDF, or should be keep PDF generation PHP-based as it is?
mPDF seems to be a composite tool that uses a portable PDF lib and an html2pdf converter.
it's hard to compare those to the libs/tools you mentioned. PrinceXML should be similar to html2pdf, but you could also use wkhtml2pdf (PDFKit, WicketPDF), which uses webkit and is free of charge.
combining those with prawn, which would translate to FPDF in PHP, should do everything you need.
You might want to look at Docmosis which has a Ruby example in the sample code for talking to their Document engine. The templating capabilities are pretty good and I've seen it producing large documents. I don't think it can stitch/import PDFs so you would have to use it with another library that can do the combining.
Please note I work with the company that produces Docmosis.

Creating Microsoft Word (.docx) documents in Ruby

Is there an easy way to create Word documents (.docx) in a Ruby application? Actually, in my case it's a Rails application served from a Linux server.
A gem similar to Prawn but for DOCX instead of PDF would be great!
As has been noted, there don't appear to be any libraries to manipulate Open XML documents in Ruby, but OpenXML Developer has complete documentation on the format of Open XML documents.
If what you want is to send a copy of a standard document (like a form letter) customized for each user, it should be fairly simple given that a DOCX is a ZIP file that contains various parts in a directory hierarchy. Have a DOCX "template" that contains all the parts and tree structure that you want to send to all users (with no real content), then simply create new (or modify existing) pieces that contain the user-specific content you want and inject it into the ZIP (DOCX file) before sending it to the user.
For example: You could have document-template.xml that contains Dear [USER-PLACEHOLDER]:. When a user requests the document, you replace [USER-PLACEHOLDER] with the user's name, then add the resulting document.xml to the your-template.docx ZIP file (which would contain all the images and other parts you want in the Word document) and send that resulting document to the user.
Note that if you rename a .docx file to .zip it is trivial to explore the structure and format of the parts inside. You can remove or replace images or other parts very easily with any ZIP manipulation tools or programmatically with code.
Generating a brand new Word document with completely custom content from raw XML would be very difficult without access to an API to make the job easier. If you really need to do that, you might consider installing Mono, then use VB.NET, C# or IronRuby to create your Open XML documents using the Open XML Format SDK 1.0. Since you would just be using the Microsoft.Office.DocumentFormat.OpenXml.Packaging Namespace to manipulate Open XML documents, it should work okay in Mono, which seems to support everything the SDK requires.
Maybe this gem is interesting for you.
https://github.com/trade-informatics/caracal/
It like prawn but with docx.
You can use Apache POI. It is written in Java, but integrates with Ruby as an extension
This is an old question but there's a new answer. If you'd like to turn an HTML doc into a Word (docx) doc, just use the 'htmltoword' gem:
https://github.com/karnov/htmltoword
I'm not sure why there was answer creep and everyone started posting templating solutions, but this answers the OP's question. Just like Prawn, except Word instead of PDF.
UPDATE:
There's also pandoc and an API wrapper for pandoc called docverter. Both have slightly complicated installs since pandoc is a haskell library.
I know if you serve a HTML document as a word document with the .doc extension, it will open in Word just fine. Just don't do anything fancy.
Edit: Here is an example using classic ASP. http://www.aspdev.org/asp/asp-export-word/
Using a technique very similar to that suggested by Grant Wagner I have created a Ruby html to word gem that should allow you to easily output Word docx files from your ruby app. You can check it out at http://github.com/nickfrandsen/htmltoword - Simply pass it a html string and it will create a corresponding word docx file.
def show
respond_to do |format|
format.docx do
file = Htmltoword::Document.create params[:docx_html_source], "file_name.docx"
send_file file.path, :disposition => "attachment"
end
end
end
Hope you find it useful. If you have any problems with it feel free to open a github issue.
Disclosure: I'm the leader of the docxtemplater project.
I know you're looking for a ruby solution, but because all other solutions only tell you how to do it globally, without giving you a library that does exactly what you want, here's a solution based on JS or NodeJS (works in both)
DocxTemplater Library
Demo of the library
You can also use it in the commandline:
npm install docxtemplater -g
docxtemplater <configFile>
----config.docxFile: The input file in docx format
----config.outputFile: The outputfile of the document
This is a way Doccy (doccyapp.com) has a api that does just that which you can use. Supports docx, odt and pages and converts to PDF as well if you like
Further to Grant's answer, you can also send Word a "Flat OPC" file, which is essentially the docx unzipped and concatenated to create a single xml file. This way, you can replace [USER-PLACEHOLDER] in one file and be done with it (ie no zipping or unzipping).
If anyone is still looking at this, this post explains how to use an XML data source. This works nicely for me.
http://seroter.wordpress.com/2009/12/23/populating-word-2007-templates-through-open-xml/
Check out this github repo: https://github.com/jawspeak/ruby-docx-templater
It allows you to create a document from a word template.
If you're running on Windows, of course, it's a matter of WIN32OLE and some pain with the Word COM objects.
Chances are that your serving from a *nix environment, though. Word 2007 uses the "Microsoft Office Open XML" format (*.docx) which can be opened using the appropriate compatibility pack from Microsoft.
Some of the more recent Office apps (2002/XP and 2003 at least) had their own XML formats which may also be useable.
I'm not aware of any Ruby tools to make the process easier, sadly.
If it can be made acceptable, I think I'd be inclined to go down the renamed-html file route. I just saved a document as HTML from WordXP, renamed it to a .doc and opened it without problem.
I encountered the same problem. Unfortunately I could not manipulate the xml because my clients should themselves to fill in templates. And to do this is not always possible (for example, office for mac does not allow this).
As a solution to this problem, I made ​​a simple gem, which can be used as an rtf document template with embedded ruby: https://github.com/eicca/rtf-templater
I tested it and it works ok for filling reports and documents. However, formatting badly displays for complex loops and conditions.

Resources