Unable to read text files using tika - apache-tika

I'm trying to parse text files using apache tika. I pass a file to the following code:
public String parseFile(File file) throws Exception{
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
BodyContentHandler handler = new BodyContentHandler(1000000000);
FileInputStream is = new FileInputStream(file);
parser.parse(is, handler, metadata, parseContext);
System.out.println("Content: "+ handler.toString());
return handler.toString();
}
My issue is that Tika recognises SOME files but not ALL files. I'm sorry but I'm unable to actully attach the files that do not work (corporate reasons). If I do find an acceptable example, I will share it. I wanted to know if there was something blatantly obvious which I am doing wrong. I'm not sure how the BodyContentHandler class works. In most of the tutorials I read online, the code is as follows :
ContentHandler handler = new BodyContentHandler ();
However, my eclipse refuses to accept that. And asks me to cast BodyContentHandler to ContentHandler which causes other problems.
I'm trying to support text files, pdf files, word documents, excel files. The text Files that DO NOT work have a certain property: I copy paste an email thread from outlook to notepad. Most files that do not work are of this kind.

Related

How to generate multiple reports in Grails using Export plugin?

I am using the Export plugin in Grails for generating PDF/Excel report. I am able to generate single report on PDF/Excel button click. But, now I want to generate multiple reports on single button click. I tried for loop, method calls but no luck.
Reference links are ok. I don't expect entire code, needs reference only.
If you take a look at the source code for the ExportService in the plugin you will notice there are various export methods. Two of which support OutputStreams. Using either of these methods (depending on your requirements for the other parameters) will allow you to render a report to an output stream. Then using those output streams you can create a zip file which you can deliver to the HTTP client.
Here is a very rough example, which was written off the top of my head so it's really just an idea rather than working code:
// Assumes you have a list of maps.
// Each map will have two keys.
// outputStream and fileName
List files = []
// call the exportService and populate your list of files
// ByteArrayOutputStream outputStream = new ByteArrayOutputStream()
// exportService.export('pdf', outputStream, ...)
// files.add([outputStream: outputStream, fileName: 'whatever.pdf'])
// ByteArrayOutputStream outputStream2 = new ByteArrayOutputStream()
// exportService.export('pdf', outputStream2, ...)
// files.add([outputStream: outputStream2, fileName: 'another.pdf'])
// create a tempoary file for the zip file
File tempZipFile = File.createTempFile("temp", "zip")
ZipOutputStream out = new ZipOutputStream(new FileOutputStream(tempZipFile))
// set the compression ratio
out.setLevel(Deflater.BEST_SPEED);
// Iterate through the list of files adding them to the ZIP file
files.each { file ->
// Associate an input stream for the current file
ByteArrayInputStream input = new ByteArrayInputStream(file.outputStream.toByteArray())
// Add ZIP entry to output stream.
out.putNextEntry(new ZipEntry(file.fileName))
// Transfer bytes from the current file to the ZIP file
org.apache.commons.io.IOUtils.copy(input, out);
// Close the current entry
out.closeEntry()
// Close the current input stream
input.close()
}
// close the ZIP file
out.close()
// next you need to deliver the zip file to the HTTP client
response.setContentType("application/zip")
response.setHeader("Content-disposition", "attachment;filename=WhateverFilename.zip")
org.apache.commons.io.IOUtils.copy((new FileInputStream(tempZipFile), response.outputStream)
response.outputStream.flush()
response.outputStream.close()
That should give you an idea of how to approach this. Again, the above is just for demonstration purposes and isn't production ready code, nor have I even attempted to compile it.

Struts File Upload identify correct extension instead of .tmp

I am new to Struts and working on File Upload using Struts.
Client:
It is Java Program which hits my Strut app by using apache HttpClient API and provides me
File.
Client as per need sometime gives me .wav file and sometime .zip file and sometime both.
Server:
Struts app which got the request from client app and upload the file.
Here, problem comes as I upload the file, it get uploaded using ".tmp" extension, which I want to get uploaded with the same extension what client has passed.
Or there is any other way by which we can check what is the extension of the file client has sent....?
I am stuck in this problem and not able to go ahead.
Please Find the code attached and tell me what modification I have to do:
Server Code:
MultiPartRequestWrapper multiWrapper=null;
File baseFile=null;
System.out.println("inside do post");
multiWrapper = ((MultiPartRequestWrapper)request);
Enumeration e = multiWrapper.getFileParameterNames();
while (e.hasMoreElements()) {
// get the value of this input tag
String inputValue = (String) e.nextElement();
// Get a File object for the uploaded File
File[] file = multiWrapper.getFiles(inputValue);
// If it's null the upload failed
if (file != null) {
FileInputStream fis=new FileInputStream(file[0]);
System.out.println(file[0].getAbsolutePath());
System.out.println(fis);
int ch;
while((ch=fis.read())!=-1){
System.out.print((char)ch);
}
}
}
System.out.println("III :"+multiWrapper.getParameter("method"));
Client code:
HttpClient client = new HttpClient();
MultipartPostMethod mPost = new MultipartPostMethod(url);
File zipFile = new File("D:\\a.zip");
File wavFile = new File("D:\\b.wav");
mPost.addParameter("recipientFile", zipFile);
mPost.addParameter("promptFile", wavFile);
mPost.addParameter("method", "addCampaign");
statusCode1 = client.executeMethod(mPost);
actually Client is written long back and cant be modified and I want to identify something at server side only to find the extension.
Please help, Thanks.
Struts2 File Uploader interceptor when uploading file pass the content type information to the Action class and one can easily find the file type by comparing contentType with MIME type.
If you want to can create a map with key as content type and file type as its value like
map.Add("image/bmp",".bmp", )
map.Add("image/gif",".gif", )
map.Add("image/jpeg",".jpeg", )
and can easily fetch the type based on the extension provides.Hope this will help you.

How do I save the origin html file with Apache Nutch

I'm new to search engines and web crawlers. Now I want to store all the original pages in a particular web site as html files, but with Apache Nutch I can only get the binary database files. How do I get the original html files with Nutch?
Does Nutch support it? If not, what other tools can I use to achieve my goal.(The tools that support distributed crawling are better.)
Well, nutch will write the crawled data in binary form so if if you want that to be saved in html format, you will have to modify the code. (this will be painful if you are new to nutch).
If you want quick and easy solution for getting html pages:
If the list of pages/urls that you intend to have is quite low, then better get it done with a script which invokes wget for each url.
OR use HTTrack tool.
EDIT:
Writing a your own nutch plugin will be great. Your problem will get solved plus you can contribute to nutch by submitting your work !!! If you are new to nutch (in terms of code & design), then you will have to invest lot of time building a new plugin ... else its easy to do.
Few pointers for helping your initiative:
Here is a page which talks about writing own nutch plugin.
Start with Fetcher.java. See lines 647-648. That is the place where you can get the fetched content on per url basis (for those pages which got fetched successfully).
pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS);
updateStatus(content.getContent().length);
You should add code right after this to invoke your plugin. Pass content object to it. By now, you would have guessed that content.getContent() is the content for url you want. Inside the plugin code, write it to some file. Filename should be based on the url name else it will be difficult to work with that. Url can be obtained by fit.url.
You must do modifications in run Nutch in Eclipse.
When you are able to run, open Fetcher.java and add the lines between "content saver" command lines.
case ProtocolStatus.SUCCESS: // got a page
pstatus = output(fit.url, fit.datum, content, status, CrawlDatum.STATUS_FETCH_SUCCESS, fit.outlinkDepth);
updateStatus(content.getContent().length);'
//------------------------------------------- content saver ---------------------------------------------\\
String filename = "savedsites//" + content.getUrl().replace('/', '-');
File file = new File(filename);
file.getParentFile().mkdirs();
boolean exist = file.createNewFile();
if (!exist) {
System.out.println("File exists.");
} else {
FileWriter fstream = new FileWriter(file);
BufferedWriter out = new BufferedWriter(fstream);
out.write(content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
out.close();
System.out.println("File created successfully.");
}
//------------------------------------------- content saver ---------------------------------------------\\
To update this answer -
It is possible to post process the data from your crawldb segment folder, and read in the html (including other data nutch has stored) directly.
Configuration conf = NutchConfiguration.create();
FileSystem fs = FileSystem.get(conf);
Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
try
{
Text key = new Text();
Content content = new Content();
while (reader.next(key, content))
{
System.out.println(new String(content.GetContent()));
}
}
catch (Exception e)
{
}
The answers here are obsolete. Now, it is simply possible to get the plain HTML-files with nutch dump. Please see this answer.
In apache Nutch 2.3.1
You can save the raw HTML by edit the Nutch code firstly run the nutch in eclipse by following https://wiki.apache.org/nutch/RunNutchInEclipse
After you finish ruunning nutch in eclipse edit file FetcherReducer.java , add this code to the output method, run ant eclipse again to rebuild the class
Finally the raw html will added to reportUrl column in your database
if (content != null) {
ByteBuffer raw = fit.page.getContent();
if (raw != null) {
ByteArrayInputStream arrayInputStream = new ByteArrayInputStream(raw.array(), raw.arrayOffset() + raw.position(), raw.remaining());
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
fit.page.setReprUrl(StringUtil.cleanField(data));
scanner.close();
}

tika returning incorrect line of text for pdf with lots of tables

I am using tika to extract text from a pdf file that has lot of tables.
java -jar tika-app-0.9.jar -t https://s3.amazonaws.com/centraldoc/alg1.pdf
It is returning some invalid text and sometimes it is trimming white space between 2 words; for example it returns
"qu inakli fmyathematical ideas to the real world" instead of "Link mathematical ideas to the real world".
Is there a way to minimize this kind of error? or is there another library that I can use? Does it make sense to use OCR to process these kind of pdf.
Try to control order when using PDFBox parser: PDFTextStripper has a flag that controls the order of lines in the document. By default (in PDFBox) it's set to false for performance reasons (no order preserved), but Tika changed its behavior between releases switching this flag on and off.
More details exactly on this problem in my blog Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood).
To get text from PDF to display in the right order, I had to set the SortByPosition flag to true... (tika-app-1.19.jar)
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParser pdfParser = new PDFParser();
PDFParserConfig config = pdfParser.getPDFParserConfig();
config.setSortByPosition(true); // needed for text in correct order
pdfParser.setPDFParserConfig(config);
pdfParser.parse(is, handler, metadata, context);

Open XML SDK: opening a Word template and saving to a different file-name

This one very simple thing I can't find the right technique. What I want is to open a .dotx template, make some changes and save as the same name but .docx extension. I can save a WordprocessingDocument but only to the place it's loaded from. I've tried manually constructing a new document using the WordprocessingDocument with changes made but nothing's worked so far, I tried MainDocumentPart.Document.WriteTo(XmlWriter.Create(targetPath)); and just got an empty file.
What's the right way here? Is a .dotx file special at all or just another document as far as the SDK is concerned - should i simply copy the template to the destination and then open that and make changes, and save? I did have some concerns if my app is called from two clients at once, if it can open the same .dotx file twice... in this case creating a copy would be sensible anyway... but for my own curiosity I still want to know how to do "Save As".
I would suggest just using File.IO to copy the dotx file to a docx file and make your changes there, if that works for your situation. There's also a ChangeDocumentType function you'll have to call to prevent an error in the new docx file.
File.Copy(#"\path\to\template.dotx", #"\path\to\template.docx");
using(WordprocessingDocument newdoc = WordprocessingDocument.Open(#"\path\to\template.docx", true))
{
newdoc.ChangeDocumentType(WordprocessingDocumentType.Document);
//manipulate document....
}
While M_R_H's answer is correct, there is a faster, less IO-intensive method:
Read the template or document into a MemoryStream.
Within a using statement:
open the template or document on the MemoryStream.
If you opened a template (.dotx) and you want to store it as a document (.docx), you must change the document type to WordprocessingDocumentType.Document. Otherwise, Word will complain when you try to open the document.
Manipulate your document.
Write the contents of the MemoryStream to a file.
For the first step, we can use the following method, which reads a file into a MemoryStream:
public static MemoryStream ReadAllBytesToMemoryStream(string path)
{
byte[] buffer = File.ReadAllBytes(path);
var destStream = new MemoryStream(buffer.Length);
destStream.Write(buffer, 0, buffer.Length);
destStream.Seek(0, SeekOrigin.Begin);
return destStream;
}
Then, we can use that in the following way (replicating as much of M_R_H's code as possible):
// Step #1 (note the using declaration)
using MemoryStream stream = ReadAllBytesToMemoryStream(#"\path\to\template.dotx");
// Step #2
using (WordprocessingDocument newdoc = WordprocessingDocument.Open(stream, true)
{
// You must do the following to turn a template into a document.
newdoc.ChangeDocumentType(WordprocessingDocumentType.Document);
// Manipulate document (completely in memory now) ...
}
// Step #3
File.WriteAllBytes(#"\path\to\template.docx", stream.GetBuffer());
See this post for a comparison of methods for cloning (or duplicating) Word documents or templates.

Resources