Losing the input stream in Apache tika - apache-tika

I am getting the Input stream from the HttpRequest and using same input stream to extract the metadata. like as shown below.
ServletFileUpload upload = new ServletFileUpload();
FileItemIterator iter = upload.getItemIterator(request);
--- more lines for the iteration and getting the stream ------
InputStream input = item.openStream();
This input is getting passed to the parser as below
public Map<String, String> extractMetadata(InputStream is) {
Map<String,String> map = new HashMap<>();
ContentHandler contentHandler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
Parser parser = new AutoDetectParser();
ParseContext parseContext = new ParseContext();
parseContext.set(Parser.class ,
new ParserDecorator(parser));
try {
TikaInputStream tikaInputStream = TikaInputStream.get(is);
parser.parse(tikaInputStream, contentHandler, metadata,parseContext);
for (String name : metadata.names()) {
map.put(name ,metadata.get(name));
}
} catch (IOException|SAXException|TikaException e) {
map.put("ERROR","Error while retriving Metadata");
}
return map;
}
But when I try to get the input stream then it is not same as if i dont use tika for extract.
Does Tika Dirty the stream ?

Related

Unable to parse .docx or .xlsx file using apache tika -1.6. Jarfiles are getting loaded, but it is not parsing

The last line is returning blank value.
Parser _autoParser = new AutoDetectParser();
ContentHandler textHandler = new BodyContentHandler(-1);
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
System.out.println("inside Tika");
Metadata metadata = new Metadata();
ParseContext contextParse = new ParseContext();
contextParse.set(PDFParserConfig.class, pdfConfig);
contextParse.set(Parser.class, _autoParser);
InputStream input = new FileInputStream(fLoc);
System.out.println("trying to read the file content");
_autoParser.parse(input, textHandler, metadata, contextParse);

How to extract content from. Pst file using apache tika?

How to parse.Pst file using apache tika
1.2?
How can I get entire body, attachment, and all Metadata of email while searching with ljcene?
for (File file : docs.listFiles()) {
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
InputStream stream = new FileInputStream(file);
try {
parser.parse(stream, handler, metadata, context);
}
catch (TikaException e) {
e.printStackTrace();
}
catch (SAXException e) {
e.printStackTrace();
}
If you're stuck with 1.2, you might try the recommendation here
If you're able to upgrade, we added that as the RecursiveParserWrapper in 1.7 ...just upgrade to 1.12 if you can, or wait a week or two and 1.13 should be out.
Via commandline:
java -jar tika-app.jar -J -t -i input_directory -o output_directory
Or in code:
Parser p = new AutoDetectParser();
RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p,
new BasicContentHandlerFactory(
BasicContentHandlerFactory.HANDLER_TYPE.XML, -1));
try (InputStream is = Files.newInputStream(file)) {
wrapper.parse(is, new DefaultHandler(), new Metadata(), context);
}
int i = 0;
for (Metadata metadata : wrapper.getMetadata()) {
for (String name : metadata.names()) {
for (String value : metadata.getValues(name)) {
System.out.println(i + " " + name +": " + value);
}
}
i++;
}

Get attachment names from Tika

I'm parsing an EML file (RFC822) using Tika, but for some reason I cannot get the attachment names. I get the body, to, cc, bcc, attachments text, etc but not the attachments names. Any ideas? Below is the code I'm using.
var handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File(#"C:\Users\test\Desktop\testemail.eml"));
ParseContext pcontext = new ParseContext();
var parser = new RFC822Parser();
parser.parse(inputstream, handler, metadata, pcontext);
Debug.WriteLine("Contents of the document:" + handler.toString());
Debug.WriteLine("Metadata of the document:");
String[] metadataNames = metadata.names();
foreach (String name in metadataNames)
{
Debug.WriteLine(name + ": " + metadata.get(name));
}

How to put two jasperReports in one zip file to download?

public String generateReport() {
try
{
final FacesContext facesContext = FacesContext.getCurrentInstance();
final HttpServletResponse response = (HttpServletResponse) facesContext.getExternalContext().getResponse();
response.reset();
response.setHeader("Content-Disposition", "attachment; filename=\"" + "myReport.zip\";");
final BufferedOutputStream bos = new BufferedOutputStream(response.getOutputStream());
final ZipOutputStream zos = new ZipOutputStream(bos);
for (final PeriodScale periodScale : Scale.getPeriodScales(this.startDate, this.endDate))
{
final JasperPrint jasperPrint = JasperFillManager.fillReport(
this.reportsPath() + File.separator + "periodicScale.jasper",
this.parameters(this.reportsPath(), periodScale.getScale(),
periodScale.getStartDate(), periodScale.getEndDate()),
new JREmptyDataSource());
final byte[] bytes = JasperExportManager.exportReportToPdf(jasperPrint);
response.setContentLength(bytes.length);
final ZipEntry ze = new ZipEntry("periodicScale"+ periodScale.getStartDate() + ".pdf"); // periodicScale13032015.pdf for example
zos.putNextEntry(ze);
zos.write(bytes, 0, bytes.length);
zos.closeEntry();
}
zos.close();
facesContext.responseComplete();
}
catch (final Exception e)
{
e.printStackTrace();
}
return "";
}
This is my action method in the managedBean which is called by the user to print a JasperReport, but when I try to put more than one report inside the zip file it's not working.
getPeriodScales are returning two objects and JasperFillManager.fillReport is running correctly as the reports print when I just generate data for one report, when I try to stream two reports though and open in WinRar only one appears and I get an "unexpedted end of archive", in 7zip both appear but the second is corrupted.
What am I doing wrong or is there a way to stream multiple reports without zipping it?
I figured out what was, I was setting the contentLenght of the response with bytes.length size, but it should be bytes.length * Scale.getPeriodScales(this.startDate, this.endDate).size()
public JasperPrint generatePdf(long consumerNo) {
Consumer consumerByCustomerNo = consumerService.getConsumerByCustomerNo(consumerNo);
consumerList.add(consumerByCustomerNo);
BillHeaderIPOP billHeaderByConsumerNo = billHeaderService.getBillHeaderByConsumerNo(consumerNo);
Long billNo = billHeaderByConsumerNo.getBillNo();
List<BillLineItem> billLineItemByBilNo = billLineItemService.getBillLineItemByBilNo(billNo);
System.out.println(billLineItemByBilNo);
List<BillReadingLine> billReadingLineByBillNo = billReadingLineService.getBillReadingLineByBillNo(billNo);
File jrxmlFile = ResourceUtils.getFile("classpath:demo.jrxml");
JasperReport jasperReport = JasperCompileManager.compileReport(jrxmlFile.getAbsolutePath());
pdfContainer.setName(consumerByCustomerNo.getName());
pdfContainer.setTelephone(consumerByCustomerNo.getTelephone());
pdfContainer.setFromDate(billLineItemByBilNo.get(0).getStartDate());
pdfContainer.setToDate(billLineItemByBilNo.get(0).getEndDate());
pdfContainer.setSupplyAddress(consumerByCustomerNo.getSupplyAddress());
pdfContainer.setMeterNo(billReadingLineByBillNo.get(0).getMeterNo());
pdfContainer.setBillType(billHeaderByConsumerNo.getBillType());
pdfContainer.setReadingType(billReadingLineByBillNo.get(0).getReadingType());
pdfContainer.setLastBilledReadingInKWH(billReadingLineByBillNo.stream().filter(billReadingLine -> billReadingLine.getRegister().contains("KWH")).collect(Collectors.toList()).get(0).getLastBilledReading());
pdfContainer.setLastBilledReadingInKW(billReadingLineByBillNo.stream().filter(billReadingLine -> billReadingLine.getRegister().contains("KW")).collect(Collectors.toList()).get(0).getLastBilledReading());
pdfContainer.setReadingType(billReadingLineByBillNo.get(0).getReadingType());
pdfContainer.setRateCategory(billLineItemByBilNo.get(0).getRateCategory());
List<PdfContainer> pdfContainerList = new ArrayList<>();
pdfContainerList.add(pdfContainer);
Map<String, Object> parameters = new HashMap<>();
parameters.put("billLineItemByBilNo", billLineItemByBilNo);
parameters.put("billReadingLineByBillNo", billReadingLineByBillNo);
parameters.put("consumerList", consumerList);
parameters.put("pdfContainerList", pdfContainerList);
JasperPrint jasperPrint = JasperFillManager.fillReport(jasperReport, parameters, new JREmptyDataSource());
return jasperPrint;
}
//above code is accroding to my requirement , you just focus on the jasperPrint object which am returning , then jasperPrint object is being used for pdf generation , storing those pdf into a zip file .
#GetMapping("/batchpdf/{rangeFrom}/{rangeTo}")
public String batchPdfBill(#PathVariable("rangeFrom") long rangeFrom, #PathVariable("rangeTo") long rangeTo) throws JRException, IOException {
consumerNosInRange = consumerService.consumerNoByRange(rangeFrom, rangeTo);
String zipFilePath = "C:\\Users\\Barada\\Downloads";
FileOutputStream fos = new FileOutputStream(zipFilePath +"\\"+ rangeFrom +"-To-"+ rangeTo +"--"+ Math.random() + ".zip");
BufferedOutputStream bos = new BufferedOutputStream(fos);
ZipOutputStream outputStream = new ZipOutputStream(bos);
try {
for (long consumerNo : consumerNosInRange) {
JasperPrint jasperPrint = generatePdf(consumerNo);
byte[] bytes = JasperExportManager.exportReportToPdf(jasperPrint);
outputStream.putNextEntry(new ZipEntry(consumerNo + ".pdf"));
outputStream.write(bytes, 0, bytes.length);
outputStream.closeEntry();
}
} finally {
outputStream.close();
}
return "All Bills PDF Generated.. Extract ZIP file get all Bills";
}
}

Tika--Extracting Distinct Items from a Compound Document

Question:
Assume an email message with an attachment (assume a JPEG attachment). How do I parse (not using the Tika facade classes) the email message and return the distinct pieces--a) the email text contents and b) the email attachment?
Configuration:
Tika 1.2
Java 1.7
Details:
I have been able to properly parse email messages in basic email message formats. However, after the parsing, I need to know a) the email's text contents and b) the the contents of any attachment to the email. I will store these items in my database as essentially parent email with child attachments.
What I cannot figure out is how I can "get back" the distinct parts and know that the parent email has attachments and be able to separately store those attachments referenced to the mail. This is, I believe, essentially similar to extracting ZipFile contents.
Code Example:
private Message processDocument(String fullfilepath) {
try {
File filename = new File(fullfilepath) ;
return this.processDocument(filename) ;
} catch (NullPointerException npe) {
Message error = new Message(false) ;
error.appendErrorMessage("The file name was null.") ;
return error ;
}
}
private Message processDocument(File filename) {
InputStream stream = null;
try {
stream = new FileInputStream(filename) ;
} catch (FileNotFoundException fnfe) {
// TODO Auto-generated catch block
fnfe.printStackTrace();
System.out.println("FileNotFoundException") ;
return diag ;
}
int writelimit = -1 ;
ContentHandler texthandler = new BodyContentHandler(writelimit);
this.safehandlerbodytext = new SafeContentHandler(texthandler);
this.meta = new Metadata() ;
ParseContext context = new ParseContext() ;
AutoDetectParser autodetectparser = new AutoDetectParser() ;
try {
autodetectparser.parse(
stream,
texthandler,
meta,
context) ;
this.documenttype = meta.get("Content-Type") ;
diag.setSuccessful(true);
} catch (IOException ioe) {
// if the document stream could not be read
System.out.println("TikaTextExtractorHelper IOException " + ioe.getMessage()) ;
//FIXME -- add real handling
} catch (SAXException se) {
// if the SAX events could not be processed
System.out.println("TikaTextExtractorHelper SAXException " + se.getMessage()) ;
//FIXME -- add real handling
} catch (TikaException te) {
// if the document could not be parsed
System.out.println("TikaTextExtractorHelper TikaException " + te.getMessage()) ;
System.out.println("Exception Filename = " + filename.getName()) ;
//FIXME -- add real handling
}
}
When Tika hits an embedded document, it goes to the ParseContext to see if you have supplied a recursing parser. If you have, it'll use that to process any embedded resources. If you haven't, it'll skip.
So, what you probably want to do is something like:
public static class HandleEmbeddedParser extends AbstractParser {
public List<File> found = new ArrayList<File>();
Set<MediaType> getSupportedTypes(ParseContext context) {
// Return what you want to handle
HashSet<MediaType> types = new HashSet<MediaType>();
types.put(MediaType.application("pdf"));
types.put(MediaType.application("zip"));
return types;
}
void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context
) throws IOException {
// Do something with the child documents
// eg save to disk
File f = File.createTempFile("tika","tmp");
found.add(f);
FileOutputStream fout = new FileOutputStream(f);
IOUtils.copy(stream,fout);
fout.close();
}
}
ParseContext context = new ParseContext();
context.set(Parser.class, new HandleEmbeddedParser();
parser.parse(....);

Resources