Tika--Extracting Distinct Items from a Compound Document - apache-tika

Question:
Assume an email message with an attachment (assume a JPEG attachment). How do I parse (not using the Tika facade classes) the email message and return the distinct pieces--a) the email text contents and b) the email attachment?
Configuration:
Tika 1.2
Java 1.7
Details:
I have been able to properly parse email messages in basic email message formats. However, after the parsing, I need to know a) the email's text contents and b) the the contents of any attachment to the email. I will store these items in my database as essentially parent email with child attachments.
What I cannot figure out is how I can "get back" the distinct parts and know that the parent email has attachments and be able to separately store those attachments referenced to the mail. This is, I believe, essentially similar to extracting ZipFile contents.
Code Example:
private Message processDocument(String fullfilepath) {
try {
File filename = new File(fullfilepath) ;
return this.processDocument(filename) ;
} catch (NullPointerException npe) {
Message error = new Message(false) ;
error.appendErrorMessage("The file name was null.") ;
return error ;
}
}
private Message processDocument(File filename) {
InputStream stream = null;
try {
stream = new FileInputStream(filename) ;
} catch (FileNotFoundException fnfe) {
// TODO Auto-generated catch block
fnfe.printStackTrace();
System.out.println("FileNotFoundException") ;
return diag ;
}
int writelimit = -1 ;
ContentHandler texthandler = new BodyContentHandler(writelimit);
this.safehandlerbodytext = new SafeContentHandler(texthandler);
this.meta = new Metadata() ;
ParseContext context = new ParseContext() ;
AutoDetectParser autodetectparser = new AutoDetectParser() ;
try {
autodetectparser.parse(
stream,
texthandler,
meta,
context) ;
this.documenttype = meta.get("Content-Type") ;
diag.setSuccessful(true);
} catch (IOException ioe) {
// if the document stream could not be read
System.out.println("TikaTextExtractorHelper IOException " + ioe.getMessage()) ;
//FIXME -- add real handling
} catch (SAXException se) {
// if the SAX events could not be processed
System.out.println("TikaTextExtractorHelper SAXException " + se.getMessage()) ;
//FIXME -- add real handling
} catch (TikaException te) {
// if the document could not be parsed
System.out.println("TikaTextExtractorHelper TikaException " + te.getMessage()) ;
System.out.println("Exception Filename = " + filename.getName()) ;
//FIXME -- add real handling
}
}

When Tika hits an embedded document, it goes to the ParseContext to see if you have supplied a recursing parser. If you have, it'll use that to process any embedded resources. If you haven't, it'll skip.
So, what you probably want to do is something like:
public static class HandleEmbeddedParser extends AbstractParser {
public List<File> found = new ArrayList<File>();
Set<MediaType> getSupportedTypes(ParseContext context) {
// Return what you want to handle
HashSet<MediaType> types = new HashSet<MediaType>();
types.put(MediaType.application("pdf"));
types.put(MediaType.application("zip"));
return types;
}
void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context
) throws IOException {
// Do something with the child documents
// eg save to disk
File f = File.createTempFile("tika","tmp");
found.add(f);
FileOutputStream fout = new FileOutputStream(f);
IOUtils.copy(stream,fout);
fout.close();
}
}
ParseContext context = new ParseContext();
context.set(Parser.class, new HandleEmbeddedParser();
parser.parse(....);

Related

JavaMail MIME attachment link by cid

Background
I have banged my head against this for a while and not made much progress. I am generating MPEG_4 / AAC files in Android and sending them by email as .mp3 files. I know they aren't actually .mp3 files, but that allows Hotmail and Gmail to play them in Preview. They don't work on iPhone though, unless they are sent as .m4a files instead which breaks the Outlook / Gmail Preview.
So I have thought of a different approach which is to attach as a .mp3 file but have an HTML link in the email body which allows the attached file to be downloaded and specifies a .m4a file name. Gmail / Outlook users can click the attachment directly whereas iPhone users can use the HTML link.
Issue
I can send an email using JavaMail with HTML in it including a link which should be pointing at the attached file to allow download of that file by the link. Clicking on the link in Gmail (Chrome on PC) gives a 404 page and iPhone just ignores my clicking on the link.
Below is the code in which I generate a multipart message and assign a CID to the attachment which I then try to access using the link in the html part. It feels like I am close, but maybe that is an illusion. I'd be massively grateful if someone could help me fix it or save me the pain if it isn't possible.
private int send_email_temp(){
Properties props = new Properties();
props.put("mail.smtp.auth", "true");
props.put("mail.smtp.host", smtp_host_setting);
//props.put("mail.debug", "true");
props.put("mail.smtp.ssl.enable", "true");
props.put("mail.smtp.starttls.enable", "true");
props.put("mail.smtp.port", smtp_port_setting);
session = Session.getInstance(props);
ActuallySendAsync_temp asy = new ActuallySendAsync_temp(true);
asy.execute();
return 0;
}
class ActuallySendAsync_temp extends AsyncTask<String, String, Void> {
public ActuallySendAsync_temp(boolean boo) {
// something to do before sending email
}
#Override
protected Void doInBackground(String... params) {
try {
Message message = new MimeMessage(session);
message.setFrom(new InternetAddress(username));
message.setRecipients(Message.RecipientType.TO,
InternetAddress.parse(recipient_email_address));
message.setSubject(email_subject);
Multipart multipart = new MimeMultipart();
MimeBodyPart messageBodyPart = new MimeBodyPart();
String file = mFileName;
/**/
DataSource source = new FileDataSource(file);
messageBodyPart.setDataHandler(new DataHandler(source));
/* /
File ff = new File(file);
try {
messageBodyPart.attachFile(ff);
} catch(IOException eio) {
Log.e("Message Error", "Old Macdonald");
}
/* /
messageBodyPart = new PreencodedMimeBodyPart("base64");
byte[] file_bytes = null;
File ff = new File(file);
try {
int length = (int) ff.length();
BufferedInputStream reader = new BufferedInputStream(new FileInputStream(ff));
file_bytes = new byte[length];
reader.read(file_bytes, 0, length);
reader.close();
} catch (IOException eio) {
Log.e("Message Error", "Old Macdonald");
}
messageBodyPart.setText(Base64.encodeToString(file_bytes, Base64.DEFAULT));
messageBodyPart.setHeader("Content-Transfer-Encoding", "base64");
/**/
messageBodyPart.setFileName( DEFAULT_AUDIO_FILENAME );//"AudioClip.mp3");
//messageBodyPart.setContentID("<audio_clip>");
String content_id = UUID.randomUUID().toString();
messageBodyPart.setContentID("<" + content_id + ">");
messageBodyPart.setDisposition(Part.ATTACHMENT);//INLINE);
messageBodyPart.setHeader("Content-Type", "audio/mp4");
multipart.addBodyPart(messageBodyPart);
MimeBodyPart messageBodyText = new MimeBodyPart();
//final String MY_HTML_MESSAGE = "<h1>My HTML</h1><a download=\"AudioClip.m4a\" href=\"cid:audio_clip\">iPhone Download</a>";
final String MY_HTML_MESSAGE = "<h1>My HTML</h1><a download=\"AudioClip.m4a\" href=\"cid:" + content_id + "\">iPhone Download</a>";
messageBodyText.setContent( MY_HTML_MESSAGE, "text/html");
multipart.addBodyPart(messageBodyText);
message.setContent(multipart);
Print_Message_To_Console(message);
Transport transport = session.getTransport("smtp");
transport.connect(smtp_host_setting, username, password);
transport.sendMessage(message, message.getAllRecipients());
transport.close();
} catch (MessagingException e) {
e.printStackTrace();
} finally {
}
return null;
}
#Override
protected void onPostExecute(Void aVoid) {
super.onPostExecute(aVoid);
// something to do after sending email
}
}
int Print_Message_To_Console(Message msg) {
int ret_val = 0;
int line_num = 0;
InputStream in = null;
InputStreamReader inputStreamReader = null;
BufferedReader buff_reader = null;
try {
in = msg.getInputStream();
inputStreamReader = new InputStreamReader(in);
buff_reader = new BufferedReader(inputStreamReader);
String temp = "";
while ((temp = buff_reader.readLine()) != null) {
Log.d("Message Line " + Integer.toString(line_num++), temp);
}
} catch(Exception e) {
Log.d("Message Lines", "------------ OOPS! ------------");
ret_val = 1;
} finally {
try {
if (buff_reader != null) buff_reader.close();
if (inputStreamReader != null) inputStreamReader.close();
if (in != null) in.close();
} catch(Exception e2) {
Log.d("Message Lines", "----------- OOPS! 2 -----------");
ret_val = 2;
}
}
return ret_val;
}
You need to create a multipart/related and set the main text part as the first body part.

I am not able to parse IOS driver page source

I got Page source using
String pageSource = driver.getPageSource();
Now i need to save this xml file to local in cache. So i need to get element attributes like x and y attribute value rather than every time get using element.getAttribute("x");. But I am not able to parse pageSource xml file to some special character. I cannot remove this character because at if i need element value/text it shows different text if i will remove special character. Appium is use same way to do this.
I was also facing same issue and i got resolution using below code which i have written and it works fine
public static void removeEscapeCharacter(File xmlFile) {
String pattern = "(\\\"([^=])*\\\")";
String contentBuilder = null;
try {
contentBuilder = Files.toString(xmlFile, Charsets.UTF_8);
} catch (IOException e1) {
e1.printStackTrace();
}
if (contentBuilder == null)
return;
Pattern pattern2 = Pattern.compile(pattern);
Matcher matcher = pattern2.matcher(contentBuilder);
StrBuilder sb = new StrBuilder(contentBuilder);
while (matcher.find()) {
String str = matcher.group(1).substring(1, matcher.group(1).length() - 1);
try {
sb = sb.replaceFirst(StrMatcher.stringMatcher(str),
StringEscapeUtils.escapeXml(str));
} catch (Exception e) {
e.printStackTrace();
}
}
try {
Writer output = null;
output = new BufferedWriter(new FileWriter(xmlFile, false));
output.write(sb.toString());
output.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if you will get that kind of problem then catch it with remove special character and parse again.
try {
doc = db.parse(fileContent);
} catch (Exception e) {
removeEscapeCharacter(file);
doc = db.parse(file);
}
It might works for you.
I can able to do same using SAXParser and add handler to do for this.
Refer SAX Parser

How to extract content from. Pst file using apache tika?

How to parse.Pst file using apache tika
1.2?
How can I get entire body, attachment, and all Metadata of email while searching with ljcene?
for (File file : docs.listFiles()) {
Metadata metadata = new Metadata();
ContentHandler handler = new BodyContentHandler();
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
InputStream stream = new FileInputStream(file);
try {
parser.parse(stream, handler, metadata, context);
}
catch (TikaException e) {
e.printStackTrace();
}
catch (SAXException e) {
e.printStackTrace();
}
If you're stuck with 1.2, you might try the recommendation here
If you're able to upgrade, we added that as the RecursiveParserWrapper in 1.7 ...just upgrade to 1.12 if you can, or wait a week or two and 1.13 should be out.
Via commandline:
java -jar tika-app.jar -J -t -i input_directory -o output_directory
Or in code:
Parser p = new AutoDetectParser();
RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p,
new BasicContentHandlerFactory(
BasicContentHandlerFactory.HANDLER_TYPE.XML, -1));
try (InputStream is = Files.newInputStream(file)) {
wrapper.parse(is, new DefaultHandler(), new Metadata(), context);
}
int i = 0;
for (Metadata metadata : wrapper.getMetadata()) {
for (String name : metadata.names()) {
for (String value : metadata.getValues(name)) {
System.out.println(i + " " + name +": " + value);
}
}
i++;
}

convert the bytes in to readable string format in blackberry?

I am working on an BB app in which I need to maintain a HTTP connection and with a name of image which is stored on server to get the text written in that image document.
I am getting the response in RTF format.
When I directly hit the server on open browser Chrome, I RTF file get downloaded.
Now I needs to perform that programetically,
1) Either convert the bytes which are coming in response in a simple string format so that I can read that.
or
2) Download the file as its happening on the browser manually so that by reading that file I read the information written in the document.
please suggest me how can I read the data from server by hitting any URL?
Currently I am working with this code:
try {
byte []b = send("new_image.JPG");
String s = new String(b, "UTF-8");
System.out.println(s);
} catch (Exception e) {
e.printStackTrace();
}
public byte[] send(String Imagename) throws Exception
{
HttpConnection hc = null;
String imageName = "BasicExp_1345619462234.jpg";
InputStream is = null;
ByteArrayOutputStream bos = new ByteArrayOutputStream();
byte[] res = null;
try
{
hc = (HttpConnection) Connector.open("http://webservice.tvdevphp.com/basisexpdemo/webservices/ocr.php?imgname="+imageName);
hc.setRequestProperty("Content-Type", "multipart/form-data;");
hc.setRequestMethod(HttpConnection.GET);
int ch;
StringBuffer sb= new StringBuffer();
is = hc.openInputStream();
while ((ch = is.read()) != -1)
{
bos.write(ch);
sb.append(ch);
}
System.out.println(sb.toString());
res = bos.toByteArray();
}
catch(Exception e){
e.printStackTrace();
}
finally
{
try
{
if(bos != null)
bos.close();
if(is != null)
is.close();
if(hc != null)
hc.close();
}
catch(Exception e2)
{
e2.printStackTrace();
}
}
return res;
}
The response is like:
{\rtf1\ansi\ansicpg1252\uc1\deflang1033\adeflang1033...................
I can read the data but its not formatted, so that i can read that programetically too.
I have done with this task....
Actually the mistake was on server side.
When they were performing OCR, the format parameter was not corrected that was reason.

how to send mail with image attachment in blackberry

I am developing a BlackBerry application that uses the Mail functionality. My problem is
I want to send mail with an image attachment. How can I do that?
You can convert the image to byte array and then use the following method to send the file as attachment.
public synchronized boolean sendMail(final byte []data)
{
Folder[] folders = store.list(4);
Folder sentfolder = folders[0];
// create a new message and store it in the sent folder
msg = new Message(sentfolder);
multipart = new Multipart();
textPart = new TextBodyPart(multipart,"Image");
Address recipients[] = new Address[1];
try {
recipients[0] = new Address(address, "XYZ");
msg.addRecipients(Message.RecipientType.TO, recipients);
msg.setSubject("Image");
try {
Thread thread = new Thread("Send mail") {
public void run() {
try {
attach = new SupportedAttachmentPart(
multipart, "application/octet-stream",
"title",data);
multipart.addBodyPart(textPart);
multipart.addBodyPart(attach);
msg.setContent(multipart);
Transport.send(msg);
}
catch(SendFailedException e)
{
}
catch (final MessagingException e) {
}
catch (final Exception e) {
}
}
};
thread.start();
return true;
}
catch (final Exception e)
{
}
}catch (final Exception e) {
}
return false;
}
This may be help you check it
//create a multipart
Multipart mp = new Multipart();
//data for the content of the file
String fileData = "<html>just a simple test</html>";
String messageData = "Mail Attachment Demo";
//create the file
SupportedAttachmentPart sap = new SupportedAttachmentPart(mp,"text/html","file.html",fileData.getBytes());
TextBodyPart tbp = new TextBodyPart(mp,messageData);
//add the file to the multipart
mp.addBodyPart(tbp);
mp.addBodyPart(sap);
//create a message in the sent items folder
Folder folders[] = Session.getDefaultInstance().getStore().list(Folder.SENT);
Message message = new Message(folders[0]);
//add recipients to the message and send
try {
Address toAdd = new Address("email#company.com","my email");
Address toAdds[] = new Address[1];
toAdds[0] = toAdd;
message.addRecipients(Message.RecipientType.TO,toAdds);
message.setContent(mp);
Transport.send(message);
} catch (Exception e) {
Dialog.inform(e.toString());
}
this is for Image file
InputStream inputStream;
FileConnection fconn = (FileConnection) Connector.open(fName, Connector.READ_WRITE);
if(fconn.exists()){
inputStream=fconn.openInputStream();
byte[] data = IOUtilities.streamToBytes(inputStream);
inputStream.close();
fconn.close();
Multipart multipart = new Multipart();
SupportedAttachmentPart attach = new SupportedAttachmentPart(multipart, ".txt/.jpeg", "attachment1", data);
multipart.addBodyPart(attach);
}

Resources