Apache Tika BodyContentHandler() is Empty

Apache Tika BodyContentHandler() is Empty - apache-tika

I'm using Apache Tika 1.18 and when I use one web service framework (sparkjava), the code below works. Yet in SpringBoot, the BodyContentHandler() line of code is empty. Thus, my returned text is empty.
Not sure what's up with this but would appreciate any suggestions.
I'm passing a Base64 encoded string to this code and it's also URLEncoded. Thus, the two decodes as the first two lines.
Running this code in the debugger in SpringBoot, the variable contents have the same values as in sparkjava, but once I get to the BodyContentHandler(), instead of having the input text as the sparkjava version has for the handler variable, the SpringBoot version has "" for handler.
I also tested this behavior with Tika 1.17. Same. Also tried removing the -1 parameter from the new BodyContentHandler() constructor. Same.
Thanks in advance.
String "data=" passed into SpringBoot POST method.
String bodyData = URLDecoder.decode(data.substring(data.indexOf("data=") + 5));
byte[] decodedBodyData = java.util.Base64.getMimeDecoder().decode(bodyData);
Tika tika = new Tika();
try
{
Parser parser = new AutoDetectParser();
// line of code below returns "". Problem!
BodyContentHandler handler = new BodyContentHandler(-1); // handle larger files.
Metadata metadata = new Metadata();
InputStream inputStream = new ByteArrayInputStream(decodedBodyData);
ParseContext context = new ParseContext();
//parsing the file
parser.parse(inputStream, handler, metadata, context);
textToReturn = handler.toString();
}
catch (IOException e)
{
e.printStackTrace();
}
catch (SAXException e)
{
e.printStackTrace();
}
catch (TikaException e)
{
e.printStackTrace();
}
catch (Exception e)
{
e.printStackTrace();
}

Related

After tika-core is upgraded from 1.26 to 2.1.0, TIKA no longer throws an exception when parsing encrypted documents in .doc format

After tika-core is upgraded from 1.26 to 2.1.0, no exception will be thrown for encrypted doc documents
protected boolean checkMsmime(InputStream stream) throws IOException, SAXException {
Metadata metadata = new Metadata();
ContentHandler handler = new DefaultHandler();
ParseContext context = new ParseContext();
BodyContentHandler bch = new BodyContentHandler();
try {
new AutoDetectParser().parse(stream, handler, metadata, context);
} catch (TikaException e) {
// doc Encryption protection
if (e instanceof EncryptedDocumentException) {
return true;
}
// office docx Encryption protection
if (e.getCause() instanceof org.apache.poi.EncryptedDocumentException) {
return true;
}
log.error(e);
return false;
}catch ( IOException exception){
System.out.println("exception exception1 "+exception);
}catch (SAXException exception){
System.out.println("exception exception2 "+exception);
}
return false;
}
In version 1.26 of tika, if the .doc document is encrypted, AutoDetectParser().parse() TIKA parsing will throw an exception, but after upgrading to 2.1.0, no exception will be thrown, and it is considered not an encrypted document.
Encrypted files in other formats can throw exceptions, only encrypted documents in .doc format no longer throw exceptions

How to train Open NLP without file

i have the following code for training Open NLP POS Tagger
Trainer(String trainingData, String modelSavePath, String dictionary){
try {
dataIn = new MarkableFileInputStreamFactory(
new File(trainingData));
lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
POSTaggerFactory fac=new POSTaggerFactory();
if(dictionary!=null && dictionary.length()>0)
{
fac.setDictionary(new Dictionary(new FileInputStream(dictionary)));
}
model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), fac);
} catch (IOException e) {
// Failed to read or parse training data, training failed
e.printStackTrace();
} finally {
if (lineStream != null) {
try {
lineStream.close();
} catch (IOException e) {
// Not an issue, training already finished.
// The exception should be logged and investigated
// if part of a production system.
e.printStackTrace();
}
}
}
}
and this works just fine. Now, is it possible to do the same without involving files? I want to store the training data in a database somewhere. Then i can read it as a stream or chunks and feed it to the trainer. I do not want to create a temp file. Is this possible?

Yes, instead of passing FileInputStream to a dictionary, you can create your own implementation of InputStream, say DatabaseSourceInputStream and use it instead.

I am not able to parse IOS driver page source

I got Page source using
String pageSource = driver.getPageSource();
Now i need to save this xml file to local in cache. So i need to get element attributes like x and y attribute value rather than every time get using element.getAttribute("x");. But I am not able to parse pageSource xml file to some special character. I cannot remove this character because at if i need element value/text it shows different text if i will remove special character. Appium is use same way to do this.

I was also facing same issue and i got resolution using below code which i have written and it works fine
public static void removeEscapeCharacter(File xmlFile) {
String pattern = "(\\\"([^=])*\\\")";
String contentBuilder = null;
try {
contentBuilder = Files.toString(xmlFile, Charsets.UTF_8);
} catch (IOException e1) {
e1.printStackTrace();
}
if (contentBuilder == null)
return;
Pattern pattern2 = Pattern.compile(pattern);
Matcher matcher = pattern2.matcher(contentBuilder);
StrBuilder sb = new StrBuilder(contentBuilder);
while (matcher.find()) {
String str = matcher.group(1).substring(1, matcher.group(1).length() - 1);
try {
sb = sb.replaceFirst(StrMatcher.stringMatcher(str),
StringEscapeUtils.escapeXml(str));
} catch (Exception e) {
e.printStackTrace();
}
}
try {
Writer output = null;
output = new BufferedWriter(new FileWriter(xmlFile, false));
output.write(sb.toString());
output.close();
} catch (IOException e) {
e.printStackTrace();
}
}
if you will get that kind of problem then catch it with remove special character and parse again.
try {
doc = db.parse(fileContent);
} catch (Exception e) {
removeEscapeCharacter(file);
doc = db.parse(file);
}
It might works for you.

I can able to do same using SAXParser and add handler to do for this.
Refer SAX Parser

How to compress the files in Blackberry?

In my application I used html template and images for browser field and saved in the sdcard . Now I want to compress that html,image files and send to the PHP server. How can I compress that files and send to server? Provide me some samples that may help lot.
i tried this way... my code is
EDIT:
private void zipthefile() {
String out_path = "file:///SDCard/" + "newtemplate.zip";
String in_path = "file:///SDCard/" + "newtemplate.html";
InputStream inputStream = null;
GZIPOutputStream os = null;
try {
FileConnection fileConnection = (FileConnection) Connector
.open(in_path);//read the file from path
if (fileConnection.exists()) {
inputStream = fileConnection.openInputStream();
}
byte[] buffer = new byte[1024];
FileConnection path = (FileConnection) Connector
.open(out_path,
Connector.READ_WRITE);//create the out put file path
if (!path.exists()) {
path.create();
}
os = new GZIPOutputStream(path.openOutputStream());// for create the gzip file
int c;
while ((c = inputStream.read()) != -1) {
os.write(c);
}
} catch (Exception e) {
Dialog.alert("" + e.toString());
} finally {
if (inputStream != null) {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
Dialog.alert("" + e.toString());
}
}
if (os != null) {
try {
os.close();
} catch (IOException e) {
e.printStackTrace();
Dialog.alert("" + e.toString());
}
}
}
}
this code working fine for single file but i want to compress all the file(more the one file)in the folder .

In case you are not familiar with them, I can tell you that in Java the stream classes follow the Decorator Pattern. These are meant to be piped to other streams to perform additional tasks. For instance, a FileOutputStream allows you to write bytes to a file, if you decorate it with a BufferedOutputStream then you get also buffering (big chunks of data are stored in RAM before being finally written to disc). Or if you decorate it with a GZIPOutputStream then you get also compression.
Example:
//To read compressed file:
InputStream is = new GZIPInputStream(new FileInputStream("full_compressed_file_path_here"));
//To write to a compressed file:
OutputStream os = new GZIPOutputStream(new FileOutputStream("full_compressed_file_path_here"));
This is a good tutorial covering basic I/O . Despite being written for JavaSE, you'll find it useful since most things work the same in BlackBerry.
In the API you have these classes available:
GZIPInputStream
GZIPOutputStream
ZLibInputStream
ZLibOutputStream
If you need to convert between streams and byte array use IOUtilities class or ByteArrayOutputStream and ByteArrayInputStream.

Save and read file with stream on BlackBerry

Argument 'address' is the string "CepVizyonVersionFile", and after Connector.openDataInputStream(address) the program throws an exception with message:
no ' : ' in URL.
What format should address be in?
public void saveLocal(String fileString, String address) {
try {
DataOutputStream fos = Connector.openDataOutputStream(address); //openFileOutput(address);
fos.write(fileString.getBytes());
fos.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public String readLocal(String address, int lenght) {
byte[] buffer = new byte[lenght];
byte[] buffer2;
String str = new String();
try {
DataInputStream fis = Connector.openDataInputStream(address);
int lnght = fis.read(buffer);
buffer2 = new byte[lnght];
fis.close();
for (int i = 0; i < lnght; i++)
buffer2[i] = buffer[i];
str = new String(buffer2);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return str;
}

Where do you put your file? If it is on the media card, your address should be like this: "file:///SDCard/"+yourfilename.

The BlackBerry API documentation for Connector has an explanation of the format:
The parameter string that describes the target should conform to the URL format as described in RFC 2396. This takes the general form:
{scheme}:[{target}][{parms}]
where {scheme} is the name of a protocol such as http.
The {target} is normally some kind of network address.
Any {parms} are formed as a series of equates of the form ";x=y". Example: ";type=a".
and the supported schemes are listed as well:
comm
socket
udp
sms
mms
http
https
tls or ssl
Bluetooth Serial Port Profile
Since you want a file, you'll need to take a look at the package documentation for javax.microedition.io.file
The format of the input string used to access a FileConnection through Connector.open() must follow the format for a fully qualified, absolute path file name as described in the file URL format as part of IETF RFCs 1738 & 2396. That RFC dictates that a file URL takes the form:
file://<host>/<path>

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Apache Tika BodyContentHandler() is Empty - apache-tika

Related

After tika-core is upgraded from 1.26 to 2.1.0, TIKA no longer throws an exception when parsing encrypted documents in .doc format

How to train Open NLP without file

I am not able to parse IOS driver page source

How to compress the files in Blackberry?

Save and read file with stream on BlackBerry

Categories

Resources