Apache Tika content issue - apache-tika

I am having a weird problem with apache tika.When I am getting the filetype first and then parsing I am not getting the content. code:-
public static void main(String[] args) throws IOException, SAXException, TikaException {
File file = new File("sample.txt");
InputStream is = new FileInputStream(file);
TikaInputStream objectData = TikaInputStream.get(is);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
TikaConfig config = TikaConfig.getDefaultConfig();
Detector detector = config.getDetector();
System.out.println("Hello : " + detector.detect(objectData, metadata).toString());
parser.parse(is, handler, metadata, context);
System.out.println("File Content :" + handler.toString());
}
But when I parse first and then get the filetype i am getting the correct content.code:-
public static void main(String[] args) throws IOException, SAXException, TikaException {
File file = new File("sample.txt");
InputStream is = new FileInputStream(file);
TikaInputStream objectData = TikaInputStream.get(is);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(is, handler, metadata, context);
System.out.println("File Content :" + handler.toString());
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
TikaConfig config = TikaConfig.getDefaultConfig();
Detector detector = config.getDetector();
System.out.println("Hello : " + detector.detect(objectData, metadata).toString());
}
Why is this happening? Is there any way around it? Because I need to manipulate the text according to the given mime type.
Edit:- I think this is a problem with dependency.What are the dependencies required for tika detect to work in terms of referenced library?

Related

Receiving an error: <identifier> expected

I'm trying to read a text simple file containing text of a small poem and then send each line to the output file, preceded by line numbers.
I haven't figured out how to add the line numbers yet, but I keep receiving the identifier expected error when I try to just send each line to the output file. Here's my code:
import java .io.File;
import java.ioFIleNotFoundException;
import java.io.PrintWriter;
import java.util.Scanner;
public class ReadFile
{
public static void main(String [] args)
{
//Construct Scanner Objects for input files
Scanner in1 = new Scanner(new File("JackBeNimble.txt"));
//Construct PrintWriter for the output file
PrintWriter out = new PrintWriter("JBN_LineByLine.txt");
//Read lines from the file
while(in1.hasNextLine())
{
String line1 = in1.nextLine();
out.println(line1);
}
}
in1.close();
out.close();
}
You have a typo for FileNotFoundException (should be java.io.FileNotFoundException) and your closing } before in1.close(); is misplaced; it should be after out.close(); Note that you are not handling any exceptions neither.
I spotted a few issues,
// Added the throws FileNotFoundException
public static void main(String [] args) throws FileNotFoundException
{
//Construct Scanner Objects for input files
Scanner in1 = new Scanner(new File("JackBeNimble.txt"));
//Construct PrintWriter for the output file
PrintWriter out = new PrintWriter("JBN_LineByLine.txt");
//Read lines from the file
while(in1.hasNextLine())
{
String line1 = in1.nextLine();
out.println(line1);
}
// Close in the main body.
in1.close();
out.close();
}

cannot parse pdf using Tika1.3 (+lucene4.2)

im trying to parse a pdf file and get its metadata and text.I still don't get the wanted results. I am sure it is a silly mistake, but i cant see it.The file d.pdf exists and it is located in the project's root folder.The imports are also correct.
public class MultiParse {
public static void main(final String[] args) throws IOException,
SAXException, TikaException {
Parser parser = new AutoDetectParser();
File f = new File("d.pdf");
System.out.println("------------ Parsing a PDF:");
extractFromFile(parser, f);
}
private static void extractFromFile(final Parser parser,
final File f ) throws IOException, SAXException,
TikaException {
BodyContentHandler handler = new BodyContentHandler(10000000);
Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get(f);
parser.parse(is, handler, metadata, new ParseContext());
for (String name : metadata.names()) {
System.out.println(name + ":\t" + metadata.get(name));
}
}
}
OUTPUT:No errors, but ..not much either:(
------------ Parsing a PDF:
Content-Type: application/pdf

A generic error occurred in GDI+ When uploading an image from Desktop application to web server using web api

I am trying to upload an image from my Windows Desktop application (VB.NET) to Web Server
using web api
The Code runs correctly in local machine. but fails when run on web server with the error message A generic error occurred in GDI+.
The following is the WebApi Code which accepts the image
public void PostFile(ImageData objImage)
{
Image img = BytesToImage(objImage.ImageFile);
string ImageName = objImage.EmployeeGUID.ToString() + ".Jpg";
string FilePath = "";
FilePath = System.Web.HttpContext.Current.Server.MapPath("~/photo") ;
try {
img.Save(FilePath + '\\' + ImageName.ToString(), System.Drawing.Imaging.ImageFormat.Jpeg);
}
catch (Exception ex) {
}
}
public class ImageData
{
public long EmployeeCode;
public Guid EmployeeGUID;
public byte[] ImageFile ;
}
private Image BytesToImage(byte[] ImageBytes)
{
Image imgNew;
MemoryStream memImage = new MemoryStream(ImageBytes);
imgNew = Image.FromStream(memImage);
return imgNew;
}
The following code is VB.NET Code (Windows forms Application) from which image is
uploaded
Public Sub SendFile()
Dim EmployeeGUID As GUID
Dim EmployeeCode As long
Dim ImagefileToSend As String
Dim objImage As ImageData
Dim client As New HttpClient
client.BaseAddress = New Uri(WebApiPath)
client.DefaultRequestHeaders.Accept.Add(New MediaTypeWithQualityHeaderValue("application/json"))
objImage = New ImageData()
objImage.EmployeeCode = EmployeeCode
objImage.EmployeeGUID = EmployeeGUID
objImage.ImageFile = ImageToBytes(Image.FromFile(ImagefileToSend))
Dim jsonFormatter As MediaTypeFormatter = New JsonMediaTypeFormatter()
Dim content As HttpContent = New ObjectContent(GetType(ImageData), objImage, jsonFormatter)
Dim result As System.Net.Http.HttpResponseMessage
Try
result = client.PostAsync("api/GetFile", content).Result
Catch ex As Exception
End Try
End Sub
Private Class ImageData
Public EmployeeCode As Long
Public EmployeeGUID As Guid
Public ImageFile As Byte()
End Class
Private Function ImageToBytes(ByVal image As Image) As Byte()
Dim memImage As New IO.MemoryStream
Dim bytImage() As Byte
image.Save(memImage, image.RawFormat)
bytImage = memImage.GetBuffer()
Return bytImage
End Function
The Photo directory did not had write permission. Now it is working correctly

Understanding mahout classification output

I have trained mahout model for three categories Category_A,Category_B,Category_C using 20newsGroupExample , Now i want to classify my documents using this model. Can somebody help me to understand output i am getting from this model.
Here is my output
{0:-2813549.8786637094,1:-2651723.736745838,2:-2710651.7525975127}
According to output category of document is 1, But expected category is 2. Am i going right or something is missing in my code ?
public class NaiveBayesClassifierExample {
public static void loadClassifier(String strModelPath, Vector v)
throws IOException {
Configuration conf = new Configuration();
NaiveBayesModel model = NaiveBayesModel.materialize(new Path(strModelPath), conf);
AbstractNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);
Vector st = classifier.classifyFull(v);
System.out.println(st.asFormatString());
System.out.println(st.maxValueIndex());
st.asFormatString();
}
public static Vector createVect() throws IOException {
FeatureVectorEncoder encoder = new StaticWordValueEncoder("text");
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
String inputData=readData();
StringReader in = new StringReader(inputData);
TokenStream ts = analyzer.tokenStream("body", in);
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
Vector v1 = new RandomAccessSparseVector(100000);
while (ts.incrementToken()) {
char[] termBuffer = termAtt.buffer();
int termLen = termAtt.length();
String w = new String(termBuffer, 0, termLen);
encoder.addToVector(w, 1.0, v1);
}
v1.normalize();
return v1;
}
private static String readData() {
// TODO Auto-generated method stub
BufferedReader reader=null;
String line, results = "";
try{
reader = new BufferedReader(new FileReader("c:\\inputFile.txt"));
while( ( line = reader.readLine() ) != null)
{
results += line;
}
reader.close();
}
catch(Exception ex)
{
ex.printStackTrace();
}
return results;
}
public static void main(String[] args) throws IOException {
Vector v = createVect();
String mp = "E:\\Final_Model\\model";
loadClassifier(mp, v);
}
}

Parsing arabic text using Sax produce?

I'm developing LWUIT project using netbeans to run on Blackberry environment. this project will read data from .net web service, I used ksoap2 and Sax Parser. Parser looks like that
public static Vector ParseSAX(String input ,final String[] elements) {
final Vector values = new Vector();
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
for(int u = 0;u < elements.length;u++){
if (qName.equalsIgnoreCase(elements[u].toString())) {
flag = true;
}
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
}
public void characters(char ch[], int start, int length) throws SAXException {
if (flag) {
values.addElement(new String(ch, start, length));
flag = false;
}
}
};
InputStreamReader inputStream = new InputStreamReader(new ByteArrayInputStream(input.getBytes()), "UTF-8");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setCharacterStream(inputStream);
saxParser.parse(is, handler);
} catch (Exception e) {
e.printStackTrace();
}
return values;
}
I cared to parse arabic characters.
By the way, I converted the project encoding to UTF-8 and changed javac.encoding=UTF-8 in project.properties and in private.properties I added runtime.encoding=UTF-8
if I put this code in isolated project, it runs fine.
If I added in BB project or web project, will produce?
I do not know what can I do?

Resources