cannot parse pdf using Tika1.3 (+lucene4.2) - parsing

im trying to parse a pdf file and get its metadata and text.I still don't get the wanted results. I am sure it is a silly mistake, but i cant see it.The file d.pdf exists and it is located in the project's root folder.The imports are also correct.
public class MultiParse {
public static void main(final String[] args) throws IOException,
SAXException, TikaException {
Parser parser = new AutoDetectParser();
File f = new File("d.pdf");
System.out.println("------------ Parsing a PDF:");
extractFromFile(parser, f);
}
private static void extractFromFile(final Parser parser,
final File f ) throws IOException, SAXException,
TikaException {
BodyContentHandler handler = new BodyContentHandler(10000000);
Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get(f);
parser.parse(is, handler, metadata, new ParseContext());
for (String name : metadata.names()) {
System.out.println(name + ":\t" + metadata.get(name));
}
}
}
OUTPUT:No errors, but ..not much either:(
------------ Parsing a PDF:
Content-Type: application/pdf

Related

How do I get a Readable File?

I have a directory filled with 99 files, I want to read these files and then hash them into a sha256 checksum. I eventually want to output them to a JSON file with a key-value pair so for example (File 1, 092180x0123). Currently I am having trouble passing my ParDo function a readable File I must be missing something very easy. This is my first time using Apache beam so any help would be amazing. Here is what I have so far
public class BeamPipeline {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p
.apply("Match Files", FileIO.match().filepattern("../testdata/input-*"))
.apply("Read Files", FileIO.readMatches())
.apply("Hash File",ParDo.of(new DoFn<FileIO.ReadableFile, KV<FileIO.ReadableFile, String>>() {
#ProcessElement
public void processElement(#Element FileIO.ReadableFile file, OutputReceiver<KV<FileIO.ReadableFile, String>> out) throws
NoSuchAlgorithmException, IOException {
// File -> Bytes
String strfile = file.toString();
byte[] byteFile = strfile.getBytes();
// SHA-256
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] messageDigest = md.digest(byteFile);
BigInteger no = new BigInteger(1, messageDigest);
String hashtext = no.toString(16);
while(hashtext.length() < 32) {
hashtext = "0" + hashtext;
}
out.output(KV.of(file, hashtext));
}
}))
.apply(FileIO.write());
p.run();
}
}
One example to have a KV pair containing the matched filename (from MetadataResult) and the corresponding SHA-256 of the whole file (instead of reading it line by line):
p
.apply("Match Filenames", FileIO.match().filepattern(options.getInput()))
.apply("Read Matches", FileIO.readMatches())
.apply(MapElements.via(new SimpleFunction <ReadableFile, KV<String,String>>() {
public KV<String,String> apply(ReadableFile f) {
String temp = null;
try{
temp = f.readFullyAsUTF8String();
}catch(IOException e){
}
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(temp);
return KV.of(f.getMetadata().resourceId().toString(), sha256hex);
}
}
))
.apply("Print results", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c) {
Log.info(String.format("File: %s, SHA-256: %s ", c.element().getKey(), c.element().getValue()));
}
}
));
Full code here. The output in my case was:
Apr 21, 2019 10:02:21 PM com.dataflow.samples.DataflowSHA256$2 processElement
INFO: File: /home/.../data/file1, SHA-256: e27cf439835d04081d6cd21f90ce7b784c9ed0336d1aa90c70c8bb476cd41157
Apr 21, 2019 10:02:21 PM com.dataflow.samples.DataflowSHA256$2 processElement
INFO: File: /home/.../data/file2, SHA-256: 72113bf9fc03be3d0117e6acee24e3d840fa96295474594ec8ecb7bbcb5ed024
Which I verified with an online hashing tool:
By the way I don't think you need OutputReceiver for a single output (no side outputs). Thanks to these questions/answers that were helpful: 1, 2, 3.

Issue in fetching data from arrays within an array

I want to fetch data from a web service but it is showing this error.
D/exception: com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was BEGIN_ARRAY at line 1 column 3 path $[0]
There are two arrays in an array, I tried parsing it but unable to get it, It is showing data in stringResponse but don't know why parsing through GSON not working.
Model Class
MainClass
private void getStringResponse() {
String url = "http://www.mocky.io/v2/5bd7f4683100003508474b3d";
StringRequest request = new StringRequest(
Request.Method.GET,
url,
new com.android.volley.Response.Listener<String>() {
#Override
public void onResponse(String response) {
String stringResponse = response.toString();
Gson gson = new Gson();
try{
ModelOne[] data = gson.fromJson(stringResponse, ModelOne[].class);
List<ModelOne.First> first = new ArrayList<>();
first = data[0].getZero();
Log.d("data", first.get(0).getFullName());
} catch (Exception e){
Log.d("exception", e.toString());
}
}
}, new com.android.volley.Response.ErrorListener() {
#Override
public void onErrorResponse(VolleyError error) {
Log.d("error", error.toString());
}
});
RequestQueue requestQueue = Volley.newRequestQueue(this);
requestQueue.add(request);
}
I just need minor guidelines, that how to tackle data when you have arrays within array and how to parse it.
and It was showing that your question has max code that's why I uploaded dropbox link

Apache Tika content issue

I am having a weird problem with apache tika.When I am getting the filetype first and then parsing I am not getting the content. code:-
public static void main(String[] args) throws IOException, SAXException, TikaException {
File file = new File("sample.txt");
InputStream is = new FileInputStream(file);
TikaInputStream objectData = TikaInputStream.get(is);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
TikaConfig config = TikaConfig.getDefaultConfig();
Detector detector = config.getDetector();
System.out.println("Hello : " + detector.detect(objectData, metadata).toString());
parser.parse(is, handler, metadata, context);
System.out.println("File Content :" + handler.toString());
}
But when I parse first and then get the filetype i am getting the correct content.code:-
public static void main(String[] args) throws IOException, SAXException, TikaException {
File file = new File("sample.txt");
InputStream is = new FileInputStream(file);
TikaInputStream objectData = TikaInputStream.get(is);
Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(is, handler, metadata, context);
System.out.println("File Content :" + handler.toString());
metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
TikaConfig config = TikaConfig.getDefaultConfig();
Detector detector = config.getDetector();
System.out.println("Hello : " + detector.detect(objectData, metadata).toString());
}
Why is this happening? Is there any way around it? Because I need to manipulate the text according to the given mime type.
Edit:- I think this is a problem with dependency.What are the dependencies required for tika detect to work in terms of referenced library?

Receiving an error: <identifier> expected

I'm trying to read a text simple file containing text of a small poem and then send each line to the output file, preceded by line numbers.
I haven't figured out how to add the line numbers yet, but I keep receiving the identifier expected error when I try to just send each line to the output file. Here's my code:
import java .io.File;
import java.ioFIleNotFoundException;
import java.io.PrintWriter;
import java.util.Scanner;
public class ReadFile
{
public static void main(String [] args)
{
//Construct Scanner Objects for input files
Scanner in1 = new Scanner(new File("JackBeNimble.txt"));
//Construct PrintWriter for the output file
PrintWriter out = new PrintWriter("JBN_LineByLine.txt");
//Read lines from the file
while(in1.hasNextLine())
{
String line1 = in1.nextLine();
out.println(line1);
}
}
in1.close();
out.close();
}
You have a typo for FileNotFoundException (should be java.io.FileNotFoundException) and your closing } before in1.close(); is misplaced; it should be after out.close(); Note that you are not handling any exceptions neither.
I spotted a few issues,
// Added the throws FileNotFoundException
public static void main(String [] args) throws FileNotFoundException
{
//Construct Scanner Objects for input files
Scanner in1 = new Scanner(new File("JackBeNimble.txt"));
//Construct PrintWriter for the output file
PrintWriter out = new PrintWriter("JBN_LineByLine.txt");
//Read lines from the file
while(in1.hasNextLine())
{
String line1 = in1.nextLine();
out.println(line1);
}
// Close in the main body.
in1.close();
out.close();
}

Parsing arabic text using Sax produce?

I'm developing LWUIT project using netbeans to run on Blackberry environment. this project will read data from .net web service, I used ksoap2 and Sax Parser. Parser looks like that
public static Vector ParseSAX(String input ,final String[] elements) {
final Vector values = new Vector();
try {
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
DefaultHandler handler = new DefaultHandler() {
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
for(int u = 0;u < elements.length;u++){
if (qName.equalsIgnoreCase(elements[u].toString())) {
flag = true;
}
}
}
public void endElement(String uri, String localName,
String qName) throws SAXException {
}
public void characters(char ch[], int start, int length) throws SAXException {
if (flag) {
values.addElement(new String(ch, start, length));
flag = false;
}
}
};
InputStreamReader inputStream = new InputStreamReader(new ByteArrayInputStream(input.getBytes()), "UTF-8");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setCharacterStream(inputStream);
saxParser.parse(is, handler);
} catch (Exception e) {
e.printStackTrace();
}
return values;
}
I cared to parse arabic characters.
By the way, I converted the project encoding to UTF-8 and changed javac.encoding=UTF-8 in project.properties and in private.properties I added runtime.encoding=UTF-8
if I put this code in isolated project, it runs fine.
If I added in BB project or web project, will produce?
I do not know what can I do?

Resources