I'm trying to read a text simple file containing text of a small poem and then send each line to the output file, preceded by line numbers.
I haven't figured out how to add the line numbers yet, but I keep receiving the identifier expected error when I try to just send each line to the output file. Here's my code:
import java .io.File;
import java.ioFIleNotFoundException;
import java.io.PrintWriter;
import java.util.Scanner;
public class ReadFile
{
public static void main(String [] args)
{
//Construct Scanner Objects for input files
Scanner in1 = new Scanner(new File("JackBeNimble.txt"));
//Construct PrintWriter for the output file
PrintWriter out = new PrintWriter("JBN_LineByLine.txt");
//Read lines from the file
while(in1.hasNextLine())
{
String line1 = in1.nextLine();
out.println(line1);
}
}
in1.close();
out.close();
}
You have a typo for FileNotFoundException (should be java.io.FileNotFoundException) and your closing } before in1.close(); is misplaced; it should be after out.close(); Note that you are not handling any exceptions neither.
I spotted a few issues,
// Added the throws FileNotFoundException
public static void main(String [] args) throws FileNotFoundException
{
//Construct Scanner Objects for input files
Scanner in1 = new Scanner(new File("JackBeNimble.txt"));
//Construct PrintWriter for the output file
PrintWriter out = new PrintWriter("JBN_LineByLine.txt");
//Read lines from the file
while(in1.hasNextLine())
{
String line1 = in1.nextLine();
out.println(line1);
}
// Close in the main body.
in1.close();
out.close();
}
Related
I have a directory filled with 99 files, I want to read these files and then hash them into a sha256 checksum. I eventually want to output them to a JSON file with a key-value pair so for example (File 1, 092180x0123). Currently I am having trouble passing my ParDo function a readable File I must be missing something very easy. This is my first time using Apache beam so any help would be amazing. Here is what I have so far
public class BeamPipeline {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p
.apply("Match Files", FileIO.match().filepattern("../testdata/input-*"))
.apply("Read Files", FileIO.readMatches())
.apply("Hash File",ParDo.of(new DoFn<FileIO.ReadableFile, KV<FileIO.ReadableFile, String>>() {
#ProcessElement
public void processElement(#Element FileIO.ReadableFile file, OutputReceiver<KV<FileIO.ReadableFile, String>> out) throws
NoSuchAlgorithmException, IOException {
// File -> Bytes
String strfile = file.toString();
byte[] byteFile = strfile.getBytes();
// SHA-256
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] messageDigest = md.digest(byteFile);
BigInteger no = new BigInteger(1, messageDigest);
String hashtext = no.toString(16);
while(hashtext.length() < 32) {
hashtext = "0" + hashtext;
}
out.output(KV.of(file, hashtext));
}
}))
.apply(FileIO.write());
p.run();
}
}
One example to have a KV pair containing the matched filename (from MetadataResult) and the corresponding SHA-256 of the whole file (instead of reading it line by line):
p
.apply("Match Filenames", FileIO.match().filepattern(options.getInput()))
.apply("Read Matches", FileIO.readMatches())
.apply(MapElements.via(new SimpleFunction <ReadableFile, KV<String,String>>() {
public KV<String,String> apply(ReadableFile f) {
String temp = null;
try{
temp = f.readFullyAsUTF8String();
}catch(IOException e){
}
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(temp);
return KV.of(f.getMetadata().resourceId().toString(), sha256hex);
}
}
))
.apply("Print results", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c) {
Log.info(String.format("File: %s, SHA-256: %s ", c.element().getKey(), c.element().getValue()));
}
}
));
Full code here. The output in my case was:
Apr 21, 2019 10:02:21 PM com.dataflow.samples.DataflowSHA256$2 processElement
INFO: File: /home/.../data/file1, SHA-256: e27cf439835d04081d6cd21f90ce7b784c9ed0336d1aa90c70c8bb476cd41157
Apr 21, 2019 10:02:21 PM com.dataflow.samples.DataflowSHA256$2 processElement
INFO: File: /home/.../data/file2, SHA-256: 72113bf9fc03be3d0117e6acee24e3d840fa96295474594ec8ecb7bbcb5ed024
Which I verified with an online hashing tool:
By the way I don't think you need OutputReceiver for a single output (no side outputs). Thanks to these questions/answers that were helpful: 1, 2, 3.
How can Univocity Parsers read a .csv file when the headers are not on the first line?
There are errors if the first line in the .csv file is not the headers.
The code and stack trace are below.
Any help would be greatly appreciated.
import com.univocity.parsers.csv.CsvParserSettings;
import com.univocity.parsers.common.processor.*;
import com.univocity.parsers.csv.*;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.UnsupportedEncodingException;
import java.lang.IllegalStateException;
import java.lang.String;
import java.util.List;
public class UnivocityParsers {
public Reader getReader(String relativePath) {
try {
return new InputStreamReader(this.getClass().getResourceAsStream(relativePath), "Windows-1252");
} catch (UnsupportedEncodingException e) {
throw new IllegalStateException("Unable to read input", e);
}
}
public void columnSelection() {
RowListProcessor rowProcessor = new RowListProcessor();
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setRowProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setSkipEmptyLines(true);
// Here we select only the columns "Price", "Year" and "Make".
// The parser just skips the other fields
parserSettings.selectFields("AUTHOR", "ISBN");
CsvParser parser = new CsvParser(parserSettings);
parser.parse(getReader("list2.csv"));
List<String[]> rows = rowProcessor.getRows();
String[] strings = rows.get(0);
System.out.print(strings[0]);
}
public static void main(String arg[]) {
UnivocityParsers univocityParsers = new UnivocityParsers();
univocityParsers.columnSelection();
}
}
Stack trace:
Exception in thread "main" com.univocity.parsers.common.TextParsingException: Error processing input: java.lang.IllegalStateException - Unknown field names: [author, isbn]. Available fields are: [list of books by author - created today]
Here is the file being parsed:
List of books by Author - Created today
"REVIEW_DATE","AUTHOR","ISBN","DISCOUNTED_PRICE"
"1985/01/21","Douglas Adams",0345391802,5.95
"1990/01/12","Douglas Hofstadter",0465026567,9.95
"1998/07/15","Timothy ""The Parser"" Campbell",0968411304,18.99
"1999/12/03","Richard Friedman",0060630353,5.95
"2001/09/19","Karen Armstrong",0345384563,9.95
"2002/06/23","David Jones",0198504691,9.95
"2002/06/23","Julian Jaynes",0618057072,12.50
"2003/09/30","Scott Adams",0740721909,4.95
"2004/10/04","Benjamin Radcliff",0804818088,4.95
"2004/10/04","Randel Helms",0879755725,4.50
As of today, on 2.0.0-SNAPSHOT you can do this:
settings.setNumberOfRowsToSkip(1);
On version 1.5.6 you can do this to skip the first line and correctly grab the headers:
RowListProcessor rowProcessor = new RowListProcessor(){
#Override
public void processStarted(ParsingContext context) {
super.processStarted(context);
context.skipLines(1);
}
};
An alternative is to comment the first line if your input file (if you have control over how the file is generated) by adding a # at the beginning of the line you want to discard:
#List of books by Author - Created today
I am writing a class to recursively extract files from inside a zip file and produce them to a Kafka queue for further processing. My intent is to be able to extract files from multiple levels of zip. The code below is my implementation of the tika ContainerExtractor to do this.
import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.Stack;
import org.apache.commons.lang.StringUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.DefaultDetector;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.extractor.ContainerExtractor;
import org.apache.tika.extractor.EmbeddedResourceHandler;
import org.apache.tika.io.TemporaryResources;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.mime.MediaType;
import org.apache.tika.parser.AbstractParser;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.pkg.PackageParser;
import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;
public class UberContainerExtractor implements ContainerExtractor {
/**
*
*/
private static final long serialVersionUID = -6636138154366178135L;
// statically populate SUPPORTED_TYPES
static {
Set<MediaType> supportedTypes = new HashSet<MediaType>();
ParseContext context = new ParseContext();
supportedTypes.addAll(new PackageParser().getSupportedTypes(context));
SUPPORTED_TYPES = Collections.unmodifiableSet(supportedTypes);
}
/**
* A stack that maintains the parent filenames for the recursion
*/
Stack<String> parentFileNames = new Stack<String>();
/**
* The default tika parser
*/
private final Parser parser;
/**
* Default tika detector
*/
private final Detector detector;
/**
* The supported container types into which we can recurse
*/
public final static Set<MediaType> SUPPORTED_TYPES;
/**
* The number of documents recursively extracted from the container and its
* children containers if present
*/
int extracted;
public UberContainerExtractor() {
this(TikaConfig.getDefaultConfig());
}
public UberContainerExtractor(TikaConfig config) {
this(new DefaultDetector(config.getMimeRepository()));
}
public UberContainerExtractor(Detector detector) {
this.parser = new AutoDetectParser(new PackageParser());
this.detector = detector;
}
public boolean isSupported(TikaInputStream input) throws IOException {
MediaType type = detector.detect(input, new Metadata());
return SUPPORTED_TYPES.contains(type);
}
#Override
public void extract(TikaInputStream stream, ContainerExtractor recurseExtractor, EmbeddedResourceHandler handler)
throws IOException, TikaException {
ParseContext context = new ParseContext();
context.set(Parser.class, new RecursiveParser(recurseExtractor, handler));
try {
Metadata metadata = new Metadata();
parser.parse(stream, new DefaultHandler(), metadata, context);
} catch (SAXException e) {
throw new TikaException("Unexpected SAX exception", e);
}
}
private class RecursiveParser extends AbstractParser {
/**
*
*/
private static final long serialVersionUID = -7260171956667273262L;
private final ContainerExtractor extractor;
private final EmbeddedResourceHandler handler;
private RecursiveParser(ContainerExtractor extractor, EmbeddedResourceHandler handler) {
this.extractor = extractor;
this.handler = handler;
}
public Set<MediaType> getSupportedTypes(ParseContext context) {
return parser.getSupportedTypes(context);
}
public void parse(InputStream stream, ContentHandler ignored, Metadata metadata, ParseContext context)
throws IOException, SAXException, TikaException {
TemporaryResources tmp = new TemporaryResources();
try {
TikaInputStream tis = TikaInputStream.get(stream, tmp);
// Figure out what we have to process
String filename = metadata.get(Metadata.RESOURCE_NAME_KEY);
MediaType type = detector.detect(tis, metadata);
if (extractor == null) {
// do nothing
} else {
// Use a temporary file to process the stream
File file = tis.getFile();
System.out.println("file is directory = " + file.isDirectory());
// Recurse and extract if the filetype is supported
if (SUPPORTED_TYPES.contains(type)) {
System.out.println("encountered a supported file:" + filename);
parentFileNames.push(filename);
extractor.extract(tis, extractor, handler);
parentFileNames.pop();
} else { // produce the file
List<String> parentFilenamesList = new ArrayList<String>(parentFileNames);
parentFilenamesList.add(filename);
String originalFilepath = StringUtils.join(parentFilenamesList, "/");
System.out.println("producing " + filename + " with originalFilepath:" + originalFilepath
+ " to kafka queue");
++extracted;
}
}
} finally {
tmp.dispose();
}
}
}
public int getExtracted() {
return extracted;
}
public static void main(String[] args) throws IOException, TikaException {
String filename = "/Users/rohit/Data/cd.zip";
File file = new File(filename);
TikaInputStream stream = TikaInputStream.get(file);
ContainerExtractor recursiveExtractor = new UberContainerExtractor();
EmbeddedResourceHandler resourceHandler = new EmbeddedResourceHandler() {
#Override
public void handle(String filename, MediaType mediaType, InputStream stream) {
// do nothing
}
};
recursiveExtractor.extract(stream, recursiveExtractor, resourceHandler);
stream.close();
System.out.println("extracted " + ((UberContainerExtractor) recursiveExtractor).getExtracted() + " files");
}
}
It works on multiple levels of zip as long as the files inside the zips are in a flat structure. for ex.
cd.zip
- c.txt
- d.txt
The code does not work if there the files in the zip are present inside a directory. for ex.
ab.zip
- ab/
- a.txt
- b.txt
While debugging I came across the following code snippet in the PackageParser
try {
ArchiveEntry entry = ais.getNextEntry();
while (entry != null) {
if (!entry.isDirectory()) {
parseEntry(ais, entry, extractor, xhtml);
}
entry = ais.getNextEntry();
}
} finally {
ais.close();
}
I tried to comment out the if condition but it did not work. Is there a reason why this is commented? Is there any way of getting around this?
I am using tika version 1.6
Tackling your question in reverse order:
Is there a reason why this is commented?
Entries in zip files are either directories or files. If files, they include the name of the directory they come from. As such, Tika doesn't need to do anything with the directories, all it needs to do is process the embedded files as and when they come up
The code does not work if there the files in the zip are present inside a directory. for ex. ab.zip - ab/ - a.txt - b.txt
You seem to be doing something wrong then. Tika's recursion and package parser handle zips with folders in them just fine!
To prove this, start with a zip file like this:
$ unzip -l ../tt.zip
Archive: ../tt.zip
Length Date Time Name
--------- ---------- ----- ----
0 2015-02-03 16:42 t/
0 2015-02-03 16:42 t/t2/
0 2015-02-03 16:42 t/t2/t3/
164404 2015-02-03 16:42 t/t2/t3/test.jpg
--------- -------
164404 4 files
Now, make us of the -z extraction flag of the Tika App, which causes Tika to extract out all of the embedded contents of a file. Run like that, and we get
$ java -jar tika-app-1.7.jar -z ../tt.zip
Extracting 't/t2/t3/test.jpg' (image/jpeg) to ./t/t2/t3/test.jpg
Then list the resulting directory, and we see
$ find . -type f
./t/t2/t3/Test.jpg
I can't see what's wrong with your code, but sadly for you we've shown that the problem is there, and not with Tika... You'd be best off reviewing the various examples of recursion that Tika provides, such as the Tika App tool and the Recursing Parser Wrapper, then re-write your code to be something simple based from those
I want to download one picture from url to my Lotus Notes application.
I can get text field from url, but image is difficult.
I try to put pic to a rich text field, but it doesn't work.
Any idea?
You can download an image from URL via LotusScript with the help of a little Script Library of type "Java".
Create a Script Library "GetImageFromUrl" of Type "Java" and put in following code:
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URL;
public class GetImageFromUrl {
public static boolean getImageFromUrl(String imageUrl, String filePath) {
try {
URL url = new URL(imageUrl);
InputStream is = url.openStream();
OutputStream os = new FileOutputStream(filePath);
byte[] b = new byte[2048];
int length;
while ((length = is.read(b)) != -1) {
os.write(b, 0, length);
}
is.close();
os.close();
return true;
} catch (Exception e) {
e.printStackTrace();
return false;
}
}
}
Then you can use the method getImageFromUrl(imageUrl, filePath) in your LotusScript code to download the image to a file. From there you can attach the image file to a RichText item with rtitem.EmbedObject(EMBED_ATTACHMENT, "", "c:/temp/image.jpg").
Option Declare
UseLSX "*javacon"
Use "GetImageFromUrl"
Sub Initialize
dim jSession As New JavaSession
dim jClass As JavaClass
Set jClass = jSession.GetClass( "GetImageFromUrl" )
If jClass.getImageFromUrl("https://your.url", "c:/temp/image.jpg") Then
MessageBox "File is downloaded"
End If
End Sub
im trying to parse a pdf file and get its metadata and text.I still don't get the wanted results. I am sure it is a silly mistake, but i cant see it.The file d.pdf exists and it is located in the project's root folder.The imports are also correct.
public class MultiParse {
public static void main(final String[] args) throws IOException,
SAXException, TikaException {
Parser parser = new AutoDetectParser();
File f = new File("d.pdf");
System.out.println("------------ Parsing a PDF:");
extractFromFile(parser, f);
}
private static void extractFromFile(final Parser parser,
final File f ) throws IOException, SAXException,
TikaException {
BodyContentHandler handler = new BodyContentHandler(10000000);
Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get(f);
parser.parse(is, handler, metadata, new ParseContext());
for (String name : metadata.names()) {
System.out.println(name + ":\t" + metadata.get(name));
}
}
}
OUTPUT:No errors, but ..not much either:(
------------ Parsing a PDF:
Content-Type: application/pdf