I have a directory filled with 99 files, I want to read these files and then hash them into a sha256 checksum. I eventually want to output them to a JSON file with a key-value pair so for example (File 1, 092180x0123). Currently I am having trouble passing my ParDo function a readable File I must be missing something very easy. This is my first time using Apache beam so any help would be amazing. Here is what I have so far
public class BeamPipeline {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p
.apply("Match Files", FileIO.match().filepattern("../testdata/input-*"))
.apply("Read Files", FileIO.readMatches())
.apply("Hash File",ParDo.of(new DoFn<FileIO.ReadableFile, KV<FileIO.ReadableFile, String>>() {
#ProcessElement
public void processElement(#Element FileIO.ReadableFile file, OutputReceiver<KV<FileIO.ReadableFile, String>> out) throws
NoSuchAlgorithmException, IOException {
// File -> Bytes
String strfile = file.toString();
byte[] byteFile = strfile.getBytes();
// SHA-256
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] messageDigest = md.digest(byteFile);
BigInteger no = new BigInteger(1, messageDigest);
String hashtext = no.toString(16);
while(hashtext.length() < 32) {
hashtext = "0" + hashtext;
}
out.output(KV.of(file, hashtext));
}
}))
.apply(FileIO.write());
p.run();
}
}
One example to have a KV pair containing the matched filename (from MetadataResult) and the corresponding SHA-256 of the whole file (instead of reading it line by line):
p
.apply("Match Filenames", FileIO.match().filepattern(options.getInput()))
.apply("Read Matches", FileIO.readMatches())
.apply(MapElements.via(new SimpleFunction <ReadableFile, KV<String,String>>() {
public KV<String,String> apply(ReadableFile f) {
String temp = null;
try{
temp = f.readFullyAsUTF8String();
}catch(IOException e){
}
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(temp);
return KV.of(f.getMetadata().resourceId().toString(), sha256hex);
}
}
))
.apply("Print results", ParDo.of(new DoFn<KV<String, String>, Void>() {
#ProcessElement
public void processElement(ProcessContext c) {
Log.info(String.format("File: %s, SHA-256: %s ", c.element().getKey(), c.element().getValue()));
}
}
));
Full code here. The output in my case was:
Apr 21, 2019 10:02:21 PM com.dataflow.samples.DataflowSHA256$2 processElement
INFO: File: /home/.../data/file1, SHA-256: e27cf439835d04081d6cd21f90ce7b784c9ed0336d1aa90c70c8bb476cd41157
Apr 21, 2019 10:02:21 PM com.dataflow.samples.DataflowSHA256$2 processElement
INFO: File: /home/.../data/file2, SHA-256: 72113bf9fc03be3d0117e6acee24e3d840fa96295474594ec8ecb7bbcb5ed024
Which I verified with an online hashing tool:
By the way I don't think you need OutputReceiver for a single output (no side outputs). Thanks to these questions/answers that were helpful: 1, 2, 3.
Related
I'm working with Apache Beam, trying to enrich data (based on this), but it seems that Beam has changed in a while, as GroupByKey does not work with unbounded sources (like PubSub) without windowing.
This is what I've got (overly simplified):
PCollection<String> input = pipeline.apply("Read pubsub",
PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("Log element", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.println(String.format("incomig %s", c.element()));
c.output(c.element());
}
}))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
PCollection<KV<String, String>> incomingData = input
.apply("Apply Random Key", MapElements
.via(new SimpleFunction<String, KV<String, String>>() {
public KV<String, String> apply(String json) {
JSONObject jsonObject = new JSONObject(json);
System.out.println(String.format("JSON: %s, %s", jsonObject.getString("id"), jsonObject.get("usageRules")));
return KV.of(jsonObject.getString("id"), json);
}
})
);
PCollection<KV<String,String>> enrichedData = incomingData
.apply("Search in db",
JdbcIO.<KV<String,String>, KV<String,String>>readAll()
.withDataSourceConfiguration(config)
.withQuery("SELECT * FROM myTable WHERE id = ?")
.withParameterSetter((element, preparedStatement) ->
preparedStatement.setString(1, element.getKey())
)
.withRowMapper(resultSet -> {
System.out.println(String.format("Result from db: %s", resultSet.getString("id")));
return KV.of(resultSet.getString("id"), resultSet.getString("id"));
})
.withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())));
GroupByKey.applicableTo(enrichedData);
TupleTag<String> CREATE_TAG = new TupleTag<>();
TupleTag<String> UPDATE_TAG = new TupleTag<>();
KeyedPCollectionTuple
.of(CREATE_TAG, incomingData)
.and(UPDATE_TAG, enrichedData)
.apply("Combine", CoGroupByKey.create())
.apply("Show data?", ParDo.of(new DoFn<KV<String, CoGbkResult>, String>() {
#ProcessElement
public void processElement(ProcessContext context) {
System.out.println("Print from CoGbkResult");
System.out.println(context.element().getKey());
System.out.println(context.element().getValue());
}
}));
At the moment, with windowing, getting incoming data, transforming it into JSONObject and searching in BD is working fine, the problem is that any .apply done after the JdbcIO.readAll is not working at all. The line "Print from CoGbkResult" just doesn't get printed at all.
I've tried modifying the window, adding other triggers, trying just to output a result just immediately, but it just stop at the RowMapper.
Thanks for your help
I want to write multibyte string (e.g. japanese) to Spanner from Dataflow Pipeline.
But it does not working.
Below is the code I tried.
(edited: I rewrote it that is closer to actual)
ParDo.of(new DoFn<TableRow, Mutation>() {
#ProcessElement
public void processElement(ProcessContext c) throws IOException {
TableRow row = c.element();
Mutation.WriteBuilder mutationWriteBuilder = Mutation.newInsertOrUpdateBuilder('testtable');
for (Entry<String, Object> entry : row.entrySet()) {
String columnName = entry.getKey();
Object value = entry.getValue();
Charset utf8 = StandardCharsets.UTF_8;
String str = new String(value.toString().getBytes(utf8), utf8);
mutationWriteBuilder.set(columnName).to(str);
}
Mutation mutation = mutationWriteBuilder.build();
c.output(mutation)
}
}
This pipeline will succeed, but the value actually written is a garbled string like '�'.
Am I doing something wrong?
I need to execute below operations in sequence as given:-
PCollection<String> read = p.apply("Read Lines",TextIO.read().from(options.getInputFile()))
.apply("Get fileName",ParDo.of(new DoFn<String,String>(){
ValueProvider<String> fileReceived = options.getfilename();
#ProcessElement
public void procesElement(ProcessContext c)
{
fileName = fileReceived.get().toString();
LOG.info("File: "+fileName);
}
}));
PCollection<TableRow> rows = p.apply("Read from BigQuery",
BigQueryIO.read()
.fromQuery("SELECT table,schema FROM `DatasetID.TableID` WHERE file='" + fileName +"'")
.usingStandardSql());
How to accomplish this in Apache Beam/Dataflow?
It seems that you want to apply BigQueryIO.read().fromQuery() to a query that depends on a value available via a property of type ValueProvider<String> in your PipelineOptions, and the provider is not accessible at pipeline construction time - i.e. you are invoking your job via a template.
In that case, the proper solution is to use NestedValueProvider:
PCollection<TableRow> tableRows = p.apply(BigQueryIO.read().fromQuery(
NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `DatasetID.TableID` WHERE file='" + fileName +"'";
}
})));
We don't know why when running this simple test, DataflowAssert fails:
#Test
#Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
We are getting the following exception:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
In order to simplify the DataflowAssert test we hardcoded the output of Table.join to match DataflowAssert,having:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
#Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
Does anyone know what we are doing wrong?
I think the error message is telling you that the actual collection contains more copies of that element than the expectation.
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
This is hamcrest indicating that you wanted an iterable over a single element, but the actual collection had an item which wasn't matched. Since all of the items coming out of "format join" have the same value, it made this harder to read than it should have been.
Specifically, this is the message produced when I run the following test, which checks to see if the collection with two copies of row is the contains exactly one copy of row:
#Category(RunnableOnService.class)
#Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
In order to get that result with your code, I had to take advantage of the fact that you only iterate over entries in table2. Specifically:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
Note that the DoFn for "Process join" may not produce the expected results as commented below:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
I'm trying to read a text simple file containing text of a small poem and then send each line to the output file, preceded by line numbers.
I haven't figured out how to add the line numbers yet, but I keep receiving the identifier expected error when I try to just send each line to the output file. Here's my code:
import java .io.File;
import java.ioFIleNotFoundException;
import java.io.PrintWriter;
import java.util.Scanner;
public class ReadFile
{
public static void main(String [] args)
{
//Construct Scanner Objects for input files
Scanner in1 = new Scanner(new File("JackBeNimble.txt"));
//Construct PrintWriter for the output file
PrintWriter out = new PrintWriter("JBN_LineByLine.txt");
//Read lines from the file
while(in1.hasNextLine())
{
String line1 = in1.nextLine();
out.println(line1);
}
}
in1.close();
out.close();
}
You have a typo for FileNotFoundException (should be java.io.FileNotFoundException) and your closing } before in1.close(); is misplaced; it should be after out.close(); Note that you are not handling any exceptions neither.
I spotted a few issues,
// Added the throws FileNotFoundException
public static void main(String [] args) throws FileNotFoundException
{
//Construct Scanner Objects for input files
Scanner in1 = new Scanner(new File("JackBeNimble.txt"));
//Construct PrintWriter for the output file
PrintWriter out = new PrintWriter("JBN_LineByLine.txt");
//Read lines from the file
while(in1.hasNextLine())
{
String line1 = in1.nextLine();
out.println(line1);
}
// Close in the main body.
in1.close();
out.close();
}