Can we do image processing with Palantir Foundry? - image-processing

I'm exploring the Palantir Foundry platform and it seems to have ton of options for rectangular data or structured data. Does anyone have experience of working with Unstructured Big data on Foundry platform? How can we use Foundry for Image analysis?

Although most examples are given using tabular data, in reality a lot of use case are using foundry for both unstructured and semi-structured data processing.
You should think of a dataset as a container of files with an API to access and process the files.
using the file level API you can get access to the files in the dataset and process them as you like. If these files are images you can extract information from the file and use it as you like.
a common use case is to have PDFs as files in a dataset and to extract information from the PDF and store it as tabular information so you can do both structured and unstructured search over it.
here is an example of file access to extract PDF:
import com.palantir.transforms.lang.java.api.Compute;
import com.palantir.transforms.lang.java.api.FoundryInput;
import com.palantir.transforms.lang.java.api.FoundryOutput;
import com.palantir.transforms.lang.java.api.Input;
import com.palantir.transforms.lang.java.api.Output;
import com.palantir.util.syntacticpath.Paths;
import com.google.common.collect.AbstractIterator;
import com.palantir.spark.binarystream.data.PortableFile;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.UUID;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.Row;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public final class ExtractPDFText {
private static String pdf_source_files_rid = "SOME RID";
private static String dataProxyPath = "/foundry-data-proxy/api/dataproxy/datasets/";
private static String datasetViewPath = "/views/master/";
#Compute
public void compute(
#Input("/Base/project_name/treasury_pdf_docs") FoundryInput pdfFiles,
#Output("/Base/project_name/clean/pdf_text_extracted") FoundryOutput output) throws IOException {
Dataset<PortableFile> filesDataset = pdfFiles.asFiles().getFileSystem().filesAsDataset();
Dataset<String> mappedDataset = filesDataset.flatMap((FlatMapFunction<PortableFile, String>) portableFile ->
portableFile.convertToIterator(inputStream -> {
String pdfFileName = portableFile.getLogicalPath().getFileName().toString();
return new PDFIterator(inputStream, pdfFileName);
}), Encoders.STRING());
Dataset<Row> dataset = filesDataset
.sparkSession()
.read()
.option("inferSchema", "false")
.json(mappedDataset);
output.getDataFrameWriter(dataset).write();
}
private static final class PDFIterator extends AbstractIterator<String> {
private InputStream inputStream;
private String pdfFileName;
private boolean done;
PDFIterator(InputStream inputStream, String pdfFileName) throws IOException {
this.inputStream = inputStream;
this.pdfFileName = pdfFileName;
this.done = false;
}
#Override
protected String computeNext() {
if (done) {
return endOfData();
}
try {
String objectId = pdfFileName;
String appUrl = dataProxyPath.concat(pdf_source_files_rid).concat(datasetViewPath).concat(pdfFileName);
PDDocument document = PDDocument.load(inputStream);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String strippedText = text.replace("\"", "'").replace("\\", "").replace("“", "'").replace("”", "'").replace("\n", "").replace("\r", "");
done = true;
return "{\"id\": \"" + String.valueOf(UUID.randomUUID()) + "\", \"file_name\": \"" + pdfFileName + "\", \"app_url\": \"" + appUrl + "\", \"object_id\": \"" + objectId + "\", \"text\": \"" + strippedText + "\"}\n";
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
}

Indeed you can do image analysis on Foundry as you have access to files and can use arbitrary libraries (for example pillow or skimage for python). This can be done at scale as well as it can be parallelised.
A simple python snippet to stitch two pictures together should get you started:
from transforms.api import transform, Input, Output
from PIL import Image
#transform(
output=Output("/processed/stitched_images"),
raw=Input("/raw/images"),
image_meta=Input("/processed/image_meta")
)
def my_compute_function(raw, image_meta, output, ctx):
image_meta = image_meta.dataframe()
def stitch_images(clone):
left = clone["left_file_name"]
right = clone["right_file_name"]
image_name = clone["image_name"]
with raw.filesystem().open(left, mode="rb") as left_file:
with raw.filesystem().open(right, mode="rb") as right_file:
with output.filesystem().open(image_name, 'wb') as out_file:
left_image = Image.open(left_file)
right_image = Image.open(right_file)
(width, height) = left_image.size
result_width = width * 2
result_height = height
result = Image.new('RGB', (result_width, result_height))
result.paste(im=left_image, box=(0, 0))
result.paste(im=right_image, box=(height, 0))
result.save(out_file, format='jpeg', quality=90)
image_meta.rdd.foreach(stitch_images)
The image_meta dataset is just a dataset that has 2 file names per row. To extract filenames from a dataset of raw files you can use something like:
#transform(
output=Output("/processed/image_meta"),
raw=Input("/raw/images"),
)
def my_compute_function(raw, output, ctx):
file_names = [(file_status.path, 1) for file_status in raw.filesystem().ls(glob="*.jpg")]
# create and write spark dataframe based on array

As others have mentioned Palantir-Foundry’s focus is on tabular data and doesn’t currently provide GPU or other tensor processing unit access. So doing anything intense like an FFT transform or Deep Learning would be ill advised at best if not downright impossible at worst.
That being said you can upload image files into dataset nodes for read/write access. You could also store their binary information as a blob type into a Dataframe in order to store files in a given record field. Given that there are a multitude of Python image processing adjacent and matrix math libraries available on the platform and given that it’s also possible to upload library packages to the platform manually through the Code Repo app, it is conceivable that someone could make use of simple manipulations on a somewhat large scale as long as it wasn’t overly complex or memory intensive.

Note that Foundry seems to have no GPU support at the moment, so if you are thinking about running deep learning based image processing, this will be quite slow on CPUs.

Related

Could someone share how to delete a paragraph form a textbox

I am currently working on a project to manipulate Docx file with the Apache POI project. I have used the api to remove text from a run inside of a text box, but cannot figure out how to remove a paragraph inside a text box. I assume that I need to use the class CTP to obtain the paragraph object to remove. Any examples or suggestion would be greatly appreciated.
In Replace text in text box of docx by using Apache POI I have shown how to replace text in Word text-box-contents. The approach is getting a list of XML text run elements from the XPath .//*/w:txbxContent/w:p/w:r using a XmlCursor which selects that path from /word/document.xml.
The same of course can be done using the path .//*/w:txbxContent/w:p, which gets the text paragraphs in text-box-contents. Having those low level paragraph XML, we can converting them into XWPFParagraphs to get the plain text out of them. Then, if the plain text contains some criterion, we can simply removing the paragraph's XML.
Source:
Code:
import java.io.FileOutputStream;
import java.io.FileInputStream;
import org.apache.poi.xwpf.usermodel.*;
import org.apache.xmlbeans.XmlObject;
import org.apache.xmlbeans.XmlCursor;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
import java.util.List;
import java.util.ArrayList;
public class WordRemoveParagraphInTextBox {
public static void main(String[] args) throws Exception {
XWPFDocument document = new XWPFDocument(new FileInputStream("WordRemoveParagraphInTextBox.docx"));
for (XWPFParagraph paragraph : document.getParagraphs()) {
XmlCursor cursor = paragraph.getCTP().newCursor();
cursor.selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' .//*/w:txbxContent/w:p");
List<XmlObject> ctpsintxtbx = new ArrayList<XmlObject>();
while(cursor.hasNextSelection()) {
cursor.toNextSelection();
XmlObject obj = cursor.getObject();
ctpsintxtbx.add(obj);
}
for (XmlObject obj : ctpsintxtbx) {
CTP ctp = CTP.Factory.parse(obj.xmlText());
//CTP ctp = CTP.Factory.parse(obj.newInputStream());
XWPFParagraph bufferparagraph = new XWPFParagraph(ctp, document);
String text = bufferparagraph.getText();
if (text != null && text.contains("remove")) {
obj.newCursor().removeXml();
}
}
}
FileOutputStream out = new FileOutputStream("WordRemoveParagraphInTextBoxNew.docx");
document.write(out);
out.close();
document.close();
}
}
Result:

Optimizations for merging graph with million nodes and csv with million rows

Consider I have graph with following nodes and relationships:
(:GPE)-[:contains]->(:PE)-[:has]->(:E)
with following attributes:
GPE: joinProp1, outProp1, randomProps
PE : joinProp2, outProp2, randomProps
E : joinProp3, outProp3, randomProps
Now consider I have csv of following format:
joinCol1, joinCol2, joinCol3, outCol1, outCol2, outCol3, randomProps
Now consider I have million rows in this csv file. Also I have million instances of each of (:GPE),(:PE),(:E) in graph. I want to merge the graph and csv into new csv. For that I want to map / equate
joinCol1 with joinProp1
joinCol2 with joinProp2
joinCol3 with joinProp3
something like this (pseudo cypher) for each row in csv:
MATCH (gpe:GPE {joinProp1:joinCol1})-[:contains]->(pe:PE {joinProp2:joinCol2})-[:has]->(e:E {joinProp3:joinCol3}) RETURN gpe.outProp1, pe.outProp2, e.outProp3
So the output csv format would be:
joinCol1, joinCol2, joinCol3, outCol1, outCol2, outCol3, outProp1, outProp2, outProp3
What is rough minimum execution time estimates (minutes or hours) in which I can complete this task if I create indices on all joinProps and use parameterized cypher (considering I am implementing this simple logic with java api). I just want to know what are rough estimates. We have similar (possibly un-optimized) task implemented and it takes several hours to do this. The challenge is to bring down that execution time. What all things I can do to optimize and bring down execution time to some minutes? Any quick optimization points / links? Will using some approach other than java api provide performance improvements?
Well I tried some things which improved performance considerably.
Some neo4j performance guidelines that applies to my scenario:
Process in batch: Avoid doing cypher call (through bolt api call) for each csv row. Iterate through several fixed number of rows of csv, form list of maps where each map will be row of csv. Then pass this list of maps to cypher as parameter. UNWIND this list inside cypher and perform the desired action. Repeat the same for next set of csv rows.
Dont return node relationship objects from cypher to java side. Instead try returning list of maps required as a final output. When we return list of nodes / relationships, we may have to reiterate through them to merge the properties with csv columns to form final output row (or map)
Pass the csv column values to cypher: To achieve point 2, send the csv column values (to be merged with graph properties) to cypher. Perform matching in cypher and form the output map by merging matched node's properties and input csv columns.
Index node / relationship properties which are to be matched (Official docs)
Parameterized the cypher (API Example, Official docs)
I did a quick dirty experiment as I explained below.
My input csv looked like this.
inputDataCsv.csv
csv-id1,csv-id2,csv-id3,outcol1,outcol2,outcol3
gppe0,ppe1,pe2,outcol1-val-0,outcol2-val-1,outcol3-val-2
gppe3,ppe4,pe5,outcol1-val-3,outcol2-val-4,outcol3-val-5
gppe6,ppe7,pe8,outcol1-val-6,outcol2-val-7,outcol3-val-8
...
To created the graph, I first created csvs of the form:
gppe.csv (entities)
gppe0,gppe_out_prop_1_val_0,gppe_out_prop_2_val_0,gppe_prop_X_val_0
gppe3,gppe_out_prop_1_val_3,gppe_out_prop_2_val_3,gppe_prop_X_val_3
gppe6,gppe_out_prop_1_val_6,gppe_out_prop_2_val_6,gppe_prop_X_val_6
...
ppe.csv (entities)
ppe1,ppe_out_prop_1_val_1,ppe_out_prop_2_val_1,ppe_prop_X_val_1
ppe4,ppe_out_prop_1_val_4,ppe_out_prop_2_val_4,ppe_prop_X_val_4
ppe7,ppe_out_prop_1_val_7,ppe_out_prop_2_val_7,ppe_prop_X_val_7
...
pe.csv (entities)
pe2,pe_out_prop_1_val_2,pe_out_prop_2_val_2,pe_prop_X_val_2
pe5,pe_out_prop_1_val_5,pe_out_prop_2_val_5,pe_prop_X_val_5
pe8,pe_out_prop_1_val_8,pe_out_prop_2_val_8,pe_prop_X_val_8
...
gppeHasPpe.csv (relationships)
gppe0,ppe1
gppe3,ppe4
gppe6,ppe7
...
ppeContainsPe.csv (relationships)
ppe1,pe2
ppe4,pe5
ppe7,pe8
...
I loaded this in neo4j as follows:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///gppe.csv' AS line
CREATE (:GPPocEntity {id:line[0],gppe_out_prop_1: line[1], gppe_out_prop_2: line[2],gppe_out_prop_X: line[3]})  
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///ppe.csv' AS line
CREATE (:PPocEntity {id:line[0],ppe_out_prop_1: line[1], ppe_out_prop_2: line[2],ppe_out_prop_X: line[3]})
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///pe.csv' AS line
CREATE (:PocEntity {id:line[0],pe_out_prop_1: line[1], pe_out_prop_2: line[2],pe_out_prop_X: line[3]})
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///gppeHasPpe.csv' AS line
MATCH(gppe:GPPocEntity {id:line[0]})
MATCH(ppe:PPocEntity {id:line[1]})
MERGE (gppe)-[:has]->(ppe)
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///ppeContainsPe.csv' AS line
MATCH(ppe:PPocEntity {id:line[0]})
MATCH(pe:PocEntity {id:line[1]})
MERGE (ppe)-[:contains]->(pe)
Next I created indixes on lookup properties:
CREATE INDEX ON :GPPocEntity(id)
CREATE INDEX ON :PPocEntity(id)
CREATE INDEX ON :PocEntity(id)
Below is utility class which reads csv into list of maps:
package csv2csv;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import org.w3c.dom.stylesheets.LinkStyle;
public class Config {
String configFilePath;
Map csvColumnToGraphNodeMapping;
List<Map> mappingsGraphRelations;
Map<String,Map<String,String>> mappingsGraphRelationsMap = new HashMap<String, Map<String,String>>();
List<String> outputColumnsFromCsv;
Map outputColumnsFromGraph;
public Config(String pConfigFilePath) {
configFilePath = pConfigFilePath;
JSONParser parser = new JSONParser();
try {
Object obj = parser.parse(new FileReader(configFilePath));
JSONObject jsonObject = (JSONObject) obj;
csvColumnToGraphNodeMapping = (HashMap) ((HashMap) jsonObject.get("csvColumn-graphNodeProperty-mapping"))
.get("mappings");
mappingsGraphRelations = (ArrayList) ((HashMap) jsonObject.get("csvColumn-graphNodeProperty-mapping"))
.get("mappings-graph-relations");
for(Map m : mappingsGraphRelations)
{
mappingsGraphRelationsMap.put(""+ m.get("start-entity") + "-" + m.get("end-entity"), m);
}
outputColumnsFromCsv = (ArrayList) ((HashMap) jsonObject.get("output-csv-columns"))
.get("columns-from-input-csv");
outputColumnsFromGraph = (HashMap) ((HashMap) jsonObject.get("output-csv-columns"))
.get("columns-from-graph");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
}
Below class performs merging and creates another csv:
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.neo4j.driver.internal.value.MapValue;
import org.neo4j.driver.internal.value.NodeValue;
import org.neo4j.driver.v1.AuthTokens;
import org.neo4j.driver.v1.Driver;
import org.neo4j.driver.v1.GraphDatabase;
import org.neo4j.driver.v1.Record;
import org.neo4j.driver.v1.Session;
import org.neo4j.driver.v1.StatementResult;
import org.apache.commons.lang3.time.StopWatch;
public class Csv2CsvUtil2 {
static String inCsvFilePath = "D:\\Mahesh\\work\\files\\inputDataCsv.csv";
static String outCsvFilePath = "D:\\Mahesh\\work\\files\\csvout.csv";
private final static Driver driver = GraphDatabase.driver(
"bolt://localhost:7687", AuthTokens.basic("neo4j", "password"));
public static void main(String[] args) throws FileNotFoundException, IOException {
mergeNonBatch();
}
private static void merge() throws FileNotFoundException, IOException
{
List<Map<String,String>> csvRowMapList = new CsvReader(inCsvFilePath).getMapListFromCsv();
Session session = driver.session();
String cypherFilter = "";
String cypher;
PrintWriter pw = new PrintWriter(new File(outCsvFilePath));
StringBuilder sb = new StringBuilder();
List<Map<String, String>> inputMapList = new ArrayList<Map<String,String>>();
Map<String,Object> inputMapListMap = new HashMap<String,Object>();
Map<String, Object> params = new HashMap<String, Object>();
cypher = "WITH {inputMapList} AS inputMapList"
+ " UNWIND inputMapList AS rowMap"
+ " WITH rowMap"
+ " MATCH (gppe:GPPocEntity {id:rowMap.csvid1})-[:has]->(ppe:PPocEntity {id:rowMap.csvid2})-[:contains]->(pe:PocEntity {id:rowMap.csvid3})"
+ " RETURN {id1:gppe.id,id2:ppe.id,id3:pe.id"
+ ",gppeprop1: gppe.gppe_out_prop_1,gppeprop2: gppe.gppe_out_prop_2"
+ ",ppeprop1: ppe.ppe_out_prop_1,ppeprop2: ppe.ppe_out_prop_2"
+ ",peprop1: pe.pe_out_prop_1,peprop2: pe.pe_out_prop_2"
+ ",outcol1:rowMap.outcol1,outcol2:rowMap.outcol2,outcol3:rowMap.outcol3}";
int i;
for(i=0;i<csvRowMapList.size();i++)
{
Map<String, String> rowMap = new HashMap<String, String>();
rowMap.put("csvid1", csvRowMapList.get(i).get("csv-id1"));
rowMap.put("csvid2", csvRowMapList.get(i).get("csv-id2"));
rowMap.put("csvid3", csvRowMapList.get(i).get("csv-id3"));
rowMap.put("outcol1", csvRowMapList.get(i).get("outcol1"));
rowMap.put("outcol2", csvRowMapList.get(i).get("outcol2"));
rowMap.put("outcol3", csvRowMapList.get(i).get("outcol3"));
inputMapList.add(rowMap);
if(i%10000 == 0) //run in batch
{
inputMapListMap.put("inputMapList", inputMapList);
StatementResult stmtRes = session.run(cypher,inputMapListMap);
List<Record> retList = stmtRes.list();
for (Record record2 : retList) {
MapValue retMap = (MapValue) record2.get(0);
sb.append(retMap.get("id1")
+","+retMap.get("id2")
+","+retMap.get("id3")
+","+retMap.get("gppeprop1")
+","+retMap.get("gppeprop2")
+","+retMap.get("ppeprop1")
+","+retMap.get("ppeprop2")
+","+retMap.get("peprop1")
+","+retMap.get("peprop2")
+","+retMap.get("outcol1")
+","+retMap.get("outcol2")
+","+retMap.get("outcol3")
+"\n"
);
}
inputMapList.clear();
}
}
if(inputMapList.size() != 0) //ingest remaining rows which does not complete
//10000 reords failing to form next batch
{
inputMapListMap.put("inputMapList", inputMapList);
StatementResult stmtRes = session.run(cypher,inputMapListMap);
List<Record> retList = stmtRes.list();
for (Record record2 : retList) {
MapValue retMap = (MapValue) record2.get(0);
sb.append(retMap.get("id1")
+","+retMap.get("id2")
+","+retMap.get("id3")
+","+retMap.get("gppeprop1")
+","+retMap.get("gppeprop2")
+","+retMap.get("ppeprop1")
+","+retMap.get("ppeprop2")
+","+retMap.get("peprop1")
+","+retMap.get("peprop2")
+","+retMap.get("outcol1")
+","+retMap.get("outcol2")
+","+retMap.get("outcol3")
+"\n"
);
}
}
pw.write(sb.toString());
pw.close();
}
}
Running in non batch mode was slowing stuff 10 times. When guidelines 2 to 4 werent followed, it was a lot slower. Will love someone if all above things are correct, if I made any mistake, if I am missing anything, if stuff can be further improved.

Can i store a Map<String, Object> inside a Shared Preferences in dart?

Is there a way that we could save a map object into shared preferences so that we can fetch the data from shared preferences rather than listening to the database all the time.
actually i want to reduce the amount of data downloaded from firebase. so i am thinking of a solution to have a listener for shared prefs and read the data from shared prefs.
But i dont see a way of achieving this in flutter or dart.
Please can someone help me to achieve this if there is a workaround.
Many Thanks,
Mahi
If you convert it to a string, you can store it
import 'dart:convert';
...
var s = json.encode(myMap);
// or var s = jsonEncode(myMap);
json.decode(...)/jsonDecode(...) makes a map from a string when you load it.
Might be easier with this package:
https://pub.dartlang.org/packages/pref_dessert
Look at the example:
import 'package:pref_dessert/pref_dessert.dart';
/// Person class that you want to serialize:
class Person {
String name;
int age;
Person(this.name, this.age);
}
/// PersonDesSer which extends DesSer<T> and implements two methods which serialize this objects using CSV format:
class PersonDesSer extends DesSer<Person>{
#override
Person deserialize(String s) {
var split = s.split(",");
return new Person(split[0], int.parse(split[1]));
}
#override
String serialize(Person t) {
return "${t.name},${t.age}";
}
}
void main() {
var repo = new FuturePreferencesRepository<Person>(new PersonDesSer());
repo.save(new Person("Foo", 42));
repo.save(new Person("Bar", 1));
var list = repo.findAll();
}
Package is still under development so it might change, but any improvements and ideas are welcomed! :)
In dart's Shared Preferences there is no way to store Map directly but you can easily trick this by, converting Map to String then save it as usual, and when you need it to retrieve the String and then convert it back to Map. So simple!
Convert your map into String using json.encode() and then save it,
When you need your map use json.decode() to get back your map from the string.
import 'dart:convert';
import 'package:shared_preferences/shared_preferences.dart';
final yourStr = sharedPreferences.getString("yourkey");
var map = json.decode(yourStr);
sharedPreferences.setString("yourkey", json.encode("value"));
For those who don't like to convert String to JSON or vice versa, personally recommend localstorage, this is the easiest way I had ever found to store any data<T> in Flutter.
import 'package:localstorage/localstorage.dart';
final LocalStorage store = new LocalStorage('myapp');
...
setLocalStorage() async {
await store.ready; // Make sure store is ready
store.setItem('myMap', myMapData);
}
...
Hope this helps!

How do I write to multiple files in Apache Beam?

Let me simplify my case. I'm using Apache Beam 0.6.0. My final processed result is PCollection<KV<String, String>>. And I want to write values to different files corresponding to their keys.
For example, let's say the result consists of
(key1, value1)
(key2, value2)
(key1, value3)
(key1, value4)
Then I want to write value1, value3 and value4 to key1.txt, and write value4 to key2.txt.
And in my case:
Key set is determined when the pipeline is running, not when constructing the pipeline.
Key set may be quite small, but the number of values corresponding to each key may be very very large.
Any ideas?
Handily, I wrote a sample of this case just the other day.
This example is dataflow 1.x style
Basically you group by each key, and then you can do this with a custom transform that connects to cloud storage. Caveat being that your list of lines per-file shouldn't be massive (it's got to fit into memory on a single instance, but considering you can run high-mem instances, that limit is pretty high).
...
PCollection<KV<String, List<String>>> readyToWrite = groupedByFirstLetter
.apply(Combine.perKey(AccumulatorOfWords.getCombineFn()));
readyToWrite.apply(
new PTransformWriteToGCS("dataflow-experiment", TonyWordGrouper::derivePath));
...
And then the transform doing most of the work is:
public class PTransformWriteToGCS
extends PTransform<PCollection<KV<String, List<String>>>, PCollection<Void>> {
private static final Logger LOG = Logging.getLogger(PTransformWriteToGCS.class);
private static final Storage STORAGE = StorageOptions.getDefaultInstance().getService();
private final String bucketName;
private final SerializableFunction<String, String> pathCreator;
public PTransformWriteToGCS(final String bucketName,
final SerializableFunction<String, String> pathCreator) {
this.bucketName = bucketName;
this.pathCreator = pathCreator;
}
#Override
public PCollection<Void> apply(final PCollection<KV<String, List<String>>> input) {
return input
.apply(ParDo.of(new DoFn<KV<String, List<String>>, Void>() {
#Override
public void processElement(
final DoFn<KV<String, List<String>>, Void>.ProcessContext arg0)
throws Exception {
final String key = arg0.element().getKey();
final List<String> values = arg0.element().getValue();
final String toWrite = values.stream().collect(Collectors.joining("\n"));
final String path = pathCreator.apply(key);
BlobInfo blobInfo = BlobInfo.newBuilder(bucketName, path)
.setContentType(MimeTypes.TEXT)
.build();
LOG.info("blob writing to: {}", blobInfo);
Blob result = STORAGE.create(blobInfo,
toWrite.getBytes(StandardCharsets.UTF_8));
}
}));
}
}
Just write a loop in a ParDo function!
More details -
I had the same scenario today, the only thing is in my case key=image_label and value=image_tf_record. So like what you have asked, I am trying to create separate TFRecord files, one per class, each record file containing a number of images. HOWEVER not sure if there might be memory issues when a number of values per key are very high like your scenario:
(Also my code is in Python)
class WriteToSeparateTFRecordFiles(beam.DoFn):
def __init__(self, outdir):
self.outdir = outdir
def process(self, element):
l, image_list = element
writer = tf.python_io.TFRecordWriter(self.outdir + "/tfr" + str(l) + '.tfrecord')
for example in image_list:
writer.write(example.SerializeToString())
writer.close()
And then in your pipeline just after the stage where you get key-value pairs to add these two lines:
(p
| 'GroupByLabelId' >> beam.GroupByKey()
| 'SaveToMultipleFiles' >> beam.ParDo(WriteToSeparateTFRecordFiles(opt, p))
)
you can use FileIO.writeDinamic() for that
PCollection<KV<String,String>> readfile= (something you read..);
readfile.apply(FileIO. <String,KV<String,String >> writeDynamic()
.by(KV::getKey)
.withDestinationCoder(StringUtf8Coder.of())
.via(Contextful.fn(KV::getValue), TextIO.sink())
.to("somefolder")
.withNaming(key -> FileIO.Write.defaultNaming(key, ".txt")));
p.run();
In Apache Beam 2.2 Java SDK, this is natively supported in TextIO and AvroIO using respectively TextIO and AvroIO.write().to(DynamicDestinations). See e.g. this method.
Update (2018): Prefer to use FileIO.writeDynamic() together with TextIO.sink() and AvroIO.sink() instead.
Just write below lines in your ParDo class :
from apache_beam.io import filesystems
eventCSVFileWriter = filesystems.FileSystems.create(gcsFileName)
for record in list(Records):
eventCSVFileWriter.write(record)
If you want the full code I can help you with that too.

Why long delay (47 secs) to go from camera to AIR on iPad2/iOS6 - CameraUI, Loader, MediaEvent, MediaPromise

I've had a great deal of success using the DevGirl XpenseIt solution offered by Jason Sturges in response to a couple of other requests for help with this: (http://stackoverflow.com/questions/11812807/take-photo-using-adobe-builder-flex-for-ios being the best example)
Great success except that between the time you press the 'Use' button in iOS6 after taking a photo using the CameraUI and the util class from the tutorial, it takes fully 47 1-hippopotamus, 2 hippapotamusses until the 'fileReady' event occurs.
It doesn't seem, to my mind that it should take the Loader class that terribly long.
Is there something I can do to improve this performance? I'm forced to add a hurry-up-and-wait UI element so my users won't think the program has hung. Here's the code of the CameraUtil.as from the above as I'm currently using it.
// http://stackoverflow.com/questions/11812807/take-photo-using-adobe-builder-flex-for-ios
package classes
{
import flash.display.BitmapData;
import flash.display.Loader;
import flash.display.LoaderInfo;
import flash.events.Event;
import flash.events.EventDispatcher;
import flash.events.IEventDispatcher;
import flash.events.MediaEvent;
import flash.filesystem.File;
import flash.filesystem.FileMode;
import flash.filesystem.FileStream;
import flash.media.CameraRoll;
import flash.media.CameraUI;
import flash.media.MediaPromise;
import flash.media.MediaType;
import flash.utils.ByteArray;
import mx.graphics.codec.JPEGEncoder;
import events.CameraEvent;
[Event(name = "fileReady", type = "events.CameraEvent")]
public class CameraUtil extends EventDispatcher
{
protected var camera:CameraUI;
protected var loader:Loader;
public var file:File;
public function CameraUtil(target:IEventDispatcher=null)
{
super(target);
if (CameraUI.isSupported)
{
camera = new CameraUI();
camera.addEventListener(MediaEvent.COMPLETE, mediaEventComplete);
}
} // End CONSTRUCTOR CameraUtil
public function takePicture():void
{
if (camera)
camera.launch(MediaType.IMAGE);
} // End FUNCTION takePicture
protected function mediaEventComplete(event:MediaEvent):void
{
var mediaPromise:MediaPromise = event.data;
if (mediaPromise.file == null)
{
// For iOS we need to load with a Loader first
loader = new Loader();
loader.contentLoaderInfo.addEventListener(Event.COMPLETE, loaderCompleted);
loader.loadFilePromise(mediaPromise);
return;
}
else
{
// Android we can just dispatch the event that it's complete
file = new File(mediaPromise.file.url);
dispatchEvent(new CameraEvent(CameraEvent.FILE_READY, file));
}
} // End FUNCTION mediaEventComplete
protected function loaderCompleted(event:Event):void
{
var loaderInfo:LoaderInfo = event.target as LoaderInfo;
if (CameraRoll.supportsAddBitmapData)
{
var bitmapData:BitmapData = new BitmapData(loaderInfo.width, loaderInfo.height);
bitmapData.draw(loaderInfo.loader);
file = File.applicationStorageDirectory.resolvePath("receipt" + new Date().time + ".jpg");
var stream:FileStream = new FileStream()
stream.open(file, FileMode.WRITE);
var j:JPEGEncoder = new JPEGEncoder();
var bytes:ByteArray = j.encode(bitmapData);
stream.writeBytes(bytes, 0, bytes.bytesAvailable);
stream.close();
dispatchEvent(new CameraEvent(CameraEvent.FILE_READY, file));
}
} // End FUNCTION loaderComplete
} // End CLASS CameraUtil
} // End PACKAGE classes
I was able to solve my delay problem by removing a step from the process. This step is one I myself do not need (at the present time) but others may, so removing it is not really an answer to the question of 'why does this process that seems reasonable take what seems an unreasonable amount of time.
I needed the BitmapData, not an external file so instead of:
Camera => [snap] => Media Promise => Loader => Write File => Event => Read File => Use BitmapData
I rewrote the class to cut out the File/AppStorage i/o.
Camera => [snap] => Media Promise => Loader => Use BitmapData
and so a very reasonable (and expected amount of comp time).
I am still surprised however that it takes such a long time to write the data to a file using the method used in the CameraUtil class. I do need to write these images out to files, but not until the user has reduced the size to a 1024x768 crop area and I encode them into a very compressed jpg, so hopefully I'll only struggle with a smaller portion of the hang/comp time.
Anybody know... Should it take so very long to write 1 file to application storage in iOS from Adobe AIR (via flex)?
I found this process painstakingly slow, too ...
However, it seems the JPEGEncoder is contributing a lot to this delay.
You can speed up the process a lot using the optimized Encoder that can be found here.
http://www.bytearray.org/?p=775
Depending on your device it is about 2 to 4 times faster than the original one.
Another step that can be omitted is the bitmapData.draw(), which is slow as well, by using the Loader.content directly. This way you skip the instantiation of another bitmap instance, which will blowup memory usage.
Like this:
protected function loaderCompleted(event:Event):void
{
var loader:Loader = (event.target as LoaderInfo).loader;
var bitmap:Bitmap = loader.content as Bitmap;
(...)
}
Still, I'm waiting for somebody to write an iOS .ane which should be able to encode a jpg in a few ms instead of seconds. But in the meantime... ;)

Resources