I am running a beam pipeline on Google Cloud Dataflow. however, the pipeline cannot be updated with exactly the same code. The pipeline looks like pipeline overview. And the code is as follows
import com.google.common.collect.Iterables;
import com.google.common.primitives.Ints;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.GenerateSequence;
import org.apache.beam.sdk.options.PipelineOptions;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.Create;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.View;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionView;
import org.joda.time.Duration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class PipelineTest {
private static final Logger logger = LoggerFactory.getLogger(PipelineTest.class);
public static void main(String[] args) {
int[] shit = new int[1000];
for (int i = 0; i < shit.length; i++) {
shit[i] = i * i;
}
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline pipeline = Pipeline.create(options);
PCollection<Iterable<Integer>> sideInput =
pipeline.apply("Create", Create.<Iterable<Integer>>of(Ints.asList(shit)));
PCollectionView<Iterable<Integer>> view =
sideInput.apply("CreateSideInput", View.asSingleton());
PCollection<String> done =
pipeline
.apply(
"FakeData",
GenerateSequence.from(0).to(50_000).withRate(10, Duration.standardSeconds(1)))
.apply(
"Map1",
ParDo.of(
new DoFn<Long, String>() {
#ProcessElement
public void processElement(ProcessContext ctx) {
Long element = ctx.element();
Iterable<Integer> v = ctx.sideInput(view);
String out = "element " + element + ", value " + Iterables.size(v);
logger.info("MAP1: " + out);
ctx.output(out);
}
})
.withSideInputs(view))
.apply(
"Map2",
ParDo.of(
new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext ctx) {
String element = ctx.element();
Iterable<Integer> v = ctx.sideInput(view);
String out = "element " + element + ", value " + Iterables.size(v);
logger.info("MAP2: " + out);
ctx.output(out);
}
})
.withSideInputs(view));
}
}
I tried to provide default value for the view as well as use two views. However, neither of them work. If the view is used in two independent transforms, the pipeline can be updated.
Related
I am trying to write to Spanner from a DataFlow streaming job by using
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-io-google-cloud-platform</artifactId>
<version>2.18.0</version>
</dependency>
After mapping the data to PCollection<Mutation> I am writing them to Spanner via SpannerIO.write
Pipeline pipeline = Pipeline.create(options);
PCollection<Mutation> mutations = pipeline.apply...
mutations.apply("WriteMutations", SpannerIO.write()
.withInstanceId(INSTANCE_ID)
.withDatabaseId(DATABASE_ID)
);
pipeline.run();
However, it throws
java.lang.IllegalStateException: Sorter should be null here
at org.apache.beam.sdk.io.gcp.spanner.SpannerIO$GatherBundleAndSortFn.startBundle (SpannerIO.java:1080)
What would be the cause of this exception?
The following pipeline produces the exception. I test it with 20 workers but it looks like it's independent of the data load.
import com.google.cloud.spanner.Mutation;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubIO;
import org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage;
import org.apache.beam.sdk.io.gcp.spanner.SpannerIO;
import org.apache.beam.sdk.options.Description;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.windowing.AfterProcessingTime;
import org.apache.beam.sdk.transforms.windowing.GlobalWindows;
import org.apache.beam.sdk.transforms.windowing.Repeatedly;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.joda.time.Duration;
import java.util.UUID;
public final class TestPipeline {
private static final Duration WINDOW_DURATION = Duration.standardSeconds(1);
private static final String DATABASE_ID = "test";
private static final String INSTANCE_ID = "test-spanner";
private static final String TEST_TABLE = "test";
public static void main(String[] args) {
TestPipelineOptions options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(TestPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply("Read pubsub", PubsubIO.readMessagesWithAttributes()
.fromSubscription(options.getInputSubscription()))
.apply("Parse message", ParDo.of(new ProcessMessage()))
.apply("Windowing", Window.<Mutation>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(WINDOW_DURATION)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
.apply("Write mutations", SpannerIO.write()
.withInstanceId(INSTANCE_ID)
.withDatabaseId(DATABASE_ID)
);
pipeline.run();
}
private static class ProcessMessage extends DoFn<PubsubMessage, Mutation> {
#ProcessElement
public void processElement(#Element final PubsubMessage message,
final OutputReceiver<Mutation> out) {
out.output(Mutation.newInsertOrUpdateBuilder(TEST_TABLE)
.set("id").to(UUID.randomUUID().toString())
.set("string").to("test")
.set("count").to(Long.MAX_VALUE)
.build()
);
}
}
interface TestPipelineOptions extends DataflowPipelineOptions {
void setInputSubscription(String inputSubscription);
#Description("Google Pubsub subscription id.")
String getInputSubscription();
}
}
Table CREATE TABLE test (id STRING(50) NOT NULL, string STRING(50) NOT NULL, count INT64) PRIMARY KEY (id);
This issue seems to occur with apache beam version 2.18, but not with version 2.17.
The issue with apache beam version 2.18 is tracked here: https://issues.apache.org/jira/browse/BEAM-9505
I was trying to implement SQL on two dataset on google cloud storage using apache beam by following Apache Beam documentation https://beam.apache.org/documentation/dsls/sql/walkthrough/
But i am ended with the below exception :
An exception occured while executing the Java class. org.apache.beam.sdk.transforms.MapElements
.via(Lorg/apache/beam/sdk/transforms/SimpleFunction;)Lorg/apache/beam/sdk/transforms/MapElements;
I tried changing Beam-sdk-version and other code changes but none of them worked .
package com.nitesh.gcp.feature;
import org.apache.beam.runners.dataflow.options.DataflowPipelineOptions;
import org.apache.beam.sdk.Pipeline;
import org.apache.beam.sdk.extensions.sql.SqlTransform;
import org.apache.beam.sdk.io.TextIO;
import org.apache.beam.sdk.options.PipelineOptionsFactory;
import org.apache.beam.sdk.schemas.Schema;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.PCollection;
import org.apache.beam.sdk.values.PCollectionTuple;
import org.apache.beam.sdk.values.Row;
import org.apache.beam.sdk.values.TupleTag;
import java.util.stream.Collectors;
public class beamSQL1 {
public static final String EMPHEADER = "empno,ename,job,mgr,hiredate,sal,comm,deptno";
public static final String DEPTHEADER = "deptno,dname,location";
public static final Schema EMPSCHEMA = Schema.builder()
.addStringField("empno")
.addStringField("ename")
.addStringField("job")
.addStringField("mgr")
.addStringField("hiredate")
.addStringField("sal")
.addStringField("comm")
.addStringField("deptno")
.build();
public static final Schema DEPTSCHEMA = Schema.builder()
.addStringField("deptno")
.addStringField("dname")
.addStringField("location")
.build();
public static void main(String[] args) {
PipelineOptionsFactory.register(DataflowPipelineOptions.class);
DataflowPipelineOptions options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(DataflowPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<String> employee = pipeline.apply("Read From GCS", TextIO.read().from("gs://amazon-test/sqlData/employee.txt"));
PCollection<String> department = pipeline.apply("Read From GCS", TextIO.read().from("gs://amazon-test/sqlData/department.txt"));
PCollection<Row> employeeRow = employee.apply("Transform To Row", ParDo.of(new RowParDo())).setRowSchema(EMPSCHEMA);
PCollection<Row> departmentRow = department.apply("Transform To Row", ParDo.of(new RowParDoForDept())).setRowSchema(DEPTSCHEMA);
PCollectionTuple output = PCollectionTuple.of(new TupleTag<>("emp"), employeeRow).and(new TupleTag<>("dept"), departmentRow);
output.apply(
SqlTransform.query(
// "SELECT emp.empno,emp.ename,dept.deptno,dept.dname FROM emp JOIN dept ON emp.deptno = dept.deptno"))
"SELECT * from emp JOIN dept ON emp.deptno = dept.deptno"))
/* p2.apply("Transform Sql", SqlTransform.query(
"SELECT * " +
"FROM PCOLLECTION order by sal desc LIMIT 14")
)*/
.apply("TransForm To String", ParDo.of(new RowToString()))
.apply("Write To GCS", TextIO.write().to("gs://amazon-test/sqlData/output/outputSql.csv").withoutSharding());
pipeline.run();
}
//ParDo for String -> Row (SQL)
public static class RowParDo extends DoFn<String, Row> {
#ProcessElement
public void processElement(ProcessContext c) {
if (!c.element().equalsIgnoreCase(EMPHEADER)) {
String[] vals = c.element().split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
Row appRow = Row
.withSchema(EMPSCHEMA)
.addValues(vals[0], vals[1], vals[2], vals[3], vals[4], vals[5], vals[6], vals[7])
.build();
c.output(appRow);
}
}
}
//ParDo for Row (SQL) -> String
public static class RowToString extends DoFn<Row, String> {
#ProcessElement
public void processElement(ProcessContext c) {
String line = c.element().getValues()
.stream()
.map(Object::toString)
.collect(Collectors.joining(","));
c.output(line);
}
}
//ParDo for String -> Row (SQL)
public static class RowParDoForDept extends DoFn<String, Row> {
#ProcessElement
public void processElement(ProcessContext c) {
if (!c.element().equalsIgnoreCase(DEPTHEADER)) {
String[] vals = c.element().split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
Row appRow = Row
.withSchema(DEPTSCHEMA)
.addValues(vals[0], vals[1], vals[2])
.build();
c.output(appRow);
}
}
}
}
I am using Lucene 6.6 and I am facing difficulty in importing lucene.queryparser and I did check the lucene documentations and it doesn't exist now.
I am using below code. Is there any alternative for queryparser in lucene6.
import java.io.IOException;
import java.text.ParseException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
public class HelloLucene {
public static void main(String[] args) throws IOException, ParseException {
// 0. Specify the analyzer for tokenizing text.
// The same analyzer should be used for indexing and searching
StandardAnalyzer analyzer = new StandardAnalyzer();
// 1. create the index
Directory index = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action", "193398817");
addDoc(w, "Lucene for Dummies", "55320055Z");
addDoc(w, "Managing Gigabytes", "55063554A");
addDoc(w, "The Art of Computer Science", "9900333X");
w.close();
// 2. query
String querystr = args.length > 0 ? args[0] : "lucene";
// the "title" arg specifies the default field to use
// when no field is explicitly specified in the query.
Query q = null;
try {
q = new QueryParser(Version.LUCENE_6_6_0, "title", analyzer).parse(querystr);
} catch (org.apache.lucene.queryparser.classic.ParseException e) {
e.printStackTrace();
}
// 3. search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. display results
System.out.println("Found " + hits.length + " hits.");
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get("isbn") + "\t" + d.get("title"));
}
// reader can only be closed when there
// is no need to access the documents any more.
reader.close();
}
private static void addDoc(IndexWriter w, String title, String isbn) throws IOException {
Document doc = new Document();
doc.add(new TextField("title", title, Field.Store.YES));
// use a string field for isbn because we don't want it tokenized
doc.add(new StringField("isbn", isbn, Field.Store.YES));
w.addDocument(doc);
}
}
Thanks!
The problem got solved.
Initially, in the build path, only Lucene-core-6.6.0 was added but lucene-queryparser-6.6.0 is a separate jar file that needs to be added separately.
I'm trying to put a String text to an edge of a JUNG tree graph. See bellow example.
What I tried is to simply add this line:
graph.addEdge("arrow", "Vmain", "Vsecond");
... but without any results (see bellow screenshot). I need some advices please.
import java.awt.BorderLayout;
import java.awt.Color;
import java.awt.Container;
import java.awt.Dimension;
import java.awt.GridLayout;
import java.awt.event.ActionEvent;
import java.awt.event.ActionListener;
import javax.swing.BorderFactory;
import javax.swing.JApplet;
import javax.swing.JButton;
import javax.swing.JComboBox;
import javax.swing.JFrame;
import javax.swing.JPanel;
import org.apache.commons.collections15.Factory;
import edu.uci.ics.jung.algorithms.layout.TreeLayout;
import edu.uci.ics.jung.graph.DirectedGraph;
import edu.uci.ics.jung.graph.DirectedSparseMultigraph;
import edu.uci.ics.jung.graph.Forest;
import edu.uci.ics.jung.graph.DelegateForest;
import edu.uci.ics.jung.graph.DelegateTree;
import edu.uci.ics.jung.graph.Tree;
import edu.uci.ics.jung.visualization.GraphZoomScrollPane;
import edu.uci.ics.jung.visualization.VisualizationViewer;
import edu.uci.ics.jung.visualization.control.CrossoverScalingControl;
import edu.uci.ics.jung.visualization.control.DefaultModalGraphMouse;
import edu.uci.ics.jung.visualization.control.ModalGraphMouse;
import edu.uci.ics.jung.visualization.control.ScalingControl;
import edu.uci.ics.jung.visualization.decorators.EdgeShape;
import edu.uci.ics.jung.visualization.decorators.ToStringLabeller;
public class TreeVisualiser extends JApplet {
Forest<String, String> graph;
Factory<DirectedGraph<String, String>> graphFactory = new Factory<DirectedGraph<String, String>>() {
public DirectedGraph<String, String> create() {
return new DirectedSparseMultigraph<String, String>();
}
};
Factory<Tree<String, String>> treeFactory = new Factory<Tree<String, String>>() {
public Tree<String, String> create() {
return new DelegateTree<String, String>(graphFactory);
}
};
VisualizationViewer<String, String> vv;
String root;
TreeLayout<String, String> treeLayout;
#SuppressWarnings({"rawtypes", "unchecked"})
public TreeVisualiser() {
// create a simple graph for the demo
graph = new DelegateForest<String, String>();
createTree();
treeLayout = new TreeLayout<String, String>(graph);
vv = new VisualizationViewer<String, String>(treeLayout, new Dimension(600, 600));
vv.setBackground(Color.white);
vv.getRenderContext().setEdgeShapeTransformer(new EdgeShape.Line());
vv.getRenderContext().setVertexLabelTransformer(new ToStringLabeller());
// add a listener for ToolTips
vv.setVertexToolTipTransformer(new ToStringLabeller());
Container content = getContentPane();
final GraphZoomScrollPane panel = new GraphZoomScrollPane(vv);
content.add(panel);
final DefaultModalGraphMouse graphMouse = new DefaultModalGraphMouse();
vv.setGraphMouse(graphMouse);
JComboBox modeBox = graphMouse.getModeComboBox();
modeBox.addItemListener(graphMouse.getModeListener());
graphMouse.setMode(ModalGraphMouse.Mode.TRANSFORMING);
final ScalingControl scaler = new CrossoverScalingControl();
JButton plus = new JButton("+");
plus.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
scaler.scale(vv, 1.1f, vv.getCenter());
}
});
JButton minus = new JButton("-");
minus.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
scaler.scale(vv, 1 / 1.1f, vv.getCenter());
}
});
JPanel scaleGrid = new JPanel(new GridLayout(1, 0));
scaleGrid.setBorder(BorderFactory.createTitledBorder("Zoom"));
JPanel controls = new JPanel();
scaleGrid.add(plus);
scaleGrid.add(minus);
controls.add(scaleGrid);
controls.add(modeBox);
content.add(controls, BorderLayout.SOUTH);
}
private void createTree() {
graph.addEdge("arrow", "Vmain", "Vsecond");
}
public static void main(String[] args) {
JFrame frame = new JFrame();
Container content = frame.getContentPane();
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
content.add(new TreeVisualiser());
frame.pack();
frame.setVisible(true);
}
}
Oh, I missed this line of code. Now it is working as I want.
vv.getRenderContext().setEdgeLabelTransformer(new ToStringLabeller());
I downloaded the Jester example code in Mahout, and tries to run it on jester dataset to see the evaluation results. the running is done successfully, but the console only has the results:
log4j:WARN No appenders could be found for logger (org.apache.mahout.cf.taste.impl.model.file.FileDataModel).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I expect to see the evaluation score range from 0 to 10. any one can help me found out how to get the score?
I am using mahout-core-0.6.jar and the following is the code:
JesterDataModel.java:
package Jester;
import java.io.File;
import java.io.IOException;
import java.util.Collection;
import java.util.regex.Pattern;
import com.google.common.collect.Lists;
import org.apache.mahout.cf.taste.example.grouplens.GroupLensDataModel;
import org.apache.mahout.cf.taste.impl.common.FastByIDMap;
import org.apache.mahout.cf.taste.impl.model.GenericDataModel;
import org.apache.mahout.cf.taste.impl.model.GenericPreference;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.common.iterator.FileLineIterator;
//import org.apache.mahout.cf.taste.impl.common.FileLineIterable;
public final class JesterDataModel extends FileDataModel {
private static final Pattern COMMA_PATTERN = Pattern.compile(",");
private long userBeingRead;
public JesterDataModel() throws IOException {
this(GroupLensDataModel.readResourceToTempFile("\\jester-data-1.csv"));
}
public JesterDataModel(File ratingsFile) throws IOException {
super(ratingsFile);
}
#Override
public void reload() {
userBeingRead = 0;
super.reload();
}
#Override
protected DataModel buildModel() throws IOException {
FastByIDMap<Collection<Preference>> data = new FastByIDMap<Collection<Preference>> ();
FileLineIterator iterator = new FileLineIterator(getDataFile(), false);
FastByIDMap<FastByIDMap<Long>> timestamps = new FastByIDMap<FastByIDMap<Long>>();
processFile(iterator, data, timestamps, false);
return new GenericDataModel(GenericDataModel.toDataMap(data, true));
}
#Override
protected void processLine(String line,
FastByIDMap<?> rawData,
FastByIDMap<FastByIDMap<Long>> timestamps,
boolean fromPriorData) {
FastByIDMap<Collection<Preference>> data = (FastByIDMap<Collection<Preference>>) rawData;
String[] jokePrefs = COMMA_PATTERN.split(line);
int count = Integer.parseInt(jokePrefs[0]);
Collection<Preference> prefs = Lists.newArrayListWithCapacity(count);
for (int itemID = 1; itemID < jokePrefs.length; itemID++) { // yes skip first one, just a count
String jokePref = jokePrefs[itemID];
if (!"99".equals(jokePref)) {
float jokePrefValue = Float.parseFloat(jokePref);
prefs.add(new GenericPreference(userBeingRead, itemID, jokePrefValue));
}
}
data.put(userBeingRead, prefs);
userBeingRead++;
}
}
JesterRecommenderEvaluatorRunner.java
package Jester;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator;
import org.apache.mahout.cf.taste.model.DataModel;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
public final class JesterRecommenderEvaluatorRunner {
private static final Logger log = LoggerFactory.getLogger(JesterRecommenderEvaluatorRunner.class);
private JesterRecommenderEvaluatorRunner() {
// do nothing
}
public static void main(String... args) throws IOException, TasteException {
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
DataModel model = new JesterDataModel();
double evaluation = evaluator.evaluate(new JesterRecommenderBuilder(),
null,
model,
0.9,
1.0);
log.info(String.valueOf(evaluation));
}
}
Mahout 0.7 is old, and 0.6 is very old. Use at least 0.7, or better, later from SVN.
I think the problem is exactly what you identified: you don't have any slf4j bindings in your classpath. If you use the ".job" files in Mahout you will have all dependencies packages. Then you will actually see output.