Recommendation Engine using Apache Spark MLIB showing up Zero recommendations after processing all operations - machine-learning

I am a newbie when it comes to Implementation of ML Algorithms. I wanted to implement a recommendation Engine and Got to know after little experimenting that collaborative-filtering can be used for the same. I am using Apache Spark for the same. I got help from one of the blogs and tried to implement the same in my local. PFB Code that I tried out. Every time I execute this the Count of Recommendations that is getting printed is always zero. I don see any Evident Error as such. Could someone please help me understand this. Also, please feel free to provide any other reference that can be referred in this regard.
package mllib.example;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.mllib.recommendation.ALS;
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel;
import org.apache.spark.mllib.recommendation.Rating;
import scala.Tuple2;
public class RecommendationEngine {
public static void main(String[] args) {
// Create Java spark context
SparkConf conf = new SparkConf().setAppName("Recommendation System Example").setMaster("local[2]").set("spark.executor.memory","1g");
JavaSparkContext sc = new JavaSparkContext(conf);
// Read user-item rating file. format - userId,itemId,rating
JavaRDD<String> userItemRatingsFile = sc.textFile(args[0]);
System.out.println("Count is "+userItemRatingsFile.count());
// Read item description file. format - itemId, itemName, Other Fields,..
JavaRDD<String> itemDescritpionFile = sc.textFile(args[1]);
System.out.println("itemDescritpionFile Count is "+itemDescritpionFile.count());
// Map file to Ratings(user,item,rating) tuples
JavaRDD<Rating> ratings = userItemRatingsFile.map(new Function<String, Rating>() {
public Rating call(String s) {
String[] sarray = s.split(",");
return new Rating(Integer.parseInt(sarray[0]), Integer
.parseInt(sarray[1]), Double.parseDouble(sarray[2]));
}
});
System.out.println("Ratings RDD Object"+ratings.first().toString());
// Create tuples(itemId,ItemDescription), will be used later to get names of item from itemId
JavaPairRDD<Integer,String> itemDescritpion = itemDescritpionFile.mapToPair(
new PairFunction<String, Integer, String>() {
#Override
public Tuple2<Integer, String> call(String t) throws Exception {
String[] s = t.split(",");
return new Tuple2<Integer,String>(Integer.parseInt(s[0]), s[1]);
}
});
System.out.println("itemDescritpion RDD Object"+ratings.first().toString());
// Build the recommendation model using ALS
int rank = 10; // 10 latent factors
int numIterations = Integer.parseInt(args[2]); // number of iterations
MatrixFactorizationModel model = ALS.trainImplicit(JavaRDD.toRDD(ratings),
rank, numIterations);
//ALS.trainImplicit(arg0, arg1, arg2)
// Create user-item tuples from ratings
JavaRDD<Tuple2<Object, Object>> userProducts = ratings
.map(new Function<Rating, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(Rating r) {
return new Tuple2<Object, Object>(r.user(), r.product());
}
});
// Calculate the itemIds not rated by a particular user, say user with userId = 1
JavaRDD<Integer> notRatedByUser = userProducts.filter(new Function<Tuple2<Object,Object>, Boolean>() {
#Override
public Boolean call(Tuple2<Object, Object> v1) throws Exception {
if (((Integer) v1._1).intValue() != 0) {
return true;
}
return false;
}
}).map(new Function<Tuple2<Object,Object>, Integer>() {
#Override
public Integer call(Tuple2<Object, Object> v1) throws Exception {
return (Integer) v1._2;
}
});
// Create user-item tuples for the items that are not rated by user, with user id 1
JavaRDD<Tuple2<Object, Object>> itemsNotRatedByUser = notRatedByUser
.map(new Function<Integer, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(Integer r) {
return new Tuple2<Object, Object>(0, r);
}
});
// Predict the ratings of the items not rated by user for the user
JavaRDD<Rating> recomondations = model.predict(itemsNotRatedByUser.rdd()).toJavaRDD().distinct();
// Sort the recommendations by rating in descending order
recomondations = recomondations.sortBy(new Function<Rating,Double>(){
#Override
public Double call(Rating v1) throws Exception {
return v1.rating();
}
}, false, 1);
System.out.println("recomondations Total is "+recomondations.count());
// Get top 10 recommendations
JavaRDD<Rating> topRecomondations = sc.parallelize(recomondations.take(10));
// Join top 10 recommendations with item descriptions
JavaRDD<Tuple2<Rating, String>> recommendedItems = topRecomondations.mapToPair(
new PairFunction<Rating, Integer, Rating>() {
#Override
public Tuple2<Integer, Rating> call(Rating t) throws Exception {
return new Tuple2<Integer,Rating>(t.product(),t);
}
}).join(itemDescritpion).values();
System.out.println("recommendedItems count is "+recommendedItems.count());
//Print the top recommendations for user 1.
recommendedItems.foreach(new VoidFunction<Tuple2<Rating,String>>() {
#Override
public void call(Tuple2<Rating, String> t) throws Exception {
System.out.println(t._1.product() + "\t" + t._1.rating() + "\t" + t._2);
}
});
Also, I see that this job is Running for real Long time. Every time it creates a model.Is there a way I can Create the Model once, persist it and Load the same for consecutive Predictions. Can we by any chance improve the Speed of execution of this job
Thanks in Advance

Related

Google Dataflow write multiple line in BigQuery

I have a simple flow which aim is to write two lines in one BigQuery Table.
I use a DynamicDestinations because after that I will write on mutliple Table, on that example it's the same table...
The problem is that I only have 1 line in my BigQuery table at the end.
It stacktrace I see the following error on the second insert
"
status: {
code: 6
message: "Already Exists: Job sampleprojet3:b9912b9b05794aec8f4292b2ae493612_eeb0082ade6f4a58a14753d1cc92ddbc_00001-0"
}
"
What does it means ?
Is it related to this limitation ?
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/550
How can I do the job ?
I use BeamSDK 2.0.0, I have try with 2.1.0 (same result)
The way I launch :
mvn compile exec:java -Dexec.mainClass=fr.gireve.dataflow.LogsFlowBug -Dexec.args="--runner=DataflowRunner --inputDir=gs://sampleprojet3.appspot.com/ --project=sampleprojet3 --stagingLocation=gs://dataflow-sampleprojet3/tmp" -Pdataflow-runner
Pipeline p = Pipeline.create(options);
final List<String> tableNameTableValue = Arrays.asList("table1:value1", "table1:value2", "table2:value1", "table2:value2");
p.apply(Create.of(tableNameTableValue)).setCoder(StringUtf8Coder.of())
.apply(BigQueryIO.<String>write()
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
.to(new DynamicDestinations<String, KV<String, String>>() {
#Override
public KV<String, String> getDestination(ValueInSingleWindow<String> element) {
final String[] split = element.getValue().split(":");
return KV.of(split[0], split[1]) ;
}
#Override
public Coder<KV<String, String>> getDestinationCoder() {
return KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of());
}
#Override
public TableDestination getTable(KV<String, String> row) {
String tableName = row.getKey();
String tableSpec = "sampleprojet3:testLoadJSON." + tableName;
return new TableDestination(tableSpec, "Table " + tableName);
}
#Override
public TableSchema getSchema(KV<String, String> row) {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("myColumn").setType("STRING"));
TableSchema ts = new TableSchema();
ts.setFields(fields);
return ts;
}
})
.withFormatFunction(new SerializableFunction<String, TableRow>() {
public TableRow apply(String row) {
TableRow tr = new TableRow();
tr.set("myColumn", row);
return tr;
}
}));
p.run().waitUntilFinish();
Thanks
DynamicDestinations associates each element with a destination - i.e. where the element should go. Elements are routed to BigQuery tables according to their destinations: 1 destination = 1 BigQuery table with a schema: the destination should include just enough information to produce a TableDestination and a schema. Elements with the same destination go to the same table, elements with different destinations go to different tables.
Your code snippet uses DynamicDestinations with a destination type that contains both the element and the table, which is unnecessary, and of course, violates the constraint above: elements with a different destination end up going to the same table: e.g. KV("table1", "value1") and KV("table1", "value2") are different destinations but your getTable maps them to the same table table1.
You need to remove the element from your destination type. That will also lead to simpler code. As a side note, I think you don't need to override getDestinationCoder() - it can be inferred automatically.
Try this:
.to(new DynamicDestinations<String, String>() {
#Override
public String getDestination(ValueInSingleWindow<String> element) {
return element.getValue().split(":")[0];
}
#Override
public TableDestination getTable(String tableName) {
return new TableDestination(
"sampleprojet3:testLoadJSON." + tableName, "Table " + tableName);
}
#Override
public TableSchema getSchema(String tableName) {
List<TableFieldSchema> fields = Arrays.asList(
new TableFieldSchema().setName("myColumn").setType("STRING"));
return new TableSchema().setFields(fields);
}
})

Stateful ParDo not working on Dataflow Runner

Based on Javadocs and the blog post at https://beam.apache.org/blog/2017/02/13/stateful-processing.html, I tried using a simple de-duplication example using 2.0.0-beta-2 SDK which reads a file from GCS (containing a list of jsons each with a user_id field) and then running it through a pipeline as explained below.
The input data contains about 146K events of which only 50 events are unique. The entire input is about 50MB which should be processable in considerably less time than the 2 min Fixed window. I just placed a window there to make sure the per-key-per-window semantics hold without using a GlobalWindow. I run the windowed data through 3 parallel stages to compare the results, each of which are explained below.
just copies the contents into a new file on GCS - this ensures all the events were being processed as expected and I verified the contents are exactly the same as input
Combine.PerKey on the user_id and pick only the first element from the Iterable - this essentially should deduplicate the data and it works as expected. The resulting file has the exact number of unique items from the original list of events - 50 elements
stateful ParDo which checks if the key has been seen already and emits an output only when its not. Ideally, the result from this should match the deduped data as [2] but all I am seeing is only 3 unique events. These 3 unique events always point to the same 3 user_ids in a few runs I did.
Interestingly, when I just switch from the DataflowRunner to the DirectRunner running this whole process locally, I see that the output from [3] matches [2] having only 50 unique elements as expected. So, I am doubting if there are any issues with the DataflowRunner for the Stateful ParDo.
public class StatefulParDoSample {
private static Logger logger = LoggerFactory.getLogger(StatefulParDoSample.class.getName());
static class StatefulDoFn extends DoFn<KV<String, String>, String> {
final Aggregator<Long, Long> processedElements = createAggregator("processed", Sum.ofLongs());
final Aggregator<Long, Long> skippedElements = createAggregator("skipped", Sum.ofLongs());
#StateId("keyTracker")
private final StateSpec<Object, ValueState<Integer>> keyTrackerSpec =
StateSpecs.value(VarIntCoder.of());
#ProcessElement
public void processElement(
ProcessContext context,
#StateId("keyTracker") ValueState<Integer> keyTracker) {
processedElements.addValue(1l);
final String userId = context.element().getKey();
int wasSeen = firstNonNull(keyTracker.read(), 0);
if (wasSeen == 0) {
keyTracker.write( 1);
context.output(context.element().getValue());
} else {
keyTracker.write(wasSeen + 1);
skippedElements.addValue(1l);
}
}
}
public static void main(String[] args) {
DataflowPipelineOptions pipelineOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
pipelineOptions.setRunner(DataflowRunner.class);
pipelineOptions.setProject("project-name");
pipelineOptions.setStagingLocation(GCS_STAGING_LOCATION);
pipelineOptions.setStreaming(false);
pipelineOptions.setAppName("deduper");
Pipeline p = Pipeline.create(pipelineOptions);
final ObjectMapper mapper = new ObjectMapper();
PCollection<KV<String, String>> keyedEvents =
p
.apply(TextIO.Read.from(GCS_SAMPLE_INPUT_FILE_PATH))
.apply(WithKeys.of(new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
try {
Map<String, Object> eventJson =
mapper.readValue(input, Map.class);
return (String) eventJson.get("user_id");
} catch (Exception e) {
}
return "";
}
}))
.apply(
Window.into(
FixedWindows.of(Duration.standardMinutes(2))
)
);
keyedEvents
.apply(ParDo.of(new StatefulDoFn()))
.apply(TextIO.Write.to(GCS_SAMPLE_OUTPUT_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COPY_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Combine.perKey(new SerializableFunction<Iterable<String>, String>() {
#Override
public String apply(Iterable<String> input) {
return !input.iterator().hasNext() ? "empty" : input.iterator().next();
}
}))
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COMBINE_FILE_PATH).withNumShards(1));
PipelineResult result = p.run();
result.waitUntilFinish();
}
}
This was a bug in the Dataflow service in batch mode, fixed in the upcoming 0.6.0 Beam release (or HEAD if you track the bleeding edge).
Thank you for bringing it to my attention! For reference, or if anything else comes up, this was tracked by BEAM-1611.

How can I do batch deletes millions on entities using DatastoreIO and Dataflow

I'm trying to use Dataflow to delete many millions of Datastore entities and the pace is extremely slow (5 entities/s). I am hoping you can explain to me the pattern I should follow to allow that to scale up to a reasonable pace. Just adding more workers did not help.
The Datastore Admin console has the ability to delete all entities of a specific kind but it fails a lot and takes me a week or more to delete 40 million entities. Dataflow ought to be able to help me delete millions of entities that match only certain query parameters as well.
I'm guessing that some type of batching strategy should be employed (where I create a mutation with 1000 deletes in it for example) but its not obvious to me how I would go about that. DatastoreIO gives me just one entity at a time to work with. Pointers would be greatly appreciated.
Below is my current slow solution.
Pipeline p = Pipeline.create(options);
DatastoreIO.Source source = DatastoreIO.source()
.withDataset(options.getDataset())
.withQuery(getInstrumentQuery(options))
.withNamespace(options.getNamespace());
p.apply("ReadLeafDataFromDatastore", Read.from(source))
.apply("DeleteRecords", ParDo.of(new DeleteInstrument(options.getDataset())));
p.run();
static class DeleteInstrument extends DoFn<Entity, Integer> {
String dataset;
DeleteInstrument(String dataset) {
this.dataset = dataset;
}
#Override
public void processElement(ProcessContext c) {
DatastoreV1.Mutation.Builder mutation = DatastoreV1.Mutation.newBuilder();
mutation.addDelete(c.element().getKey());
final DatastoreV1.CommitRequest.Builder request = DatastoreV1.CommitRequest.newBuilder();
request.setMutation(mutation);
request.setMode(DatastoreV1.CommitRequest.Mode.NON_TRANSACTIONAL);
try {
DatastoreOptions.Builder dbo = new DatastoreOptions.Builder();
dbo.dataset(dataset);
dbo.credential(getCredential());
Datastore db = DatastoreFactory.get().create(dbo.build());
db.commit(request.build());
c.output(1);
count++;
if(count%100 == 0) {
LOG.info(count+"");
}
} catch (Exception e) {
c.output(0);
e.printStackTrace();
}
}
}
There is no direct way of deleting entities using the current version of DatastoreIO. This version of DatastoreIO is going to be deprecated in favor of a new version (v1beta3) in the next Dataflow release. We think there is a good use case for providing a delete utility (either through an example or PTransform), but still work in progress.
For now you can batch your deletes, instead of deleting one at a time:
public static class DeleteEntityFn extends DoFn<Entity, Void> {
// Datastore max batch limit
private static final int DATASTORE_BATCH_UPDATE_LIMIT = 500;
private Datastore db;
private List<Key> keyList = new ArrayList<>();
#Override
public void startBundle(Context c) throws Exception {
// Initialize Datastore Client
// db = ...
}
#Override
public void processElement(ProcessContext c) throws Exception {
keyList.add(c.element().getKey());
if (keyList.size() >= DATASTORE_BATCH_UPDATE_LIMIT) {
flush();
}
}
#Override
public void finishBundle(Context c) throws Exception {
if (keyList.size() > 0) {
flush();
}
}
private void flush() throws Exception {
// Make one delete request instead of one for each element.
CommitRequest request =
CommitRequest.newBuilder()
.setMode(CommitRequest.Mode.NON_TRANSACTIONAL)
.setMutation(Mutation.newBuilder().addAllDelete(keyList).build())
.build();
db.commit(request);
keyList.clear();
}
}

Sum and Average Aggregation using DataFlow

I have following type of sample data.
s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user
Now I need to find average_time_span/user, average_time_span/user_level and total_time_span/user.
I'm able to find each of above mention value but couldn't able to find all of those at once. As I'm new to DataFlow, please suggest me appropriate method to do so.
static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
#Override
public void processElement(ProcessContext c) {
String[] words = c.element().split(",");
if (words.length == 5) {
Instant timestamp = Instant.parse(words[1].trim());
KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));
c.outputWithTimestamp(userTime, timestamp);
c.outputWithTimestamp(userLevelTime, timestamp);
}
}
}
public static void main(String[] args) {
TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(TestOptions.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
.apply(ParDo.of(new ExtractUserAndUserLevelFn()))
.apply(Window.<KV<String, Long>>into(
FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
.apply(GroupByKey.<String, Long>create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
public void processElement(ProcessContext c) {
String key = c.element().getKey();
Iterable<Long> docsWithThatUrl = c.element().getValue();
Long sum = 0L;
for (Long item : docsWithThatUrl)
sum += item;
KV<String, Long> userTime = KV.of(key, sum);
c.output(userTime);
}
}))
.apply(MapElements.via(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
withNumShards(options.getShardsNumber()));
p.run();
}
One approach would be to first parse the lines into one PCollection that contains a record per line, and the from that collection create two PCollection of key-value pairs. Let's say you define a record representing a line like this:
static class Record implements Serializable {
final String user;
final String role;
final long duration;
// need a constructor here
}
Now, create a LineToRecordFn that create Records from the input lines, so that you can do:
PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(ParDo.of(new LineToRecordFn()));
You can window here, if you want. Whether you window or not, you can then create your keyed-by-role and keyed-by-user PCollections:
PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.role,r.duration);
}
}));
PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.user, r.duration);
}
}));
Now, you can get the means and sum in just a few lines:
PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
Sum.<String>longsPerKey());
Note that dataflow does some optimization before running your job. So, while it might look like you're doing two passes over the records PCollection, that may not be true.
The Mean and Sum transforms look like they would work well for this use case. Basic usage looks like this:
PCollection<KV<String, Double>> meanPerKey =
input.apply(Mean.<String, Integer>perKey());
PCollection<KV<String, Integer>> sumPerKey = input
.apply(Sum.<String>integersPerKey());

Facing Critical Performance issue in Primefaces 4 & 5

I am working on a project which deal with heavy data sets. I am using Primefaces 4 & 5, spring and hibernate. I have to to display a very huge datasets such as min 3000 rows with 100 columns with various features such as sorting, filtering, row-expansion etc. My problem is, my applications took 8 to 10 mins to show the whole page as well as other functionalities(sorting, filtering ) also takes a lot time. My client is not happy at all. However I can use pagination for this but again My client do not want paging. So I decided to use livescroll but unfortunately I failed to implement livescroll with lazyload or without lazyload as there were bugs in PF regarding livescroll. also i have posted this question here earlier but no solution found.
This performance issue is very critical and show stopper for me. To show 3000 rows with 100 columns, the size of the page which is getting loaded is ~10MB.
I have calculated the time consumed by various life-cycles of of JSF, using Phase-listener I figure out that its Browser who is taking time to parse the response rendered by jsf. To complete the all phases my application took only 25 sec.
At minimal I want to increase the performance of my project. Please share any idea, suggestion and anything which could help to overcome this problem
Note: There is no database manipulations in getters and setters as well as no complex business logic.
UPDATE :
This is my datatable without lazyload:
<p:dataTable
style="width:100%"
id="cdTable"
selection="#{controller.selectedArray}"
resizableColumns="true"
draggableColumns="true"
var="cd"
value="#{controller.cdDataModel}"
editable="true"
editMode="cell"
selectionMode="multiple"
rowSelectMode="add"
scrollable="true"
scrollHeight="650"
rowKey="#{cd.id}"
rowIndexVar="rowIndex"
styleClass="screenScrollStyle"
liveScroll="true"
scrollRows="50"
filterEvent="enter"
widgetVar="dt4"
>
Here everything is working except filtering. Once I filter then first page is displayed but unable to sort or livescroll on datatable. Note this I have tested in Primefaces5.
2nd Approch
With lazyload with same datatable
1) When I add rows="100" livescroll happens but problem with row-editing, row-expansion but filter & sorting works.
2) When I remove rows livescroll works with row-editing, row-expansion etc but filter & sorting dont work.
My LazyLoadModel is as follows
public class MyDataModel extends LazyDataModel<YData>
{
#Override
public List<YData> load(int first, int pageSize,
List<SortMeta> multiSortMeta, Map<String, Object> filters) {
System.out.println("multisort wala load");
return super.load(first, pageSize, multiSortMeta, filters);
}
/**
*
*/
private static final long serialVersionUID = 1L;
private List<YData> datasource;
public YieldRecBondDataModel() {
}
public YieldRecBondDataModel(List<YData> datasource) {
this.datasource = datasource;
}
#Override
public YData getRowData(String rowKey) {
// In a real app, a more efficient way like a query by rowKey should be
// implemented to deal with huge data
// List<YData> yList = (List<YData>) getWrappedData();
for (YData y : datasource)
{
System.out.println("datasource :"+datasource.size());
if(y.getId()!=null)
{
if (y.getId()==(new Long(rowKey)))
{
return y;
}
}
}
return null;
}
#Override
public Object getRowKey(YData y) {
return y.getId();
}
#Override
public void setRowIndex(int rowIndex) {
/*
* The following is in ancestor (LazyDataModel):
* this.rowIndex = rowIndex == -1 ? rowIndex : (rowIndex % pageSize);
*/
if (rowIndex == -1 || getPageSize() == 0) {
super.setRowIndex(-1);
}
else
super.setRowIndex(rowIndex % getPageSize());
}
#Override
public List<YData> load(int first, int pageSize, String sortField, SortOrder sortOrder, Map<String,Object> filters) {
List<YData> data = new ArrayList<YData>();
System.out.println("sort order : "+sortOrder);
//filter
for(YData yInfo : datasource) {
boolean match = true;
for(Iterator<String> it = filters.keySet().iterator(); it.hasNext();) {
try {
String filterProperty = it.next();
String filterValue = String.valueOf(filters.get(filterProperty));
Field yField = yInfo.getClass().getDeclaredField(filterProperty);
yField.setAccessible(true);
String fieldValue = String.valueOf(yField.get(yInfo));
if(filterValue == null || fieldValue.startsWith(filterValue)) {
match = true;
}
else {
match = false;
break;
}
} catch(Exception e) {
e.printStackTrace();
match = false;
}
}
if(match) {
data.add(yInfo);
}
}
//sort
if(sortField != null) {
Collections.sort(data, new LazySorter(sortField, sortOrder));
}
int dataSize = data.size();
this.setRowCount(dataSize);
//paginate
if(dataSize > pageSize) {
try {
List<YData> subList = data.subList(first, first + pageSize);
return subList;
}
catch(IndexOutOfBoundsException e) {
return data.subList(first, first + (dataSize % pageSize));
}
}
else
return data;
}
#Override
public int getRowCount() {
// TODO Auto-generated method stub
return super.getRowCount();
}
}
I am fade up with these issues and becomes show stopper for me. Even i tried Primefaces 5
If your data is loaded from db i suggest you to do a better LazyDataModel like:
public class ElementiLazyDataModel extends LazyDataModel<T> implements Serializable {
private Service<T> abstractFacade;
public ElementiLazyDataModel(Service<T> abstractFacade) {
this.abstractFacade = abstractFacade;
}
public Service<T> getAbstractFacade() {
return abstractFacade;
}
public void setAbstractFacade(Service<T> abstractFacade) {
this.abstractFacade = abstractFacade;
}
#Override
public List<T> load(int first, int pageSize, String sortField, SortOrder sortOrder, Map<String, Object> filters) {
PaginatedResult<T> pr = abstractFacade.findRange(new int[]{first, first + pageSize}, sortField, sortOrder, filters);
setRowCount(new Long(pr.getTotalItems()).intValue());
return pr.getItems();
}
}
The service is some kind of backend communication (like an EJB) injected in the ManagedBean that use this model.
The service for pagination may be like this:
#Override
public PaginatedResult<T> findRange(int[] range, String sortField, SortOrder sortOrder, Map<String, Object> filters) {
final Query query = getEntityManager().createQuery("select x from " + entityClass.getSimpleName() + " x")
.setFirstResult(range[0]).setMaxResults(range[1] - range[0] + 1);
// Add filter sort etc.
final Query queryCount = getEntityManager().createQuery("select count(x) from " + entityClass.getSimpleName() + " x");
// Add filter sort etc.
Long rowCount = (Long) queryCount.getSingleResult();
List<T> resultList = query.getResultList();
return new PaginatedResult<T>(resultList, rowCount);
}
Note that you have to do the paginated query (with jpa like this the orm do the query for you, but if you don't use orm have to do paginated query, for oracle look at TOP-N query, for example: http://oracle-base.com/articles/misc/top-n-queries.php)
Remember your return obj must be contains also the total record as a fast count:
public class PaginatedResult<T> implements Serializable {
private List<T> items;
private long totalItems;
public PaginatedResult() {
}
public PaginatedResult(List<T> items, long totalItems) {
this.items = items;
this.totalItems = totalItems;
}
public List<T> getItems() {
return items;
}
public void setItems(List<T> items) {
this.items = items;
}
public long getTotalItems() {
return totalItems;
}
public void setTotalItems(long totalItems) {
this.totalItems = totalItems;
}
}
All this is useful if your database table is correctly setup, pay aptention to the execution plan of the possible query and add the right index.
Hope to give some hint to improve you performance
In the end, remember to your final user that the human eyes can't see more that 10-20 record at once, so it is very useless to have thousand record in a page.
You have used the default load implementation which is used in the showcases of Primefaces. This is not the correct implementation for your case where you load your data from a database.
The load method should use the correct query with consideration of :
1) the filter fields that are used, example:
String query = "select e from Entity e where lower(e.f1) like lower('" + filters.get(key) + "'%) and..., etc. for the other fields
2) the sorting columns that are used, example:
query.append("order by ").append(sortField).append(" ").append(SortOrder.ASCENDING.name() ? "" : sortOrder.substring(0, 4)),..., etc. for the other columns.
3) The total count of your query WITH 1) attached to it. Example:
Long totalCount = (Long) entityManager.createQuery("select count(*) from Entity e where lower(e.f1) like lower('filterKey1%') and lower(e.f2) like lower('filterKey2%') ...").getSingleResult();

Resources