Grails bulk insert/update optimization - grails

I am importing a large amount of data from a csv file, (file size is over 100MB)
the code i'm using looks like this :
def errorLignes = []
def index = 1
csvFile.toCsvReader(['charset':'UTF-8']).eachLine { tokens ->
if (index % 100 == 0) cleanUpGorm()
index++
def order = Orders.findByReferenceAndOrganization(tokens[0],organization)
if (!order) {
order = new Orders()
}
if (tokens[1]){
def user = User.findByReferenceAndOrganization(tokens[1],organization)
if (user){
order.user = user
}else{
errorLignes.add(tokens)
}
}
if (tokens[2]){
def customer = Customer.findByCustomCodeAndOrganization(tokens[2],organization)
if (customer){
order.customer = customer
}else{
errorLignes.add(tokens)
}
}
if (tokens[3]){
order.orderType = Integer.parseInt(tokens[3])
}
// etc.....................
order.save()
}
and i'm using the cleanUpGorm method to clean session after each 100 entries
def cleanUpGorm() {
println "clean up gorm"
def session = sessionFactory.currentSession
session.flush()
session.clear()
propertyInstanceMap.get().clear()
}
I also turned 2nd level cache off
hibernate {
cache.use_second_level_cache = false
cache.use_query_cache = false
cache.provider_class = 'net.sf.ehcache.hibernate.EhCacheProvider'
}
the grails version of the project is 2.0.4 and as database i am using mysql
for every entry , i am doing 3 calls to a find
to check if the order already exists
to check if user is correct
to check if customer is correct
and finally i'm saving the order instance
the import process is too slow, i am wondering how can I speed up and optimise this code.
EDIT :
I found that the searchable plugin is also making it slower .
so , to get around this , I used the command :
searchableService.stopMirroring()
But it still not fast enough,I am finally changing the code to use groovy sql instead

This found this blog entry very useful:
http://naleid.com/blog/2009/10/01/batch-import-performance-with-grails-and-mysql/
You are already cleaning up GORM, but try cleaning every 100 entries:
def propertyInstanceMap = org.codehaus.groovy.grails.plugins.DomainClassGrailsPlugin.PROPERTY_INSTANCE_MAP
propertyInstanceMap.get().clear()
Creating database indexes might help aswell and use default-storage-engine=innodb instead of MyISAM.

I'm also in the process of writing a number of services that will accomplish loads of very large datasets (multiple files of up to ~17million rows each). I initially tried the cleanUpGorm method you use, but found that, whilst it did improve things, the loading was still slow. Here's what I did to make it much faster:
Investigate what it is that is causing the app to actually be slow. I installed the Grails Melody plugin, then did a run-app then opened a browser at /monitoring. I could then see which routines took time to execute and what the worst-performing queries actually were.
Many of the Grails GORM methods map to a SQL ... where ... clause. You need to ensure that you have an index for each item used in a where clause for each query that you want to make faster, otherwise the method will become considerably slower the bigger your dataset is. This includes putting indexes on the id and version columns that are injected into each of your domain classes.
Ensure you have indexes set up for all of your hasMany and belongsTo relationships.
If the performance is still too slow, use Spring Batch. Even if you've never used it before, it should take you no time at all to set up a batch parse of a CSV file to parse into Grails domain objects. I suggest you use the grails-spring-batch plugin to do this and use the examples here to get a working implementation going quickly. It's extremely fast, very configurable and you don't have to mess around with cleaning up the session.

I had used batch insert while insert records, this is much faster than gorm cleanup method. Below example describes you how to implement it.
Date startTime = new Date()
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
(1..50000).each {counter ->
Person person = new Person()
person.firstName = "abc"
person.middleName = "abc"
person.lastName = "abc"
person.address = "abc"
person.favouriteGame = "abc"
person.favouriteActor = "abc"
session.save(person)
if(counter.mod(100)==0) {
session.flush();
session.clear();
}
if(counter.mod(10000)==0) {
Date endTime =new Date()
println "Record inserted Counter =>"+counter+" Time =>"+TimeCategory.minus(endTime,startTime)
}
}
tx.commit();
session.close();

Related

AWS DynamoDB session table keeps growing, can't delete expired sessions

the ASP.NET_SessionState table grows all the time, already at 18GB, not a sign of ever deleting expired sessions.
we have tried to execute DynamoDBSessionStateStore.DeleteExpiredSessions, but it seems to have no effect.
our system is running fine, sessions are created and end-users are not aware of the issue. however, it doesn't make sense the table keeps growing all the time...
we have triple checked permissions/security, everything seems to be in order. we use SDK version 3.1.0. what else remains to be checked?
With your table being over 18 GB, which is quite large (in this context), it does not surprise me that this isn't working after looking at the code for the DeleteExpiredSessions method on GitHub.
Here is the code:
public static void DeleteExpiredSessions(IAmazonDynamoDB dbClient, string tableName)
{
LogInfo("DeleteExpiredSessions");
Table table = Table.LoadTable(dbClient, tableName, DynamoDBEntryConversion.V1);
ScanFilter filter = new ScanFilter();
filter.AddCondition(ATTRIBUTE_EXPIRES, ScanOperator.LessThan, DateTime.Now);
ScanOperationConfig config = new ScanOperationConfig();
config.AttributesToGet = new List<string> { ATTRIBUTE_SESSION_ID };
config.Select = SelectValues.SpecificAttributes;
config.Filter = filter;
DocumentBatchWrite batchWrite = table.CreateBatchWrite();
Search search = table.Scan(config);
do
{
List<Document> page = search.GetNextSet();
foreach (var document in page)
{
batchWrite.AddItemToDelete(document);
}
} while (!search.IsDone);
batchWrite.Execute();
}
The above algorithm is executed in two parts. First it performs a Search (table scan) using a filter is used to identify all expired records. These are then added to a DocumentBatchWrite request that is executed as the second step.
Since your table is so large the table scan step will take a very, very long time to complete before a single record is deleted. Basically, the above algorithm is useful for lazy garbage collection on small tables, but does not scale well for large tables.
The best I can tell is that the execution of this is never actually getting past the table scan and you may be consuming all of the read throughput of your table.
A possible solution for you would be to run a slightly modified version of the above method on your own. You would want to call the the DocumentBatchWrite inside of the do-while loop so that records will start to be deleted before the table scan is concluded.
That would look like:
public static void DeleteExpiredSessions(IAmazonDynamoDB dbClient, string tableName)
{
LogInfo("DeleteExpiredSessions");
Table table = Table.LoadTable(dbClient, tableName, DynamoDBEntryConversion.V1);
ScanFilter filter = new ScanFilter();
filter.AddCondition(ATTRIBUTE_EXPIRES, ScanOperator.LessThan, DateTime.Now);
ScanOperationConfig config = new ScanOperationConfig();
config.AttributesToGet = new List<string> { ATTRIBUTE_SESSION_ID };
config.Select = SelectValues.SpecificAttributes;
config.Filter = filter;
Search search = table.Scan(config);
do
{
// Perform a batch delete for each page returned
DocumentBatchWrite batchWrite = table.CreateBatchWrite();
List<Document> page = search.GetNextSet();
foreach (var document in page)
{
batchWrite.AddItemToDelete(document);
}
batchWrite.Execute();
} while (!search.IsDone);
}
Note: I have not tested the above code, but it is just a simple modification to the open source code so it should work correctly, but would need to be tested to ensure the pagination works correctly on a table whose records are being deleted as it is being scanned.

Observed Lists and Maps, and Firebase, oh my. How can I improve this mess?

So this is what I ended up with to get realtime starring/liking (of communities, in my case) working, with a Firebase datastore. It's a mess and surely I'm missing some fundamentals.
Here my element gets communities, each as a Map community stored in an observed List communities. It has to rewrite that List several times as it changes each community Map based on the results of the changed star count and the user's starred state, and some other fun:
getCommunities() {
// Since we call this method a second time after user
// signed in, clear the communities list before we recreate it.
if (communities.length > 0) { communities.clear(); }
var firebaseRoot = new db.Firebase(firebaseLocation);
var communityRef = firebaseRoot.child('/communities');
// TODO: Undo the limit of 20; https://github.com/firebase/firebase-dart/issues/8
communityRef.limit(20).onChildAdded.listen((e) {
var community = e.snapshot.val();
// snapshot.name is Firebase's ID, i.e. "the name of the Firebase location",
// so we'll add that to our local item list.
community['id'] = e.snapshot.name();
print(community['id']);
// If the user is signed in, see if they've starred this community.
if (app.user != null) {
firebaseRoot.child('/users/' + app.user.username + '/communities/' + community['id']).onValue.listen((e) {
if (e.snapshot.val() == null) {
community['userStarred'] = false;
// TODO: Add community star_count?!
} else {
community['userStarred'] = true;
}
print("${community['userStarred']}, star count: ${community['star_count']}");
// Replace the community in the observed list w/ our updated copy.
communities
..removeWhere((oldItem) => oldItem['alias'] == community['alias'])
..add(community)
..sort((m1, m2) => m1["updatedDate"].compareTo(m2["updatedDate"]));
communities = toObservable(communities.reversed.toList());
});
}
// If no updated date, use the created date.
if (community['updatedDate'] == null) {
community['updatedDate'] = community['createdDate'];
}
// Handle the case where no star count yet.
if (community['star_count'] == null) {
community['star_count'] = 0;
}
// The live-date-time element needs parsed dates.
community['updatedDate'] = DateTime.parse(community['updatedDate']);
community['createdDate'] = DateTime.parse(community['createdDate']);
// Listen for realtime changes to the star count.
communityRef.child(community['alias'] + '/star_count').onValue.listen((e) {
int newCount = e.snapshot.val();
community['star_count'] = newCount;
// Replace the community in the observed list w/ our updated copy.
// TODO: Re-writing the list each time is ridiculous!
communities
..removeWhere((oldItem) => oldItem['alias'] == community['alias'])
..add(community)
..sort((m1, m2) => m1["updatedDate"].compareTo(m2["updatedDate"]));
communities = toObservable(communities.reversed.toList());
});
// Insert each new community into the list.
communities.add(community);
// Sort the list by the item's updatedDate, then reverse it.
communities.sort((m1, m2) => m1["updatedDate"].compareTo(m2["updatedDate"]));
communities = toObservable(communities.reversed.toList());
});
}
Here we toggle the star, which again replaces the observed communities List a few times as we update the count in the affected community Maps and thus rewrite the List to reflect that:
toggleStar(Event e, var detail, Element target) {
// Don't fire the core-item's on-click, just the icon's.
e.stopPropagation();
if (app.user == null) {
app.showMessage("Kindly sign in first.", "important");
return;
}
bool isStarred = (target.classes.contains("selected"));
var community = communities.firstWhere((i) => i['id'] == target.dataset['id']);
var firebaseRoot = new db.Firebase(firebaseLocation);
var starredCommunityRef = firebaseRoot.child('/users/' + app.user.username + '/communities/' + community['id']);
var communityRef = firebaseRoot.child('/communities/' + community['id']);
if (isStarred) {
// If it's starred, time to unstar it.
community['userStarred'] = false;
starredCommunityRef.remove();
// Update the star count.
communityRef.child('/star_count').transaction((currentCount) {
if (currentCount == null || currentCount == 0) {
community['star_count'] = 0;
return 0;
} else {
community['star_count'] = currentCount - 1;
return currentCount - 1;
}
});
// Update the list of users who starred.
communityRef.child('/star_users/' + app.user.username).remove();
} else {
// If it's not starred, time to star it.
community['userStarred'] = true;
starredCommunityRef.set(true);
// Update the star count.
communityRef.child('/star_count').transaction((currentCount) {
if (currentCount == null || currentCount == 0) {
community['star_count'] = 1;
return 1;
} else {
community['star_count'] = currentCount + 1;
return currentCount + 1;
}
});
// Update the list of users who starred.
communityRef.child('/star_users/' + app.user.username).set(true);
}
// Replace the community in the observed list w/ our updated copy.
communities.removeWhere((oldItem) => oldItem['alias'] == community['alias']);
communities.add(community);
communities.sort((m1, m2) => m1["updatedDate"].compareTo(m2["updatedDate"]));
communities = toObservable(communities.reversed.toList());
print(communities);
}
There's also some other craziness where we have to get the list of communities again when app.changes because we only load app.user after the app and list initially load, and now that we have the user we need to turn on the appropriate stars. So my attached() looks like:
attached() {
app.pageTitle = "Communities";
getCommunities();
app.changes.listen((List<ChangeRecord> records) {
if (app.user != null) {
getCommunities();
}
});
}
There, it seems I could just be getting the stars and updating said each affected community Map, then repopulating the observed communities List, but that's the least of it.
The full thing: https://gist.github.com/DaveNotik/5ccdc9e74429cf87d641
How can I improve all this Map/List management, e.g. where every time I change a community Map, I have to rewrite the whole communities List? Should I be thinking of it differently?
What about all this querying Firebase? Surely, there's a better way, but it seems I need to do a lot to keep it realtime, and also the element gets attached and detached, so it seems I need to run getCommunities() each time. Unless the OOP way is objects get created, and they're always there to be observed whenever the element is attached? I'm missing those fundamentals.
This app.changes business to handle the case where we load the list before we have the app.user (which then means we want to load her stars) - is there a better way?
Other ridiculousness?
Big question, I know. Thank you for helping me get a handle on the right approach as I move forward!
I think there is two different ways to choose, if you want to keep a data of your application in real time sync with server database:
1 Polling (pull method ie. a client pulls the data from server)
Application polls ie. requests the updated data from the server. Polling can be automatic (for example with interval of 60s) or requested by user (= refresh). The short automatic interval will cause high load on server and with long interval you lose real time feeling.
2 Full-duplex (push method ie. server can push the data to the client)
An application and a server have full-duplex connection in between and server is able to send the data or a notification of the data available to the client. Then the client can decide whether or not to retrieve the data.
This method is a modern one, because it'll keep the net traffic and the server load in minimum and yet providing a real time updates.
The firebase boasts with this kind of updates, but I'm not sure is it full-dublex or just a clever way of polling. Websocket protocol is a real full-duplex connection and dart server supports it.
The updated data from a server can include:
1 A full dataset
Basically the server sends a full dataset (=initial query) and the server doesn't "know" anything about updated data. This is easiest way to go, if you have reasonable small datasets. Many times you'll have a very small datasets among the big ones, so this way can be useful.
2 A dataset including a new data only
The server can send a dataset based on modified timestamp ie. every time a record in the database changes, a timestamp for update will be saved and the query can be filtered based on this timestamp. In other words application knows when it has last updated the data and then requests newer data.
3 A changed record
A server keeps a track of updated data and sends it to the application. The data can be sent record by record when changes occurs or server can collect the data for a bigger chunks to be sent. This method requires a server to keep a track of every client connected in order to send a correct data to each client. When you add an authentication process for clients ie. not every data can be send to all, it can get quite complicated.
I think the easiest way is to use the method number 2 for updated data.
Last thing...
What to do with the data received?
1 Handle everything as a new
If application receives an updated data, it will destroy/clear all the lists and maps and recreate/refill them with the new data. Typical problems with this are a user loses a current position on a page or the data user were looking jumps around. If application has modified or extended an old data for some reason, all those modifications will be lost. This method works ok, if a user requests a refresh.
2 Update only the changed data
The application never clears initial list or maps, it just updates them with a new received data. Typically you will construct a new combined map from queried data for your specific need (for example a certain view). The combined map has already all information you want to show in the specific view (default values even if the initial queries didn't had the data for the field) and you just update a new values in it.
If the updated information needs a new member in the list you just add it in the end.
If the updated information requires a deletion from the list, it might be a good thing to use extra field "active" and filter the list/map with it. With filtering you won't lose any referencies or so.
If you need to sort a data or filter it, it should be done by a view or user request. Basically the data is stored in the application and updated as needed. When a user needs to see the data in a specific way, the view should show the data a proper way. This is called model-controller-view and the main idea is to separate the data from the view.
I'm sorry this long answer didn't answer any of your questions, but I tried to cut this challenge to a smaller chunks. Many times you can see an interface between these chunks and you can design and organize your code nicely by using these interfaces.

DataNucleus Memory/Cache Handling for large update/insert

We are running application in Spring context using DataNucleus as our ORM mapping and mysql as our database.
Our application have a daily import job of some data feed into our database. The size of the data feed translate into around 1 millions row of insert/update. The performance of the import start out to be very good but then it degrade overtime (as the number of executed query increase) and at some point the application freeze or stop responding. We will have to wait for the whole job to complete before the application response again.
This behavior looks very like a memory leak to us and we have been looking hard at our code to catch any potential problem, however the problem didn't go away. One interesting thing we found from the heap dump is that org.datanucleus.ExecutionContextThreadedImpl (or the HashSet/HashMap) hold 90% of our memory (5GB) during the import. (I have attahed the screenshot of the dump below). My research on the internet said this reference is the Level1 Cache (not sure am i correct). My question is during a large import, how can i limit/control the size of the level1 cache. May be ask DN to not cache during my import?
If that's not the L1 cache, what's the possible cause of my memory issue?
Our code use a transaction for every insert to prevent locking of large chunk of data in the database. It's call the flush method every 2000 insert
As a temporary fix, we moved our import process to run overnight when no one is using our app. Obviously, this cannot go on forever. Please could someone at least point us in the right direction so that we can do more research and hoping we can find a fixes.
Would be good if someone have knowledge of decoding the heap dump
Your help would be very much appreciated by all of us here. Many thanks!
https://s3-ap-southeast-1.amazonaws.com/public-external/datanucleus_heap_dump.png
https://s3-ap-southeast-1.amazonaws.com/public-external/datanucleus_dump2.png
Code Below - Caller of this method does not have a transaction. This method will process one import object per call, and we need to process around 100K of these object daily
#Override
#PreAuthorize("(hasUserRole('ROLE_ADMIN')")
#Transactional(propagation = Propagation.REQUIRED)
public void processImport(ImportInvestorAccountUpdate account, String advisorCompanyKey) {
ImportInvestorAccountDescriptor invAccDesc = account
.getInvestorAccount();
InvestorAccount invAcc = getInvestorAccountByImportDescriptor(
invAccDesc, advisorCompanyKey);
try {
ParseReportingData parseReportingData = ctx
.getBean(ParseReportingData.class);
String baseCCY = invAcc.getBaseCurrency();
Date valueDate = account.getValueDate();
ArrayList<InvestorAccountInformationILAS> infoList = parseReportingData
.getInvestorAccountInformationILAS(null, invAcc, valueDate,
baseCCY);
InvestorAccountInformationILAS info = infoList.get(0);
PositionSnapshot snapshot = new PositionSnapshot();
ArrayList<Position> posList = new ArrayList<Position>();
Double totalValueInBase = 0.0;
double totalQty = 0.0;
for (ImportPosition importPos : account.getPositions()) {
Asset asset = getAssetByImportDescriptor(importPos
.getTicker());
PositionInsurance pos = new PositionInsurance();
pos.setAsset(asset);
pos.setQuantity(importPos.getUnits());
pos.setQuantityType(Position.QUANTITY_TYPE_UNITS);
posList.add(pos);
}
snapshot.setPositions(posList);
info.setHoldings(snapshot);
log.info("persisting a new investorAccountInformation(source:"
+ invAcc.getReportSource() + ") on " + valueDate
+ " of InvestorAccount(key:" + invAcc.getKey() + ")");
persistenceService.updateManagementEntity(invAcc);
} catch (Exception e) {
throw new DataImportException(invAcc == null ? null : invAcc.getKey(), advisorCompanyKey,
e.getMessage());
}
}
Do you use the same pm for the entire job?
If so, you may want to try to close and create new ones once in a while.
If not, this could be the L2 cache. What setting do you have for datanucleus.cache.level2.type? It think it's a weak map by default. You may want to try none for testing.

Grails flushing not working

I'm working on processing a large csv file and I found this article about batch import: http://naleid.com/blog/2009/10/01/batch-import-performance-with-grails-and-mysql/. I tried to do the same, but it seems to have no effect.
Should the instances be viewable in the database after each flushing? Because now there is either 0 or all of the entites when I try to query 'SELECT COUNT(*) FROM TABLE1', so it looks like the instances are commited all at once.
Then I also noticed that the import works quickly when importing for the first time to the blank table, but when the table is full and the entity should be either updated or saved as new, the whole process is enormously slow. It's mainly because of the memory not being cleaned and decreases to 1MB or less and the app gets stuck. So is it because of not flushing the session?
My code for importing is here:
public void saveAll(List<MedicalInstrument> listMedicalInstruments) {
log.info("start saving")
for (int i = 0; i < listMedicalInstruments.size() - 1; i++) {
def medicalInstrument = listMedicalInstruments.get(i)
def persistedMedicalInstrument = MedicalInstrument.findByCode(medicalInstrument.code)
if (persistedMedicalInstrument) {
persistedMedicalInstrument.properties = medicalInstrument.properties
persistedMedicalInstrument.save()
} else {
medicalInstrument.save()
}
if ((i + 1) % 100 == 0) {
cleanUpGorm()
if ((i + 1) % 1000 == 0) {
log.info("saved ${i} entities")
}
}
}
cleanUpGorm()
}
protected void cleanUpGorm() {
log.info("cleanin GORM")
def session = sessionFactory.currentSession
session.flush()
session.clear()
propertyInstanceMap.get().clear()
}
Thank you very much for any help!
Regards,
Lojza
P.S.: my JVM memory has 252.81 MB in total, but it's only testing environment for me and 3 other people.
I had a similar problem once. Then I realized the reason was because I was doing it in a grails service, which was transactional by default. So every call to the methods in the service was itself wrapped in a transaction which made changes to that database linger around until the method completes in which case the results are not flushed.
From my experience they still won't necessarily be visible in the database until they are all committed at the end. I'm using Oracle and the only way I've been able to get them to commit in batches (so visible in the database) is by creating separate transactions for each batch and close it out after the flush. That, however, resulted in errors at the end of the process on the final flush. I didn't have time to figure it out, and using the process above I NEVER had issues no matter how large the data load was.
Using this method still helps tremendously - it really does flush and clear the session. You can watch your memory utilization to see that.
As far as updating a table with records do you have indexes on the table? Sometimes the indexes will slow down mass insert/updates like this because the database is trying to keep the index fresh. Maybe disable the index before the import/update and then enable when it is done?

Returning Updated Results from DBSet.SqlQuery

I want to use the following method to flag people in the Person table so that they can be processed. These people must be flagged as "In Process" so that other threads do not operate on the same rows.
In SQL Management Studio the query works as expected. When I call the method in my application I receive the row for the person but with the old status.
Status is one of many navigation properties off of Person and when this query returns it is the only property returned as a proxy object.
// This is how I'm calling it (obvious, I know)
var result = PersonLogic.GetPeopleWaitingInLine(100);
// And Here is my method.
public IList<Person> GetPeopleWaitingInLine(int count)
{
const string query =
#"UPDATE top(#count) PERSON
SET PERSON_STATUS_ID = #inProcessStatusId
OUTPUT INSERTED.PERSON_ID,
INSERTED.STATUS_ID
FROM PERSON
WHERE PERSON_STATUS_ID = #queuedStatusId";
var queuedStatusId = StatusLogic.GetStatus("Queued").Id;
var inProcessStatusId = StatusLogic.GetStatus("In Process").Id;
return Context.People.SqlQuery(query,
new SqlParameter("count", count),
new SqlParameter("queuedStateId", queuedStateId),
new SqlParameter("inProcessStateId", inProcessStateId)
}
// update | if I refresh the result set then I get the correct results
// but I'm not sure about this solution since it will require 2 DB calls
Context.ObjectContext().Refresh(RefreshMode.StoreWins, results);
I know it is an old question but this could help somebody.
It seems you are using a global Context for your query, EF is designed to retain cache info, if you allways need fresh data must use a fresh context to retrieve it. as this:
using (var tmpContext = new Contex())
{
// your query here
}
This create the context and recycle it. This means no cache was stored and next time it gets fresh data from database not from cache.

Resources