Insert 10,000,000+ rows in grails - grails

I've read a lot of articles recently about populating a grails table from huge data, but seem to have hit a ceiling. My code is as follows:
class LoadingService {
def sessionFactory
def dataSource
def propertyInstanceMap = org.codehaus.groovy.grails.plugins.DomainClassGrailsPlugin.PROPERTY_INSTANCE_MAP
def insertFile(fileName) {
InputStream inputFile = getClass().classLoader.getResourceAsStream(fileName)
def pCounter = 1
def mCounter = 1
Sql sql = new Sql(dataSource)
inputFile.splitEachLine(/\n|\r|,/) { line ->
line.each { value ->
if(value.equalsIgnoreCase('0') {
pCounter++
return
}
sql.executeInsert("insert into Patient_MRNA (patient_id, mrna_id, value) values (${pCounter}, ${mCounter}, ${value.toFloat()})")
pCounter++
}
if(mCounter % 100 == 0) {
cleanUpGorm()
}
pCounter = 1
mCounter++
}
}
def cleanUpGorm() {
session.currentSession.clear()
propertyInstanceMap.get().clear()
}
}
I have disabled secondary cache, I'm using assigned ids, and I am explicitly handling this many to many relationship through a domain, not the hasMany and belongsTo.
My speed has increased monumentally after applying these methods, but after a while the inserts slow down to the point of almost stopping compared to about 623,000 per minute at the beginning.
Is there some other memory leak that I should be aware of or have I just hit the ceiling in terms of batch inserts in grails?
To be clear it takes about 2 minutes to insert 1.2 million rows, but then they start to slow down.

Try doing batch inserts, it's much more efficient
def updateCounts = sql.withBatch { stmt ->
stmt.addBatch("insert into TABLENAME ...")
stmt.addBatch("insert into TABLENAME ...")
stmt.addBatch("insert into TABLENAME ...")
...
}

I have fought with this in earlier versions of Grails. Back then I resorted to either simply run the batch manually in proper chunks or use another tool for the batch import, such as Pentaho Data Integration (or other ETL tool or DIY).

Related

efficient way to do bulk updates where values are assigned serially from a list?

Lets say there is a domain A with property p.
class A{
Integer p
}
I have a list of A i.e
def lis = A.list()
and then i have a list of numbers
def num = [4, 1, 22, ......]
what is the most efficient way to do bulk update in grails where each object of A is assigned a number from num serially.
One way could be
for(int i=0; i<lis.size(); i++){
lis[i].p = num[i]
lis[i].save(flush: true)
}
But this solution i assume is not efficient. Can this be achieved using HQL or other efficient methods? I appreciate any help! Thanks!
if your list of A and numbers is a very great amount of data to treat (if lis.size is for example equal to 10 000), then you should do this :
A.withNewTransaction { status -> // begin a new hibernate session
int stepForFlush = 100
int totalLisSize = A.count()
def lis
for(int k=0; k < totalLisSize; k+=stepForFlush) {
lis = A.list(max: stepForFlush, offset: k) // load only 100 elements in the current hibernate session
...
for(int i=0; i<lis.size(); i++) {
lis[i].p = num[k+i]
lis[i].save()
}
A.withSession { session ->
session.flush() // flush changes to database
session.clear() // clear the hibernate session, the 100 elements are no more attached to the hibernate session
// Then they are now eligible to garbage collection
// you ensure not maintaining in memory all the elements you are treating
}
} // next iteration, k+=100
} // Transaction is closed then transaction is commited = a commit is executed to database,
// and then all changes that has been flush previously are committed.
Note :
In this solution, you do not load all your A elements in memory and it helps when your A.list().size() is very great.
While I agree that your solution is likely to be inefficient, it's mostly due to the fact that you're flushing on every save. So you can get a performance boost by using a transaction; which automatically causes a flush when committed:
A.withTransaction { status ->
...
for(int i=0; i<lis.size(); i++) {
lis[i].p = num[i]
lis[i].save()
}
}
Of course, if you can use the #Transactional annotation, that would be better.
Yes, you can use HQL, but the fundamental problem here is that your list of numbers is arbitrary, so you'd need multiple HQL queries; one for each update.
Try the transactional approach first, since that's the easiest to set up.

Grails: how to get last inserted record matching query

Getting the last record is trivial in SQL, e.g. (for MySQL)
class ExchangeRate {
Currency from
Currency to
BigDecimal rate // e.g. 10
Date dateCreated
}
class Currency {
String iso
etc..
}
SQL to get the latest is trivial:
Select max(id), rate
from exchange_rate
where from_id = 1
and to_id = 3
or
select rate
from exchange_rate
where from_id = 2
order by id desc
limit 1
The question is, how does one do this efficiently in Grails? I only want a single result.
This obviously wont work:
def query = ExchangeRate.where {
from == from && to == to && id == max(id)
}
ExchangeRate exchangeRate = query.find()
There have been several posts on this, but not with an actual answer which I could figure out how to apply (I am a SQL guy, and don't know hibernate language and would prefer a solution which did not involve it if there was one)
If there was an easy way to run SQL directly without having to hand manage result sets that would work (as we will never use another DB other than MySQL)
I am sure it could be done with sort and limit, but a) haven't found an example I could copy, and b) would assume this be inefficient, because it appears that the sorting and limiting is done in code, not SQL?
This example is in the documentation:
Book.findAll("from Book as b where b.author=:author",
[author: 'Dan Brown'], [max: 10, offset: 5])
could lead to this:
def exchangeRates = ExchangeRate.findAll("from ExchangeRate as e where e.from = :from order by id desc", [from: from], [max: 1])
if (exchangeRates.size() == 1) {
return exchangeRates.first().rate
}
return null
is there a better way (e.g. one which doesnt use hibernate low level language, or one which uses SQL instead, or one which is more efficient?)
Try using a subquery according to the documentation.
def query = ExchangeRate.where {
id = max(id).of { from == fromValue } && to == toValue
}
query.find()

Concurrency Issue Grails

I am using this code for updating a row.
SequenceNumber.withNewSession {
def hibSession = sessionFactory.getCurrentSession()
Sql sql = new Sql(hibSession.connection())
def rows = sql.rows("select for update query");
}
in this query I am updating the number initially sequenceNumber is 1200.
and every time this code run then it will b increamented by 1.
and I am running this code 5 times in loop.
but this is not flushing the hibernate session so that every time I am getting the same number 1201.
I have also used
hibSession.clear()
hibSession.flush()
but no success.
If I use following code then its works fine.
SequenceNumber.withNewSession {
Sql sql = new Sql(dataSource)
def rows = sql.rows("select for update query")
}
every time I am getting a new number.
Can anybody tell me what's wrong with above code
Try with Transaction, + flush it on the end, like:
SequenceNumber.withTransaction { TransactionStatus status ->
...
status.flush()
}

Why sql.rows Groovy method is so slow

I tried to fetch some data with the sql.rows() Groovy method and it took a very long time to return the values.
So I tried the "standard" way and it's much much faster (150 times faster).
What am I missing ?
Look at the code below : the first method returns results in about 2500ms and the second in 15 ms !
class MyService {
javax.sql.DataSource dataSource
def SQL_QUERY = "select M_FIRSTNAME as firstname, M_LASTNAME as lastname, M_NATIONALITY as country from CT_PLAYER order by M_ID asc";
def getPlayers1(int offset, int maxRows)
{
def t = System.currentTimeMillis()
def sql = new Sql(dataSource)
def rows = sql.rows(SQL_QUERY, offset, maxRows)
println "time1 : ${System.currentTimeMillis()-t}"
return rows
}
def getPlayers2(int offset, int maxRows)
{
def t = System.currentTimeMillis();
Connection connection = dataSource.getConnection();
Statement statement = connection.createStatement();
statement.setMaxRows(offset + maxRows -1);
ResultSet resultSet = statement.executeQuery(SQL_QUERY);
def l_list =[];
if(resultSet.absolute(offset)) {
while (true) {
l_list << [
'firstname':resultSet.getString('firstname'),
'lastname' :resultSet.getString('lastname'),
'country' :resultSet.getString('country')
];
if(!resultSet.next()) break;
}
}
resultSet.close()
statement.close()
connection.close()
println "time2 : ${System.currentTimeMillis()-t}"
return l_list
}
When you call sql.rows, groovy eventually calls SqlGroovyMethods.toRowResult for each row returned by the resultSet.
This method interrogates the ResultSetMetaData for the resultSet each time to find the column names, and then fetches the data for each of these columns from the resultSet into a Map which it adds to the returned List.
In your second example, you directly get the columns required by name (as you know what they are), and avoid having to do this lookup every row.
I think I found the reason this method is so slow : statement.setMaxRows() is never called !
That means that a lot of useless data is sent by the database (when you want to see the first pages of a large datagrid)
I wonder how your tests would turn out if you try with setFetchSize instead of setMaxRows. A lot of this has to the underlying JDBC Driver's default behavior.

list.find(closure) and executing against that value

Really my question is "Can the code sample below be even smaller? Basically the code sample is designed to first look through a list of objects, find the most granular (in this case it is branch) and then query backwards depending on what object it finds.
1 - If it finds a branch, return the findAllBy against the branch
2 - If it finds a department, return the findAllBy against the department
3 - If it finds an organization, return the findAllBy against the organization
The goal is to find the most granular object (which is why order is important), but do I need to have two separate blocks (one to define the objects, the other to check if they exist)? Or can those two executions be made into one command...
def resp
def srt = [sort:"name", order:"asc"]
def branch = listObjects.find{it instanceof Branch}
def department = listObjects.find{it instanceof Department}
def organization = listObjects.find{it instanceof Organization}
resp = !resp && branch ? Employees.findAllByBranch(branch,srt) : resp
resp = !resp && department ? Employees.findAllByDepartment(department,srt) : resp
resp = !resp && organization ? Employees.findAllByOrganization(organization,srt) : resp
return resp
What I'm thinking is something along the lines of this:
def resp
resp = Employees.findAllByBranch(listObjects.find{it instanceof Branch})
resp = !resp ? Employees.findAllByDepartment(listObjects.find{it instanceof Department}) : resp
resp = !resp ? Employees.findAllByOrganization(listObjects.find{it instanceof Organization}) : resp
But I believe that will throw an exception since those objects might be null
You can shorten it up a bit more with findResult instead of a for in loop with a variable you need to def outside:
def listObjects // = some predetermined list that you've apparently created
def srt = [sort:"name", order:"asc"]
def result = [Branch, Department, Organization].findResult { clazz ->
listObjects?.find { it.class.isAssignableFrom(clazz) }?.with { foundObj ->
Employees."findAllBy${clazz.name}"(foundObj, srt)
}
}
findResult is similar to find, but it returns the result from the first non-null item rather than the item itself. It avoids the need for a separate collection variable outside of the loop.
Edit: what I had previously didn't quite match the behavior that I think you were looking for (I don't think the other answers do either, but I could be misunderstanding). You have to ensure that there's something found in the list before doing the findAllBy or else you could pull back null items which is not what you're looking for.
In real, production code, I'd actually do things a bit differently though. I'd leverage the JVM type system to only have to spin through the listObjects once and short circuit when it found the first Branch/Department/Organization like this:
def listObjects
def sort = [sort:"name", order:"asc"]
def result = listObjects?.findResult { findEmployeesFor(it, sort) }
... // then have these methods to actually exercise the type specific findEmployeesFor
def findEmployeesFor(Branch branch, sort) { Employees.findAllByBranch(branch, sort) }
def findEmployeesFor(Department department, sort { Employees.findAllByDepartment(department, sort)}
def findEmployeesFor(Organization organization, sort { Employees.findAllByOrganization(organization, sort)}
def findEmployeesFor(Object obj, sort) { return null } // if listObjects can hold non/branch/department/organization objects
I think that this code is actually clearer and it reduces the number of times we iterate over the list and the number of reflection calls we need to make.
Edit:
A for in loop is more efficient, since you want to break processing on first non-null result (i.e. in Groovy we cannot break out of a closure iteration with "return" or "break").
def resp
for(clazz in [Branch,Department,Organization]) {
resp = Employees."findAllBy${clazz.name}"(listObjects?.find{it instanceof $clazz})
if(resp) return
}
if(resp) // do something...
Original:
List results = [Branch,Department,Organization].collect{clazz->
Employees."findAllBy${clazz.name}"(listObjects?.find{it instanceof $clazz})
}
Enjoy Groovy ;--)
I think #virtualeyes nearly had it, but instead of a collect (which as he says you can't break out of), you want to use a find, as that stops running the first valid result it gets:
List results = [Branch,Department,Organization].find { clazz->
Employees."findAllBy${clazz.name}"(listObjects?.find{it instanceof clazz})
}

Resources