Jenkins Resource Locks: Get lock for multiple Node specific resources - jenkins

I'd like to create multiple Resources for a certain node, or use a reusable type for several nodes.
In this case it is "RAM requirement", so the resource name e.g. would be 1GBRAM. alternatively 1GBRAM_Nodexy if I need to specify this on a per node basis.
In the end I'd like to limit the amount of concurrent Jobs based on the peak amount of memory a Job uses up on this node, to avoid hangs because of low memory on the server. And I can set the amount of RAM which is available for executors.
Different Nodes will have different amounts of RAM, and individual Jobs have different RAM requirements.
So I would like to configure each Job with its RAM requirements
lock(resource: '1GBRAM_Nodexy', quantity: 8)
Is this achievable with Pipelines and lockable resources?
Is there an alternative, better way to achieve this? Ideally, the locks can be checked before the slave is selected, and the best suited node is picked.
Read about resource locks and labels. I Did not find any Node specific section, also no possibility to acquire multiple items of the same resource.
lock(resource: '1GBRAM_Nodexy', quantity: 8)
I expect that each run of the Job locks the equivalent amount of RAM on the used slave node. If not enough "RAM" units are used up, a Job is not run on such a node.

I think you can't quite do what you're looking for, but perhaps you can come close.
First, what you want is to use label instead of resource. You'd define as many 1GB-representing resources (say, GB1, GB2, GB3, etc.) as you have RAM, giving them all the same label (say, GB), and then use a lock statement like this (e.g., if the job in question needed 4GB of memory):
lock(label: 'GB', quantity: 4)
This will lock 4 of the resources that have this GB label, waiting if needed until it's able to do so, and then will release them when leaving the locked scope.
The node-specific locking is the trickier part. If you were content with using a different label per node (NodeA_GB, NodeB_GB, etc.), and with "pinning" jobs to particular nodes, then the solution above would suffice, e.g.:
// Require 4GB of memory on NodeA
lock(label: 'NodeA_GB', quantity: 4)
What I'm not aware of a way to do is to have a specific node selected because it has RAM available -- i.e., your "the locks can be checked before the slave is selected, and the best suited node is picked" statement. But you could at least detect the node that was allocated by a regular agent statement, using env.NODE_NAME, then use that as part of your node-specific lock label:
agent any
stages {
stage('Build') {
steps {
// This assumes that all possible nodes have a label like this defined with their name in it
lock(label: "${NODE_NAME}_GB", quantity: 4) {
// ... build steps
}
}
}
}
Incidentally, I'm using a label+quantity approach myself but in order to achieve lock-based throttling -- restricting the total number of concurrent builds across all branches of a multibranch pipeline job -- since the Throttle Concurrent Builds plugin went through a period of not being maintained and had some significant, open issues during that time.

Addition to accepted answer(edit queue is full):
As for selecting specific node because it has RAM available -- i.e., your "the locks can be checked before the slave is selected, and the best suited node is picked" statement, a org.jenkins.plugins.lockableresources.LockableResourcesManager class may be used to check available memory on the nodes, and decide, which node to use, for example:
def nodeFreeGbThreshold = 2
def resourceManager = new org.jenkins.plugins.lockableresources.LockableResourcesManager()
def nodeAFreeGb = resourceManager.getFreeResourceAmount("NodeA_GB")
def nodeBFreeGb = resourceManager.getFreeResourceAmount("NodeB_GB")
def agentLabel = nodeAFreeGb < nodeFreeGbThreshold ? 'NodeA' : 'NodeB'
pipeline {
agent { label 'agentLabel' }
stages {
stage('Build') {
steps {
// This assumes that all possible nodes have a label like this defined with their name in it
lock(label: "${NODE_NAME}_GB", quantity: 4) {
// ... build steps
}
}
}
}
}
and for scripted pipelines:
def nodeFreeGbThreshold = 2
def resourceManager = new org.jenkins.plugins.lockableresources.LockableResourcesManager()
def nodeAFreeGb = resourceManager.getFreeResourceAmount("NodeA_GB")
def nodeBFreeGb = resourceManager.getFreeResourceAmount("NodeB_GB")
def agentLabel = nodeAFreeGb < nodeFreeGbThreshold ? 'NodeA' : 'NodeB'
node(agentLabel) {
// This assumes that all possible nodes have a label like this defined with their name in it
lock(label: "${NODE_NAME}_GB", quantity: 4) {
// ... build steps
}
}

Related

Neo4J Very Large Admin Import with limited RAM

I am importing several TB of CSV data into Neo4J for a project I have been working on. I have enough fast storage for the estimated 6.6TiB, however the machine has only 32GB of memory, and the import tool is suggesting 203GB to complete the import.
When I run the import, I see the following (I assume it exited because it ran out of memory). Is there any way I can import this large dataset with the limited amount of memory I have? Or if not with the limited amount of memory I have, with the maximum ~128GB that the motherboard this machine can support.
Available resources:
Total machine memory: 30.73GiB
Free machine memory: 14.92GiB
Max heap memory : 6.828GiB
Processors: 16
Configured max memory: 21.51GiB
High-IO: true
WARNING: estimated number of nodes 37583174424 may exceed capacity 34359738367 of selected record format
WARNING: 14.62GiB memory may not be sufficient to complete this import. Suggested memory distribution is:
heap size: 5.026GiB
minimum free and available memory excluding heap size: 202.6GiB
Import starting 2022-10-08 19:01:43.942+0000
Estimated number of nodes: 15.14 G
Estimated number of node properties: 97.72 G
Estimated number of relationships: 37.58 G
Estimated number of relationship properties: 0.00
Estimated disk space usage: 6.598TiB
Estimated required memory usage: 202.6GiB
(1/4) Node import 2022-10-08 19:01:43.953+0000
Estimated number of nodes: 15.14 G
Estimated disk space usage: 5.436TiB
Estimated required memory usage: 202.6GiB
.......... .......... .......... .......... .......... 5% ∆1h 38m 2s 867ms
neo4j#79d2b0538617:~/import$
TL:DR; Using Periodic Commit, or Transaction Batching
If you're trying to follow the Operations Manual: Neo4j Admin Import, and your csv matches the movies.csv in that example, I would suggest instead doing a more manual USING PERIODIC COMMIT LOAD CSV...:
Stop the db.
Put your csv at neo4j/import/myfile.csv.
If you're using Desktop: Project > DB > click the ... on the right >
Open Folder
Add the APOC plugin.
Start the DB.
Next, open a browser instance, run the following (adjust for your data), and leave it until tomorrow:
USING PERIODIC COMMIT LOAD CSV FROM 'file:///myfile.csv' AS line
WITH line[3] AS nodeLabels, {
id: line[0],
title: line[1],
year: toInteger(line[2])
} AS nodeProps
apoc.create.node(SPLIT(line[3],';',
Note: There are many ways to solve this problem, depending on your source data and the model you wish to create. This solution is only meant to give you a handful of tools to help you get around the memory limit. If it is a simple CSV, and you don't care about what labels the nodes get initially, and you have headers, you can skip the complex APOC, and probably just do something like the following:
USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///myfile.csv' AS line
CREATE (a :ImportedNode)
SET a = line
File for Each Label
Original Asker mentioned having a separate csv for each label. In such instances it may be helpful to have a great-big single-command that can handle all of it, rather than needing to manually step through each step of the operation.
Assuming two label-types, each with a unique 'id' property, and one with a 'parent_id' referencing the other label...
UNWIND [
{ file: 'country.csv', label: 'Country'},
{ file: 'city.csv', label: 'City'}
] AS importFile
USING PERIODIC COMMIT LOAD CSV FROM 'file:///' + importFile.file AS line
CALL apoc.merge.node([importFile.label], {id: line.id}) YIELD node
SET node = line
;
// then build the relationships
MATCH (city :City)
WHERE city.parent_id
MATCH (country :Country {id: city.parent_id)
MERGE (city)-[:IN]->(country)

Dynamic targetHits in Vespa yql

I'm trying to create a Vespa query where I would like to set the rate limit of the targetHits. For example the query below has a constant number of 3 targetHits:
'yql': 'select id, title from sources * where \
([{"targetHits":3}]nearestNeighbor(embeddings_vector,query_embeddings_vector));'
Is there some way I can set this number dynamically in every call?
Also what is the difference between hits and targetHits? Does it have to do with the minimum and desired requirement?
Thanks a lot.
I'm not sure what you mean by rate limit of targetHits but generally
The targetHits is per content node if you run Vespa with multi-node cluster
The targetHits is the number of hits you want to expose to the ranking profile's first phase ranking function
hits only controls how many to return in the SERP response. It's perfectly valid to ask for targethits 500 per content node for ranking and finally just the global best 10 (according to your ranking profile)
Is there some way I can set this number dynamically in every call?
You can modify the SQL you send of course, but a better way to do this is often to create a Searcher as part of your application that modifies it programmatically, e.g:
public class TargetHitsSearcher extends Searcher {
#Override
public Result search(Query query, Execution execution) {
Item root = query.getModel().getQueryTree().getRoot();
if (root instanceof NearestNeighborItem) { // In general; search recursively
int target = query.properties().getInteger("myTarget");
((NearestNeighborItem)root).setTargetNumHits(target);
}
return execution.search(query);
}
}

Determine perforce changelist number after running p4.run("sync") in Jenkins SCM pipeline

On the Jenkins server, Perforce plugin (P4) is installed.
Within my Jenkins server job pipeline (implemented as shared library in groovy-lang), there is a pipeline stage to sync from perforce to the jenkins workspace as:
p4.run("sync")
I want to determine the changelist number of this operation. I need to use this changelist number in the later stages of the pipeline.
I am thinking to do as follows:
p4.run("sync")
changelist_number = p4.run("changes -m1 #have")
Will this work? Or give me a better solution. Also I am very unfamiliar about this topic. It would be nice if you can explain what all this means.
The changelist number (that is, the highest changelist number associated with any synced revision) is returned as part of the p4 sync output if you're running in tagged mode:
C:\Perforce\test\merge>p4 changes ...
Change 226 on 2020/11/12 by Samwise#Samwise-dvcs-1509687817 'foo'
Change 202 on 2020/10/28 by Samwise#Samwise-dvcs-1509687817 'Populate //stream/test.'
C:\Perforce\test\merge>p4 -Ztag sync ...
... depotFile //stream/test/merge/foo.txt
... clientFile c:\Perforce\test\merge\foo.txt
... rev 2
... action updated
... fileSize 20
... totalFileSize 20
... totalFileCount 1
... change 226
Tagged output is converted into a dictionary that's returned by the run method, so you should be able to just do:
changelist_number = p4.run("sync")[0]["change"]
to sync and get the changelist number as a single operation.
There are some edge cases here -- deleted files aren't synced and so the deleted revisions won't factor into that changelist number.
A more ironclad method is to put the horse before the cart -- get the current changelist number (from the depot, not limited to what's in your client), and then sync to that exact number. That way consistency is guaranteed; if a new changelist is submitted between the two commands, your stored changelist number still matches what you synced to.
changelist_number = p4.run("changes", "-m1", "-ssubmitted")[0]["change"]
p4.run("sync", "#{changelist_number}")
Any other client syncing to that changelist number is guaranteed to get the same set of revisions (subject to its View).

Dataflow: How to create a pipeline from an already existing PCollection spewed by another pipeline

I am trying split my pipeline into many smaller pipelines so they execute faster. I am partitioning a PCollection of Google Cloud Storage blobs (PCollection)so that I get a
PCollectionList<Blob> collectionList
from there I would love to be able to something like:
Pipeline p2 = Pipeline.create(collectionList.get(0));
.apply(stuff)
.apply(stuff)
Pipeline p3 = Pipeline.create(collectionList.get(1));
.apply(stuff)
.apply(stuff)
But I haven't found any documentation about creating an initial PCollection from an already existing PCollection, I'd be very grateful if anyone can point me the right direction.
Thanks!
You should look into the Partition transform to split a PCollection into N smaller ones. You can provide a PartitionFn to define how the split is done. You can find below an example from the Beam programming guide:
// Provide an int value with the desired number of result partitions, and a PartitionFn that represents the partitioning function.
// In this example, we define the PartitionFn in-line.
// Returns a PCollectionList containing each of the resulting partitions as individual PCollection objects.
PCollection<Student> students = ...;
// Split students up into 10 partitions, by percentile:
PCollectionList<Student> studentsByPercentile =
students.apply(Partition.of(10, new PartitionFn<Student>() {
public int partitionFor(Student student, int numPartitions) {
return student.getPercentile() // 0..99
* numPartitions / 100;
}}));
// You can extract each partition from the PCollectionList using the get method, as follows:
PCollection<Student> fortiethPercentile = studentsByPercentile.get(4);

Rails 4 - sum values grouped by external key

I know this must be simple but I'm really lost.
Three models: Job, Task and Operation, as follows:
Job
has_many :tasks
Task
belongs_to :job
belongs_to :operation
Operation
has_many :jobs
Job has an attribute, total_pieces, which tells me how many pieces you need.
For each Job, you can add a number of Tasks, which can belong to different Operations (cutting, drilling, etc.) and for every task you can set up a number of pieces.
I don't know in advance how many Operations will be needed for a single Job, but I need to alert user of the number of pieces left for that Operation when a new Task is inserted.
Let us make an example:
Job 1: total_pieces=100
- Task 1: operation 1(cutting), pieces=20
- Task 2: operation 1(cutting), pieces=30
- Task 3: operation 2(drilling), pieces=20
I need to alert the user that they still need to cut 50 pieces and to drill 80.
Hypothetically, if i add:
- Task 4: operation 3(bending), pieces=20
I need to alert the user that they also still need to bend 80 pieces.
So far i've managed to list all kinds of Operations for each Job using map, but now i need to sum up all pieces of the Task with the same Operation type in a Job, and only for those Operations present in the Tasks belonging to that Job.
Is there any way to do this using map? Or do I need to write a query manually?
EDIT: this is what I managed to patch up at the moment.
A method operations_applied in Job gives me a list of ids for all the Operations usend in Tasks queued for the Job.
Then another method, pieces_remaining for(operation) gives me the remaining pieces for the single operation.
Finally, in the Job views I need, I iterate through all operations_applied printing all pieces_remaining_for.
I know this is not particularly elegant but so far it works, any ideas to improve this?
Thank you.
If I'm not misunderstanding, it is not possible to do what you want to do with map, since map always applies to arr.size == arr.map {...}.size and you want to reduce your array.
What you could do is something like this:
job = Jobs.first
operation_pieces = {}
job.tasks.each do |task|
operation_pieces[task.operation.id] ||= { operation: task.operation }
operation_pieces[task.operation.id][:pieces] ||= 0
operation_pieces[task.operation.id][:pieces] += task.pieces
end
Now operation_pieces contains the sum of pieces for the operation with the id of the respective index. But I'm sure there is a more elegant version to do this ;)
EDIT: changed the code example to a hash
EDIT: and here is the more elegant version:
job = Jobs.first
job.tasks
.group_by(&:operation)
.map { |op, tasks|
{ op => tasks.sum(&:pieces) }
}
The group_by groups your array of tasks by the operation of the task (maybe you need to use group_by { |t| t.operation } instead, I'm not sure) and inside the map afterwards the pieces of each task with the same operation is summed up. Finally, you end up with a hash of the type OPERATION => PIECES_SUM (INTEGER).
I assume the following variables required for your query,
attributes
Task:
number_of_pieces,
Job:
name,
Operation:
name
Job.joins("LEFT JOIN tasks ON jobs.id = tasks.job_id").joins("LEFT JOIN operations ON operations.id = tasks.operation_id").select(" SUM(tasks.number_of_pieces) as number_of_pieces, operations.name, jobs.name").group("operations.id, jobs.id")
This will list all the jobs, and sum of pieces required for each operation under it.
if you have the job_id for you want the list of operations and the pieces, then use the below code,
Job.find(params[:job_id]).joins("LEFT JOIN tasks ON jobs.id = tasks.job_id").joins("LEFT JOIN operations ON operations.id = tasks.operation_id").select(" SUM(tasks.number_of_pieces) as number_of_pieces, operations.name, jobs.name").group("operations.id, jobs.id")
please put in comments if your need any explanation.

Resources