recommended maximum dataset size for Apache Jena - jena

I would like to know what is currently the maximum dataset size supported by Apache Jena (https://jena.apache.org/index.html). For example is it suited for datasets in the order of 100 million, 1 billion or 20 billion triples?
Thank you
D063520

Related

Why max_batches independent of the size of the dataset?

I am wondering why the number of images has no influence on the number of iterations when training. Here is an example to to make my question clearer:
Suppose we have 6400 images for a training to recognize 4 classes. Based on AlexeyAB explanations, we keep batch= 64, subdivisions = 16 and write max_batches = 8000 since max_batches is determined by #classes x 2000.
Since we have 6400 images, a complete epoch requires 100 iterations. Therefore this training ends after 80 epochs.
Now, suppose that we have 12800 images. In that case, an epoch needs 200 iterations. Therefore the training ends after 40 epochs.
Since an epoch refers to one cycle through the full training dataset, I'm wondering why we don't increase the number of iterations when our dataset increases, in order to keep the number of epochs constant.
Said differently, I'm asking for a simple explanation as to why the number of epochs seems to be irrelevant to the quality of the training. I feel that it's a consequence of Yolo's construction but I am not knowledgeable enough to understand how.
Why the number of images has no influence on the number of iterations when training?
In darknet yolo, the number of iterations depends on the max_batches parameter in .cfg file. After running for max_batches, the darknet saves the final_weights.
In each epoch, all the data samples are passed through the network, so if you have many images, the training time for one epoch (and iteration) will be higher, you can test that by increasing images in your data.
The sub-division accounts for the number of mini-batches. Let's say, you have 100 images in your dataset. your batch size is 10, sub-division is 2, max_batches is 20.
So, in each iteration, 10 images are passed to the network in two mini-batches (Each having 5 samples), once you have done 20 baches (20*10 data samples), the training will be completed. (The details can be a little different, I'm using a slightly modified darknet by original author pjreddie)
The instructions are updated now. max_batches is equal to classes*2000 but not less than number of training images and not less than 6000. Please find it at this link.

Is high label cardinality but low metric/label count and infrequent sampling an acceptable use-case for Prometheus?

I have a use-case of monitoring that I'm not entirely sure if it's a good
match for Prometheus or not, and I wanted to ask for opinions before I delve
deeper.
The numbers of what I'm going to store:
Only 1 metric.
That metric has 1 label with 1,000,000 to 2,000,000 distinct values.
The values are gauges (but does it make a difference if they are counters?)
Sample rate is once every 5 minutes. Retaining data for 180 days.
Estimated storage size if I have 1 million distinct label values:
(According to formula in Prometheus' documentation: retention_time_seconds *
ingested_samples_per_second * bytes_per_sample)
(24*60)/5=288 5-minute intervals in a day.
(180*288) * (1,000,000) * 2 = 103,680,000,000 ~= 100GB
samples/label-value label-value-count bytes/sample
So I assume 100-200GB will be required.
Is this estimation correct?
I read in multiple places about avoiding high-cardinality labels, and I would
like to ask about this. Considering I will be looking at one time-series at a time Is the problem with high-cardinality labels? Or
having a high number of time-series? As each label value produces another
time-series? I also read in multiple places that Prometheus can handle
millions of time-series at once, so even if I have 1 label with one million
distinct values, I should be fine in terms of time-series count, do I have to
worry about the labels having high cardinality in this case? I'm aware that
it depends on the strength of the server, but assuming average capacity, I
would like to know if Prometheus' implementation has a problem handling this
case efficiently.
And also, if it's a matter of time-series count, am I correct in assuming
that it will not make a significant difference between the following
options?
1 metric with 1 label of 1,000,000 distinct label values.
10 metrics each with 1 label of 100,000 distinct label values.
X metrics each with 1 label of Y distinct label values.
where X * Y = 1,000,000
Thanks for the help!
That might work, but it's not what Prometheus is designed for and you'll likely run into issues. You probably want a database rather than a monitoring system, maybe Cassandra here.
How the cardinality is split out across metrics won't affect ingestion performance, however it'll be relatively slow to have to read 1M series in a query.
Note that Victoria Metrics is an easy to configure backend for Prometheus which will reduce storage requirements significantly.

How to calculate optimal batch size

Sometimes I run into a problem:
OOM when allocating tensor with shape
e.g.
OOM when allocating tensor with shape (1024, 100, 160)
Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in the model, it runs fine.
Is there a generic way to calculate optimal batch size based on model and GPU memory, so the program doesn't crash?
In short: I want the largest batch size possible in terms of my model, which will fit into my GPU memory and won't crash the program.
From the recent Deep Learning book by Goodfellow et al., chapter 8:
Minibatch sizes are generally driven by the following factors:
Larger batches provide a more accurate estimate of the gradient, but
with less than linear returns.
Multicore architectures are usually
underutilized by extremely small batches. This motivates using some
absolute minimum batch size, below which there is no reduction in the
time to process a minibatch.
If all examples in the batch are to be
processed in parallel (as is typically the case), then the amount of
memory scales with the batch size. For many hardware setups this is
the limiting factor in batch size.
Some kinds of hardware achieve
better runtime with specific sizes of arrays. Especially when using
GPUs, it is common for power of 2 batch sizes to offer better runtime.
Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes
being attempted for large models.
Small batches can offer a
regularizing effect (Wilson and Martinez, 2003), perhaps due to the
noise they add to the learning process. Generalization error is often
best for a batch size of 1. Training with such a small batch size
might require a small learning rate to maintain stability because of
the high variance in the estimate of the gradient. The total runtime
can be very high as a result of the need to make more steps, both
because of the reduced learning rate and because it takes more steps
to observe the entire training set.
Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".
You might want also to consult several good posts here in Stack Exchange:
Tradeoff batch size vs. number of iterations to train a neural network
Selection of Mini-batch Size for Neural Network Regression
How large should the batch size be for stochastic gradient descent?
Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.
Hope this helps...
UPDATE (Dec 2017):
There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.
UPDATE (Mar 2021):
Of interest here is also another paper from 2018, Revisiting Small Batch Training for Deep Neural Networks (h/t to Nicolas Gervais), which runs contrary to the larger the better advice; quoting from the abstract:
The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.
You can estimate the largest batch size using:
Max batch size= available GPU memory bytes / 4 / (size of tensors + trainable parameters)
Use the summaries provided by pytorchsummary (pip install) or keras (builtin).
E.g.
from torchsummary import summary
summary(model)
.....
.....
================================================================
Total params: 1,127,495
Trainable params: 1,127,495
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 13.93
Params size (MB): 4.30
Estimated Total Size (MB): 18.25
----------------------------------------------------------------
Each instance you put in the batch will require a full forward/backward pass in memory, your model you only need once. People seem to prefer batch sizes of powers of two, probably because of automatic layout optimization on the gpu.
Don't forget to linearly increase your learning rate when increasing the batch size.
Let's assume we have a Tesla P100 at hand with 16 GB memory.
(16000 - model_size) / (forward_back_ward_size)
(16000 - 4.3) / 18.25 = 1148.29
rounded to powers of 2 results in batch size 1024
Here is a function to find batch size for training the model:
def FindBatchSize(model):
"""model: model architecture, that is yet to be trained"""
import os, sys, psutil, gc, tensorflow, keras
import numpy as np
from keras import backend as K
BatchFound= 16
try:
total_params= int(model.count_params()); GCPU= "CPU"
#find whether gpu is available
try:
if K.tensorflow_backend._get_available_gpus()== []:
GCPU= "CPU"; #CPU and Cuda9GPU
else:
GCPU= "GPU"
except:
from tensorflow.python.client import device_lib; #Cuda8GPU
def get_available_gpus():
local_device_protos= device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
if "gpu" not in str(get_available_gpus()).lower():
GCPU= "CPU"
else:
GCPU= "GPU"
#decide batch size on the basis of GPU availability and model complexity
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <1000000):
BatchFound= 64
if (os.cpu_count() <16) and (total_params <500000):
BatchFound= 64
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <2000000) and (total_params >=1000000):
BatchFound= 32
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=2000000) and (total_params <10000000):
BatchFound= 16
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=10000000):
BatchFound= 8
if (os.cpu_count() <16) and (total_params >5000000):
BatchFound= 8
if total_params >100000000:
BatchFound= 1
except:
pass
try:
#find percentage of memory used
memoryused= psutil.virtual_memory()
memoryused= float(str(memoryused).replace(" ", "").split("percent=")[1].split(",")[0])
if memoryused >75.0:
BatchFound= 8
if memoryused >85.0:
BatchFound= 4
if memoryused >90.0:
BatchFound= 2
if total_params >100000000:
BatchFound= 1
print("Batch Size: "+ str(BatchFound)); gc.collect()
except:
pass
memoryused= []; total_params= []; GCPU= "";
del memoryused, total_params, GCPU; gc.collect()
return BatchFound
I ran into a similar GPU mem error which was solved by configuring the tensorflow session with the following:
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
see: google colaboratory `ResourceExhaustedError` with GPU

Gigabyte or Gibibyte (1000 or 1024)?

This may be a duplicate and I apologies if that is so but I really want a definitive answer as that seems to change depending upon where I look.
Is it acceptable to say that a gigabyte is 1024 megabytes or should it be said that it is 1000 megabytes? I am taking computer science at GCSE and a typical exam question could be how many bytes in a kilobyte and I believe the exam board, AQA, has the answer for such a question as 1024 not 1000. How is this? Are both correct? Which one should I go with?
Thanks in advance- this has got me rather bamboozled!
The sad fact is that it depends on who you ask. But computer terminology is slowly being aligned with normal terminology, in which kilo is 103 (1,000), mega is 106 (1,000,000), and giga is 109 (1,000,000,000).
This is reflected in the International System of Quantities and the International Electrotechnical Commission, which define gigabyte as 109 and use gibibyte for the computer-specific 1024 x 1024 x 1024 value.
The reason it "depends who you ask," is that for many years, specifically in relation to "bytes" of storage, the prefixes kilo, mega, and giga meant 1024, 10242, and 10243. But that flies in the face of normal convention with regard to these prefixes. So again, computer terminology is being aligned with non-computer terminology.
The term gigabyte is commonly used to mean either 10003 bytes or 10243 bytes depending on the context. Disk manufacturers prefer the decimal term while memory manufacturers use the binary.
Decimal definition
1 GB = 1,000,000,000 bytes (= 10003 B = 109 B)
Based on powers of 10, this definition uses the prefix as defined in the International System of Units (SI). This is the recommended definition by the International Electrotechnical Commission (IEC). This definition is used in networking contexts and most storage media, particularly hard drives, flash-based storage, and DVDs, and is also consistent with the other uses of the SI prefix in computing, such as CPU clock speeds or measures of performance.
Binary definition
1 GiB = 1,073,741,824 bytes (= 10243 B = 230 B).
The binary definition uses powers of the base 2, as is the architectural principle of binary computers. This usage is widely promulgated by some operating systems, such as Microsoft Windows in reference to computer memory (e.g., RAM). This definition is synonymous with the unambiguous unit gibibyte.
The difference between units based on decimal and binary prefixes increases as a semi-logarithmic (linear-log) function—for example, the decimal kilobyte value is nearly 98% of the kibibyte, a megabyte is under 96% of a mebibyte, and a gigabyte is just over 93% of a gibibyte value. This means that a 300 GB (279 GiB) hard disk might be indicated variously as 300 GB, 279 GB or 279 GiB, depending on the operating system.
The Wikipedia article https://en.wikipedia.org/wiki/Gigabyte has a good writeup of the confusion surrounding the usage of the term

IOPS versus Throughput

What is the key difference between IOPS and Throughput in large data storage?
Does file size have an effect on IOPS? Why?
IOPS measures the number of read and write operations per second, while throughput measures the number of bits read or written per second.
Although they measure different things, they generally follow each other as IO operations have about the same size.
If you have large files, you simply need more IO operations to read the entire file. The file size has no effect on the IOPS as it measures the number of clusters read or written, not the number of files.
If you have small files, there will be more overhead, so while the IOPS and throughput look good, you may experience a lower actual performance.
This is the analogy I came up with when talking about Throughput and IOPS.
Think of it as:
You have 4 buckets (Disk blocks) of the same size that you want to fill or empty with water.
You'll be using a jug to transfer the water into the buckets. Now your question will be:
At a given time (per second), how many jugs of water can you pour (write) or withdraw (read)? This is IOPS.
At a given time (per second) what's the amount (bit, kb, mb, etc) of water the jug can transfer into/out of the bucket continuously? This is throughput.
Additionally, there is a delay in the process of you pouring and/or withdrawing the water. This is Latency.
There's 3 things to consider when talking about IOPS and Throughput:
Size (file size/block size)
Patterns (Random/Sequential)
Mix (Read/Write) percentage
The Disk IOPS Describes the count of input/output operations on the disk per seconds, regardless block size.
The disk throughput describes how many data may be transferred per second, so the block size play a huge role upon calculating the throughput required by app
Let's consider as the sample the 3000 IOPS and SQL database engine, the block size in terms of db engine is called the page size and for SQL Server it's equal to 8 KB. If you wish to calculate the actual throughput, if the IOPS defined, you will end up with the formula below:
throughput = [IOPS] * [block size] = 3000 * 8 = 24 000 KB/s = 24 MB/s
IOPS - Number of read write operations mostly useful for OLTP transactions used in AWS for DBs like Cassandra.
Throughput - Is the number of bit transferred per sec. i.e.data transferred per sec.
Mainly a unit for high data transfer applications like big data hadoop,kafka streaming
IOPS- The time taken for a storage system to perform an Input/Output operation per second from start to finish constitutes IOPS.
Throughput- Data transfer speed in megabytes per second is often termed as throughput. Earlier, it was measured in Kilobytes. But now the standard has become megabytes.
More about this see: What is the difference between IOPS and throughput?

Resources