Pass large amount of binary data from u-boot to linux kernel - memory

Have some issues with passing large amount of data (3 MB) from uboot to linux kernel 2.6.35.3 on imx50 ARM board. This data is required in kernel device driver probe function and then it should be released. First uboot load data from flash to RAM, then pass physical address for linux kernel using bootargs. In kernel I try to reserve certain amount of memory using reserve_resource() in arch/arm/kernel/setup.c file:
--- a/arch/arm/kernel/setup.c Tue Jul 17 11:22:39 2012 +0300
+++ b/arch/arm/kernel/setup.c Fri Jul 20 14:17:16 2012 +0300
struct resource my_mem_res = {
.name = "My_Region",
.start = 0x77c00000,
.end = 0x77ffffff,
.flags = IORESOURCE_MEM | IORESOURCE_BUSY,
};
## -477,6 +479,10 ##
kernel_code.end = virt_to_phys(_etext - 1);
kernel_data.start = virt_to_phys(_data);
kernel_data.end = virt_to_phys(_end - 1);
+ my_mem_res.start = mi->bank[i].start + mi->bank[i].size - 0x400000;
+ my_mem_res.end = mi->bank[i].start + mi->bank[i].size - 1;
for (i = 0; i < mi->nr_banks; i++) {
if (mi->bank[i].size == 0)
## -496,6 +502,8 ##
if (kernel_data.start >= res->start &&
kernel_data.end <= res->end)
request_resource(res, &kernel_data);
+
+ request_resource(res, &my_mem_res);
}
if (mdesc->video_start) {
By this I'm trying to tell kernel that this memory area it reserved and this data should not be modified by kernel.
70000000-77ffffff : System RAM
70027000-7056ffff : Kernel text
70588000-7062094f : Kernel data
77c00000-77ffffff : My_Region
In driver ioremap(0x77c00000, AREA_SIZE) is used to get kernel memory address. But when I dump content of memory, there is only zeros. If boot kernel with mem=120M (total 128MB RAM is avaliable), then my data is above kernel system ram region, then I get data I expect.
So, my questions:
Why I get zeros and how do I pass large amount of binary data from uboot to linux kernel?

You could use a custom ATAG to either pass the data block or to pass the address & length of the data. Note that the "A" in ATAG stands for ARM, so this solution is not portable to other architectures. An ATAG is preferable to a command-line bootarg IMO because you do not want the user to muck with physical memory addresses. Also the Linux kernel will process the ATAG list before the MMU (i.e. virtual memory) is enabled.
In U-Boot, look at lib_arm/armlinux.c or arch/arm/lib/bootm.c for routines that build the ARM tag list. Write your own routine for your new tag(s), and then invoke it in do_bootm_linux().
In the Linux kernel ATAGs are processed in arch/arm/kernel/setup.c, when virtual memory has not yet been enabled. If you just pass an address & length values from U-Boot, then the pointer & length can be assigned to global variables that are exported,
void *my_data;
unsigned int my_dlen;
EXPORT_SYMBOL(my_data);
EXPORT_SYMBOL(my_dlen);
and then the driver can retrieve it.
extern void *my_data;
extern unsigned int my_dlen;
request_mem_region(my_data, my_dlen, DRV_NAME);
md_map = ioremap(my_data, my_dlen);
I've used similar code to probe for SRAM in U-Boot, then pass the starting address & number of KBytes found to the kernel in a custom ATAG. A kernel driver obtains these values, and if they are nonzero and have sane values, creates a block device out of the SRAM. The major difference from your situation is that the SRAM is in a completely different physical address range from the SDRAM.
NOTE
An ATAG is built by U-Boot for the physical memory that the kernel can use, so this is really where you need to define and exclude your reserved RAM. It's probably too late to do that in the kernel.

Related

malloc kernel panics instead of returning NULL

I'm attempting to do an exercise from "Expert C Programming" where the point is to see how much memory a program can allocate. It hinges on malloc returning NULL when it cannot allocate anymore.
#include <stdio.h>
#include <stdlib.h>
int main() {
int totalMB = 0;
int oneMeg = 1<<20;
while (malloc(oneMeg)) {
++totalMB;
}
printf("Allocated %d Mb total \n", totalMB);
return 0;
}
Rather than printing the total, I get a kernel panic after allocating ~8GB on my 16GB Macbook Pro.
Kernel panic log:
Anonymous UUID: 0B87CC9D-2495-4639-EA18-6F1F8696029F
Tue Dec 13 23:09:12 2016
*** Panic Report ***
panic(cpu 0 caller 0xffffff800c51f5a4): "zalloc: zone map exhausted while allocating from zone VM map entries, likely due to memory leak in zone VM map entries (6178859600 total bytes, 77235745 elements allocated)"#/Library/Caches/com.apple.xbs/Sources/xnu/xnu-3248.50.21/osfmk/kern/zalloc.c:2628
Backtrace (CPU 0), Frame : Return Address
0xffffff91f89bb960 : 0xffffff800c4dab12
0xffffff91f89bb9e0 : 0xffffff800c51f5a4
0xffffff91f89bbb10 : 0xffffff800c5614e0
0xffffff91f89bbb30 : 0xffffff800c5550e2
0xffffff91f89bbba0 : 0xffffff800c554960
0xffffff91f89bbd90 : 0xffffff800c55f493
0xffffff91f89bbea0 : 0xffffff800c4d17cb
0xffffff91f89bbf10 : 0xffffff800c5b8dca
0xffffff91f89bbfb0 : 0xffffff800c5ecc86
BSD process name corresponding to current thread: a.out
Mac OS version:
15F34
I understand that this can easily be fixed by the doctor's cliche of "It hurts when you do that? Then don't do that" but I want to understand why malloc isn't working as expected.
OS X 10.11.5
For the definitive answer to that question, you can look at the source code, which you'll find here:
zalloc.c source in XNU
In that source file find the function zalloc_internal(). This is the function that gives the kernel panic.
In the function you'll find a "for (;;) {" loop, which basically tries to allocate the memory you're requesting in the specified zone. If there isn't enough space, it immediately tries again. If that fails it does a zone_gc() (garbage collect) to try to reclaim memory. If that also fails, it simply kernel panics - effectively halting the computer.
If you want to understand how zalloc.c works, look up zone-based memory allocators.
Your program is making the kernel run out of space in the zone called "VM map entries", which is a predefined zone allocated at boot. You could probably get the result you are expecting from your program, without a kernel panic, if you allocated more than 1 MB at a time.
In essence it is not really a problem for the kernel to allocate you several gigabytes of memory. However, allocating thousands of smaller allocations summing up to those gigabytes is much harder.

SP605 Spartan 6 DDR3 addressing

the following post is quite long, but since I have had trouble making the SP605 board properly interact with the DDR3 for over a month now, hopefully this will be useful to others in the same situation as I find myself in. I am pretty certain it's a simple configuration or conceptual error, but I would be more than happy to have this resolved soon.
=== SCENARIO ===
I have created a USB-UART interface to communicate with the FPGA and control the DDR3. Using the IP generator in ISE, I generated a MIG wrapper and then I designed the memory interface controller. However, I have referenced manuals ug388 and ug416, but I have not been able to have the DDR3 behave as expected.
=== PROBLEM STATEMENT ===
Playing around with the burst lengths for write and read commands, I am able to get data back from the DDR3, yet the addressing scheme does not seem to be correct as data is duplicated in addresses 0 and 1, 2 and 3, 4 and 5, and so forth. Also, whenever I write into address 0, for example, nothing changes. Then, when I write into address 1, both addresses 0 and 1 are updated with the data value I just sent. It seems I am "losing" half of the memory space due to this coupled effect.
=== DDR3 IP CONFIGURATION ===
The setup for the DDR3 using the IP generator – considering the SP605 board scenario – is listed below. In sum, I activated the DDR3 Bank 3 and configured Port0 to be 32-bit bidirectional.
Memory selection:
Enable AXI interface: unchecked
Use extended MCB performance range: unchecked
Memory type for bank 3: DDR3 SDRAM
Memory type for bank 1: none
Options for C3 – DDR3 SDRAM
Frequency: 400 MHz
Memory part: MTJ41J64M16XX-187E
Memory options for C3 – DDR3 SDRAM
Output driver impedance control: RZQ/6
RTT (nominal) – ODT: RZQ/4
Auto self refresh: enabled
Port configuration for C3 – DDR3 SDRAM
Two 32-bit bi-directional and four 32-bit unidirectional ports
Port0: checked
Port1: unchecked
Port2: unchecked
Port3: unchecked
Port4: unchecked
Port5: unchecked
Memory address mapping selection: row-bank-column
FPGA options for C3 – DDR3 SDRAM
Memory interface pin termination: Calibrated input termination
Select RZQ pin location: R7
Select ZIO pin location: W4
Debug signals for memory controller: disable
System clock: differential
=== DATA STRUCTURE ===
From Matlab, I send in a 64-bit command which should write or read the DDR3 based on the address and data provided in this command.
wire [00:00] cmd_instruction = usb_data[63:63]; // ‘0’ = write; ‘1’ = read
wire [27:00] cmd_address = usb_data[62:37]; // 26-bit address
wire [31:00] cmd_data = usb_data[31:00]; // 32-bit data
In ug388, the following can be extracted:
Page 20: The address is 26 bits wide.
C_MEM_ADDR_WIDTH = 13
C_MEM_BANKADDR_WIDTH = 3
C_MEM_NUM_COL_BITS = 10
C_P0_DATA_PORT_SIZE = 32 // 32-bit data ports
C_P0_MASK_SIZE = 4 // 4 bytes = 32 bits (1 mask bit = 1 entire data byte)
Pages 26-27: Command data structure.
pX_cmd_addr[29:0]: 30-bit address, however the last two bits should = “00” since every word (32 bits) is formed by 4 bytes.
pX_cmd_bl[5:0]: Burst length of 1 is obtained by setting this signal to 0.
pX_cmd_instr[2:0]: The only command instructions used are write=”000” and read=”001”.
Page 28: Write data structure.
pX_wr_mask[PX_MASKSIZE-1:0]: 4-bit mask is set to “0000” so that all 4 bytes are always written into the memory.
=== SIGNAL ASSIGNMENTS ===
Using all this information, I assigned my signals in the following manner:
assign p0_mcb_cmd_instr = {2'b00, cmd_instruction};
assign p0_mcb_cmd_addr = {2’d0, cmd_address, 2'd0};
assign p0_mcb_cmd_bl = 6'd0;
assign p0_mcb_wr_data = cmd_data;
assign p0_mcb_wr_mask = 4'd0;
localparam C3_MEM_BURST_LEN = 8;
=== CONCLUSIONS ===
Based on the configuration, does anyone know what the expected behavior of my controller should be?
If any additional information is necessary for clarification, please let me know.
Thanks a lot,
Bruno.

Process memory limit and address space for UNIX and Linux and Windows

what is the maximum amount of memory for a single process in UNIX and Linux and windows? how to calculate that? How much user address space and kernel address space for 4 GB of RAM?
How much user address space and kernel address space for 4 GB of RAM?
The address space of a process is divided into two parts,
User space: On standard 32 bit x86_64 architecture,the maximum addressable memory is 4GB, out of which addresses from 0x00000000 to 0xbfffffff = (3GB) meant for code, data segments. This region can be addressed when user process executing either in user or kernel mode.
Kernel space: Similarly, addresses from 0xc0000000 to 0xffffffff = (1GB) are meant for virtual address space of the kernel and can only addressed when the process executes in kernel mode.
This particular address space split on x86 is determined by the value of PAGE_OFFSET. Referring to Linux 3.11.1v page_32_types.h and page_64_types.h, page offset is defined as below,
#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)
Where Kconfig defines a default value of default 0xC0000000 also with other address split options available.
Similarly for 64 bit,
#define __PAGE_OFFSET _AC(0xffff880000000000, UL).
On 64 bit architecture 3G/1G split doesn't hold anymore due to huge address space. As per the source latest Linux version has given above offset as offset.
When I see my 64 bit x86_64 architecture, a 32 bit process can have entire 4GB of user address space and kernel will hold address range above 4GB. Interestingly on modern 64 bit x86_64 CPU's not all address lines are enabled(or the address bus is not large enough) to provide us 2^64 = 16 exabytes of virtual address space. Perhaps AMD64/x86 architectures has 48/42 lower bits enabled respectively resulting to 2^48 = 256TB / 2^42= 4TB of address space. Now this definitely improves performance with large amount of RAM, at the same time question arises how it is efficiently managed with the OS limitations.
In Linux there's a way to find out the limit of address space you can have.
Using the rlimit structure.
struct rlimit {
rlim_t cur; //current limit
rlim_t max; //ceiling for cur.
}
rlim_t is a unsigned long type.
and you can have something like:
#include <stdio.h>
#include <stdlib.h>
#include <sys/resource.h>
//Bytes To GigaBytes
static inline unsigned long btogb(unsigned long bytes) {
return bytes / (1024 * 1024 * 1024);
}
//Bytes To ExaBytes
static inline double btoeb(double bytes) {
return bytes / (1024.00 * 1024.00 * 1024.00 * 1024.00 * 1024.00 * 1024.00);
}
int main() {
printf("\n");
struct rlimit rlim_addr_space;
rlim_t addr_space;
/*
* Here we call to getrlimit(), with RLIMIT_AS (Address Space) and
* a pointer to our instance of rlimit struct.
*/
int retval = getrlimit(RLIMIT_AS, &rlim_addr_space);
// Get limit returns 0 if succeded, let's check that.
if(!retval) {
addr_space = rlim_addr_space.rlim_cur;
fprintf(stdout, "Current address_space: %lu Bytes, or %lu GB, or %f EB\n", addr_space, btogb(addr_space), btoeb((double)addr_space));
} else {
fprintf(stderr, "Coundn\'t get address space current limit.");
return 1;
}
return 0;
}
I ran this on my computer and... prrrrrrrrrrrrrrrrr tsk!
Output: Current address_space: 18446744073709551615 Bytes, or 17179869183 GB, or 16.000000 EB
I have 16 ExaBytes of max address space available on my Linux x86_64.
Here's getrlimit()'s definition
it also lists the other constants you can pass to getrlimits() and introduces getrlimit()s sister setrlimit(). There is when the max member of rlimit becomes really important, you should always check you don't exceed this value so the kernel don't punch your face, drink your coffee and steal your papers.
PD: please excuse my sorry excuse of a drum roll ^_^
On Linux systems, see man ulimit
(UPDATED)
It says:
The ulimit builtin is used to set the resource usage limits of the
shell and any processes spawned by it. If a new limit value is
omitted, the current value of the limit of the resource is printed.
ulimit -a prints out all current values with switch options, other switches, e.g. ulimit -n prints out no. of max. open files.
Unfortunatelly, "max memory size" tells "unlimited", which means that it is not limited by system administrator.
You can view the memory size by
cat /proc/meminfo
Which results something like:
MemTotal: 4048744 kB
MemFree: 465504 kB
Buffers: 316192 kB
Cached: 1306740 kB
SwapCached: 508 kB
Active: 1744884 kB
(...)
So, if ulimit says "unlimited", the MemFree is all yours. Almost.
Don't forget that malloc() (and new operator, which calls malloc()) is a STDLIB function, so if you call malloc(100) 10 times, there will be lot of "slack", follow link to learn why.

Detect SSD using Delphi [duplicate]

I'm getting ready to release a tool that is only effective with regular hard drives, not SSD (solid state drive). In fact, it shouldn't be used with SSD's because it will result in a lot of read/writes with no real effectiveness.
Anyone knows of a way of detecting if a given drive is solid-state?
Finally a reliable solution! Two of them, actually!
Check /sys/block/sdX/queue/rotational, where sdX is the drive name. If it's 0, you're dealing with an SSD, and 1 means plain old HDD.
I can't put my finger on the Linux version where it was introduced, but it's present in Ubuntu's Linux 3.2 and in vanilla Linux 3.6 and not present in vanilla 2.6.38. Oracle also backported it to their Unbreakable Enterprise kernel 5.5, which is based on 2.6.32.
There's also an ioctl to check if the drive is rotational since Linux 3.3, introduced by this commit. Using sysfs is usually more convenient, though.
You can actually fairly easily determine the rotational latency -- I did this once as part of a university project. It is described in this report. You'll want to skip to page 7 where you see some nice graphs of the latency. It goes from about 9.3 ms to 1.1 ms -- a drop of 8.2 ms. That corresponds directly to 60 s / 8.2 ms = 7317 RPM.
It was done with simple C code -- here's the part that measures the between positions aand b in a scratch file. We did this with larger and larger b values until we have been wandered all the way around a cylinder:
/* Measure the difference in access time between a and b. The result
* is measured in nanoseconds. */
int measure_latency(off_t a, off_t b) {
cycles_t ta, tb;
overflow_disk_buffer();
lseek(work_file, a, SEEK_SET);
read(work_file, buf, KiB/2);
ta = get_cycles();
lseek(work_file, b, SEEK_SET);
read(work_file, buf, KiB/2);
tb = get_cycles();
int diff = (tb - ta)/cycles_per_ns;
fprintf(stderr, "%i KiB to %i KiB: %i nsec\n", a / KiB, b / KiB, diff);
return diff;
}
This command lsblk -d -o name,rota lists your drives and has a 1 at ROTA if it's a rotational disk and a 0 if it's an SSD.
Example output :
NAME ROTA
sda 1
sdb 0
Detecting SSDs is not as impossible as dseifert makes out. There is already some progress in linux's libata (http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-04/msg03625.html), though it doesn't seem user-ready yet.
And I definitely understand why this needs to be done. It's basically the difference between a linked list and an array. Defragmentation and such is usually counter-productive on a SSD.
You could get lucky by running
smartctl -i sda
from Smartmontools. Almost all SSDs has SSD in the Model field. No guarantee though.
My two cents to answering this old but very important question... If a disk is accessed via SCSI, then you will (potentially) be able to use SCSI INQUIRY command to request its rotational rate. VPD (Vital Product Data) page for that is called Block Device Characteristics and has a number 0xB1. Bytes 4 and 5 of this page contain a number with meaning:
0000h "Medium rotation rate is not reported"
0001h "Non-rotating medium (e.g., solid state)"
0002h - 0400h "Reserved"
0401h - FFFEh "Nominal medium rotation rate in rotations per minute (i.e.,
rpm) (e.g., 7 200 rpm = 1C20h, 10 000 rpm = 2710h, and 15 000 rpm = 3A98h)"
FFFFh "Reserved"
So, SSD must have 0001h in this field. The T10.org document about this page can be found here.
However, the implementation status of this standard is not clear to me.
I wrote the following javascript code. I needed to determine if machine was ussing SSD drive and if it was boot drive. The solution uses MSFT_PhysicalDisk WMI interface.
function main()
{
var retval= false;
// MediaType - 0 Unknown, 3 HDD, 4 SSD
// SpindleSpeed - -1 has rotational speed, 0 has no rotational speed (SSD)
// DeviceID - 0 boot device
var objWMIService = GetObject("winmgmts:\\\\.\\root\\Microsoft\\Windows\\Storage");
var colItems = objWMIService.ExecQuery("select * from MSFT_PhysicalDisk");
var enumItems = new Enumerator(colItems);
for (; !enumItems.atEnd(); enumItems.moveNext())
{
var objItem = enumItems.item();
if (objItem.MediaType == 4 && objItem.SpindleSpeed == 0)
{
if (objItem.DeviceID ==0)
{
retval=true;
}
}
}
if (retval)
{
WScript.Echo("You have SSD Drive and it is your boot drive.");
}
else
{
WScript.Echo("You do not have SSD Drive");
}
return retval;
}
main();
SSD devices emulate a hard disk device interface, so they can just be used like hard disks. This also means that there is no general way to detect what they are.
You probably could use some characteristics of the drive (latency, speed, size), though this won't be accurate for all drives. Another possibility may be to look at the S.M.A.R.T. data and see whether you can determine the type of disk through this (by model name, certain values), however unless you keep a database of all drives out there, this is not gonna be 100% accurate either.
write text file
read text file
repeat 10000 times...
10000/elapsed
for an ssd will be much higher, python3:
def ssd_test():
doc = 'ssd_test.txt'
start = time.time()
for i in range(10000):
with open(doc, 'w+') as f:
f.write('ssd test')
f.close()
with open(doc, 'r') as f:
ret = f.read()
f.close()
stop = time.time()
elapsed = stop - start
ios = int(10000/elapsed)
hd = 'HDD'
if ios > 6000: # ssd>8000; hdd <4000
hd = 'SSD'
print('detecting hard drive type by read/write speed')
print('ios', ios, 'hard drive type', hd)
return hd

Why is cudaMalloc giving me an error when I know there is sufficient memory space?

I have a Tesla C2070 that is supposed to have 5636554752 bytes of memory.
However, this gives me an error:
int *buf_d = NULL;
err = cudaMalloc((void **)&buf_d, 1000000000*sizeof(int));
if( err != cudaSuccess)
{
printf("CUDA error: %s\n", cudaGetErrorString(err));
return EXIT_ERROR;
}
How is this possible? Does this have something to do with the maximum memory pitch? Here are the GPU's specs:
Device 0: "Tesla C2070"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5636554752 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
As for the machine I'm running on, it has 24 Intel® Xeon® Processor X565, with the Linux distribution Rocks 5.4 (Maverick).
Any ideas? Thanks!
The basic problem is in your question title - you don't actually know that you have sufficient memory, you are assuming you do. The runtime API includes the cudaMemGetInfo function which will return how much free memory there is on the device. When a context is established on a device, the driver must reserved space for device code, local memory for each thread, fifo buffers for printf support, stack for each thread, and heap for in-kernel malloc/new calls (see this answer for further details). All of this can consume rather a lot of memory, leaving you with much less than the maximum avialable memory after ECC reservations you are assuming to be available to your code. The API also includes cudaDeviceGetLimit which you can use to query the amounts of memory that on device runtime support is consuming. There is also a companion call cudaDeviceSetLimit which can allow you to change the amount of memory each component of runtime support will reserve.
Even after you tuned the runtime memory footprint to your tastes and have the actual free memory value from the driver, there is still page size granularity and fragmentation considerations to contend with. Rarely is it possible to allocate every byte of what the API will report as free. Usually, I would do something like this when the objective is to try and allocate every available byte on the card:
const size_t Mb = 1<<20; // Assuming a 1Mb page size here
size_t available, total;
cudaMemGetInfo(&available, &total);
int *buf_d = 0;
size_t nwords = total / sizeof(int);
size_t words_per_Mb = Mb / sizeof(int);
while(cudaMalloc((void**)&buf_d, nwords * sizeof(int)) == cudaErrorMemoryAllocation)
{
nwords -= words_per_Mb;
if( nwords < words_per_Mb)
{
// signal no free memory
break;
}
}
// leaves int buf_d[nwords] on the device or signals no free memory
(note never been near a compiler, only safe on CUDA 3 or later). It is implicitly assumed that none of the obvious sources of problems with big allocations apply here (32 bit host operating system, WDDM windows platform without TCC mode enabled, older known driver issues).

Resources