LLVM Instruction Scheduling in RISC-V - clang

I am looking at instruction scheduling in LLVM for RISC-V backend. I understood there are two ways of scheduling (ScheduleDAGRRList & MachineScheduler). From debug logs i can RISC-V uses ScheduleDAGRRList approach.
Is MachineScheduler is better than ScheduleDAGRRList? If so, how can i enable MachineScheduler for RISC-V ?
I tried llc -enable-misched file.ll, but with no luck.

The RISC-V backend added support for the Machine Scheduler (MISched) in LLVM release 10.0.
https://releases.llvm.org/10.0.0/docs/ReleaseNotes.html
The TableGen SchedMachineModel descriptions in RISCVSchedRocket64.td describe it as an in-order processor.
// Rocket machine model for scheduling and other instruction cost heuristics.
def Rocket64Model : SchedMachineModel {
let MicroOpBufferSize = 0; // Explicitly set to zero since Rocket is in-order.
let IssueWidth = 1; // 1 micro-ops are dispatched per cycle.
let LoadLatency = 3;
let MispredictPenalty = 3;
}
You can enable machine scheduling for rocket-rv64 with:
-O3 -mllvm -enable-misched -mllvm -enable-post-misched -mcpu=rocket-rv64

Related

In PyTorch, how to convert the cuda() related codes into CPU version?

I have some existing PyTorch codes with cuda() as below, while net is a MainModel.KitModel object:
net = torch.load(model_path)
net.cuda()
and
im = cv2.imread(image_path)
im = Variable(torch.from_numpy(im).unsqueeze(0).float().cuda())
I want to test the code in a machine without any GPU, so I want to convert the cuda-code into CPU version. I tried to look at some relevant posts regarding the CPU/GPU switch of PyTorch, but they are related to the usage of device and thus doesn't apply to my case.
As pointed out by kHarshit in his comment, you can simply replace .cuda() call with .cpu():
net.cpu()
# ...
im = torch.from_numpy(im).unsqueeze(0).float().cpu()
However, this requires changing the code in multiple places every time you want to move from GPU to CPU and vice versa.
To alleviate this difficulty, pytorch has a more "general" method .to().
You may have a device variable defining where you want pytorch to run, this device can also be the CPU (!).
for instance:
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
Once you determined once in your code where you want/can run, simply use .to() to send your model/variables there:
net.to(device)
# ...
im = torch.from_numpy(im).unsqueeze(0).float().to(device)
BTW,
You can use .to() to control the data type (.float()) as well:
im = torch.from_numpy(im).unsqueeze(0).to(device=device, dtype=torch.float)
PS,
Note that the Variable API has been deprecated and is no longer required.
net = torch.load(model_path, map_location=torch.device('cpu'))
Pytorch docs: https://pytorch.org/tutorials/beginner/saving_loading_models.html#save-on-cpu-load-on-gpu

How does erlang implements preemptive scheduling with one OS thread?

I want to know how erlang's VM preempts the running code and contexts the stack. How it can be done in a language such as c?
The trick is that the Erlang runtime has control over the VM, so it can - entirely in userspace - keep track of how many VM instructions it's already executed (or, better yet, an estimate or representation of the actual physical computation required for those instructions - a.k.a. "reductions" in Erlang VM parlance) and - if that number exceeds some threshold - immediately swap around process pointers/structs/whatever and resume the execution loop.
Think of it something like this (in kind of a pseudo-C that may or may not actually be C, but I wouldn't know because I ain't a C programmer, but you asked how you'd go about it in C so I'll try my darndest):
void proc_execute(Proc* proc)
{
/* I don't recall if Erlang's VM supports different
reduction limits for different processes, but if it
did, it'd be a rather intuitive way to define process
priorities, i.e. making sure higher-priority processes
get more reductions to spend */
int rds = proc->max_reductions;
for (; rds > 0; rds--) {
/* Different virtual instructions might execute different numbers of
physical instructions, so vm_execute_next_instruction will return
however many reductions are left after executing that virtual
instruction. */
rds = vm_execute_next_instruction(proc, rds);
if (proc->exited) break;
}
}
void vm_loop(Scheduler* sched)
{
Proc *proc;
for (;;) {
proc = sched_next_in_queue(sched);
/* we'll assume that the proc will be null if the
scheduler doesn't have any processes left in its
list */
if (!proc) break;
proc_execute(proc);
}
}
Proc* sched_next_in_queue(Scheduler* sched)
{
if (!sched->current_proc->exited) {
/* If the process hasn't exited yet, readd it to the
end of the queue so we can resume running it
later */
shift(sched->queue, sched->current_proc);
}
sched->current_proc = pop(sched->queue);
return sched->current_proc;
}
This is obviously quite simplified (notably excluding/eliding a lot of important stuff like how VM instructions are implemented and how messages get passed), but hopefully it illustrates how (if I'm understanding right, at least) Erlang's preemptive scheduler and process model works on a basic level.
All code of Erlang will compile to operation code of Erlang's VM. Erlang's VM execute Erlang's operation code by OS's threads which are created at startup of Erlang's VM.
Erlang's code run on Virtual CPUs which are controlled by Erlang's VM. And Erlang's VM consider IO as interrupt of Virtual CPUs. So Erlang's VM implements a machine and a scheduler like an OS. Because of operation code and non-blocking IO, we can implements preempts in Erlang's VM using C languange.

MSBuild is failing inconsistently when performing a TFS build (usually error C1093 / Not enough Storage)

I have a really odd and hard to diagnose issue with MSBuild / TFS. I have a solution that contains about 12 different build configurations. When running on the build server, it takes maybe 30mins to build the lot and has worked fine for weeks now but now is occasionally failing.
Most of the time, when it fails it'll be an error like this:
19:25:45.037 2>TestPlanDocument.cpp(1): fatal error C1093: API call 'GetAssemblyRefHash' failed '0x8007000e' : ErrorMessage: Not enough storage is available to complete this operation. [C:\Builds\1\ICCSim Card Test Controller\ICCSimCTC Release\src\CardTestController\CardTestController.vcxproj]
The error will sometimes happen on a different file. It won't happen for every build configuration either, it's very inconsistent and occasionally even builds all of them successfully. There's not much different between the build configurations either, mostly it's just a few string changes and of course they all build locally just fine.
The API call in question is usually GetAssemblyRefHash but not always. I don't think this is the issue, as Googling for GetAssemblyRefHash specifically brings up next to nothing. I suspect there's some kind of resource issue at play here but I'm at a loss as to what: There's plenty of HDD space (Hundreds of GB's), plenty of RAM (Machine originally had 4GB minimum allocated but was dynamic as it's a Hyper-v - it never pushed above 2.5GB. I upped this to 8GB minimum just in case and there's been no change).
I've set the build verbosity to diagnostic and it doesn't really show anything else that's helpful, just the same error.
For reference, the build server is fully up to date on all patches. It's running Windows Server 2012 R2, has TFS 2013 and VS 2013 installed, both are on Update 4.
I'm really at a loss at this point and would appreciate any help or pointers.
EDIT: Just to keep people up to date, the compile toolchain was in 32bit mode however even after switching to 64bit, the issue persists.
I think I found the source, but I still don't know the reason.
Browsing through the Microsoft Shared Source, we can find the source for GetAssemblyRefHash():
HRESULT CAsmLink::GetAssemblyRefHash(mdToken FileToken, const void** ppvHash, DWORD* pcbHash)
{
if (TypeFromToken(FileToken) != mdtAssemblyRef) {
VSFAIL( "You can only get AssemblyRef hashes for assemblies!");
return E_INVALIDARG;
}
HRESULT hr;
CAssembly *file = NULL;
if (FAILED(hr = m_pImports->GetFile( FileToken, (CFile**)&file)))
return hr;
return file->GetHash(ppvHash, pcbHash);
}
Only two places here to investigate - the call to m_pImports->GetFile(), where m_pImports is CAssembly *m_pImports;, the other is file->GetHash().
m_pImports->GetFile() is here, and is a dead end:
HRESULT CAssembly::GetFile(DWORD index, CFile** file)
{
if (!file)
return E_POINTER;
if (RidFromToken(index) < m_Files.Count()) {
if ((*file = m_Files.GetAt(RidFromToken(index))))
return S_OK;
}
return ReportError(E_INVALIDARG);
}
file->GetHash(), which is here:
HRESULT CAssembly::GetHash(const void ** ppvHash, DWORD *pcbHash)
{
ASSERT( ppvHash && pcbHash);
if (IsInMemory()) {
// We can't hash an InMemory file
*ppvHash = NULL;
*pcbHash = 0;
return S_FALSE;
}
if (!m_bDoHash || (m_cbHash && m_pbHash != NULL)) {
*ppvHash = m_pbHash;
*pcbHash = m_cbHash;
return S_OK;
}
DWORD cchSize = 0, result;
// AssemblyRefs ALWAYS use CALG_SHA1
ALG_ID alg = CALG_SHA1;
if (StrongNameHashSize( alg, &cchSize) == FALSE)
return ReportError(StrongNameErrorInfo());
if ((m_pbHash = new BYTE[cchSize]) == NULL)
return ReportError(E_OUTOFMEMORY);
m_cbHash = cchSize;
if ((result = GetHashFromAssemblyFileW(m_Path, &alg, (BYTE*)m_pbHash, cchSize, &m_cbHash)) != 0) {
delete [] m_pbHash;
m_pbHash = 0;
m_cbHash = 0;
}
*ppvHash = m_pbHash;
*pcbHash = m_cbHash;
return result == 0 ? S_OK : ReportError(HRESULT_FROM_WIN32(result));
}
We can see that about halfway down, it tries to allocate room to store the byte[] result, and when fails, returns E_OUTOFMEMORY, which is the error code you're seeing:
if ((m_pbHash = new BYTE[cchSize]) == NULL)
return ReportError(E_OUTOFMEMORY);
m_cbHash = cchSize;
There's other paths to consider, but this seems like the most obvious source. So it looks like the problem is that a plain memory allocation is failing.
What could cause this?
Lack of free physical memory pages / swap
Memory fragmentation in the process.
Inability to reserve commit space for this in the swap file
Lack of address space
At this point, my best guess would be memory fragmentation. Have you triple checked that the Microsoft CPP compiler is running in 64-bit mode? Perhaps see if you can debug the compiler (Microsoft symbol servers may be able to help you here), and set a breakpoint for that line and dump the heap when it happens.
Some specifics on diagnosing heap fragmentation - fire up sysinternal's VMMap when the compiler breaks, and look at the free list - you need three chunks at least 64 kB free to perform an allocation; less than 64 kB and it won't get used, and two 64 kB chunks are reserved.
Okay, I have an update to this! I opened a support ticket with Microsoft and have been busy working with them to figure out the issue.
They went down the same paths as outlined above and came to the same conclusion - it's not a resources issue.
To cut a long story short, Microsoft has now acknowledged that this is likely a bug in the VC++ compiler, which is almost certainly caused by a race condition (Though this is unconfirmed). There's no word on if they'll fix it in a future release.
There is a workaround by using the /MP flag at the project level to limit the number of compiler processes opened by MSBuild without disabling multiple instances entirely (Which for me was doubling build times).
To do this, go to your project properties and under Configuration Properties -> C/C++ -> Command Line, you need to specify the /MP flag and then a number to limit the number of processes.
My build server has 8 Virtual CPU's and the normal behaviour is equivelant to /MP8 but this causes the bug to sometimes appear. For me, using /MP4 seems to be enough to limit the bug without causing build times to increase too much. If you're seeing a bug similar to this, you may need to experiment with other numbers such as /MP6 or /MP2.

PowerPC e500 : any "Page Global Enable" flag equivalent?

From the Intel x86 System Programming Guide :
PGE Page Global Enable (bit 7 of CR4) — (Introduced in the P6 family processors.)
Enables the global page feature when set; disables the global page feature when clear. The global page feature allows frequently used or shared pages to be marked as global to all users (done with the global flag, bit 8, in a page-directory or page-table entry). Global pages are not flushed from the translation-lookaside buffer (TLB) on a task switch or a write to register CR3.
Is there any equivalent feature on PowerPC e500 core family ?
Thanks,
Telenn
Looking at the e500 Core Reference Manual (downloadable from freescale.com) I found that the MAS[0-4] registers do roughly what you need (section 2.12). This is part of the Freescale Book E implementation, so further details can also be found on freescale.com.

Reserving a portion of SDRAM to pass data between U-Boot and the Linux Kernel

How can I reserve a portion of SDRAM, say 4 bytes, to pass a flag between U-Boot and the Linux kernel so that this reserved memory location is not initialized by the linker and the value preserved after a warm boot? I'm trying to avoid using bootargs to minimize wear of the NAND flash used in an embedded application. My question could be considered an extension to the solution provided by:
How to detect cold boot versus warm boot on an ARM processor?
I have built u-boot.lds with the linker script below and built it with:
-fno-zero-initialized-in-bss without success.
OUTPUT_FORMAT("elf32-littlearm", "elf32-littlearm", "elf32-littlearm")
OUTPUT_ARCH(arm)
ENTRY(_start)
SECTIONS
{
. = 0x00000000;
. = ALIGN(4);
.text :
{
cpu/arm926ejs/start.o (.text)
*(.text)
}
. = ALIGN(4);
.rodata : { *(SORT_BY_ALIGNMENT(SORT_BY_NAME(.rodata*))) }
. = ALIGN(4);
.data : { *(.data) }
. = ALIGN(4);
.got : { *(.got) }
. = .;
__u_boot_cmd_start = .;
.u_boot_cmd : { *(.u_boot_cmd) }
__u_boot_cmd_end = .;
. = ALIGN(4);
__bss_start = .;
_U_BOOT_FLAG = .; . = . + 4;
.bss (NOLOAD) : { *(.bss) . = ALIGN(4); }
_end = .;
}
Any ideas?
There is already a method to pass data between U-Boot and the Linux ARM kernel. It's called the ATAG memory list. Information such as usable memory regions, and board information are passed from U-Boot to the Linux ARM kernel using this data list. You could define a custom ATAG for your data. In U-Boot, add your routine to build your ARM tag in lib_arm/armlinux.c. Then ATAGs are processed in arch/arm/kernel/setup.c.
For documentation see Section 8 of this or this alt site.
ADDENDUM
Links to the referenced ATAG documentation are tenuous (even Google has a bad link).
Try searching for the actual name of the document, which is "Booting ARM Linux" by Vincent Sanders.
Currently there's a copy in Google's cache of the simtec site, and a broader search turned up a translation in Korean(?).
Another or an earlier version (?) (but seems to have been updated) by Russel King on ARM booting is here.
If you want to go by the global-variable approach in How to detect cold boot versus warm boot on an ARM processor? :
You can force that global variable to be in a specific ELF section (see http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Variable-Attributes.html) , and then in the linker script set that section to a specific address.
If you have good ld-script skills, you could even get the linker to initialize all bss sections except that one :)

Resources