How would one go about modifying individual assembly instructions in an application while it is running?
I have a Mobile Substrate tweak that I am writing for an existing application. In the tweak's constructor (MSInitialize), I need to be able to rewrite individual instruction(s) in the app's code. What I mean by this is that there may be multiple places in the application's address space that I wish to modify, but in each instance, only a single instruction needs to be modified. I have already disabled ASLR for the application and know the exact memory address of the instruction to be patched, and I have the hex bytes (as a char[], but this is uninportant and can be changed if necessary) of the new instruction. I just need to figure out how to perform the change.
I know that iOS uses Data Execution Prevention (DEP) to specify that executable memory pages cannot also be writeable and vice versa, but I know that it is possible to bypass this on a jailbroken device. I also know that the ARM processor used by iDevices has an instruction cache that needs to be updated to reflect the change. However, I do not even know where to begin to do this.
So, to answer the question that would surely otherwise be asked, I have not tried anything. This is not because I am lazy; rather, it is because I have absolutely no clue how this could be accomplished. Any help at all would be greatly appreciated.
Edit:
If it helps at all, my ultimate goal is to use this in a Mobile Substrate tweak that hooks an App Store application. Previously, in order to mod this application, one would have to first crack it to decrypt the app so the binary could be patched. I want to make it so people wouldn't have to crack the app, since that can lead to piracy which I am strongly against. I can't use Mobile Substrate normally because all of the work is done in C++, not Objective-C, and the application is stripped, leaving no symbols to use MSHookFunction on.
Completely forgot I asked this question, so I'll show what I ended up with now. The comments should explain how and why it works.
#include <stdio.h>
#include <stdbool.h>
#include <mach/mach.h>
#include <libkern/OSCacheControl.h>
#define kerncall(x) ({ \
kern_return_t _kr = (x); \
if(_kr != KERN_SUCCESS) \
fprintf(stderr, "%s failed with error code: 0x%x\n", #x, _kr); \
_kr; \
})
bool patch32(void* dst, uint32_t data) {
mach_port_t task;
vm_region_basic_info_data_t info;
mach_msg_type_number_t info_count = VM_REGION_BASIC_INFO_COUNT;
vm_region_flavor_t flavor = VM_REGION_BASIC_INFO;
vm_address_t region = (vm_address_t)dst;
vm_size_t region_size = 0;
/* Get region boundaries */
if(kerncall(vm_region(mach_task_self(), ®ion, ®ion_size, flavor, (vm_region_info_t)&info, (mach_msg_type_number_t*)&info_count, (mach_port_t*)&task))) return false;
/* Change memory protections to rw- */
if(kerncall(vm_protect(mach_task_self(), region, region_size, false, VM_PROT_READ | VM_PROT_WRITE | VM_PROT_COPY))) return false;
/* Actually perform the write */
*(uint32_t*)dst = data;
/* Flush CPU data cache to save write to RAM */
sys_dcache_flush(dst, sizeof(data));
/* Invalidate instruction cache to make the CPU read patched instructions from RAM */
sys_icache_invalidate(dst, sizeof(data));
/* Change memory protections back to r-x */
kerncall(vm_protect(mach_task_self(), region, region_size, false, VM_PROT_EXECUTE | VM_PROT_READ));
return true;
}
vm_protect to w^x, assuming you're jailbroken with a decent jailbreak (e.g. if mobilesubstrate works)
Writing to instruction memory from processor registers is, as others say above, a bit tricky. Especially with iPhones, since Apple tries to keep the processor details secret.
Permissions on memory access are the first problem. Executable memory is not normally writable. However, if this is overcome, then there is a little dance to go through to get data out of the processor registers and into the instruction pipeline. In general, there are synchronisation instructions, which force a specific order on the memory accesses before and after them, and cache commands, which force dirty write data out to memory and flush out clean and possibly stale read data. Both of these are highly dependent on the detailed implementation of the processor.
Arm Has nice manuals on the web that explain these in detail for specific processors. However, whether the processors inside iPhones do what the public Arm manuals say, I have no idea.
Here's a place to start understanding the Arm memory synchronisation model for one processor:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0092b/ch04s03s04.html
and that goes on to tell how to flush the instruction cache by a write to a control register. It certainly is possible to write self-modifying code for Arm processors because somewhere in that manual I found a statement that said that it is sometimes unavoidable and the has to be supported.
(I'm not claiming this is an answer. But it wouldn't fit in a comment.)
Related
In WIN32:
I'm sure that if the handle is the same, the memory may not be the same, and the same handle will be returned no matter how many times getMemoryWin32HandleKHR is executed.
This is consistent with vulkan's official explanation: Vulkan shares memory.
It doesn't seem to work properly in Linux.
In my program,
getMemoryWin32HandleKHR works normally and can return a different handle for each different memory.
The same memory returns the same handle.
But in getMemoryFdKHR, different memories return the same fd.
Or the same memory executes getMemoryFdKHR twice, it can return two different handles.
This causes me to fail the device memory allocation during subsequent imports.
I don't understand why this is?
Thanks!
#ifdef WIN32
texGl.handle = device.getMemoryWin32HandleKHR({ info.memory, vk::ExternalMemoryHandleTypeFlagBits::eOpaqueWin32 });
#else
VkDeviceMemory memory=VkDeviceMemory(info.memory);
int file_descriptor=-1;
VkMemoryGetFdInfoKHR get_fd_info{
VK_STRUCTURE_TYPE_MEMORY_GET_FD_INFO_KHR, nullptr, memory,
VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT
};
VkResult result= vkGetMemoryFdKHR(device,&get_fd_info,&file_descriptor);
assert(result==VK_SUCCESS);
texGl.handle=file_descriptor;
// texGl.handle = device.getMemoryFdKHR({ info.memory, vk::ExternalMemoryHandleTypeFlagBits::eOpaqueFd });
Win32 is nomal.
Linux is bad.
It will return VK_ERROR_OUT_OF_DEVICE_MEMORY.
#ifdef _WIN32
VkImportMemoryWin32HandleInfoKHR import_allocate_info{
VK_STRUCTURE_TYPE_IMPORT_MEMORY_WIN32_HANDLE_INFO_KHR, nullptr,
VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT, sharedHandle, nullptr };
#elif __linux__
VkImportMemoryFdInfoKHR import_allocate_info{
VK_STRUCTURE_TYPE_IMPORT_MEMORY_FD_INFO_KHR, nullptr,
VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT,
sharedHandle};
#endif
VkMemoryAllocateInfo allocate_info{
VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO, // sType
&import_allocate_info, // pNext
aligned_data_size_, // allocationSize
memory_index };
VkDeviceMemory device_memory=VK_NULL_HANDLE;
VkResult result = vkAllocateMemory(m_device, &allocate_info, nullptr, &device_memory);
NVVK_CHECK(result);
I think it has something to do with fd.
In my some test: if I try to get fd twice. use the next fd that vkAllocateMemory is work current......but I think is error .
The fd obtained in this way is different from the previous one.
Because each acquisition will be a different fd.
This makes it impossible for me to distinguish, and the following fd does vkAllocateMemory.
Still get an error.
So this test cannot be used.
I still think it should have the same process as win32. When the fd is obtained for the first time, vkAllocateMemory can be performed correctly.
thanks very much!
The Vulkan specifications for the Win32 handle and POSIX file descriptor interfaces explicitly state different things about their importing behavior.
For HANDLEs:
Importing memory object payloads from Windows handles does not transfer ownership of the handle to the Vulkan implementation. For handle types defined as NT handles, the application must release handle ownership using the CloseHandle system call when the handle is no longer needed.
For FDs:
Importing memory from a file descriptor transfers ownership of the file descriptor from the application to the Vulkan implementation. The application must not perform any operations on the file descriptor after a successful import.
So HANDLE importation leaves the HANLDE in a valid state, still referencing the memory object. File descriptor importation claims ownership of the FD, leaving it in a place where you cannot use it.
What this means is that the FD may have been released by the internal implementation. If that is the case, later calls to create a new FD may use the same FD index as a previous call.
The safest way to use both of these APIs is to have the Win32 version emulate the functionality of the FD version. Don't try to do any kinds of comparisons of handles. If you need some kind of comparison logic, then you'll have to implement it yourself. When you import a HANDLE, close it immediately afterwards.
What exactly should I expect to happen when using DiscardResource?
What's the difference between discard and destroying/deleting a resource.
When is a good time/use-case to discard a resource?
Unfortunately Microsoft doesn't seem to say much about it other than it "discards a resource".
TL;DR: Is a rarely used function that provides a driver hint related to handling clear compression structures. You are unlikely to use it except based on specific performance advice.
DiscardResource is the DirectX 12 version of the Direct3D 11.1 method. See Microsoft Docs
The primary use of these methods it to optimize the performance of tiled-based deferred rasterizer graphics parts by discarding the render target after present. This is a hint to the driver that the contents of the render target are no longer relevant to the operation of the program, so it can avoid some internal clearing operations on the next use.
For DirectX 11, this is in the DirectX 11 App template to use DiscardView because it makes use of DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL:
void DX::DeviceResources::Present()
{
// The first argument instructs DXGI to block until VSync, putting the application
// to sleep until the next VSync. This ensures we don't waste any cycles rendering
// frames that will never be displayed to the screen.
DXGI_PRESENT_PARAMETERS parameters = { 0 };
HRESULT hr = m_swapChain->Present1(1, 0, ¶meters);
// Discard the contents of the render target.
// This is a valid operation only when the existing contents will be entirely
// overwritten. If dirty or scroll rects are used, this call should be removed.
m_d3dContext->DiscardView1(m_d3dRenderTargetView.Get(), nullptr, 0);
// Discard the contents of the depth stencil.
m_d3dContext->DiscardView1(m_d3dDepthStencilView.Get(), nullptr, 0);
// If the device was removed either by a disconnection or a driver upgrade, we
// must recreate all device resources.
if (hr == DXGI_ERROR_DEVICE_REMOVED || hr == DXGI_ERROR_DEVICE_RESET)
{
HandleDeviceLost();
}
else
{
DX::ThrowIfFailed(hr);
}
}
The DirectX 12 App template doesn't need those explicit calls because it uses DXGI_SWAP_EFFECT_FLIP_DISCARD.
If you are wondering why the DirectX 11 app doesn't just use DXGI_SWAP_EFFECT_FLIP_DISCARD, it probably should. The DXGI_SWAP_EFFECT_FLIP_SEQUENTIAL swap effect was the only one supported by Windows 8.x for Windows Store apps, which is when DiscardView was introduced. For Windows 10 / DirectX 12 / UWP, it's probably better to always use DXGI_SWAP_EFFECT_FLIP_DISCARD unless you specifically don't want the backbuffer discarded.
It is also useful for multi-GPU SLI / Crossfire configurations since the clearing operation can require synchronization between the GPUs. See this GDC 2015 talk
There are also other scenario-specific usages. For example, if doing deferred rendering for the G-buffer where you know every single pixel will be overwritten, you can use DiscardResource instead of doing ClearRenderTargetView / ClearDepthStencilView.
I have a PCI based Device, more specifically based on tms320c6000 DSP, I am trying to communicate (reading some registers) with this device through the Jungo WinDriver. Surprisingly it sometimes work and sometimes doesn't, when it doesn't system hang and I have to restart the system.
this is the snipped code which I used to read EMIF Registers, for example.
WD_TRANSFER tt[9];
BZERO(tt);
for (unsigned i = 0; i < 9; i++) {
tt[i].cmdTrans = RM_DWORD;
tt[i].dwPort = mmr + (i * 4);
}
WD_MultiTransfer(hDevice, &tt, 9);
mmr came from WD_CardRegister function which gave information about the PCI BARs and their mapped address (mmr is non prefechtable mapped memory).
I would be very grateful if someone could give me some hint about what might cause this problem.
Thanks
I am answering my question in case the problem happened to someone else.
there are sequence of actions which should be taken before using this device.
Warm reset through HDCR Register (set WARMRESET bit).
then the EMIF registers should be initialised, these are the values I used.
struct emif emif_val = {
0x00052078, //GBLCTL;
0x73a28e01, //CE1 Flash/FPGA;
0xffffffd3, //CE0 SDRAM;
0x00000000, //Reserved;
0x22a28a22, //CE2 Daughtercard 32-bit async
0x22a28a42, //CE3 Daughtercard 32-bit sync
0x63115000, //SDRAM contral, 4 banks
0x0000081b, //SDRAM timing
0x001faf4d //SDRAM extended control
};
and then you are able to access all address space of the device without any problem.
and for more information this linux source code could be very helpful
I hope this question isn't too open-ended. I ran into a memory issue with Rust, where I got an "out of memory" from calling next on an Iterator trait object. I'm unsure how to debug it. Prints have only brought me to the point where the failure occurs. I'm not very familiar with other tools such as ltrace, so although I could create a trace (231MiB, pff), I didn't really know what to do with it. Is a trace like that useful? Would I do better to grab gdb/lldb? Or Valgrind?
In general I would try to do the following approach:
Boilerplate reduction: Try to narrow down the problem of the OOM, so that you don't have too much additional code around. In other words: the quicker your program crashes, the better. Sometimes it is also possible to rip out a specific piece of code and put it into an extra binary, just for the investigation.
Problem size reduction: Lower the problem from OOM to a simple "too much memory" so that you can actually tell the some part wastes something but that it does not lead to an OOM. If it is too hard to tell wether you see the issue or not, you can lower the memory limit. On Linux, this can be done using ulimit:
ulimit -Sv 500000 # that's 500MB
./path/to/exe --foo
Information gathering: If you problem is small enough, you are ready to collect information which has a lower noise level. There are multiple ways which you can try. Just remember to compile your program with debug symbols. Also it might be an advantage to turn off optimization since this usually leads to information loss. Both can be archived by NOT using the --release flag during compilation.
Heap profiling: One way is too use gperftools:
LD_PRELOAD="/usr/lib/libtcmalloc.so" HEAPPROFILE=/tmp/profile ./path/to/exe --foo
pprof --gv ./path/to/exe /tmp/profile/profile.0100.heap
This shows you a graph which symbolizes which parts of your program eat which amount of memory. See official docs for more details.
rr: Sometimes it's very hard to figure out what is actually happening, especially after you created a profile. Assuming you did a good job in step 2, you can use rr:
rr record ./path/to/exe --foo
rr replay
This will spawn a GDB with superpowers. The difference to a normal debug session is that you can not only continue but also reverse-continue. Basically your program is executed from a recording where you can jump back and forth as you want. This wiki page provides you some additional examples. One thing to point out is that rr only seems to work with GDB.
Good old debugging: Sometimes you get traces and recordings that are still way too large. In that case you can (in combination with the ulimit trick) just use GDB and wait until the program crashes:
gdb --args ./path/to/exe --foo
You now should get a normal debugging session where you can examine what the current state of the program was. GDB can also be launched with coredumps. The general problem with that approach is that you cannot go back in time and you cannot continue with execution. So you only see the current state including all stack frames and variables. Here you could also use LLDB if you want.
(Potential) fix + repeat: After you have a glue what might go wrong you can try to change your code. Then try again. If it's still not working, go back to step 3 and try again.
Valgrind and other tools work fine, and should work out of the box as of Rust 1.32. Earlier versions of Rust require changing the global allocator from jemalloc to the system's allocator so that Valgrind and friends know how to monitor memory allocations.
In this answer, I use the macOS developer tool Instruments, as I'm on macOS, but Valgrind / Massif / Cachegrind work similarly.
Example: An infinite loop
Here's a program that "leaks" memory by pushing 1MiB Strings into a Vec and never freeing it:
use std::{thread, time::Duration};
fn main() {
let mut held_forever = Vec::new();
loop {
held_forever.push("x".repeat(1024 * 1024));
println!("Allocated another");
thread::sleep(Duration::from_secs(3));
}
}
You can see memory growth over time, as well as the exact stack trace that allocated the memory:
Example: Cycles in reference counts
Here's an example of leaking memory by creating an infinite reference cycle:
use std::{cell::RefCell, rc::Rc};
struct Leaked {
data: String,
me: RefCell<Option<Rc<Leaked>>>,
}
fn main() {
let data = "x".repeat(5 * 1024 * 1024);
let leaked = Rc::new(Leaked {
data,
me: RefCell::new(None),
});
let me = leaked.clone();
*leaked.me.borrow_mut() = Some(me);
}
See also:
Why does Valgrind not detect a memory leak in a Rust program using nightly 1.29.0?
Handling memory leak in cyclic graphs using RefCell and Rc
Minimal `Rc` Dependency Cycle
In general, to debug, you can use either a log-based approach (either by inserting the logs yourself, or having a tool such a ltrace, ptrace, ... to generate the logs for you) or you can use a debugger.
Note that ltrace, ptrace or debugger-based approaches require that you be able to reproduce the problem; I tend to favor manual logs because I work in an industry where bug reports are generally too imprecise to allow immediate reproduction (and thus we use logs to create the reproducer scenario).
Rust supports both approaches, and the standard toolset that one uses for C or C++ programs works well for it.
My personal approach is to have some logging in place to quickly narrow down where the issue occurs, and if logging is insufficient to fire up a debugger for a more fine-combed inspection. In this case I would recommend going straight away for the debugger.
A panic is generated, which means that by breaking on the call to the panic hook, you get to see both the call stack and memory state at the moment where things go awry.
Launch your program with the debugger, set a break point on the panic hook, run the program, profit.
I'm playing heaps of videos at the same time with AVPlayer. To reduce loading times, I'm storing the corresponding views in a NSCache.
This works fine until reaching a certain number of videos, from which the videos simply stop playing, or even appearing.
There's no error, log or memory warning. In particular, I'm listening to UIApplicationDidReceiveMemoryWarningNotification to clear the cache but this is never received.
If I remove the cache, all the videos play at expense of worse performance.
This makes me suspect that AVPlayer is using memory from a different process (which one?). And when that memory reaches a certain limit, new players cease to work.
Is this correct?
If so, is there a way to be notified when this magic limit is reached to take the appropriate measures (e.g., clear the cache) to ensure playback of other media?
Good news and bad news - good is you can probably fix the problem, bad is it takes work and is somewhat complex.
Root Problem
The reason you don't get notified early happens because iOS does not find out that your app has exceeded its memory budget til its almost too late, then it immediately kills it. The problem has to do with the way iOS (and OS X) manage the file system cache. Normally, when files get opened, as you read the data, the file data gets transferred into a buffer in the Unified Buffer Cache (a term you can google for more info) - I'll call it UBC from now on.
So suppose you have 10 open files, and you have read every file to the end, but have not closed the files. Well, all that data is sitting in the UBC. Now, if you close the files, the buffers are all freed. And technically, the OS can purge this buffers too - only it seems by the time it realizes that memory is tight, it chooses to blow the app away first (and there may be valid reasons for it to do this). So imagine that you app is showing videos, and the way the videos get loaded is through the file system, the number of free buffers starts dropping. At some point iOS notices this, tracks down who most belong to (your app), and wham, kills your app ASAP.
I hit this problem myself in an open source project I support, PhotoScrollerNetwork. Users started complaining that their project was getting terminated by the system, like you, without any notification. I tried in vain to monitor the UBC (there are APIs on OS X to do so, but not on iOS). In the end I found a solution using an heuristic - monitor all your memory usage including the UBC, and don't exceed 50% of the total available iOS memory pool.
So (you might ask) - what is the Apple approved way to solve this problem? Well, there is none. How do I know that? Because I had a half hour long discussion at WWDC 2012 with the Director of Core iOS in one of the labs (after getting ping ponged around by others who had no idea what I was talking about). In the end, after I explained the above heuristic, he told me directly that the solution was probably as good as any he could think of. Without an API to directly monitor the UBC, you can only approximate its usage and adjust accordingly.
But you say, I'm using the NSCache - why doesn't the system account for the AVPlayer memory there? There reason is undoubtedly the UBC - an AVPlayer instance probably only consumes a few thousand K of memory itself - its the open file to the video that is not accounted for by iOS.
Possible Solutions
1) If you can load the videos directly into a NSData object, and keep that in the NSCache, you can most likely totally avoid the UBC issues mentioned above. [I don't know enough about the AV system to know if you can do this.] In this case the system should be capable of purging memory when it needs to.
2) Continue using your original code, but add memory management to it. That is, when you get create an AVPlayer instance, you will need to account for the size of the video in bytes, and keep a running tally of all this memory. When you approach 50% of total device free memory, then start purging old AVPlayers.
Code
For completeness, I've provided the relevant code from PhotoScrollerNetwork below. If you want more details you can peruse the project - however its quite complex so expect to spend some time (its doing JPEG decoding on the fly for massive images and writing tiles to the file system as the decode proceeds).
// Data Structure
typedef struct {
size_t freeMemory;
size_t usedMemory;
size_t totlMemory;
size_t resident_size;
size_t virtual_size;
} freeMemory;
Early on in your app:
// ubc_threshold_ratio defaults to 0.5f
// Take a big chunk of either free memory or all memory
freeMemory fm = [self freeMemory:#"Initialize"];
float freeThresh = (float)fm.freeMemory*ubc_threshold_ratio;
float totalThresh = (float)fm.totlMemory*ubc_threshold_ratio;
size_t ubc_threshold = lrintf(MAX(freeThresh, totalThresh));
size_t ubc_usage = 0;
// Method on some class to monitor the memory pool
- (freeMemory)freeMemory:(NSString *)msg
{
// http://stackoverflow.com/questions/5012886
mach_port_t host_port;
mach_msg_type_number_t host_size;
vm_size_t pagesize;
freeMemory fm = { 0, 0, 0, 0, 0 };
host_port = mach_host_self();
host_size = sizeof(vm_statistics_data_t) / sizeof(integer_t);
host_page_size(host_port, &pagesize);
vm_statistics_data_t vm_stat;
if (host_statistics(host_port, HOST_VM_INFO, (host_info_t)&vm_stat, &host_size) != KERN_SUCCESS) {
LOG(#"Failed to fetch vm statistics");
} else {
/* Stats in bytes */
natural_t mem_used = (vm_stat.active_count +
vm_stat.inactive_count +
vm_stat.wire_count) * pagesize;
natural_t mem_free = vm_stat.free_count * pagesize;
natural_t mem_total = mem_used + mem_free;
fm.freeMemory = (size_t)mem_free;
fm.usedMemory = (size_t)mem_used;
fm.totlMemory = (size_t)mem_total;
struct task_basic_info info;
if(dump_memory_usage(&info)) {
fm.resident_size = (size_t)info.resident_size;
fm.virtual_size = (size_t)info.virtual_size;
}
#if MEMORY_DEBUGGING == 1
LOG(#"%#: "
"total: %u "
"used: %u "
"FREE: %u "
" [resident=%u virtual=%u]",
msg,
(unsigned int)mem_total,
(unsigned int)mem_used,
(unsigned int)mem_free,
(unsigned int)fm.resident_size,
(unsigned int)fm.virtual_size
);
#endif
}
return fm;
}
When you open a video, add the size to ubc_usage, and when you close one decrement it. When you want to open a new video, test ubc_usage against ubc_threadhold, and its it exceeds the value you have to close something first.
PS: you can try calling that freeMemory method at other times, and see, but in my case it hardly changes at all when files get opened - the system seems to consider the whole UBC as "free", since it could purge it if it needed to (I guess).
If you're throwing all of these videos in a NSCache, you have to be prepared for the cache to throw away items when it feels like they are consuming too much memory. From the NSCache documentation:
The NSCache class incorporates various auto-removal policies, which
ensure that it does not use too much of the system’s memory. The
system automatically carries out these policies if memory is needed by
other applications. When invoked, these policies remove some items
from the cache, minimizing its memory footprint.
Check to see if you're getting nils back from the cache, and if you are, you'll have to reconstruct your objects.
Edit:
It is also worth mentioning that objc.io #7 advises against storing large objects in a NSCache:
The eviction method of NSCache is non-deterministic and not
documented. It’s not a good idea to put in super-large objects like
images that might fill up your cache faster than it can evict itself.