Recently I have read the blog from mikeash which tells the detail implementation of dispatch_once. I also get the source code of it in macosforge
I understand most of the code except this line:
dispatch_atomic_maximally_synchronizing_barrier();
It is a macro and defined :
#define dispatch_atomic_maximally_synchronizing_barrier() \
do { unsigned long _clbr; __asm__ __volatile__( \
"cpuid" \
: "=a" (_clbr) : "0" (0) : "rbx", "rcx", "rdx", "cc", "memory" \
); } while(0)
I know it is used to make sure it "Defeat the speculative read-ahead of peer CPUs", but I don't know that cpuid and the words followed. I know little about assemble language.
Could anyone elaborate it for me ? Thanks a lot.
libdispatch source code pretty much explains it.
http://opensource.apple.com/source/libdispatch/libdispatch-442.1.4/src/shims/atomic.h
// see comment in dispatch_once.c
#define dispatch_atomic_maximally_synchronizing_barrier() \
http://opensource.apple.com/source/libdispatch/libdispatch-442.1.4/src/once.c
// The next barrier must be long and strong.
//
// The scenario: SMP systems with weakly ordered memory models
// and aggressive out-of-order instruction execution.
//
// The problem:
//
// The dispatch_once*() wrapper macro causes the callee's
// instruction stream to look like this (pseudo-RISC):
//
// load r5, pred-addr
// cmpi r5, -1
// beq 1f
// call dispatch_once*()
// 1f:
// load r6, data-addr
//
// May be re-ordered like so:
//
// load r6, data-addr
// load r5, pred-addr
// cmpi r5, -1
// beq 1f
// call dispatch_once*()
// 1f:
//
// Normally, a barrier on the read side is used to workaround
// the weakly ordered memory model. But barriers are expensive
// and we only need to synchronize once! After func(ctxt)
// completes, the predicate will be marked as "done" and the
// branch predictor will correctly skip the call to
// dispatch_once*().
//
// A far faster alternative solution: Defeat the speculative
// read-ahead of peer CPUs.
//
// Modern architectures will throw away speculative results
// once a branch mis-prediction occurs. Therefore, if we can
// ensure that the predicate is not marked as being complete
// until long after the last store by func(ctxt), then we have
// defeated the read-ahead of peer CPUs.
//
// In other words, the last "store" by func(ctxt) must complete
// and then N cycles must elapse before ~0l is stored to *val.
// The value of N is whatever is sufficient to defeat the
// read-ahead mechanism of peer CPUs.
//
// On some CPUs, the most fully synchronizing instruction might
// need to be issued.
dispatch_atomic_maximally_synchronizing_barrier();
For x86_64 and i386 architecture, it uses cpuid instruction to flush the instruction pipeline as #Michael mentioned. cpuid is serializing instruction to prevent memory reordering. And __sync_synchronize for the other architecture.
https://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Atomic-Builtins.html
__sync_synchronize (...)
This builtin issues a full memory barrier.
these builtins are considered a full barrier. That is, no memory operand will be moved across the operation, either forward or backward. Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.
Related
void deal_msg(unsigned char * buf, int len)
{
unsigned char msg[1024];
strcpy(msg,buf);
//memcpy(msg, buf, len);
puts(msg);
}
void main()
{
// network operation
sock = create_server(port);
len = receive_data(sock, buf);
deal_msg(buf, len);
}
As the pseudocode shows above, the compile environment is vc6 and running environment is windows xp sp3 en. No other protection mechanisms are applied, that is stack can be executed, no ASLR.
The send data is 'A' * 1024 + addr_of_jmp_esp + shellcode.
My question is:
if strcpy is used, the shellcode is generated by msfvenom, msfvenom -p windows/exec cmd=calc.exe -a x86 -b "\x00" -f python,
msfvenom attempts to encode payload with 1 iterations of x86/shikata_ga_nai
after data is sent, no calc pops up, the exploit won't work.
But if memcpy is used, shellcode generated by msfvenom -p windows/exec cmd=calc.exe -a x86 -f python without encoding works.
How to avoid the original program's crash after calc pops up, how to keep stack balance to avoid crash?
Hard to say. I'd use a custom payload (just copy the windows/exec cmd=calc.exe) and put a 0xcc at the start and debug it (or something that will be easily recognizable under the debugger like a ud2 or \0xeb\0xfe). If your payload is executed, you'll see it. Bypass the added instruction (just NOP it) and try to see what can possibly go wrong with the remainder of the payload.
You'll need a custom payload ; Since you're on XP SP3 you don't need to do crazy things.
Don't try to do the overflow and smash the whole stack (given your overflow it seems to be perfect, just enough overflow to control rIP).
See how the target function (deal_msg in your example) behave under normal conditions. Note the stack address when the ret is executed (and if register need to have certain values, this depend on the caller).
Try to replicate that in your shellcode: you'll most probably to adjust the stack pointer a bit at the end of your shellcode.
Make sure the caller (main) stack hasn't been affected when executing the payload. This might happen, in this case reserve enough room on the stack (going to lower addresses), so the caller stack is far from the stack space needed by the payload and it doesn't get affected by the payload execution.
Finally return to the ret of the target or directly after the call of the deal_msg function (or anywhere you see fit, e.g. returning directly to ExitProcess(), but this might be more interesting to return close to the previous "normal" execution path).
All in all, returning somewhere after the payload execution is easy, just push <addr> and ret but you'll need to ensure that the stack is in good shape to continue execution and most of the registers are correctly set.
Intel engineers wrote that we should use VZEROUPPER/VZEROALL to avoid costly transition to non-VEX state on all processors, including future Xeon processor, but not on Xeon Phi: https://software.intel.com/pt-br/node/704023
People have also measured and found out that VZEROUPPER and VZEROALL are expensive on Knights Landing:
36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
See the above link.
So my code will be the following, if I have just used ymm0 and ymm1:
if [we are running on a Xeon Phi]
vpxor ymm0,ymm0,ymm0
vpxor ymm1,ymm1,ymm1
else
vzeroall
endif
How can I detect Xeon Phi (Knights Landing and later Xeon Phi processors) to implement the above code?
We now have the following situation now about the VZEROUPPER/VZEROALL:
These instructions are not needed and are very costly on Xeon Phi Knight Landing 36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
These instructions are very cheap and are needed on Xeon and Core processors (Skylake/Kaby Lake) and will be needed for Xeon in the foreseeble future, to avoid costly transition to non-VEX state.
The advertising materials claim that Xeon Phi (Knights Landing) is fully compatible with other Xeon processors.
Is there a reliable way to detect Xeon Phi, for the purpose of avoiding VZEROUPPER/VZEROALL?
There is an article "How to detect Knights Landing AVX-512 support (Intel® Xeon Phi™ processor)" by James R., Updated February 22, 2016, but it only focuses specific new instructions that became available on the Knights Landing. So it is still not very clear about the VEX transitions.
It would have been good to know whether Intel plans to implement a CPUID bit to show whether non-VEX state are costly? For example:
Bit is set to 0 - VEX state transitions are costly, but VZEROUPPER/VZEROALL are cheap and should be used to clear the state;
Bit is set to 1 – there is no transition penalty, VZEROUPPER/VZEROALL is not needed.
The above mentioned article about detecting Knights Landing suggests to check the bits AVX-512F+CD+ER+PF as introduced in Knights Landing.
So the code suggests to check all these bits at once, and if all are set, then we are on the Knights Landing:
uint32_t avx2_bmi12_mask = (1 << 16) | // AVX-512F
(1 << 26) | // AVX-512PF
(1 << 27) | // AVX-512ER
(1 << 28); // AVX-512CD
It would have been good to know whether Intel plans to add these all bits to a simple Xeon (non Phi) or Core processors in the near future, so they will also support the AVX-512F+CD+ER+PF features introduced in the Knight Landding?
In case that Xeon and Core processor will support AVX-512F+CD+ER+PF, we won’t be able to distinguish Xeon from Xeon Phi.
Please advise.
If you specifically want to check for being on a KNL (rather than the more general "Does the CPU I am running on have feature X?") you can do that by looking at the "Extended Family", "Family" and "Model" fields in %eax after calling cpuid with %eax==1 and %ecx == 0. C++ code something like that below will do the job.
However, as others are implicitly pointing out, this is a very specific test, and will, for instance, fail on future Knights cores, so you would likely be better doing as has been suggested and checking for AVX-512 features that are not in Xeon, so AVX512-ER and AVX512-PF. (Of course, such instructions could appear in future Xeons, so this is not guaranteed in the long term, but, quoting Keynes: "In the long term we're all dead" :-))
class cpuidState
{
uint32_t orig_eax; /* Values sent in to the cpuid instruction */
uint32_t orig_ecx;
uint32_t eax; /* Values received back from it. */
uint32_t ebx;
uint32_t ecx;
uint32_t edx;
void cpuid()
{
__asm__ __volatile__("cpuid"
: "+a" (eax), "=b" (ebx), "+c" (ecx), "=d" (edx));
}
void update (uint32_t eaxVal, uint32_t ecxVal)
{
orig_eax = eaxVal;
orig_ecx = ecxVal;
eax = eaxVal;
ecx = ecxVal;
cpuid();
}
void ensureCorrectLeaf(uint32_t eaxVal, uint32_t ecxVal)
{
if (orig_eax != eaxVal || orig_ecx != ecxVal)
update (eaxVal, ecxVal);
}
public:
cpuidState() : orig_eax (-1), orig_ecx(-1) { }
// Include the Extended Model in the test. Without it we see some Xeons as KNL :-(
bool onKNL() { ensureCorrectLeaf(1,0); return (eax & 0x0f0ff0) == 0x50670; }
};
Here's my linker script:
MEMORY {
text (rx) : ORIGIN = 0x000000, LENGTH = 64K
data (rw!x) : ORIGIN = 0x800100, LENGTH = 0xFFA0
}
SECTIONS {
.vectors : AT (0x0000) { entry.o (.vectors); }
.text : AT (ADDR (.vectors) + SIZEOF(.vectors)) { * (.text.startup); * (.text); * (.progmem.data); _etext = .; }
.data : AT (ADDR (.text) + SIZEOF (.text)) { PROVIDE (__data_start = .); * (.data); * (.rodata); * (.rodata.str1.1); PROVIDE (__data_end = .); } > data
.bss : AT (ADDR (.bss)) { PROVIDE (__bss_start = .); * (.bss); PROVIDE (__bss_end = .); } > data
__data_load_start = LOADADDR(.data);
__data_load_end = __data_load_start + SIZEOF(.data);
}
And this is my initialization code. init is called at reset.
.section .text,"ax",#progbits
/* Handle low level hardware initialization. */
.global init
init: eor r1, r1
out 0x3f, r1
ldi r28, 0xFF
ldi r29, 0x02
out 0x3e, r29
out 0x3d, r28
rjmp __do_copy_data
rjmp __do_clear_bss
jmp main
/* Handle copying data into RAM. */
.global __do_copy_data
__do_copy_data: ldi r17, hi8(__data_end)
ldi r26, lo8(__data_start)
ldi r27, hi8(__data_start)
ldi r30, lo8(__data_load_start)
ldi r31, hi8(__data_load_start)
rjmp .L__do_copy_data_start
.L__do_copy_data_loop: lpm r0, Z+
st X+, r0
.L__do_copy_data_start: cpi r26, lo8(__data_end)
cpc r27, r17
brne .L__do_copy_data_loop
rjmp main
/* Handle clearing the BSS. */
.global __do_clear_bss
__do_clear_bss: ldi r17, hi8(__bss_end)
ldi r26, lo8(__bss_start)
ldi r27, hi8(__bss_start)
rjmp .L__do_clear_bss_start
.L__do_clear_bss_loop: st X+, r1
.L__do_clear_bss_start: cpi r26, lo8(__bss_end)
cpc r27, r17
brne .L__do_clear_bss_loop
The problem is that the initialization code hangs sometime during the copying process. Here's an edited dump of my symbol table, if it's helpful to anyone.
00000000 a __tmp_reg__
...
00000000 t reset
...
00000001 a __zero_reg__
...
0000003d a __SP_L__
...
00000074 T main
0000009a T init
000000ae T __do_copy_data
000000c6 T __do_clear_bss
...
00000446 A __data_load_start
00000446 T _etext
0000045b A __data_load_end
00800100 D __data_start
00800100 D myint
00800115 B __bss_start
00800115 D __data_end
00800115 b foobar.1671
00800135 B ticks
00800139 B __bss_end
C is designed to work on von Neumann architectures. AVR is Harvard based. This means that C expects strings to be in RAM. As a consequence, if you ever take a look at the disassembly for any elf binary that would eventually be copied as a hex to your AVR chip, you will see two sections: __do_copy_data and __do_clear_bss [when required]. These routines, which are added in the linking stages, take care of the basic needs of the C language. As a consequence, what you are seeing here with your pointers is likely that they are pointing to the wrong addresses. In other words, they are either pointing to an address in program space but you are reading from data space [different address bus]. Or you are purposely reading from data space but have not copied the strings over.
See:
avr/pgmspace.h
FAQ: ROM Array and scroll down for strings
and of course, the AVR instruction set as provided by Atmel, especifically, the instruction to copy program memory over to data memory
Edited to reflect new question and comment: Your assembly for both sections look ok to me. I will have to take a look at your linker scripts with closer scrutinity to check if there any funny businesses going on there. Since you are writing a bootloader, do you mind if I ask if you have taken a look at bootloader support on AVR-libc?
http://deans-avr-tutorials.googlecode.com/svn/trunk/Progmem/Output/Progmem.pdf
This document may help to clarify the use of flash/ram memories of AVR.
I actually got it to work. All I had to do was enable reading from and writing to external RAM in the SREG. It was mindblowingly obvious and simple, I know, but it was buried in the data sheet and it was poorly documented.
This resource was helpful, but didn't have any documentation for my chip. If you're having this problem, look in the data sheet for your AVR and see how it is related. The process is not the same for all variations of AVRs.
How to use external RAM.
i've got a question regarding the exploit_notesearch program.
This program is only used to create a command string we finally call with the system() function to exploit the notesearch program that contains a buffer overflow vulnerability.
The commandstr looks like this:
./notesearch Nop-block|shellcode|repeated ret(will jump in nop block).
Now the actual question:
The ret-adress is calculated in the exploit_notesearch program by the line:
ret = (unsigned int) &i-offset;
So why can we use the address of the i-variable that is quite at the bottom of the main-stackframe of the exploit_notesearch program to calculate the ret address that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe, and has to contain an address in the nop block(which is in the same buffer).
that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe
As long as the system uses virtual memory, another process will be created by system() for the vulnerable program, and assuming that there is no stack randomization,
both processes will have almost identical values of esp (as well as offset) when their main() functions will start, given that the exploit was compiled on the attacked machine (i.e. with vulnerable notesearch).
The address of variable i was chosen just to give an idea about where the frame base is. We could use this instead:
unsigned long sp(void) // This is just a little function
{ __asm__("movl %esp, %eax");} // used to return the stack pointer
int main(){
esp = sp();
ret = esp - offset;
//the rest part of main()
}
Because the variable i will be located on relatively constant distance from esp, we can use &i instead of esp, it doesn't matter much.
It would be much more difficult to get an approximate value for ret if the system did not use virtual memory.
the stack is allocated in a way as first in last out approach. The location of i variable is somewhere on the top and lets assume that it is 0x200, and the return address is located in a lower address 0x180 so in order to determine the where about to put the return address and yet to leave some space for the shellcode, the attacker must get the difference, which is: 0x200 - 0x180 = 0x80 (128), so he will break that down as follows, ++, the return address is 4 bytes so, we have only 48 bytes we left before reaching the segmentation. that is how it is calculated and the location i give approximate reference point.
When a value is copied from one register to another, what happens to the value
in the source register? What happens to the value in the destination register.
I'll show how it works in simple processors, like DLX or RISC, which are used to study CPU-architecture.
When (AT&T syntax, or copy $R1 to $R2)
mov $R1, $R2
or even (for RISC-styled architecture)
add $R1, 0, $R2
instruction works, CPU will read source operands: R1 from register file and zero from... may be immediate operand or zero-generator; pass both inputs into Arithmetic Logic Unit (ALU). ALU will do an operation which will just pass first source operand to destination (because A+0 = A) and after ALU, destination will be written back to register file (but to R2 slot).
So, Data in source register is only readed and not changed in this operation; data in destination register will be overwritten with copy of source register data. (old state of destination register will be lost with generating of heat.)
At physical level, any register in register file is set of SRAM cells, each of them is the two inverters (bi-stable flip-flop, based on M1,M2,M3,M4) and additional gates for writing and reading:
When we want to overwrite value stored in SRAM cell, we will set BL and -BL according to our data (To store bit 0 - set BL and unset -BL; to store bit 1 - set -BL and unset BL); then the write is enabled for current set (line) of cells (WL is on; it will open M5 and M6). After opening of M5 and M6, BL and -BL will change state of bistable flip-flop (like in SR-latch). So, new value is written and old value is discarded (by leaking charge into BL and -BL).