Development “sys” Tree

Since I sometimes miss stuff in partial diffs and I do have a bunch of debug code that people occasionally ask for (notably, interrupt debugging for sparc64), here's a full diff against -current for my developement “sys” directory.

This is meant for folks who know what they are doing and want to pick stuff out before I get around to splitting out a particular change and/or checking in a particular change to the OpenBSD tree.

Among other things, the iommu code has been changed somewhat, interrupts are counted individually instead of by level, and some of the schizo code has been updated (doesn't work yet). There is also a ton of debug code in there (extra printf()s and the like). This stuff should not be used in a production environment unless every single changed line is well understood.

The interrupt changes allow some useful information to get to userland:

$ vmstat -i
inr000/softclock                1148623      100
inr000/sab0                          49        0
inr000/softnet                     8446        0
inr001/tick                     1148624      100
inr7d0/pciide0                    82805        7
inr7e1/hme0                       17889        1
inr7eb/sab0                         183        0
Total                           2406619      209

Even more interrupt information is available through DDB (with kernel option INTR_DEBUG):

kdb breakpoint at 1274d64
Stopped at      Debugger+0x4:   nop
ddb> call psycho_dump_mapping(0x2013f00)
PCI interrupt mappings:
0xfffc4c20: 0x800007d0<V,TID=0x0,IGN=0x1f,INO=0x10,INR=0x7d0/2000,pcib,slot0,INTA#> is idle
0xfffc4c20: 0x800007d1<V,TID=0x0,IGN=0x1f,INO=0x11,INR=0x7d1/2001,pcib,slot0,INTB#> is idle
0xfffc4c20: 0x800007d2<V,TID=0x0,IGN=0x1f,INO=0x12,INR=0x7d2/2002,pcib,slot0,INTC#> is idle
0xfffc4c20: 0x800007d3<V,TID=0x0,IGN=0x1f,INO=0x13,INR=0x7d3/2003,pcib,slot0,INTD#> is idle
OBIO interrupt mappings:
0xfffc5008: 0x800007e1<V,TID=0x0,IGN=0x1f,INO=0x21,INR=0x7e1/2017,OBIO> is idle
0xfffc5018: 0x800007e3<V,TID=0x0,IGN=0x1f,INO=0x23,INR=0x7e3/2019,OBIO> is idle
0xfffc5020: 0x800007e4<V,TID=0x0,IGN=0x1f,INO=0x24,INR=0x7e4/2020,OBIO> is idle
0xfffc5028: 0x800007e5<V,TID=0x0,IGN=0x1f,INO=0x25,INR=0x7e5/2021,OBIO> is idle
0xfffc5048: 0x800007e9<V,TID=0x0,IGN=0x1f,INO=0x29,INR=0x7e9/2025,OBIO> is idle
0xfffc5050: 0x800007ea<V,TID=0x0,IGN=0x1f,INO=0x2a,INR=0x7ea/2026,OBIO> is idle
0xfffc5058: 0x800007eb<V,TID=0x0,IGN=0x1f,INO=0x2b,INR=0x7eb/2027,OBIO> is pending
0xfffc5070: 0x800007ee<V,TID=0x0,IGN=0x1f,INO=0x2e,INR=0x7ee/2030,OBIO> is idle
0xfffc5078: 0x800007ef<V,TID=0x0,IGN=0x1f,INO=0x2f,INR=0x7ef/2031,OBIO> is idle
0xfffc5080: 0x800007f0<V,TID=0x0,IGN=0x1f,INO=0x30,INR=0x7f0/2032,OBIO> is idle
0
ddb> call intr_dump
intrlev:
0x0000: ih 0x2008000 arg       0x0 pil   1 idle next 0x208cd00 pend       0x0 map          0x0 clr          0x0 fun softintr
        ih 0x208cd00 arg 0x2011c10 pil   1 busy next 0x200b180 pend       0x0 map          0x0 clr          0x0 fun generic_softclock-softintr_handler
        ih 0x200b180 arg 0x2011df0 pil   6 idle next 0x200b300 pend       0x0 map          0x0 clr          0x0 fun compoll-softintr_handler
        ih 0x200b300 arg 0x2011e40 pil   6 idle next 0x200b580 pend       0x0 map          0x0 clr          0x0 fun comkbd_soft-softintr_handler
        ih 0x200b580 arg 0x2011ea0 pil   6 idle next 0x200bf80 pend       0x0 map          0x0 clr          0x0 fun sab_softintr-softintr_handler
        ih 0x200bf80 arg       0x0 pil   1 idle next       0x0 pend       0x0 map          0x0 clr          0x0 fun softnet
0x0001: ih 0x208cc80 arg       0x0 pil  10 idle next       0x0 pend       0x0 map          0x0 clr          0x0 fun tickintr
0x07d0: ih 0x208ce00 arg 0x20ad000 pil   5 idle next       0x0 pend       0x0 map   0xfffc4c20 clr   0xfffc5480 fun pciide_pci_intr
0x07e1: ih 0x208d100 arg 0x2020000 pil   6 idle next       0x0 pend       0x0 map   0xfffc5008 clr   0xfffc5808 fun hme_intr
0x07e3: ih 0x200b000 arg 0x1fc3000 pil  13 idle next       0x0 pend       0x0 map   0xfffc5018 clr   0xfffc5818 fun ce4231_cintr
0x07e4: ih 0x200af80 arg 0x1fc3000 pil  13 idle next       0x0 pend       0x0 map   0xfffc5020 clr   0xfffc5820 fun ce4231_pintr
0x07e5: ih 0x200bb00 arg 0x2013f00 pil  15 idle next       0x0 pend       0x0 map   0xfffc5028 clr   0xfffc5828 fun psycho_powerfail
0x07e9: ih 0x200b280 arg 0x1fc3200 pil   6 idle next       0x0 pend       0x0 map   0xfffc5048 clr   0xfffc5848 fun comkbd_intr
0x07ea: ih 0x200b200 arg 0x201d000 pil   6 idle next       0x0 pend       0x0 map   0xfffc5050 clr   0xfffc5850 fun comintr
0x07eb: ih 0x200b600 arg 0x2013a00 pil   6 idle next       0x0 pend       0x0 map   0xfffc5058 clr   0xfffc5858 fun sab_intr
0x07ee: ih 0x200bc00 arg 0x2013f00 pil  15 idle next       0x0 pend       0x0 map   0xfffc5070 clr   0xfffc5870 fun psycho_ue
0x07ef: ih 0x200bb80 arg 0x2013f00 pil   1 idle next       0x0 pend       0x0 map   0xfffc5078 clr   0xfffc5878 fun psycho_ce
0x07f0: ih 0x200ba80 arg 0x2013e00 pil  15 idle next       0x0 pend       0x0 map   0xfffc5080 clr   0xfffc5880 fun psycho_bus_error
counts:
0x0000: ih 0x2008000 count            0 name /soft/softint
        ih 0x208cd00 count      1215664 name /soft/softclock
        ih 0x200b180 count            0 name /soft/com0
        ih 0x200b300 count            0 name /soft/comkbd0
        ih 0x200b580 count           55 name /soft/sab0
        ih 0x200bf80 count         8579 name /soft/softnet
0x0001: ih 0x208cc80 count      1215671 name /mainbus/tick
0x07d0: ih 0x208ce00 count        83130 name /mainbus/psycho0-pbm_mem(2-80)/pciide0
0x07e1: ih 0x208d100 count        18144 name /mainbus/psycho0-pbm_mem(2-80)/hme0
0x07e3: ih 0x200b000 count            0 name /mainbus/psycho0-pbm_mem(2-80)/ebus0_mem/audioce0-cap
0x07e4: ih 0x200af80 count            0 name /mainbus/psycho0-pbm_mem(2-80)/ebus0_mem/audioce0-play
0x07e5: ih 0x200bb00 count            0 name /mainbus/psycho0-powerfail
0x07e9: ih 0x200b280 count            0 name /mainbus/psycho0-pbm_mem(2-80)/ebus0_mem/comkbd0
0x07ea: ih 0x200b200 count            0 name /mainbus/psycho0-pbm_mem(2-80)/ebus0_mem/com0
0x07eb: ih 0x200b600 count          208 name /mainbus/psycho0-pbm_mem(2-80)/ebus0_mem/sab0
0x07ee: ih 0x200bc00 count            0 name /mainbus/psycho0-ue
0x07ef: ih 0x200bb80 count            0 name /mainbus/psycho0-ce
0x07f0: ih 0x200ba80 count            0 name /mainbus/psycho0-pci_error
intrpending:
level 1
   0x0000: ih 0x208cd00 arg 0x2011c10 pil   1 busy next 0x200b180 pend       0x0 map          0x0 clr          0x0 fun generic_softclock-softintr_handler
intrcnt:
  1:    1224029
  5:      83130
  6:      18336
 10:    1215671
0
ddb> c

sys_20030818.diff.gz

bus_dma

coherency | IOMMU | cache | PCI DMA

I/O on these systems can get a little complicated. There is enough scope for getting things wrong that this subject has been split into several sub-sections.

Coherency

The various bus controllers supported by the bus_dma code all interact with the memory sub-system using the same protocols as a CPU. While a streaming cache transaction does require manual synchronization, even an x86 system can have races that require manual synchronization (e.g., to make sure that a PCI write has completed, one should issue a PCI read to the same device). In short, these I/O systems are all cache coherent. No justification has been found for the various ASI_PHYS_CACHED accesses and gratuitous cache flushing through either extensive testing or in the available Sun documentation (ominous unused DCACHE_BUG flags and “Bypass non-coherent D$” comments in locore.s or no). Should anyone have information to the contrary, please speak up. Until then, the working hypothesis is that someone received a severe blow to the head shortly before writing said code.

IOMMU

The 3.2 IOMMU code had more than a few “quirks”. One of the consequences was that faster systems with a lot of network activity could panic with DMA errors due to incorrect IOMMU mappings. After much effort, it was concluded that the existing—extremely hairy—code could not be fixed within the required timeframe. The resulting rewrite has already been checked in to the source tree (see the checkin note below). While this code is believed to be correct, it was written to meet a particular deadline and may not be as efficient or flexible as it could be.

The main problem with the 3.2 code was that it computed the size, loaded, and unloaded each type of mapping using different algorithms for just about every combination of mapping and operation. Some of these were wrong and sometimes they simply did not agree with each other.

The new code simplifies this greatly by remembering the exact mapping at load time using a page table with a splay tree index (for quick PA->DVMA translation). The “unload” and some “sync” operations can be made much more efficient with this information available.

There was some concern that the IOMMU code can be accessed from many different interrupt levels. This means the flush synchronization code must either exclude other interrupts (not a nice thing to do when busy-waiting for a flush) or made re-entrant. The latter was implemented by providing each mapping with a private flush area.

A number of things for a post-3.3 update to the IOMMU code have already been implemented:

  • Most of the bus_dma handlers have been consolidated into iommu.c (which removes much redundant glue code).
  • There are now VA-indexed mappings. The IOMMU API no longer requires the caller to first convert everything into a PA-based representation.
  • An IOMMU-native implementation of bus_dmamap_load_mbuf() reduces the work required to map mbufs.
  • The STC flush buffer is now accessed using normal memory accesses (using ASI_PRIMARY instead of ASI_PHYS_CACHED).
  • A bus_dma_addr_t and bus_dma_size_t type have been added for consistency. Since these only require 32 bits, the size of each bus_dma_segment_t is reduced from 40 bytes to 24 bytes. If it were not for the _ds_mlist hack, each segment would only need 8 bytes.

Things that still should be done:

  • Change the iommu_dvmamap_append_range() interface to accept ranges that cross page boundaries (all the callers now use nearly identical cut/paste code to do this).
  • Automatically resize the page table to accommodate larger mappings.
  • A custom implementation of bus_dmamap_load_uio() would probably be beneficial.
  • Achieve the original one-loop-implementation-per-input-type goal.
  • The whole _ds_mlist implementation requires thought. The bus_dma API is severely brain-damaged when it comes to the life-cycle of bus_dmamem_alloc()-based mappings.
  • The way the correct STC and IOMMU is found by the handler functions is a bit convoluted.
  • The presence of an STC is still indicated by testing a pointer (which has no other purpose) instead of testing a flag.
  • Some more complexity in the address mapping table could greatly reduce the memory footprint of the table. Unfortunately the tree macros do not currently support “upper_bound”/“lower_bound” style operations.

iommu_20030320.diff.gz

Module name:    src
Changes by:     henric (at) cvs.openbsd.org  2003/03/06 01:26:09

Modified files:
        sys/arch/sparc64/dev: ebus.c iommu.c iommureg.h iommuvar.h
                              psycho.c sbus.c schizo.c
        sys/arch/sparc64/include: bus.h
        sys/arch/sparc64/sparc64: machdep.c

Log message:
The existing IOMMU code had a rounding problem that was most noticeable
on faster systems under heavy network load.  This replaces some of the
unreadable iommu functions with something a little less dense and a lot
less crash prone.

The bus_dma function pointer/cookie handling was broken.  Change them
to work like the stacked bus_space drivers (where "work" is the key word).

Cache

The bus_dma code likes flushing the data cache (which happens to be a virtually addressed cache). The handler for mainbus flushes the entire data cache by calling blast_vcache() on most map unload operations (ones that do not involve a _ds_mlist mapping). One implication of this is that the entire data cache is flushed for every network packet that enters or leaves the box on a sane NIC. One may suppose that flushing the entire data cache twice for every packet routed or bridged through the box will have a negative impact on performance.

This patch removes the cache flushes when bus_dma mappings are synced or unloaded. The system can now handle significantly more network traffic before the console becomes unresponsive.

cache_20030523.diff.gz

This was checked in to -current on June 6th.

PCI Coherency

Some of the bus controllers have support for synchronizing DMA from device past PCI-PCI bridges. These are not currently supported, which may lead to races with certain configurations that include standard PCI-PCI bridges. The system will ensure that no DMA is pending before propagating an interrupt to the CPU, but it is still possible that there is outstanding DMA activity by the time the device driver executes. A PCI read from the device in question would normally ensure that the device DMA is complete before the read returns, but because the Host-PCI bridge uses separate paths for PIO and DMA transactions, a further synchronization may be needed. See sections 11.2.1.1 and 19.3.0.5 of the UltraSPARC IIi User's Manual. For psycho-based systems, the U2P User's Manual discusses this in section 8.2.2.4.

It is likely that a system with a streaming cache does sufficient synchronization for streaming transfers, but even on those systems, that may not always be done for consistent (non-STC) DVMA operations (i.e., if BUS_DMA_STREAMING is not requested when loading the DMA map). Page 3-9 of the U2P manual states that the PBM automatically synchronizes consistent DVMA. Other Sun Host-PCI bridges may require manual synchronization through some mechanism similar to the IIi's write synchronization register.

Consider the following sequnce of events for a network driver of some kind:

  1. A packet arrives.
  2. An interrupt is delivered to the CPU (synchronizing the bus).
  3. A second packet arrives.
  4. The interrupt handler is called to service both packets.

The interrupt-triggered synchronization may be useful (there's no point in invoking the handler before the data has been written to memory), but it is not sufficient to ensure the integrity of the second packet.

Keep in mind that the potential race is between PIO and DMA bus transactions. This means that to have a race, the particular driver needs to rely on the ordering of these PIO and DMA operations. Many NIC drivers indicate they are done with a frame by writing to memory. In this case, there is no PIO involved during the normal operation of the device. There may yet be interesting issues if—as is common—DMA for the packet data uses the STC and descriptor ring access does not (This is only relevant for platforms with STCs, e.g. SBUS or psycho-based PCI systems).

Note the careful use of the phrase “normal operation” above. There may yet be potential races between code that uses PIO to reset, pause, or some-such the device and DMA already initiated by the device.

This patch adds support for the IIi/e's DMA synchronization register:

sabredma_20030602.diff.gz

This was checked in to -current on June 4th.

Header Cleanup

Reasonable documentation on gcc's inline asm clobbers could be useful. Experimentation seems to suggest that having input or output operands, but an absence of clobbers causes gcc to assume that everything is clobbered. This suggests that accurate clobbers can decrease code size and improve performance.

Some miscellanous things have been changed:

  • Add clobbers to all inline asm code.
  • Use rd/wr/rdpr/wrpr functions instead of direct inline asm.
  • Remove some left-over sparc v8 defines.
  • Add a kernel config option that switches the kernel from TSO to RMO mode (option CPU_MEMORY_MODEL_RMO).
  • Change the kernel register window handler to assume a 64-bit stack (by changing WSTATE_KERN).

cleanup_20030319.diff.gz

Some pieces have been comitted to -current (16-May through 22-May).

Multi-word “as” Warnings

The sparc64 set is a synthetic instruction that can generate one or more real instructions. This can cause surprising results when used in a delay slot. It may not be obvious from the code that only part of the full set will be executed before the branch is taken. Some hacking to binutils resulted in some useful warning messages when a set is in a delay slot and generates more than one real instruction.

Note that this may miss some cases and will certainly warn about code that is not wrong. However, the former can probably be fixed easily by someone familiar with the binutils source and a strong argument can be made that the latter is bad style (one can always do the sethi/or by hand to make the exact behavior obvious from inspecting the source). One might also argue that any set instruction in a delay slot is bad form. The number of generated instructions may vary depending on a constant defined elsewhere, leading to severe hair-pulling on the part of the victim of the “impossible” system behavior.

set_20030315.diff.gz

Links