Ok, I still don't fully understand all of it but I found what seems to be the issue. This forum post was the main inspiration: https://forum.pjrc.com/threads/25256-Teensy-3-hard-fault-due-to-SRAM_L-and-SRAM_U-boundary
Basically, it looks like it's not that the variable spans the SRAM boundary but that writes potentially spanned the boundary... or more specifically, spanned the boundary while not being a multiple of the block size, 32 bits.
So, I was reading from the Serial buffer (I'm using USB serial, not any of the many numbered hardware serial versions) in arbitrary amounts of bytes... basically, I'd read as much as I could get, write it to the buffer, subtract that from number of byes left, then try to read the remainder again on the next go around. Each time trying to read the remaining amount but only getting what I asked for on the last loop. But that also means that the number of bytes I get may not be divisible by 4 (32 bit block size) such that I wind up with an incorrectly sized write that spans that pesky SRAM boundary.
So, I ditched that huge buffer that moves the backgroundBitmap var and instead initially limited Serial.read() to 4 bytes. And it worked... but was slow. So I started increasing the read size and finally found that up to 64 bytes in any multiple of 4, it works, but anything beyond 64 bytes fails. I couldn't find any specifics on what the Teensy USB serial buffer size is, though most other implementations it is 64 bytes, so I imagine that by going above that I run the risk of reading a full packet and then a little of the next one. This is made more confusing that the Teensy is using USB CDC ACM Serial and the USB buffer is likely built into the chip not stored in SRAM... checking the datasheet didn't yield more specifics that I could find.
Anyways, like I said... a little fuzzy on exactly why but it seems to be a logical solution. Also, this should work for all display sizes. And my testing shows that when limiting to 64 bytes per read, it is just as fast as when the reads were unbound... about 12ms per frame update. Wish I could get that to be faster, but nothing comes to mind.