Unable to load 12288 bytes from serial into BackgroundLayer

Yup… it shifts by however large the buffer I add is. The “magic number” does not seem to shift by the same amount though. Probably other things moving around during compile. It’s unfortunate I cannot just tell backgroundBitmap to be in SRAM_U directly, since adding a random buffer to shift it would be kind of trial and error.

Speaking of… adding exactly 19,007 bytes of extra buffer puts backgroundBitmap at exactly memory address 0x20000000

Annoyingly I have to actually do something with that buffer otherwise it gets compiled out. Playing with it…

Ok… played around more. If I add 18,945 bytes of buffer that places the backgroundBitmap var at 1FFF_FFC4 or exactly 60 bytes from the SRAM_U boundry and at that point, it works. WTF? I fully expected it to work when that var was right on the boundary, but it’s still crossing it, so I wouldn’t think it would work. Unless it’s another var involved in which case I have no idea which one.

I searched the .map file for addresses 0x10000 (memory boundary) + 0x65c0 (your magic number in hex), and this is the only thing close:

.debug_info 0x00016590 0x2c77 C:\Users\admin\AppData\Local\Temp\buildb021674c3765f772d1e0d5e5b9c8c263.tmp/core\core.a(HardwareSerial2.cpp.o)

Are you using hardwareserial2 for your serial transfer? Coincidence?

Eureka!
Ok, I still don’t fully understand all of it but I found what seems to be the issue. This forum post was the main inspiration: Teensy 3, hard fault due to SRAM_L and SRAM_U boundary

Basically, it looks like it’s not that the variable spans the SRAM boundary but that writes potentially spanned the boundary… or more specifically, spanned the boundary while not being a multiple of the block size, 32 bits.

So, I was reading from the Serial buffer (I’m using USB serial, not any of the many numbered hardware serial versions) in arbitrary amounts of bytes… basically, I’d read as much as I could get, write it to the buffer, subtract that from number of byes left, then try to read the remainder again on the next go around. Each time trying to read the remaining amount but only getting what I asked for on the last loop. But that also means that the number of bytes I get may not be divisible by 4 (32 bit block size) such that I wind up with an incorrectly sized write that spans that pesky SRAM boundary.

So, I ditched that huge buffer that moves the backgroundBitmap var and instead initially limited Serial.read() to 4 bytes. And it worked… but was slow. So I started increasing the read size and finally found that up to 64 bytes in any multiple of 4, it works, but anything beyond 64 bytes fails. I couldn’t find any specifics on what the Teensy USB serial buffer size is, though most other implementations it is 64 bytes, so I imagine that by going above that I run the risk of reading a full packet and then a little of the next one. This is made more confusing that the Teensy is using USB CDC ACM Serial and the USB buffer is likely built into the chip not stored in SRAM… checking the datasheet didn’t yield more specifics that I could find.

Anyways, like I said… a little fuzzy on exactly why but it seems to be a logical solution. Also, this should work for all display sizes. And my testing shows that when limiting to 64 bytes per read, it is just as fast as when the reads were unbound… about 12ms per frame update. Wish I could get that to be faster, but nothing comes to mind.

Great work tracking it down!

I believe you’re right about 64 bytes per USB packet. I remember get similar performance to you with my port of Fadecandy to SmartMatrix on a 128x64 display, around 90fps. That was at 36-bit color (and Fadecandy’s interpolation turned off) if I remember correctly, so if you’re down to 24-bit maybe there’s a bit more optimization possible, though you might have to move away from USB CDC. Fadecandy keeps the USB buffers intact to use as the data source during refresh, to avoid an extra copy operation.

You can learn more about the Teensy USB stack in Teensyduino’s usb_serial.c/h, usb_mem.c/h. Compare to Fadecandy’s usb_mem, fc_usb

Hmm… yeah, you’re probably right about USB CDC. Right now that’s certainly my bottleneck. 4ms to generate the frame and 12ms to push it. And that’s just one 128x32 display… I have 3 working as one big display. So I have to push out to all of them at once… and the performance suffers. Even threaded it still takes about 50ms to push each frame to all 3 teensys.

Will have to look for a raw USB data implementation for Python (what my animation framework is written in)… thanks for the tip! Hopefully that will help :slight_smile: