How often does DRAM refresh have to be done?

Sdílet
Vložit
  • čas přidán 25. 05. 2024
  • This is a video following up on a question that came out of an earlier video on the channel: how long can you leave DRAM refresh off before memory contents start to decay.
    Code for this episode:
    github.com/wbhart/PCRetroProg...
    Minus Zero Degrees RAM information:
    minuszerodegrees.net/memory/4...

Komentáře • 42

  • @sandman9601
    @sandman9601 Před 24 dny +12

    We used to do a fun trick in our lab back around the DDR2 days. Write a pattern to memory up in an area DOS doesn't use. Then power off the system, remove the DIMM, pass it around if you'd like, and put it back in. Depending on how long you took, you could see various amounts of the 1's in the pattern decay to 0's.

    • @Heckatomba
      @Heckatomba Před 23 dny +6

      Ever tried to use cold spray? Not my idea, back in 2009 security released a paper where they used cold spray to extend the time before the data in DRAM decayed. (2009, Cold boot attack)

    • @sandman9601
      @sandman9601 Před 23 dny +5

      @@Heckatomba We did try that, and it worked. Cold definitely reduces leakage.

    • @JJFX-
      @JJFX- Před 14 dny

      Out of curiosity, did you notice a consistent pattern for which bits seemed to degrade the fastest?

    • @sandman9601
      @sandman9601 Před 14 dny

      @@JJFX- Didn't really look, but nothing stood out.

  • @josephlunderville3195
    @josephlunderville3195 Před 29 dny +2

    After all my speculation it's incredibly gratifying to see the subsequent thorough testing. I'm happy you felt compelled to go down the rabbithole and thanks for taking us with you!

  • @pvc988
    @pvc988 Před 29 dny +6

    I don't know about DRAM but when I was working with SDR SDRAM on FPGA, memory contents easily survived reconfigurations which take couple of seconds. There are no refresh cycles nor memory accesses during that time. Every pin is in HiZ state.

  • @adriansdigitalbasement
    @adriansdigitalbasement Před 29 dny +7

    I'll be testing: -) You don't need to refresh the whole chip by the way. Just the part you're using. So no issues only refreshing 64k on a 256k bank. In fact you can use 256kbit chips in place of 64kbit chips and addresses line A8 are just not used.

    • @adriansdigitalbasement
      @adriansdigitalbasement Před 29 dny +5

      Also for visual fun why not copy the RAM under test to the VGA framebuffer in 640*200 mode so you can the pattern it days to.

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 29 dny +1

      I think the issue with only refreshing 64k is the way you'd do that. Basically you'd just refresh 128 rows (or 256 rows, depending on the kind of chips you have). This would mean that 128 bytes out of every 256 is refreshed. DOS itself would not only be using those bytes. (Consecutive rows correspond to consecutive bytes.) So really, you have to refresh all the rows in the chip, unless you are using memory in a fairly weird way.

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 29 dny

      @@adriansdigitalbasement That's a nice idea!

  • @ChrisJackson-js8rd
    @ChrisJackson-js8rd Před 17 dny +1

    careful in these tests that length between refreshes and length the system has been running (therefore the temp) don't correlate in any systematic way
    not that it would change the results. but if you did want to quantify the time to corruption more precisely you would have to incorporate both temp and time between refreshes into your analysis
    very nice video. i loved the systematic and logical approach you took to the question :)

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 17 dny +1

      Yes, I think one would have to characterise the variation with temperature before attempting to understand the time to corruption.

  • @IExSet
    @IExSet Před 28 dny

    Wow, you uncover super topics ! I am not tired to like your videos !

  • @Torbjorn.Lindgren
    @Torbjorn.Lindgren Před 23 dny +3

    It's my understanding that the refresh period is bound not by how long it takes to decay in isolation (what you test), but how long it takes to decay when memory rows around it are accessed. But I don't know how big of an effect this is on chips this old, it can be quite pronounced on newer chips but they're many orders of magnitude denser. But you may find that as you use the memory the retention time creeps down, so a healthy safety margin might be in order.
    But even with modern (ultra-dense) chips it's often possible to get away with setting refresh rates way below the official numbers in practice, "overclockers" often maxing out the register allocated for this in the memory controller which translates to something like 5-20 times longer than the official refresh rate which is IIRC spec'd at 85C. At least for some memory it's documented to require refresh four times as often at 125C ("military") over 85C ("commercial") so there's definitely a temperature component - perhapos a doubling every 20C? I've never seen any manufacturer project this to lower temperatures but it sounds possible that it might hold - it's known you can store memory content almost indefinitely at cryogenic temperatures.
    IE, reading a row ALSO depletes nearby rows "a bit" and writes also "leak" somewhat into nearby rows - this is the basic idea behind the RowHammer attacks on recent DDR memory (unfortunately with DDR4/5 things are getting so cramped NO reasonable refresh-interval might be safe and other remedial actions has to be taken). I do know that how long memory lasts without refreshing can be extremely variable depending on brand and model of memory chips, there's examples of 8-bit micros where some will survive a few seconds while others doesn't survive a brief flick (200ms?) of the power switch.
    This "accelerated decay" can be hard to profile since rows may not be layed out how you think, often both row and address lines are routed based on "easiest" path rather than A0/A1/... since it doesn't matter.
    For the memory you show I guess you could try hammer the "next" physical row by trying incrementing A8 to A15 (given the 8+8 setup) and try reads and write (inverse bit value?) - that's "only" 16 (8 address lines, read/write) sampling tests (or 18 for a 256kbit chip) but as you mention it's already slow and this would work as a multiplier, but you can use your existing results as a guide to narrow it down to find out if it has an appreciatable effect or not on your XT's memory.

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 23 dny

      Interesting. I forgot to mention in the video that someone had suggested after watching the previous video that it may depend on things like how many 1's are in the row or how many 1's are nearby and so on. Some experiments along these lines would be interesting, though as you point out, not necessarily indicative of how things go on the whole.

    • @Vegemeister1
      @Vegemeister1 Před 19 dny +1

      Hah, yeah, this video got recommended to me and I ran down to the comments to yell about Rowhammer! Targeted Row Refresh! The CZcams algorithm is magic sometimes.
      There's also a part in recent DRAM specs (I don't remember if it came in DDR 4 or DDR5) where there's a temperature threshold above which refresh frequency is doubled.

    • @Vegemeister1
      @Vegemeister1 Před 19 dny

      See also www.csl.cornell.edu/~martinez/doc/isca13-mukundan.pdf

    • @JJFX-
      @JJFX- Před 14 dny +1

      Yeah one of the 'easiest' ways to improve memory performance is still maxing out the refresh interval (tREFI) and/or speeding up the cycle times (tRFC) as much as possible. On good DDR5 chips we rarely see issues with the interval cranked up to ~16 microseconds or so and DDR4 often handled it too. Interestingly, I recall having to be more careful in the DDR3 days.
      This can be one of the scarier changes though because errors don't always show up in testing as you'd imagine. You could test it for a week straight but if the environment warms up enough a few months later you could end up with problems.
      Memory has really become one of the final frontiers of traditional overclocking now that CPUs and GPUs are pushed so close to the edge out of the box. Even profiled memory kits are often so badly tuned that squeezing another 20-30% performance out of cheap kits is still fairly common. I expect this to change and be less relevant as dynamic refresh features are actually implemented and we see CPU cache sizes increase.

  • @volodumurkalunyak4651
    @volodumurkalunyak4651 Před 25 dny +2

    Setting tREFI to 262k clock cycles on DDR5 is way more interesting.
    2Gbit/32banks/8192bytes per row (maximum)=minimum of 8192rows.
    Refresh - takes at most 1 row in each bank at a time -> 262k*8192rows/(5200MT/s)=0,413s for whole memory to refresh.

  • @MonochromeWench
    @MonochromeWench Před 9 dny +1

    Nice of IBM to let you disable DRAM refresh and do things like this. Can it be done on other contemporary systems? Once per frame refresh would be a big improvement over anytime during a frame refresh that it does by default. Even if doing it during frames you can cycle count your custom dram refresh and exit it if you need to do something at that time. It would be tricky to time your cga register writes using dram refresh code but it might be doable if there was no other choice but the dram chips seem very forgiving so it is probably more trouble than it's worth. Testing temperature dependence might be an idea, blast the chips with very warm/hot air and see if they decay faster. A hair dryer on low heat would probably be good enough and shouldn't get hot enough to melt things

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 9 dny

      I believe it is possible to change DRAM refresh on more recent systems. But of course it is done differently.
      I have thought about timing the DRAM refresh at the normal rate (or close to it), and people have definitely done that. We may end up needing to do something similar for some effects because the PIC cannot be exactly put in sync with the CGA card due to the fact that they start up at random times to one another and the dividers they use have a common factor in the number of cycles per division that they use.
      The hot air gun is a nice idea which I must admit I didn't think of.

  • @Roxor128
    @Roxor128 Před 27 dny +2

    Downside of trying to include error-correction back in the 1980s is the number of extra bits you need. For parity, it's just one extra bit, and therefore one extra chip. If you want to protect 8 bits of data, you need 13 bits for ECC, needing 62.5% more chips.
    Really making use of ECC came later when memory was being accessed in larger blocks from chips that would read out multiple bits at a time. With 9 chips of 8 bits each, you can do a 72-bit code with 64 bits of data, which can correct one error and detect two. Though this isn't using the code to its full capacity, and neither was the 8-bit example from earlier. That 72-bit code is just a truncated version of a 128-bit one, but that wouldn't have a nice power-of-two number of data bits in it (120). The 13-bit code is truncated from a 16-bit one, which would have 11 data bits.
    It took me a while between finding out about Hamming Codes and figuring out how the 72/64 one used for ECC memory would actually work. It's basically just calculating the error-correction bits with bits 0-71 normally, and acting as if bits 72-127 are always zero, and as that last range of bits would all be data bits, it doesn't need to bother storing any of them.

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 27 dny +1

      Ah, very interesting. That's a nice trick. I've used ECC and once a long time ago read about how the idea basically worked, but never looked at it in that much detail.

    • @Roxor128
      @Roxor128 Před 26 dny

      @@pcretroprogrammer2656 What got my head around it was a combination of 3Blue1Brown's videos about it, plus a lot of fiddling around in Logisim Evolution implementing it. I went as far as 16 data bits with a 24-bit communications channel (truncated from the 32-bit code, with 2 bits unused), but just finding that you can truncate a code and have it still work was what finally got my head around things.

  • @georgegonzalez2476
    @georgegonzalez2476 Před 23 dny

    The refresh doesn't touch every memory location. It only has to touch the row or column addresses.

    • @DerIchBinDa
      @DerIchBinDa Před 21 dnem +1

      Only the row, column does not play any role in refresh.

  • @danielkowalski7527
    @danielkowalski7527 Před 11 dny

    so how often? ^^

  • @tighematt
    @tighematt Před měsícem +1

    Those results seem odd, surely the decay can’t always be identical yet you wait 10 times the same duration and only see 1 error. Even on the larger test the error stays almost constant after the first wait?
    If you use a slightly longer wait or more iterations do you see more random behaviour?

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 29 dny

      Remember that reading the values out to check them has the effect of refreshing them. So it is basically the same experiment run 10 times. I'd expect the results to stabilise at some point, with all the bits that are going to fail in a given interval of time eventually failing, and all the ones that can hold their contents for that long never failing. That was one of the conclusions of the video.
      I'm sure as it heats up things would be different, but a couple of minutes is not enough heating for it to show up in the results.

    • @tighematt
      @tighematt Před 29 dny +1

      Thanks, yes I understood that. The results just seemed too consistent? It’s certainly piqued my interest! It would be interesting to try to log which byte or ideally bit fails each time - perhaps that is different. Just seems odd that the ram would decay identically every time, but maybe that is just how it is?
      I’ll try your code on my XT later on. Thanks for interesting video.

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 29 dny +1

      @@tighematt I guess there could be a bug in my code somewhere. It is pretty rough code.
      I'll be interested in what you find, and I hope the video spurs some interesting follow-ons from various people, even if it does just turn out that I made a silly mistake somewhere.

    • @tighematt
      @tighematt Před 29 dny +1

      I ran your code on my XT, it has a V20 so I had to tweak ITERS a little….but I got the exact same result as you!
      I checked the utility I wrote some years ago to reduce ram refresh speed for performance. It sets period to 14ms, I went as far as I could go without getting parity errors. So seems odd that is so different!

    • @pcretroprogrammer2656
      @pcretroprogrammer2656  Před 28 dny +1

      @@tighematt It's also odd that I always got parity errors when turning NMI back on. I'm not sure what accounts for the difference, other than possible bugs in my very rough code. Given that I can't think of anything else, I'm just supposing that some bits fail very quickly, but most bits take a long time. I guess we need more data (and more careful code). I like Adrian Black's idea of copying the data to the screen memory in mode 6 so we can actually see the decay after each interval.