As previously discussed in this article, I have been wearing the Libre 3, Dexcom G6 and G7, to compare performance across all three.
These are the results. These are laid out showing the Consensus Error Grid for each device and all three together followed by tables providing data about the BIAS, MARD vs Fingersticks (MARDf), 20/20 performance vs fingersticks and hypo detection.
If you don’t want to read the details, then the basics are that the Libre3 proved to be the best performer of the three, with best results on nearly all of the metrics in this n=1 experiment.
First, a reminder of the approach.
The approach taken is outlined here, with the one change that the 20/20 measure uses 100mg/dl as the cut off point, rather than 80mg/dl as specified in the text. It’s worth bearing in mind that the Dexcom G7 in use here is the version sourced from Europe with the updated algorithm that was used for the accuracy study that can be found here.
To achieve this, there were some 78 fingerpricks during the G6/G7 life and 98 for the Libre3.
As always, this is n=1 and the results should be taken that way.
Consensus Error Grids
Below are the consensus error grids for the three devices and the combined one. It’s pretty clear from this that there were difference between these three sensors. The Dexcom G6 showed the greatest dispersion, while the G7 and Libre 3 were less dispersed than the G6 but appeared to show a slightly positive bias in the G7 case and a slightly negative one in the Libre3 case.
When we look at all three sets of data combined on the same grid, the dispersion and location of the readings is much clearer.
While the above charts give a pretty clear indicator of the relationship between the fingerprick tests and the CGMs, they’re not that easy to compare to headline figures. Remember, the published MARD of the G7 vs a YSI is 8.7% in adults, and the rather questionable MARD of the L3 in their study was 7.9%.
The following tables show the MARDf and Bias measured in this experiment, and the day by day MARDf, to give an indication of how things progressed over the life of the sensors.
Now you’ll note that in this case, there are two MARDf and Bias values given. A total value and a post calibration value. This is because, during the early life of both the G6 and G7, I found that there was significant variation between CGM readings and the fingerprick tests. The fingerprick readers had both been tested with control solution, so calibration of the CGMs was required. This took place on Day 4 with the G7 and day 3 with the G6, according to the number of readings that were greater than 20% from the blood tests. Both the G6 and G7 needed calibrating as they were producing values that were a long way from the blood values.
Other datapoints that are of interest relate to hypo detection and accuracy. Here we have three additional data sets. The 20/20 percentage, which shows the percentage of values from the dataset that were within 20mg/dl from the reading, when the reference was less 100mg/dl and within 20% when it was greater than 100mg/dl. There are also “hypo count” metrics which show the percentage of readings that weren’t hypo when blood showed they should be, and the percentage of CGM readings that were hypo, when blood said otherwise.
What we can see from these tables is that the Libre 3 was more consistent and had considerably fewer datapoints that feel outside the 20/20 rule. Again, this is n=1 and the G7’s 12.8% is some way off the value seen in the accuracy study referenced earlier.
While the percentage of hypos not detected looks high for the G6 and G7 (at 27.3%), it was better than the outcome from the G6 in the “Six of the best” experiment.
The final table that we have shows the number of times that the different systems incorrectly called “Hypo” when blood said otherwise. The G6 and L3 were tied on this one, while the G7 never made that mistake. To some extent, the wider dispersion of the G6 data and the slightly negative bias that the consensus error grid appears to show for the Libre 3 seem to support these results.
Any other observations?
One of the complaints from users of the G7 has been that the glucose data is more jumpy than they expected. Part of the reason for this is that the G6 and G7 software no longer backsmooths the historic data points, but there were occasions where I saw noticeable differences between the G6 and G7. There were times when the G7 data did appear to be quite jumpy, which wasn’t what I expected to see. however, before drawing too many conclusions about this, I intend to run another G7 to see whether this is consistent or just in relation to the backsmoothing.
What is your takeaway from this n=1 experiment?
I was surprised at the performance of the Libre 3. It produced a performance that was consistent with older versions of the Libre, in that it tended to measure on the lower side, but it was a lot closer to blood values than I expected. The data in this n=1 experiment suggested that it was the closest to fingerpricks without calibration. I’m interested to see whether that carries on across multiple sensors, as I know others have seen a much wider dispersion of values when compared to fingersticks.
The G7 was much more disappointing than I expected it to be. For what was supposed to be a more accurate version of the algorithm, the performance of the first four days was awful, and indeed, without calibration, I wonder what the results might have been. When I test the next G7, it will be done without any calibration.
Ultimately, for me, the key takeaway from this experiment is that anyone using the G7 should check the performance over the first few days, and if necessary, calibrate the sensor. It was an outcome that I was not expecting.