Perceptual Image Codec: What Matters in Practical Learned Image Compression

(apple.github.io)

33 points | by ksec 4 hours ago

4 comments

klodolph 49 minutes ago
Interesting, but when I look at the sweater in the second image, the knitting just looks completely lost in the PICO vesion. The knitting looks correct but soft in other codecs. In the PICO version, it looks just completely wrong to me. The yarn structure has been replaced with a bunch of fuzzy strips. Similar problem in the third picture.
I guess this is what happens when you chase after extremely low data rates but I’m not happy with the results.
[-]
- crazygringo 2 minutes ago
  I think it's fascinating because it seems to be a completely different type of compression.
  You can see it in the hair as well. It seems very clear that it is engaging in a kind of texture synthesis.
  So it seems to be looking at an area, and capturing the textural quality. And then reproducing that, so the overall effect is the same, but individual fibers or fuzzy bits are randomly generated from scratch.
  And so yes, if you zoom in enough, the knitting looks completely wrong because the regular geometric pattern of irregular yarn it is made of has been replaced by a completely irregular pattern of irregular yarn.
  In other words, it is essentially hallucination of details on a micro scale but not on a macro scale.
  And I think that raises a really interesting philosophical question of what we consider to be valid image reconstruction from lossy compression.
  Because on the one hand, this is no different from blurriness or even the kind of blocky JPEG compression we are familiar with. It's just pixels that are wrong. Those blocks don't appear in the original image. The blurriness isn't there in the original image.
  But on the other hand, we see blurriness as being somehow more "honest", and we are easily able to recognize that blockiness is an artifact.
  Whereas with textural hallucination, it is no longer clear what is being filled in versus what is original, because it's doing such a good job of emulating so many aspects of the original texture.
  And it's really hard to say if one approach is better or worse than the other. It's probably more accurate to say that one is more appropriate than the other in different contexts. Like if it is just a normal news photograph, I am perfectly happy with a sharper image because it's not changing anything substantial – it's not changing the face of a world leader or the number of people in the photo. But on the other hand, if I am doing online shopping for shirts and I want to be able to zoom in on the texture, then it's incredibly important that the texture be accurate and not loosely hallucinated.
- Npovview 32 minutes ago
  I saw mentioned such artifacts when one video was reviewing DLSS from Nvidia.
dahart 1 hour ago
Looks very cool assuming all the comparisons are correct & fair and there’s no major failure cases. Quick link to the HTML version of the paper to save you a couple of clicks: https://arxiv.org/html/2605.05148v1
Since this is by Apple, I’m certainly curious if this is aimed at becoming the new default format for Apple devices. What kind of effort does it take to do that, beyond getting the paper published?
On the PR summary page, the “speed” column should be labeled “time”. Time is lower-is-better, whereas speed means higher-is-better.
The BD rate column could also use a less cryptic label. (Though maybe the audience is paper reviewers and not me.) The paper itself doesn’t even write out what the BD acronym in “BD rate” stands for, but it seems like it would be fair and accurate and better to call the column maybe something like relative compressed size, and mention the exact metric in the caption — where there’s already an explanation of BD rate.
I’m somewhat confused by, and slightly skeptical TBH, of the device timings. Are they correct & fair? Why is the NN-only portion almost as fast on an iPhone 17 compared to a V100 when the V100 has 4x the FP throughput? Is it comparing apples to apples (ha!), and is the GPU implementation reasonable? The data suggests the GPU implementation is not saturating the GPU.
Also why are there several different GPU models? And why is V100 even used? V100 is four generations old and not even supported anymore.
[-]
- ksec 34 minutes ago
  >what the BD acronym in “BD rate” stands for,
  Bjontegaard Delta-Rate (BD Rate) metric, proposed in 2001 by Gisle Bjontegaard, is a method for calculating the average difference between two rate-distortion (RD) curves.
  It is extremely common in codec comparison, along with terms like PSNR, SSIM and VMAF ( which is newer and developed by Netflix so it tends to get explained a bit more )
  >’m certainly curious if this is aimed at becoming the new default format for Apple devices.
  I certainly hope not. Not unless it is deterministic and much much higher quality.
- kllrnohj 24 minutes ago
  > Why is the NN-only portion almost as fast on an iPhone 17 compared to a V100 when the V100 has 4x the FP throughput?
  Might have some sequential section or a block size that struggles to fill a V100 or a large chunk of CPU-only work or any number of things like that.
kllrnohj 25 minutes ago
I find it very curious that their new image codec did not really compare itself against other image codecs, but instead primarily video codecs pretending to do images. As in, no JPEG or JPEG-XL.
150ms to decode 12mp is also incredibly slow. That's like PNG territory of slow. A more flagship 50mp image would be... oof.
a-dub 2 hours ago
this is interesting. would be cool to explore something like integrating a vlm to add a "semantic" term to the loss function. looking through the comparisons, some of the baseline codecs create meaningfully different details (as could be described by text) in the images.